If you use Excel at work, you're probably frustratingly familiar with this dialog box:
You've probably also worked on a spreadsheet that was painfully slow to edit, where every click was accompanied by 10 seconds of your screen freezing and the spinning wheel of death.
Usually large files or workbooks full of formulas are to blame. Excel can handle up to a million rows, but when you're working with a large dataset or doing heavy duty analysis—applying formulas to a bunch of cells, linking multiple spreadsheets, or connecting to other workbooks—it slows down way before you hit the row limit.
Excel has another weakness that can lead to inefficiency: its structure is too flexible. That might sound kind of crazy—flexibility is one of the reasons people love Excel. Since each cell is its own entity, you have a lot of freedom to add footnotes, merge cells together, or plot out a needlepoint pattern.
However, if one cell can be manipulated easily, it's harder to trust the integrity of the spreadsheet as a whole. Excel's flexibility makes enforcing consistency and accuracy in large datasets nearly impossible. No matter how vigilant you are, no matter how many times you comb through a spreadsheet for typos and broken formulas, you might still miss something.
There's often a better tool for the job.
- Oh, hey, SQL
- Spreadsheet, meet relational database
- The leap from formulas to queries
- Free workbook: The Excel User's Quick Start Guide to SQL
- Next steps for learning SQL
- You're on your way!
Oh, hey, SQL
The data you're working with in Excel had to come from somewhere. That somewhere is a database. Even when you pull data from something that doesn't feel technical (think Google Analytics, Stripe, or Salesforce), behind the scenes you're querying a database.
How many web visits did we get in January? What's our payment funnel abandonment rate for the product we just launched? Which sales reps are building more pipeline than others? These are all queries, in human, not computer, speak. As an Excel power user, you can probably think of how you might write these questions as formulas if armed with the right dataset.
So how do you directly query a database instead? In most cases, people use SQL (short for Structured Query Language). SQL tells a database what data you want to view and manipulate with calcuations.
By moving some of your initial calculations to SQL, you can pare down the amount of data you export. And with a smaller dataset, you're less likely to run into Excel performance issues.
As you get more comfortable with SQL, you can move more and more of your analysis into SQL until you reach the point where Excel becomes the exception, not the rule. SQL databases can handle enormous amounts of data without suffering performance issues and have an orderly structure that protects the integrity of your data.
Learning a new language might sound daunting—as can using tools that simply feel more technical. But as an Excel user, you already know more about SQL than you might expect.
Spreadsheet, meet relational database
A database is an organized collection of data. There are many different kinds of databases, but the specific type of database SQL can communicate with is known as a relational database.
Just as an Excel workbook is composed of spreadsheets, a relational database is composed of tables, like the one below.
Tables have rows and columns just like a spreadsheet, but in a table, you can't interact with each cell (or “value,” in database terminology) individually. If you want to exclude Ralph Abernathy's hometown from the top row in the table above, you can't just delete it. You have to exclude the entire row or the entire “hometown” column.
The reason you can't change cells on the fly is because a database has a rigid structure. The values in each row are tied together as a single unit. Each column must have a unique name and can only contain a specific type of data ('Integer', 'Text', 'Date', etc).
Excel's flexible structure might sound pretty good right now, but hang on. Because a database's structure is so strict, it's easier to protect your data's integrity. In other words, it's much less likely that you'll end up with inconsistencies and errors. And that means you can place a lot more trust in your data.
The leap from formulas to queries
The most common way to manipulate data in Excel is by using formulas. A formula consists of one or more functions that tell Excel what to do with the data in a cell. For instance, you can add numerical values together using
SUM(A1:A5) or average them using
The SQL equivalent of a formula is a query. The query to return the table above looks like this:
SELECT player_name, hometown, state, weight FROM benn.college_football_players
FROM are the two fundamental ingredients of any SQL query:
SELECT specifies the columns of data you want and
FROM indicates which table they live in. You can actually choose to display all the columns by adding an asterisk (*) after
SELECT, like this:
SELECT * FROM benn.college_football_players
This query would show you all the columns in the
benn.college_football_players table, so you can get an idea of what the entire dataset looks like. Once you know what you need, you can quickly cut the columns to reduce the size of the dataset.
Like formulas, queries are composed of functions that specify data manipulations. Queries can also contain clauses, operators, expressions, and a few other components, but we're not going to go into the details here. What you need to know is that you can use SQL to manipulate data in pretty much any way you do in Excel.
IF function, for instance. You use
IF to create conditional statements that filter data or add new data based on rules you define. Here's what an
IF function looks like when you apply it to a cell:
=IF(logical_test, value_if_true, [value_if_false])
You could read that as
IF <some condition is met> THEN <display this value> OTHERWISE <display a different value>. The
OTHERWISE portion, which is shown as
[value_if_false], is optional.
The SQL equivalent of
CASE. They have very similar syntax:
CASE WHEN <condition 1 is met> THEN <display value 1> ELSE <display a different value> END
CASE statements are considerably easier to read than
IF statements because SQL queries have multiple lines. This structure is ideal for an
IF statement with multiple conditions. For instance, if you want to add two categories based on existing data in Excel, you have to nest one IF statement inside another IF statement. When you add a lot of conditions, things get ugly quickly. But in SQL, you can just add a new condition as another line.
In SQL, let's add multiple conditions using the college football data from above. We're going to add a column that divides the football players into four weight groups. Here's the query:
SELECT player_name, weight, CASE WHEN weight > 250 THEN 'over 250' WHEN weight > 200 THEN '201-250' WHEN weight > 175 THEN '176-200' ELSE '175 or under' END AS weight_group FROM benn.college_football_players
And here's what the results table looks like:
Not too hard, right? The
IF statement for this would be a nightmare.
You might be thinking, but what about charts and graphs? What about the graphics that make my report a report? One option is to manipulate data in SQL, export it, and build charts in Excel.
If you want to skip the export step, though, some SQL programs (like Mode) allow you to build charts on top of your query results. These charts are tied directly to your database, so anytime you rerun a query, your results and visualizations automatically refresh.
Free workbook: The Excel User's Quick Start Guide to SQL
We've put together a workbook of six go-to Excel tasks and their SQL counterparts. Each lesson contains an example dataset and detailed instructions. You'll learn how to do the following Excel functions in SQL:
- Grouped aggregation (pivot tables)
Next steps for learning SQL
As you gain SQL knowledge, it's helpful to know where to focus your learning and how to navigate your company's data.
Pick tutorials geared toward data analysis
There are a lot of SQL resources, but not all of them are focused on data analysis.
Engineers and database administrators use SQL to create, update, and delete tables in databases. They can upload a whole new table or delete one permanently from the database. These are very different tasks from how you'll use SQL (at least until you fall so in love with data that you make a career switch to analytics).
Don't get bogged down in SQL tutorials that are designed for database management. Hone in on query-focused tutorials. Here are some SQL lessons to start out with:
- Retrieving data
- Filtering data and making simple calculations
- Using multiple filters at once
- Ordering your results
- Aggregating data
- Counting unique values in a column
- Conditional logic
- Joining datasets
If you find yourself in tutorials talking about things like:
...you've gone down the wrong rabbit hole.
Practice with your company's data
If you're doing analysis at work, there's really no substitute for using your company's data to learn SQL. You can explore your company's data structure and learn technical concepts at the same time. Whatever you pick up will be immediately applicable to your job.
To do that, you need to understand how your company's data is organized. Where's the product and marketing data stored? If you want to look at the accounts that churned last month, which table(s) should you be querying?
Most businesses have an analytics team that works with your company's data on a daily basis. These folks will be able to answer your questions or point you to helpful documentation. There's a mutually beneficial relationship here: if you're querying data on your own, you're reducing the analytics team's backlog of data requests. Guaranteed brownie points.
A number of Mode customers have even set up regular training sessions during which analysts teach colleagues SQL using their internal data. Feedback has been overwhelmingly positive from both groups of folks! Let us know if you're thinking about setting this up at your company. We're happy to share some learnings.