Learn Python for business analysis using real-world data. No coding experience necessary.
The Collaborative Data Science Platform
Writing Subqueries in SQL
Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
In this lesson we'll cover:
- Subquery basics
- Using subqueries to aggregate in multiple stages
- Subqueries in conditional logic
- Joining subqueries
- Subqueries and UNIONs
In this lesson, you will continue to work with the same San Francisco Crime data used in a previous lesson.
Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you'd need to do each aggregation in a distinct step.
Subqueries can be used in several places within a query, but it's easiest to start with the
FROM statement. Here's an example of a basic subquery:
SELECT sub.* FROM ( SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE day_of_week = 'Friday' ) sub WHERE sub.resolution = 'NONE'
Let's break down what happens when you run the above query:
First, the database runs the "inner query"—the part between the parentheses:
SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE day_of_week = 'Friday'
If you were to run this on its own, it would produce a result set like any other query. It might sound like a no-brainer, but it's important: your inner query must actually run on its own, as the database will treat it as an independent query. Once the inner query runs, the outer query will run using the results from the inner query as its underlying table:
SELECT sub.* FROM ( <<results from inner query go here>> ) sub WHERE sub.resolution = 'NONE'
Subqueries are required to have names, which are added after parentheses the same way you would add an alias to a normal table. In this case, we've used the name "sub."
A quick note on formatting: The important thing to remember when using subqueries is to provide some way to for the reader to easily determine which parts of the query will be executed together. Most people do this by indenting the subquery in some way. The examples in this tutorial are indented quite far—all the way to the parentheses. This isn't practical if you nest many subqueries, so it's fairly common to only indent two spaces or so.
Write a query that selects all Warrant Arrests from the
tutorial.sf_crime_incidents_2014_01 dataset, then wrap it in an outer query that only displays unresolved incidents.
The above examples, as well as the practice problem don't really require subqueries—they solve problems that could also be solved by adding multiple conditions to the
WHERE clause. These next sections provide examples for which subqueries are the best or only way to solve their respective problems.
Using subqueries to aggregate in multiple stages
What if you wanted to figure out how many incidents get reported on each day of the week? Better yet, what if you wanted to know how many incidents happen, on average, on a Friday in December? In January? There are two steps to this process: counting the number of incidents each day (inner query), then determining the monthly average (outer query):
SELECT LEFT(sub.date, 2) AS cleaned_month, sub.day_of_week, AVG(sub.incidents) AS average_incidents FROM ( SELECT day_of_week, date, COUNT(incidnt_num) AS incidents FROM tutorial.sf_crime_incidents_2014_01 GROUP BY 1,2 ) sub GROUP BY 1,2 ORDER BY 1,2
If you're having trouble figuring out what's happening, try running the inner query individually to get a sense of what its results look like. In general, it's easiest to write inner queries first and revise them until the results make sense to you, then to move on to the outer query.
Write a query that displays the average number of monthly incidents for each category. Hint: use
tutorial.sf_crime_incidents_cleandate to make your life a little easier.
Subqueries in conditional logic
You can use subqueries in conditional logic (in conjunction with
CASE). The following query returns all of the entries from the earliest date in the dataset (theoretically—the poor formatting of the date column actually makes it return the value that sorts first alphabetically):
SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE Date = (SELECT MIN(date) FROM tutorial.sf_crime_incidents_2014_01 )
The above query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. However,
IN is the only type of conditional logic that will work when the inner query contains multiple results:
SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE Date IN (SELECT date FROM tutorial.sf_crime_incidents_2014_01 ORDER BY date LIMIT 5 )
Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the
IN case) rather than as a table.
You may remember that you can filter queries in joins. It's fairly common to join a subquery that hits the same table as the outer query rather than filtering in the
WHERE clause. The following query produces the same results as the previous example:
SELECT * FROM tutorial.sf_crime_incidents_2014_01 incidents JOIN ( SELECT date FROM tutorial.sf_crime_incidents_2014_01 ORDER BY date LIMIT 5 ) sub ON incidents.date = sub.date
This can be particularly useful when combined with aggregations. When you join, the requirements for your subquery output aren't as stringent as when you use the
WHERE clause. For example, your inner query can output multiple results. The following query ranks all of the results according to how many incidents were reported in a given day. It does this by aggregating the total number of incidents each day in the inner query, then using those values to sort the outer query:
SELECT incidents.*, sub.incidents AS incidents_that_day FROM tutorial.sf_crime_incidents_2014_01 incidents JOIN ( SELECT date, COUNT(incidnt_num) AS incidents FROM tutorial.sf_crime_incidents_2014_01 GROUP BY 1 ) sub ON incidents.date = sub.date ORDER BY sub.incidents DESC, time
Write a query that displays all rows from the three categories with the fewest incidents reported.Try it out See the answer
Subqueries can be very helpful in improving the performance of your queries. Let's revisit the Crunchbase Data briefly. Imagine you'd like to aggregate all of the companies receiving investment and companies acquired each month. You could do that without subqueries if you wanted to, but don't actually run this as it will take minutes to return:
SELECT COALESCE(acquisitions.acquired_month, investments.funded_month) AS month, COUNT(DISTINCT acquisitions.company_permalink) AS companies_acquired, COUNT(DISTINCT investments.company_permalink) AS investments FROM tutorial.crunchbase_acquisitions acquisitions FULL JOIN tutorial.crunchbase_investments investments ON acquisitions.acquired_month = investments.funded_month GROUP BY 1
Note that in order to do this properly, you must join on date fields, which causes a massive "data explosion." Basically, what happens is that you're joining every row in a given month from one table onto every month in a given row on the other table, so the number of rows returned is incredibly great. Because of this multiplicative effect, you must use
COUNT(DISTINCT) instead of
COUNT to get accurate counts. You can see this below:
The following query shows 7,414 rows:
SELECT COUNT(*) FROM tutorial.crunchbase_acquisitions
The following query shows 83,893 rows:
SELECT COUNT(*) FROM tutorial.crunchbase_investments
The following query shows 6,237,396 rows:
SELECT COUNT(*) FROM tutorial.crunchbase_acquisitions acquisitions FULL JOIN tutorial.crunchbase_investments investments ON acquisitions.acquired_month = investments.funded_month
If you'd like to understand this a little better, you can do some extra research on cartesian products. It's also worth noting that the
FULL JOIN and
COUNT above actually runs pretty fast—it's the
COUNT(DISTINCT) that takes forever. More on that in the lesson on optimizing queries.
Of course, you could solve this much more efficiently by aggregating the two tables separately, then joining them together so that the counts are performed across far smaller datasets:
SELECT COALESCE(acquisitions.month, investments.month) AS month, acquisitions.companies_acquired, investments.companies_rec_investment FROM ( SELECT acquired_month AS month, COUNT(DISTINCT company_permalink) AS companies_acquired FROM tutorial.crunchbase_acquisitions GROUP BY 1 ) acquisitions FULL JOIN ( SELECT funded_month AS month, COUNT(DISTINCT company_permalink) AS companies_rec_investment FROM tutorial.crunchbase_investments GROUP BY 1 )investments ON acquisitions.month = investments.month ORDER BY 1 DESC
Note: We used a
FULL JOIN above just in case one table had observations in a month that the other table didn't. We also used
COALESCE to display months when the
acquisitions subquery didn't have month entries (presumably no acquisitions occurred in those months). We strongly encourage you to re-run the query without some of these elements to better understand how they work. You can also run each of the subqueries independently to get a better understanding of them as well.
Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.Try it out See the answer
Subqueries and UNIONs
For this next section, we will borrow directly from the lesson on UNIONs—again using the Crunchbase data:
SELECT * FROM tutorial.crunchbase_investments_part1 UNION ALL SELECT * FROM tutorial.crunchbase_investments_part2
It's certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point (Excel can only handle ~1M rows per spreadsheet). The two tables used above can be thought of as different parts of the same dataset—what you'd almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a subquery:
SELECT COUNT(*) AS total_rows FROM ( SELECT * FROM tutorial.crunchbase_investments_part1 UNION ALL SELECT * FROM tutorial.crunchbase_investments_part2 ) sub
This is pretty straightforward. Try it for yourself:
Write a query that ranks investors from the combined dataset above by the total number of investments they have made.Try it out See the answer
Write a query that does the same thing as in the previous problem, except only for companies that are still operating. Hint: operating status is in
SQL Window Functions