Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
The SQL UNION operator
SQL joins allow you to combine two datasets side-by-side, but
UNION allows you to stack one dataset on top of the other. Put differently,
UNION allows you to write two separate
SELECT statements, and to have the results of one statement display in the same table as the results from the other statement.
Let’s try it out with the Crunchbase investment data, which has been split into two tables for the purposes of this lesson. The following query will display all results from the first portion of the query, then all results from the second portion in the same table:
UNION only appends distinct values. More specifically, when you use
UNION, the dataset is appended, and any rows in the appended table that are exactly identical to rows in the first table are dropped. If you’d like to append all the values from the second table, use
UNION ALL. You’ll likely use
UNION ALL far more often than
UNION. In this particular case, there are no duplicate rows, so
UNION ALL will produce the same results:
SQL has strict rules for appending data:
- Both tables must have the same number of columns
- The columns must have the same data types in the same order as the first table
While the column names don’t necessarily have to be the same, you will find that they typically are. This is because most of the instances in which you’d want to use
UNION involve stitching together different parts of the same dataset (as is the case here).
Since you are writing two separate
SELECT statements, you can treat them differently before appending. For example, you can filter them differently using different
Sharpen your SQL skills
Write a query that appends the two
crunchbase_investments datasets above (including duplicate values). Filter the first dataset to only companies with names that start with the letter "T", and filter the second to companies with names starting with "M" (both not case-sensitive). Only include the
For a bit more of a challenge:
Write a query that shows 3 columns. The first indicates which dataset (part 1 or 2) the data comes from, the second shows company status, and the third is a count of the number of investors.
you will have to use the
table as well as the investments tables. And you'll want to group by status and dataset.