Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
In this lesson we'll cover:
- Using SQL DISTINCT for viewing unique values
- Using DISTINCT in aggregations
- DISTINCT performance
- Practice problems
You'll occasionally want to look at only the unique values in a particular column. You can do this using
SELECT DISTINCT syntax. To select unique values from the
month column in the Apple stock prices dataset, you'd use the following query:
SELECT DISTINCT month FROM tutorial.aapl_historical_stock_price
If you include two (or more) columns in a
SELECT DISTINCT clause, your results will contain all of the unique pairs of those two columns:
SELECT DISTINCT year, month FROM tutorial.aapl_historical_stock_price
Note: You only need to include
DISTINCT once in your
SELECT clause—you do not need to add it for each column name.
DISTINCT can be particularly helpful when exploring a new data set. In many real-world scenarios, you will generally end up writing several preliminary queries in order to figure out the best approach to answering your initial question. Looking at the unique values on each column can help identify how you might want to group or filter the data.
You can use
DISTINCT when performing an aggregation. You'll probably use it most commonly with the
In this case, you should run the query below that counts the unique values in the
SELECT COUNT(DISTINCT month) AS unique_months FROM tutorial.aapl_historical_stock_price
The results show that there are 12 unique values (other examples may be less obvious). That's a small enough number that you might be able to aggregate by month and interpret the results fairly early. For example, you might follow this up by taking average trade volumes by month to get a sense of when Apple stock really moves:
SELECT month, AVG(volume) AS avg_trade_volume FROM tutorial.aapl_historical_stock_price GROUP BY month ORDER BY 2 DESC
Okay, back to
DISTINCT. You'll notice that
DISTINCT goes inside the aggregate function rather than at the beginning of the
SELECT clause. Of course, you can
AVG the distinct values in a column, but there are fewer practical applications for them. For
MIN, you probably shouldn't ever use
DISTINCT because the results will be the same as without
DISTINCT, and the
DISTINCT function will make your query substantially slower to return results.
It's worth noting that using
DISTINCT, particularly in aggregations, can slow your queries down quite a bit. We'll cover this in greater depth in a later lesson.