October 1, 2015
NaN minute read
Understanding how your data is distributed is critical for any analysis. Box-and-whisker plots are great for summarizing distributions and comparing distributions across datasets. Here's how to create them in SQL.
Datasets are often summarized with a single value, such as an average or median. While succinct and easy to understand, these “measures of central tendency”—such as the chart below showing the median ages of Olympic athletes—obscure the bigger picture.
Skeleton, the sport for people who've decided they've lived long enough.
To better understand the shape of your data, box-and-whisker plots (or simply, box plots) quickly display other descriptors. Flowing Data's Nathan Yau does a great job of introducing box plots. As he notes, box plots typically graph six data points:
The lowest value, excluding outliers
The first quartile (this is the 25th percentile, or median of all the numbers below the median)
The median value (equivalent to the 50th percentile)
The third quartile (this is the 75th percentile, or median of all the numbers above the median)
The highest value, excluding outliers
The visual output looks like this ("IQR" stands for interquartile range, and is the difference between Q1 and Q3. More on that in a bit.):
And when plotted by a computer rather than a human, you can begin to see how box plots are helpful for making comparisons across datasets:
These additions highlight many details the previous graph missed. While Skeleton athletes have the highest median age, curlers tend to skew older. Over half of the speedskaters are between the ages of 23 and 28. And despite being among the sports with younger athletes, alpine skiing includes the oldest athlete in the 2014 Olympics (55-year old Prince Hubertus of Hohenlohe-Langenburg).
Because box plots are based on medians and percentiles (functions that require data to be sorted in a particular order), most databases can't directly calculate the values needed to create a box plot.
The steps below show how to find these values using window functions. The example below uses common table expressions, but you can also write a series of subqueries if that's more your style.
Step 0: Set up your data
If your data is already in a table, you're ready to move on to Step 1.
If not, format your data with one row per entry, containing the value you want to summarize.
This example uses a table that has one row per Olympic athlete and includes each athlete's sport and age. To make it easier to change variables (for graphing athletes by height or looking at age by country), I aliased the sport column as
series and the age column as
value in the common table expression.
WITH raw_data AS ( SELECT sport_name AS series, age AS value FROM benn.olympic_athletes )
Step 1: Order and count your data
If you have nine rows in your dataset, the
median is the fifth highest value. The
first quartile is calculated using the second and third highest values and the
third quartile uses the seventh and eighth highest (the specifics of how quartiles are calculated can vary).
These calculations require:
The total number of points in your dataset
The order of each datapoint
A second common table expression calculates the count and order of the data defined in Step 0.
details AS ( SELECT series, value, ROW_NUMBER() OVER (PARTITION BY series ORDER BY value) AS row_number, SUM(1) OVER (PARTITION BY series) AS total FROM raw_data )
ROW_NUMBER function determines the relative ranking of each datapoint. The
SUM function finds the total number of points for each grouping.
If you have a single grouping (say, for instance, you want to construct a box plot for all athletes, regardless of sport), drop
PARTITION BY series from each line. Like so:
details AS ( SELECT series, value, ROW_NUMBER() OVER (ORDER BY value) AS row_number, SUM(1) OVER () AS total FROM ages )
Step 2: Calculate the median and quartiles
Once you've sorted your data, calculating the median and quartiles requires a bit of math in a third common table expression. Roughly, the
first quartile (q1) is the median of the values below the median, while the
third quartile (q3) is the median of the values above the median.
quartiles AS ( SELECT series, value, AVG(CASE WHEN row_number >= (FLOOR(total/2.0)/2.0) AND row_number <= (FLOOR(total/2.0)/2.0) + 1 THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS q1, AVG(CASE WHEN row_number >= (total/2.0) AND row_number <= (total/2.0) + 1 THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS median, AVG(CASE WHEN row_number >= (CEIL(total/2.0) + (FLOOR(total/2.0)/2.0)) AND row_number <= (CEIL(total/2.0) + (FLOOR(total/2.0)/2.0) + 1) THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS q3 FROM details )
As before, if you'd rather display your data in a single grouping, drop
PARTITION BY series from each line (but leave the parentheses).
Step 3: Calculate the min and max
The final step calculates the min and max values of your dataset and aggregates the dataset into a single row.
SELECT series, MIN(value) AS minimum, AVG(q1) AS q1, AVG(median) AS median, AVG(q3) AS q3, MAX(value) AS maximum FROM quartiles GROUP BY 1
To see it all together, check out this query.
If you're using Postgres, you can add outliers—like in the embedded example above—to your box plot by replacing the final step with the query below.
To determine which points are outliers, you must first determine the interquartile range (IQR). It's the difference between
q3. Outliers are values that fall 1.5x IQR below
q1 or 1.5x IQR above
This query returns outlier values as a string of comma-separated values. It also replaces the minimum and maximum values with non-outlier values.
SELECT series, ARRAY_TO_STRING(ARRAY_AGG(CASE WHEN value < q1 - ((q3-q1) * 1.5) THEN value::VARCHAR ELSE NULL END),',') AS lower_outliers, MIN(CASE WHEN value >= q1 - ((q3-q1) * 1.5) THEN value ELSE NULL END) AS minimum, AVG(q1) AS q1, AVG(median) AS median, AVG(q3) AS q3, MAX(CASE WHEN value <= q3 + ((q3-q1) * 1.5) THEN value ELSE NULL END) AS maximum, ARRAY_TO_STRING(ARRAY_AGG(CASE WHEN value > q3 + ((q3-q1) * 1.5) THEN value::VARCHAR ELSE NULL END),',') AS upper_outliers FROM quartiles GROUP BY 1
Redshift offers a couple useful functions that make calculating medians and quartiles easier. As you'll see below, two built-in functions eliminate the need for the second CTE (
details, explained in Step 1) and dramatically shorten Step 2.
MEDIAN window function calculates—you guessed it—the median.
PERCENTILE_CONT window function finds any specified percentiles, offering a quick way to calculate the first quartile (25th percentile) and third quartile (75th percentile).
Unfortunately, Redshift doesn't support
ARRAY_AGG functions, so you can't use the method above to add outliers.
WITH raw_data AS ( SELECT sport_name AS series, age AS value FROM benn.olympic_athletes ), quartiles AS ( SELECT series, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY series) AS q1, MEDIAN(value) OVER (PARTITION BY series) AS median, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY series) AS q3 FROM raw_data ) SELECT series, MIN(value) AS minimum, AVG(q1) AS q1, AVG(median) AS median, AVG(q3) AS q3, MAX(value) AS maximum FROM quartiles GROUP BY 1
If you want to visualize your query results as an actual box plot, you can copy the visualization code from the embedded example.
Create a report in Mode that returns the same columns as above.
maximum are required,
upper_outliers are optional.
Replace your report's HTML (when editing your report, click “Report” and open the HTML pane) with this report's HTML. Voilà!
Like last week, we've compiled the key syntax for calculating box plot values in Redshift into one bite-size image.
Work-related distractions for data enthusiasts.