July 23, 2020
NaN minute read
The data team at Mode loves SQL. We code it, write about it, and empower people to use it all over the world. So, in that vein, we thought it would be fun to use some SQL to analyze some SQL! That said, we don’t believe in analysis for the sake of analysis, so we did some research to help our readers better understand one of the most powerful tools in SQL—window functions.
We’re going to borrow PostgreSQL’s definition:
"A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result."
For those not familiar with the concept of window functions, I recommend you take a look at the Mode tutorial on this subject before moving on.
Okay, so we were curious to understand which window functions are most commonly used, so we built a window (pun intended) into our customer’s usage of these types of calculations.
We built a dataset of all queries created since January 1, 2019 and used some regex magic to extract window functions. Then we counted up uses of each window function in our dataset and identified the most frequently used. The chart below shows the percent of all window function use attributed to the top five most used.
Given that these are the most frequently used window functions, we thought it would be useful to provide examples of how to use them—both for folks who are new to SQL, and for SQL users like me, who, sometimes forget a little syntax.
One of the most common use cases for the
SUM window function is calculating a running sum. Let’s look at an example:
SELECT o.occurred_at, SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) as running_gloss_orders FROM demo.orders o
To break down the syntax here,
SUM(o.gloss_qty) defines the aggregation—we’re going to be taking a sum of
gloss_qty. Next, the
OVER(ORDER BY o.occurred_at) clause tells your SQL engine to go row-by-row and take the sum of every record of
gloss_qty in order by
occurred_at up until the row it’s currently looking at.
We can also create separate running sums for different groups within our dataset by adding a
PARTITION BY as seen below:
SELECT o.occurred_at, o.account_id, SUM(o.gloss_qty) OVER(PARTITION BY o.account_id ORDER BY o.occurred_at) as running_gloss_orders FROM demo.orders o
Ah, row number. This is my favorite window function. I commonly use
ROW_NUMBER when I want to return the most recent record of a table I’m working with.
ROW_NUMBER simply returns the number of the given row starting with 1, which is the first record as defined by the
ORDER BY. No need to specify a variable in the parenthesis here!
WITH order_ranks as ( SELECT o.id, o.account_id, o.gloss_qty, o.gloss_amt_usd, ROW_NUMBER() OVER(PARTITION BY o.account_id ORDER BY o.occurred_at DESC) as acct_order_rank FROM demo.orders o ) SELECT * FROM order_ranks WHERE acct_order_rank = 1
DESC in the
ORDER BY —this tells the SQL engine to order rows by
occurred_at in descending order (i.e. latest record first.) The query above returns the most recent record in the orders table for each account. This query also makes use of CTEs (if those are a foreign concept, read our latest post on them).
DENSE_RANK works similar to
ROW_NUMBER but differs in how it handles cases where two rows for the field specified in your
ORDER BY have identical values.
DENSE_RANK will assign rows with identical
ORDER_BY values the same number, whereas
ROW_NUMBER will produce distinct values.
SELECT o.id, o.account_id, DATE_TRUNC('day',o.occurred_at) as occurred_day, o.gloss_qty, o.gloss_amt_usd, DENSE_RANK() OVER(PARTITION BY o.account_id ORDER BY DATE_TRUNC('day',o.occurred_at)) as dense_rank FROM demo.orders o
You can see in the results below that the 2nd and 3rd rows have the same dense_rank value of 2. Oh, and you’ll notice that the window function above uses a value derived from another SQL function
DATE_TRUNC —nesting SQL functions inside a window function is fair game.
LAG function is a favorite among time series junkies. This function allows you to compare a row to any of the rows preceding it. So for instance, if you wanted to find out how one order of
gloss_qty compared to the previous orders, this is the function to use.
SELECT o.id, o.occurred_at, o.gloss_qty, LAG(gloss_qty,1) OVER(ORDER BY o.occurred_at) as order_lag_1, LAG(gloss_qty,2) OVER(ORDER BY o.occurred_at) as order_lag_2, LAG(gloss_qty,3) OVER(ORDER BY o.occurred_at) as order_lag_3 FROM demo.orders o
Keep in mind that the first n rows of your data will be
NULL (where n is the number of “lags” you specify). You can see this in the results of our query:
And finally, we can top off this list with
MAX works in the same way as
SUM so I’m going to spice up the example with some more advanced syntax. Let’s say that I want to see how a current order stacks up against the highest order quantity among the previous 5 orders for
gloss_qty. We can do this by adding in some additional syntax in our
ORDER BY clause.
SELECT o.id, o.occurred_at, o.gloss_qty, MAX(gloss_qty) OVER(ORDER BY o.occurred_at ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) as max_order FROM demo.orders o
After we specify the field to order by, we add a definition for our window size:
ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING. This clause is essentially saying, "look back across the previous 5 orders (not including the current order) and take the maximum value."
This syntax is flexible and can define any window across your dataset. The syntax typically looks like:
ORDER BY [order_var] ROWS BETWEEN window_start AND window_end
window_end take on one of the following values:
UNBOUNDED PRECEDING (i.e. all rows before the current row)
[VALUE] PRECEDING (where [VALUE] = # of rows behind the current row to consider)
[VALUE] FOLLOWING (where [VALUE] = # of rows ahead of the current row to consider)
UNBOUNDED FOLLOWING (i.e. all rows after the current row)
Window functions are incredibly powerful tools that enable quick and flexible analysis, all in the context of the SQL editor. Now, go find your favorite window function and analyze away!
Curious about how Mode works? Sit back and watch the video—no reps ;)
Work-related distractions for data enthusiasts.