mode helix
NEXT WEBINARHow companies should be thinking about data governanceRegister now

July 23, 2020 5 minute read

The Top 5 Most Popular Window Functions and How to Use Them

Image of author

Nate Coleman

Data Scientist

The data team at Mode loves SQL. We code it, we write about it, and empower people to use it all over the world. So, in that vein, we thought it would be fun to use some SQL to analyze some SQL! That said, we don’t believe in analysis for the sake of analysis, so we did some research to help our readers better understand one of the most powerful tools in SQL—window functions.

What is a window function?

We’re going to borrow PostgreSQL’s definition:

"A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result."

For those not familiar with the concept of window functions, I recommend you take a look at the Mode tutorial on this subject before moving on.

Okay, so we were curious to understand which window functions are most commonly used, so we built a window (pun intended) into our customer’s usage of these types of calculations.

We built a dataset of all queries created since January 1, 2019 and used some regex magic to extract window functions. Then we counted up uses of each window function in our dataset and identified the most frequently used. The chart below shows the percent of all window function use attributed to the top five most used.

Window functions 1

Given that these are the most frequently used window functions, we thought it would be useful to provide examples of how to use them—both for folks who are new to SQL, and for SQL users like me, who, sometimes forget a little syntax.

1. Sum

One of the most common use cases for the SUM window function is calculating a running sum. Let’s look at an example:

SELECT o.occurred_at,
       SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) as running_gloss_orders
  FROM demo.orders o

To break down the syntax here, SUM(o.gloss_qty) defines the aggregation—we’re going to be taking a sum of gloss_qty. Next, the OVER(ORDER BY o.occurred_at) clause tells your SQL engine to go row-by-row and take the sum of every record of gloss_qty in order by occurred_at up until the row it’s currently looking at.

We can also create separate running sums for different groups within our dataset by adding a PARTITION BY as seen below:

SELECT o.occurred_at,
         SUM(o.gloss_qty) OVER(PARTITION BY o.account_id 
                          ORDER BY o.occurred_at) as running_gloss_orders
  FROM demo.orders o

2. Row Number

Ah, row number. This is my favorite window function. I commonly use ROW_NUMBER when I want to return the most recent record of a table I’m working with. ROW_NUMBER simply returns the number of the given row starting with 1, which is the first record as defined by the ORDER BY. No need to specify a variable in the parenthesis here!

    order_ranks as (
              ROW_NUMBER() OVER(PARTITION BY o.account_id 
                           ORDER BY o.occurred_at DESC) as acct_order_rank
      FROM demo.orders o

    SELECT *
     FROM  order_ranks
    WHERE  acct_order_rank = 1

Note the DESC in the ORDER BY —this tells the SQL engine to order rows by occurred_at in descending order (i.e. latest record first.) The query above returns the most recent record in the orders table for each account. This query also makes use of CTEs (if those are a foreign concept, read our latest post on them).

3. Dense Rank

DENSE_RANK works similar to ROW_NUMBER but differs in how it handles cases where two rows for the field specified in your ORDER BY have identical values. DENSE_RANK will assign rows with identical ORDER_BY values the same number, whereas ROW_NUMBER will produce distinct values.

         DATE_TRUNC('day',o.occurred_at) as occurred_day,
         DENSE_RANK() OVER(PARTITION BY o.account_id 
                      ORDER BY DATE_TRUNC('day',o.occurred_at)) as dense_rank
    FROM demo.orders o

You can see in the results below that the 2nd and 3rd rows have the same dense_rank value of 2. Oh, and you’ll notice that the window function above uses a value derived from another SQL function DATE_TRUNC —nesting SQL functions inside a window function is fair game.

Window functions 2

4. Lag

The LAG function is a favorite among time series junkies. This function allows you to compare a row to any of the rows preceding it. So for instance, if you wanted to find out how one order of gloss_qty compared to the previous orders, this is the function to use.

            LAG(gloss_qty,1) OVER(ORDER BY o.occurred_at) as order_lag_1,
            LAG(gloss_qty,2) OVER(ORDER BY o.occurred_at) as order_lag_2,
            LAG(gloss_qty,3) OVER(ORDER BY o.occurred_at) as order_lag_3
     FROM   demo.orders o

Keep in mind that the first n rows of your data will be NULL (where n is the number of “lags” you specify). You can see this in the results of our query:

Window functions 3

5. Max

And finally, we can top off this list with MAX. MAX works in the same way as SUM so I’m going to spice up the example with some more advanced syntax. Let’s say that I want to see how a current order stacks up against the highest order quantity among the previous 5 orders for gloss_qty. We can do this by adding in some additional syntax in our ORDER BY clause.

         MAX(gloss_qty) OVER(ORDER BY o.occurred_at 
                        ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) as max_order
FROM demo.orders o

After we specify the field to order by, we add a definition for our window size: ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING. This clause is essentially saying, "look back across the previous 5 orders (not including the current order) and take the maximum value."

This syntax is flexible and can define any window across your dataset. The syntax typically looks like:

ORDER BY [order_var] ROWS BETWEEN window_start AND window_end

where window_start and windown_end take on one of the following values:

  • UNBOUNDED PRECEDING (i.e. all rows before the current row)
  • [VALUE] PRECEDING (where [VALUE] = # of rows behind the current row to consider)
  • [VALUE] FOLLOWING (where [VALUE] = # of rows ahead of the current row to consider)
  • UNBOUNDED FOLLOWING (i.e. all rows after the current row)

Window functions are incredibly powerful tools that enable quick and flexible analysis, all in the context of the SQL editor. Now, go find your favorite window function and analyze away!

Looks like you've got a thing for cutting-edge data news.

Get the latest.