Pandas .groupby(), Lambda Functions, & Pivot Tables
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
In this lesson, you'll learn how to group, sort, and aggregate data to examine subsets and trends. Specifically, you’ll learn how to:
- Sample and sort data with
- Create Lambda functions
- Group data by columns with
- Plot grouped data
- Group and aggregate data with
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you'll be using records of United States domestic flights from the US Department of Transportation. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Clone. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets.head(n=5). Run this code so you can see the first five rows of the dataset.
datasets is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
In this lesson, you'll use records of United States domestic flights from the US Department of Transportation. It includes a record of each flight that took place from January 1-15 of 2015. Each record contains a number of values:
FlightDate Flight Date (yyyymmdd) UniqueCarrier Unique Carrier Code FlightNum Flight Number (Flights on different days may have the same flight number) Origin Origin Airport Dest Destination Airport DepDelay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers. ArrDelay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers. Cancelled Cancelled Flight Indicator (1=Yes) CarrierDelay Carrier Delay, in Minutes WeatherDelay Weather Delay, in Minutes NASDelay National Air System Delay, in Minutes SecurityDelay Security Delay, in Minutes LateAircraftDelay Late Aircraft Delay, in Minutes
AA American Airlines Inc. OO SkyWest Airlines Inc. DL Delta Air Lines Inc. NK Spirit Air Lines HA Hawaiian Airlines Inc. WN Southwest Airlines Co. B6 JetBlue Airways US US Airways Inc. AS Alaska Airlines Inc. MQ Envoy Air F9 Frontier Airlines Inc. VX Virgin America EV ExpressJet Airlines Inc. UA United Air Lines Inc.
For more visual exploration of this dataset, check out this estimator of which flight will get you there the fastest on FiveThirtyEight.
After following the steps above, go to your notebook and import NumPy and Pandas, then assign your DataFrame to the
data variable so it's easy to keep track of:
import pandas as pd import numpy as np
data = datasets # assign SQL query results to the data variable data = data.fillna(np.nan)
.sample() method lets you get a random set of rows of a DataFrame. Set the parameter
n= equal to the number of rows you want. Sampling the dataset is one way to efficiently explore what it contains, and can be especially helpful when the first few rows all look similar and you want to see diverse data.
Grab a sample of the flight data to preview what kind of data you have. Re-run this cell a few times to get a better idea of what you're seeing:
Now that you have a sense for what some random records look like, take a look at some of the records with the longest delays. This is likely a good place to start formulating hypotheses about what types of flights are typically delayed.
The values in the
arr_delay column represent the number of minutes a given flight is delayed. Sort by that column in descending order to see the ten longest-delayed flights. Note that values of
0 indicate that the flight was on time:
Wow. The longest delay was 1444 minutes—a whole day! The worst delays occurred on American Airlines flights to DFW (Dallas-Fort Worth), and they don't seem to have been delayed due to weather (you can tell because the values in the
weather_delay column are
January can be a tough time for flying—snowstorms in New England and the Midwest delayed travel at the beginning of the month as people got back to work. But how often did delays occur from January 1st-15th?
To quickly answer this question, you can derive a new column from existing data using an in-line function, or a lambda function.
In the previous lesson, you created a column of boolean values (True or False) in order to filter the data in a DataFrame. Here, it makes sense to use the same technique to segment flights into two categories: delayed and not delayed.
The technique you learned in the previous lesson calls for you to create a function, then use the
.apply() method like this:
def is_delayed(x): return x > 0
data['delayed'] = data['arr_delay'].apply(is_delayed)
For very short functions or functions that you do not intend to use multiple times, naming the function may not be necessary. The function used above could be written more quickly as a lambda function, or a function without a name.
The following code does the same thing as the above cell, but is written as a lambda function:
data['delayed'] = data['arr_delay'].apply(lambda x: x > 0)
Let's break it down:
lambda— this is a lambda function
x:— the parameter name within the function
x > 0— what to do with the parameter
Your biggest question might be, What is x? The
.apply() method is going through every record one-by-one in the
data['arr_delay'] series, where x is each record. Just as the
def function does above, the
lambda function checks if the value of each
arr_delay record is greater than zero, then returns
This might be a strange pattern to see the first few times, but when you’re writing short functions, the lambda function allows you to work more quickly than the def function.
Now that you have determined whether or not each flight was delayed, you can get some information about the aggregate trends in flight delays.
Count the values in this new column to see what proportion of flights are delayed:
False 103037 True 98627 Name: delayed, dtype: int64
value_counts() method actually returns the two numbers, ordered from largest to smallest. You can use them to calculate the percentage of flights that were delayed:
not_delayed = data['delayed'].value_counts() # first value of the result above delayed = data['delayed'].value_counts() # second value of the result above total_flights = not_delayed + delayed # total count of flights print float(delayed) / total_flights # converting to float to get a float result
51% of flights had some delay. That's pretty high! Better bring extra movies.
You might have noticed in the example above that we used the
float() function. Try to answer the following question and you'll see why:
What is 4 divided by 3?
One? That's weird.
This calculation uses whole numbers, called integers. When you use arithmetic on integers, the result is a whole number without the remainder, or everything after the decimal. This can cause some confusing results if you don't know what to expect. Turn at least one of the integers into a float, or numbers with decimals, to get a result with decimals.
In Python, if at least one number in a calculation is a float, the outcome will be a float.
Besides being delayed, some flights were cancelled. Those flights had a delay of "0", because they never left. What percentage of the flights in this dataset were cancelled? A percentage, by definition, falls between 0 and 1, which means it's probably not an
data['cancelled'].value_counts() # count the (0, 1) values
0.0 196873 1.0 4791 Name: cancelled, dtype: int64
not_delayed, delayed = data['cancelled'].value_counts() print delayed / (delayed + not_delayed), '<- without conversion' print float(delayed) / (delayed + not_delayed), '<- _with_ conversion!'
0 <- without conversion 0.02375733894 <- _with_ conversion!
Across all flights, about 2.38% were cancelled.
Python will also infer that a number is a float if it contains a decimal, for example:
If half of the flights were delayed, were delays shorter or longer on some airlines as opposed to others? To compare delays across airlines, we need to group the records of airlines together. Boolean indexing won't work for this—it can only separate the data into two categories: one that is true, and one that is false (or, in this case, one that is delayed and one that is not delayed). What we need here is two categories (delayed and not delayed) for each airline.
.groupby() function allows us to group records into buckets by categorical values, such as carrier, origin, and destination in this dataset.
Since you already have a column in your data for the
unique_carrier, and you created a column to indicate whether a flight is
delayed, you can simply pass those arguments into the
groupby() function. This will create a segment for each unique combination of
delayed. In other words, it will create exactly the type of grouping described in the previous two paragraphs:
group_by_carrier = data.groupby(['unique_carrier','delayed'])
groupby() as splitting the dataset data into buckets by carrier (
‘unique_carrier’), and then splitting the records inside each carrier bucket into delayed or not delayed (
‘delayed’). The result is assigned to the
What happens next gets tricky. If you just look at the
group_by_carrier variable, you'll see that it is a
DataFrameGroupBy object. You can think of that as instructions on how to group, but without instructions on how to display values:
<pandas.core.groupby.DataFrameGroupBy object at 0x10c6df610>
You need to provide instructions on what values to display. For example:
You're grouping all of the rows that share the same carrier, as well as all the rows that share the same value for
delayed. You need to tell the function what to do with the other values. You could do any number of things:
- average the number of minutes of delay
- sum the total number of minutes of delay
You've already started down the path of simply determining the proportion of flights that are delayed or not, so you might as well finish the problem. Instead of averaging or summing, use
.size() to count the number of rows in each grouping:
unique_carrier delayed AA False 8912 True 9841 AS False 3527 True 2104 B6 False 4832 True 4401 DL False 17719 True 9803 EV False 10596 True 11371 F9 False 1103 True 1848 HA False 1351 True 1354 MQ False 4692 True 8060 NK False 1550 True 2133 OO False 9977 True 10804 UA False 7885 True 8624 US False 7850 True 6353 VX False 1254 True 781 WN False 21789 True 21150 dtype: int64
That's exactly what you're looking for! It's a little hard to read, though. Pandas has a handy
.unstack() method—use it to convert the results into a more readable format and store that as a new variable,
count_delays_by_carrier = group_by_carrier.size().unstack() count_delays_by_carrier
.groupby() is a tough but powerful concept to master, and a common one in analytics especially. For example, a marketing analyst looking at inbound website visits might want to group data by channel, separating out direct email, search, promotional content, advertising, referrals, organic visits, and other ways people found the site. The analyst might also want to examine retention rates among certain groups of people (known as cohorts) or how people who first visited the site around the same time behaved. Learn more about retention analysis among cohorts in this blog post.
See Wes McKinney's blog post on groupby for more examples and explanation.
This is very similar to the GROUP BY clause in SQL, but with one key difference:
Retain data after aggregating: By using
.groupby(), we retain the original data after we've grouped everything. You can still access the original dataset using the
data variable, but you can also access the grouped dataset using the new
group_by_carrier. When using SQL, you cannot directly access both the grouped/aggregated dataset and the original dataset (technically you can, but it would not be straightforward).
Nevertheless, here’s how the above grouping would work in SQL, using
SELECT unique_carrier, COUNT(CASE WHEN arr_delay <= 0 OR arr_delay IS NULL THEN 'not_delayed' END) AS not_delayed, COUNT(CASE WHEN arr_delay > 0 THEN 'delayed' END) AS delayed FROM tutorial.us_flights GROUP BY unique_carrier
Though Southwest (WN) had more delays than any other airline, all the airlines had proportionally similar rates of delayed flights. You can see this by plotting the delayed and non-delayed flights.
Use a new parameter in
.plot() to stack the values vertically (instead of allowing them to overlap) called
count_delays_by_carrier.plot(kind='barh', stacked=True, figsize=[16,6], colormap='winter')
If you need a refresher on making bar charts with Pandas, check out this earlier lesson.
You can customize plots a number of ways. Here's a quick guide to common parameters:
figsize=: (width, height) in inches
colormap=: a long list of color palettes, including: '
title=: a string
stacked=: stack the values vertically (instead of allowing them to overlap)
Here's the full list of plot parameters for DataFrames.
How many flights were delayed longer than 20 minutes?
Bonus Question: What proportion of delayed flights does this represent?View Solution
A pivot table is composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.
You can define how values are grouped by:
index=("Rows" in Excel)
We define which values are summarized by:
values=the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function
We define how values are summarized by:
aggfunc=(Aggregation Function) how rows are summarized, such as
Let's create a
.pivot_table() of the number of flights each carrier flew on each day:
flights_by_carrier = data.pivot_table(index='flight_date', columns='unique_carrier', values='flight_num', aggfunc='count') flights_by_carrier.head()
In this table, you can see the
count of flights (
flight_num) flown by each
unique_carrier on each
flight_date. This is extremely powerful, because you don't have to write a separate function for each carrier—this one function handles counts for all of them.
.pivot_table() does not necessarily need all four arguments, because it has some smart defaults. If we pivot on one column, it will default to using all other numeric columns as the index (rows) and take the average of the values. For example, if we want to pivot and summarize on
In the table above, we get the average of values by day, across all numberic columns. That was quick!
Several columns in the dataset indicate the reasons for the flight delay. You now know that about half of flights had delays—what were the most common reasons? Was there a lot of snow in January? Did the planes freeze up?
To find out, you can pivot on the date and type of delay,
delays_list, summing the number of minutes of each type of delay:
delays_list = ['carrier_delay','weather_delay','late_aircraft_delay','nas_delay','security_delay'] flight_delays_by_day = data.pivot_table(index='flight_date', values=delays_list, aggfunc='sum') flight_delays_by_day
The results in this table are the sum of minutes delayed, by type of delay, by day. How do each of the flight delays contribute to overall delay each day?
Let's build an area chart, or a stacked accumulation of counts, to illustrate the relative contribution of the delays. You can pass the arguments
stacked=True to create the stacked area chart,
colormap='autumn' to give it vibrant color, and
figsize=[16,6] to make it bigger:
flight_delays_by_day.plot(kind='area', figsize=[16,6], stacked=True, colormap='autumn') # area plot
It looks like late aircraft caused a large number of the delays on the 4th and the 12th of January. One hypothesis is that snow kept planes grounded and unable to continue their routes. In the next lesson, we'll dig into which airports contributed most heavily to delays.
Which airlines contributed most to the sum total minutes of delay? Pivot the daily sum of delay minutes by airline.
Bonus Points: Plot the delays as a stacked bar chart.View Solution
That was a ton of new material! In this Python lesson, you learned about:
- Sampling and sorting data with
- Lambda functions
- Grouping data by columns with
- Plotting grouped data
- Grouping and aggregate data with
In the next lesson, you'll learn about data distributions, binning, and box plots.
Python Histograms, Box Plots, & Distributions