mode helix
NEWIntroducing Helix—the first instant, responsive data engine.Learn more

Python Histograms, Box Plots, & Distributions

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

In this lesson, you'll learn how to:

  • Discern the distribution of a dataset
  • Describe the shape of the data with basic statistics
  • Make histograms
  • Compare distributions with histograms
  • Make box plots

Using flight data, you'll learn how to better compare trends among airlines, adjusting your analysis based on how many flights an airline flies. By the end, you'll know which airlines and airports are more or less reliable—and maybe even make it to Thanksgiving on time this year!

Loading data into Mode Python notebooks

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you'll be using records of United States domestic flights from the US Department of Transportation. To access the data, you’ll need to use a bit of SQL. Here’s how:

  1. Log into Mode or create an account.
  2. Navigate to this report and click Clone. This will take you to the SQL Query Editor, with a query and results pre-populated.
  3. Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
  4. The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets[0] is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Getting oriented with the data

This lesson uses data about US flight delays. For a data dictionary with more information, click here.

After following the steps above, go to your notebook and import NumPy and Pandas, then assign your DataFrame to the data variable so it's easy to keep track of:

Input

import pandas as pd
import numpy as np
    
Input

data = datasets[0] # assign SQL query results to the data variable
data = data.fillna(np.nan)
    
Input

data.head() # our flight data, including delay minute counts by type, and total delay upon arrival
    
Output
flight_date unique_carrier flight_num origin dest arr_delay cancelled distance carrier_delay weather_delay late_aircraft_delay nas_delay security_delay actual_elapsed_time
0 2015-01-02 AA 1 JFK LAX -19.0 0.0 2475.0 NaN NaN NaN NaN NaN 381.0
1 2015-01-03 AA 1 JFK LAX -39.0 0.0 2475.0 NaN NaN NaN NaN NaN 358.0
2 2015-01-04 AA 1 JFK LAX -12.0 0.0 2475.0 NaN NaN NaN NaN NaN 385.0
3 2015-01-05 AA 1 JFK LAX -8.0 0.0 2475.0 NaN NaN NaN NaN NaN 389.0
4 2015-01-06 AA 1 JFK LAX 25.0 0.0 2475.0 0.0 0.0 0.0 25.0 0.0 424.0

Distributions

Flying can be a frustrating and long endeavor in the U.S., especially around holidays and hurricane season when flights are delayed hours, sometimes days. As consumers, we're generally aware of those trends, but what about the day-to-day trends of air travel delays? Which airlines are doing better at getting you there on time, and which airports are the worst to fly out of?

One of the best ways to answer questions like these is to look at the distributions of the relevant variables. You can think of the distribution of a dataset or variable as a list of possible values, and some indication as to how frequently each value occurs. For a quick refresher on distributions, check out this lesson.

Before you look at the distributions of delays across airlines, start by exploring which airlines have the most delays.

First, build the series indicating whether or not flights are delayed, just as you did in the previous lesson:

Input

data['delayed'] = data['arr_delay'].apply(lambda x: x > 0) #from previous lesson
    

Now count the number of delayed flights for each airline. Since you're only after one value for each airline, you don't have to use .groupby(), .size(), and .unstack() as in the previous lesson. Instead, just filter the dataset, the count up the rows for each carries using .value_counts():

Input

delayed_flights = data[data['delayed'] == True] #filter to only rows where delayer == True
delayed_flights['unique_carrier'].value_counts() #count the number of rows for each carrier
    
Output

WN    21150
EV    11371
OO    10804
AA     9841
DL     9803
UA     8624
MQ     8060
US     6353
B6     4401
NK     2133
AS     2104
F9     1848
HA     1354
VX      781
Name: unique_carrier, dtype: int64
    

It might be a good idea to look at the proportion of each airline's flights that were delayed, rather than just the total number of each airlines delayed flights.

ABC of Analysis: Always Be Curious: A great analyst will always be skeptical and curious. When you reach a point in your analysis where you might have an answer, go a little further. Consider what might be affecting your results and what might support a counterargument. The best analysis will be presented as a very refined version of the broader investigation. Always be curious!

Proportion of flights delayed

To calculate the proportion of flights that were delayed, complete these 4 steps:

1. Group by carrier and delayed

Group flights by unique_carrier and delayed, get the count with .size() (as you did in the previous lesson, use .unstack() to return a DataFrame:

Input

data.groupby(['unique_carrier','delayed']).size().unstack()
    
Output
delayed False True
unique_carrier
AA 8912 9841
AS 3527 2104
B6 4832 4401
DL 17719 9803
EV 10596 11371
F9 1103 1848
HA 1351 1354
MQ 4692 8060
NK 1550 2133
OO 9977 10804
UA 7885 8624
US 7850 6353
VX 1254 781
WN 21789 21150

Since you're going to use this for further analysis, it will be helpful to use .reset_index() to clean up the index. While you're at it, assign the new DataFrame to a variable.

Input

delayed_by_carrier = data.groupby(['unique_carrier','delayed']).size().unstack().reset_index()
delayed_by_carrier[:5]
    
Output
delayed unique_carrier False True
0 AA 8912 9841
1 AS 3527 2104
2 B6 4832 4401
3 DL 17719 9803
4 EV 10596 11371

2. Create total flight count column

Create a simple derived column that sums delayed and on-time flights for each carrier. If you want a refreshed on derived columns, return to this lesson:

Input

delayed_by_carrier['flights_count'] = (delayed_by_carrier[False] + delayed_by_carrier[True])
delayed_by_carrier[:5]
    
Output
delayed unique_carrier False True flights_count
0 AA 8912 9841 18753
1 AS 3527 2104 5631
2 B6 4832 4401 9233
3 DL 17719 9803 27522
4 EV 10596 11371 21967

3. Calculate the proportion of total flights delayed

Divide the number of delayed flights by the total number of flights to get the proportion of delayed flights. Store this as another derived column:

Input

delayed_by_carrier['proportion_delayed'] = delayed_by_carrier[True] / delayed_by_carrier['flights_count']
delayed_by_carrier[:4]
    
Output
delayed unique_carrier False True flights_count proportion_delayed
0 AA 8912 9841 18753 0.524769
1 AS 3527 2104 5631 0.373646
2 B6 4832 4401 9233 0.476660
3 DL 17719 9803 27522 0.356188

4. Sort by the proportion of flights delayed

Finally, output the whole table, sorted by proportion of flights delayed by carrier:

Input

delayed_by_carrier.sort_values('proportion_delayed', ascending=False)
    
Output
delayed unique_carrier False True flights_count proportion_delayed
7 MQ 4692 8060 12752 0.632058
5 F9 1103 1848 2951 0.626228
8 NK 1550 2133 3683 0.579147
0 AA 8912 9841 18753 0.524769
10 UA 7885 8624 16509 0.522382
9 OO 9977 10804 20781 0.519898
4 EV 10596 11371 21967 0.517640
6 HA 1351 1354 2705 0.500555
13 WN 21789 21150 42939 0.492559
2 B6 4832 4401 9233 0.476660
11 US 7850 6353 14203 0.447300
12 VX 1254 781 2035 0.383784
1 AS 3527 2104 5631 0.373646
3 DL 17719 9803 27522 0.356188

This paints a very different picture - Southwest is squarely in the middle of the pack when it comes to the percentage of flights delayed.

Basic statistics: mean, median, percentiles

But what about the length of time passengers are delayed on each airline? It's plausible that the proportion of flight delays may be lower for some airlines, but the length of those delays may be much longer. Passengers care more about longer flight delays than shorter ones, and they might choose a different airline given a record of long delays. For passengers, which airlines seem more reliable? Or in more measured terms, How many minute are flights delayed on average, for each airline?

Mean

The mean, or the average, gives you a general idea of how many minutes flights were delayed for each airline. .pivot_table() calculates the mean of the aggregated values by default. You can pivot on the column unique_carrier to see the mean delay time aggregated by airline:

Input

data.pivot_table(columns='unique_carrier', values='arr_delay').sort_values(ascending=False)
    
Output

unique_carrier
MQ    35.627406
F9    28.836953
NK    22.779670
OO    19.031663
EV    18.358520
UA    16.094772
AA    15.616299
B6    13.576129
WN    11.273536
US     7.671557
HA     6.458937
DL     4.118949
VX     3.833908
AS     1.731951
Name: arr_delay, dtype: float64
    

Note that because .pivot_table calculated mean by default, the above is effectively the same as if you were to explicitly pass the argument like this (and it produces the exact same results):

Input

data.pivot_table(columns='unique_carrier', values='arr_delay', aggfunc=np.mean).sort_values(ascending=False)
    
Output

unique_carrier
MQ    35.627406
F9    28.836953
NK    22.779670
OO    19.031663
EV    18.358520
UA    16.094772
AA    15.616299
B6    13.576129
WN    11.273536
US     7.671557
HA     6.458937
DL     4.118949
VX     3.833908
AS     1.731951
Name: arr_delay, dtype: float64
    

Basic statistics with .describe()

.describe() is a handy function when you’re working with numeric columns. You can use .describe() to see a number of basic statistics about the column, such as the mean, min, max, and standard deviation. This can give you a quick overview of the shape of the data.

Before using describe(), select the arr_delay series for all Southwest flights:

Input

southwest = data[data['unique_carrier'] == 'WN']['arr_delay']
    
Output

62030    18.0
62031    11.0
62032     9.0
62033    44.0
62034    42.0
Name: arr_delay, dtype: float64
    

In case that was confusing, here's what just happened, evaluated inside-out, then left to right:

  1. data['unique_carrier'] == 'WN'] creates a boolean index that returns True for rows that represent Southwest flights
  2. Wrapping that in data[...] applies the boolean index to the DataFrame data.
  3. ['arr_delay'] reduces the columns to just the ['arr_delay'] column (and the index).

You can now run .describe() on this new object you've created to get basic statistics:

Input

southwest.describe()
    
Output

count    42020.000000
mean        11.273536
std         36.438970
min        -55.000000
25%         -9.000000
50%          1.000000
75%         19.000000
max        535.000000
Name: arr_delay, dtype: float64
    

This is a feature that really sets python apart from SQL or Excel. It would take a lot of work to get this information in either of those tools, but here it is as easy as adding the .describe() method.

Here's a quick breakdown of the above as it relates to this particular dataset:

  • count: there are 42,020 rows in the dataset, which is filtered to only show Southwest (WN).
  • mean: the average delay.
  • std: the standard deviation. More on this below.
  • min: the shortest delay in the dataset. In this case, the flight was very early.
  • 25%: the 25th percentile. 25% of delays were lower than -9.00.
  • 50%: the 50th percentile, or the median. 50% of delays were lower than 1.00.
  • 75%: the 75th percentile. 75% of delays were lower than 19.00.
  • max: the longest delay in the dataset: 535.00.
Get the basic statistics for arrival delays for flights originating in Chicago. Use the .describe() method.

Histograms

If all of Southwest's flights are delayed five minutes, but American Airlines' flights are sometimes delayed a full day, you might say that the uncertainty of American Airline's delays are higher. It's also possible that when the flights aren't late, they're early, and it’s arguable that that's a point in favor of the airline. There’s a lot of nuance to this problem that would be best explored visually.

You can visually represent the distribution of flight delays using a histogram. Histograms allow you to bucket the values into bins, or fixed value ranges, and count how many values fall in that bin.

Let's look at a small example first. Say you have two bins:

A = [0:10]
B = [10:20]

which represent fixed ranges of 0 to 10 and 10 to 20, respectively.

And we have the data:

[1, 2, 5, 9, 12, 20]

Therefore, our bins contain:

A = [1, 2, 5, 9]
B = [12, 20]

So the histogram would look like this:

Input

pd.Series([1, 2, 5, 9, 12, 20]).plot(kind='hist', bins=[0,10,20]) # `bins` defines the start and end points of bins
    
Output

      
    

Something very important to note is that histograms are not bar charts. In a bar chart, the height of the bar represents a numerical value (such as number of delayed flights), but each bar itself represents a category—something that cannot be counted, averaged, or summed (like airline).

In a histogram, the height of the bars represents some numerical value, just like a bar chart. The bars themselves, however, cannot be categorical—each bar is a group defined by a quantitative variable (like delay time for a flight).

One way to quickly tell the difference is that histograms do not have space between the bars. Because they represent values on a continuous spectrum, a space between bars would have a specific meaning: it would indicate a bin that is empty. Bar charts can have spaces between bars. Because they show categories, the order and spacing is not as important.

Check out this blog post by Flowing Data for a great explanation of histograms.

numpy .arange()

Say you’re interested in analyzing length of delays and you want to put these lengths into bins that represent every 10 minute period. You can use the numpy method .arange() to create a list of numbers that define those bins. The bins of ten minute intervals will range from 50 minutes early (-50) to 200 minutes late (200). The first bin will hold a count of flights that arrived between 50 and 40 minutes early, then 40 and 30 minutes, and so on.

Input

bin_values = np.arange(start=-50, stop=200, step=10)
print bin_values
    
Output

[-50 -40 -30 -20 -10   0  10  20  30  40  50  60  70  80  90 100 110 120
 130 140 150 160 170 180 190]
    

Let's plot the distribution of Southwest flight delays as a histogram using the bin_values variable created above:

Input

wn_carrier = data[data['unique_carrier'] == 'WN']
wn_carrier['arr_delay'].hist(bins=bin_values, figsize=[14,6])
    
Output

      
    

Great! You can see that the vast majority of Southwest flights weren't more than 30 minutes late. Though the airline has many flights, the majority of them aren't late enough to make you regret going on vacation.

Plot Virgin America's flight delays at five-minute intervals from -50 minutes to 200 minutes.
Let's dig into the Virgin America flights that were unexpectedly delayed. Which flights were delayed between 20-25 minutes? Was there a given reason? What hypotheses might you make about why there are more flights in that bin as opposed to the 15-20 minute bucket? Select the flights using boolean indexing, then count the origin airports for those flights.

Comparing distributions with histograms

Seeing one distribution is helpful to give us a shape of the data, but how about two?

Compare the distribution of two airlines with a similar number of total flights, US Airways and Envoy Air:

Input

bin_values = np.arange(start=-50, stop=200, step=10)
us_mq_airlines_index = data['unique_carrier'].isin(['US','MQ']) # create index of flights from those airlines
us_mq_airlines = data[us_mq_airlines_index] # select rows
group_carriers = us_mq_airlines.groupby('unique_carrier')['arr_delay'] # group values by carrier, select minutes delayed
group_carriers.plot(kind='hist', bins=bin_values, figsize=[12,6], alpha=.4, legend=True) # alpha for transparency
    
Output

unique_carrier
MQ    Axes(0.125,0.125;0.775x0.775)
US    Axes(0.125,0.125;0.775x0.775)
Name: arr_delay, dtype: object
      
    

The two distributions look similar, but not the same (the third color is where they overlap). You can use .describe() to see key statistics about the carriers:

Input

group_carriers.describe()
    
Output

unique_carrier       
MQ              count    11275.000000
                mean        35.627406
                std         58.444090
                min        -51.000000
                25%         -2.000000
                50%         17.000000
                75%         53.000000
                max        788.000000
US              count    13972.000000
                mean         7.671557
                std         34.672795
                min        -59.000000
                25%        -10.000000
                50%         -1.000000
                75%         13.000000
                max        621.000000
dtype: float64
    

Looks like delays on Envoy Airlines are more distributed than delays on US Airways, meaning that the values are more spread out.

Standard deviation

One of the measures you see above is std, standard deviation, which describes how flight delays are dispersed. In comparing the histograms, you can see that US Airways' delays are most concentrated between -20 to 20 minutes, while Envoy Air's flight delays are more distributed from 0 to 200 minutes. You can say that Envoy Air's delays are more dispersed than US Airways' delays, meaning that for a given flight on either airline, you would be less certain about the length of the delay for the Envoy flight.

The standard deviation measure is based on the statistical concept of normal distribution, or a common expected shape of distribution among various types of data. The value for standard deviation defines a range above and below the mean for which a certain percentage of the data lie. You can see in this visualization that, for a normal distribution:

  • 34.1% of records fall between the mean and one standard deviation higher.
  • 34.1% of records fall between the mean and one standard deviation lower.

So you might say that 68.2% of data falls within one standard deviation of the mean.

Image from Wikipedia.

It's easiest to understand this by taking a look at just the standard deviations from the output above:

Input

print 'Envoy Air', group_carriers.describe()['MQ']['std'], 'minutes from the mean', group_carriers.describe()['MQ']['mean']
print 'US Airways', group_carriers.describe()['US']['std'], 'minutes from the mean', group_carriers.describe()['US']['mean']
    
Output

Envoy Air 58.4440903966 minutes from the mean 35.627405765
US Airways 34.6727946615 minutes from the mean 7.67155740052
    

You can see that Envoy Air has a mean of about 35 minutes and a standard deviation of about 58 minutes. This means that 62.8% of the Envoy Air flights are between 23 minutes early (35 - 58 = -23) and 93 minutes late (35 + 58 = 93).

By contrast, US Airways has a lower mean, indicating that its flights, on average, are less delayed. US Airways also has a lower standard deviation, meaning delays are typically closer to the mean. More specifically, 62.8% of US Airways flights fall between 27 minutes early (7 - 34 = -27) and 41 minutes late (7 + 34 = 41).

Based on this information, you might say that US Airways delays have lower dispersion than Envoy Air.

Where not to vacation in the middle of January

Let's look at another very real dimension of this data—flight delays segmented by airport. Airports suffer from various challenges, such as weather, runway design, and congestion that impact on-time flight departures, so let's find out how they stack up. We don't want to get stuck in the mountains over New Years (or do we?).

To make things easier, you can look at the 20 highest volume airports by origin:

Input

hi_volume = data['origin'].value_counts()[:20]
hi_volume
    
Output

ATL    12678
ORD    10046
DFW     9854
DEN     7441
LAX     7434
IAH     5762
PHX     5610
SFO     5551
LAS     4902
MCO     4318
LGA     4075
DTW     4048
CLT     3959
MSP     3789
EWR     3754
SLC     3740
BOS     3738
SEA     3639
JFK     3609
FLL     3052
Name: origin, dtype: int64
    
Input

hi_volume_airports_names = hi_volume.index.tolist()
print hi_volume_airports_names
    
Output

['ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'IAH', 'PHX', 'SFO', 'LAS', 'MCO', 'LGA', 'DTW', 'CLT', 'MSP', 'EWR', 'SLC', 'BOS', 'SEA', 'JFK', 'FLL']
    

Filtering a list of values with .isin()

To create a boolean index where you’re looking for values matching anything in a list, you must use .isin() with the desired list of matches.

You can create a boolean index filtering for the records that originated ('origin') in the 20 highest volume airports.

Input

hi_volume_airports = data[data['origin'].isin(hi_volume_airports_names)]
hi_volume_airports.head()
    
Output
flight_date unique_carrier flight_num origin dest arr_delay cancelled distance carrier_delay weather_delay late_aircraft_delay nas_delay security_delay actual_elapsed_time delayed
0 2015-01-02 AA 1 JFK LAX -19.0 0.0 2475.0 NaN NaN NaN NaN NaN 381.0 False
1 2015-01-03 AA 1 JFK LAX -39.0 0.0 2475.0 NaN NaN NaN NaN NaN 358.0 False
2 2015-01-04 AA 1 JFK LAX -12.0 0.0 2475.0 NaN NaN NaN NaN NaN 385.0 False
3 2015-01-05 AA 1 JFK LAX -8.0 0.0 2475.0 NaN NaN NaN NaN NaN 389.0 False
4 2015-01-06 AA 1 JFK LAX 25.0 0.0 2475.0 0.0 0.0 0.0 25.0 0.0 424.0 True

For comparison, the above use of .isin() in Python is equivalent to the following SQL query:

SELECT * FROM tutorial.flights
  WHERE origin IN ('ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'IAH', 'PHX', 'SFO', 'LAS', 'MCO', 'LGA', 'DTW', 'CLT', 'MSP', 'EWR', 'SLC', 'BOS', 'SEA', 'JFK', 'FLL');

Box plots

Now that you have DataFrame of flights originating from high volume airports, you might ask: where did the longest flight delays originate in January 2015?

You can create a pivot table that pivots the flight date on the airport, where the values are the mean of the flight delays for that day.

Input

hi_volume_airports_pivots = hi_volume_airports.pivot_table(index='flight_date', columns='origin', values='arr_delay')
hi_volume_airports_pivots
    
Output
origin ATL BOS CLT DEN DFW DTW EWR FLL IAH JFK LAS LAX LGA MCO MSP ORD PHX SEA SFO SLC
flight_date
2015-01-02 3.327536 3.590580 0.509317 20.526899 36.049598 -6.842809 8.316993 -0.543307 12.156187 3.688742 13.709512 16.500000 -2.947712 5.834734 1.193333 4.590062 16.547325 12.254717 8.371429 4.534161
2015-01-03 15.428112 30.471616 13.768340 51.186292 37.604138 22.738007 37.370229 15.666667 39.844037 31.882979 18.550685 26.117338 15.606426 17.511364 20.027586 37.995702 19.783843 13.771812 11.773364 13.465190
2015-01-04 21.423343 26.867857 23.325077 52.495238 38.360104 35.771626 53.617978 25.293651 20.464286 55.445578 19.564767 28.159016 32.450704 39.847025 19.461279 83.225619 20.180085 10.291262 19.251092 15.503125
2015-01-05 3.095000 11.208609 6.051672 29.899200 28.705263 24.696594 22.674051 13.711864 8.450505 19.554422 17.229381 15.788618 34.984177 14.929204 23.874564 63.916667 13.665217 5.418060 13.225806 2.003356
2015-01-06 6.361725 43.310580 13.294964 15.344029 11.534626 35.078616 43.104530 23.425926 3.622642 43.359073 13.330579 7.234004 61.165049 29.996785 9.435088 42.356183 12.156658 4.372180 8.582716 0.581481
2015-01-07 0.944276 10.651316 4.869565 33.301095 10.428762 13.403727 22.030508 11.254464 10.490476 15.536680 7.498652 5.442446 46.063973 8.977918 -1.666667 38.479361 7.348028 9.467925 5.289216 2.977941
2015-01-08 3.033099 6.807692 10.484568 14.569873 11.217450 20.593060 15.419463 2.558442 1.571121 2.749091 8.597911 6.171329 3.575221 9.152648 47.264605 96.695578 8.000000 8.738351 5.141487 12.619718
2015-01-09 1.833499 21.045603 5.742331 21.551237 8.591810 34.665653 22.632107 1.808696 7.611354 43.294964 4.487245 8.144112 42.325581 8.758410 6.834459 46.355837 2.160550 7.464029 9.425178 3.878893
2015-01-10 -5.473046 3.763547 -1.658915 2.822014 5.501582 2.584906 0.422680 -5.172269 0.937888 1.259259 2.564706 2.709746 -11.311475 0.273273 8.542857 16.635209 2.213483 -2.761506 0.621622 2.718894
2015-01-11 -2.118085 -2.569767 5.789286 16.045977 19.767313 5.808725 -1.670543 -3.008734 17.064904 -2.964158 40.793103 24.195531 -7.576923 -2.242991 2.264493 22.578704 11.557143 6.381132 27.650633 5.946043
2015-01-12 42.375375 8.254777 14.975524 22.791444 19.114820 24.692771 8.219780 8.960699 22.710526 4.297101 12.710526 10.982175 16.641509 21.563863 1.274510 31.676056 5.371230 7.318519 27.918719 7.051546
2015-01-13 2.812957 -9.384106 0.086505 9.789279 7.248656 -2.710692 -2.901024 -7.118721 1.415274 -13.214559 -2.937853 -1.553506 -0.883234 -1.462295 -5.660959 23.323259 2.083990 3.267176 11.153652 0.528090
2015-01-14 -1.400000 -3.091216 -1.681250 -0.638838 2.690160 -1.903727 -5.456446 3.360360 -0.530120 -14.911877 -3.695418 -2.958559 0.002994 1.885350 -7.691030 2.735369 -1.161593 -1.134831 1.324455 -5.717949

You can use .describe() to see the mean and dispersion for each airport:

Input

hi_volume_airports_pivots.describe()
    
Output
origin ATL BOS CLT DEN DFW DTW EWR FLL IAH JFK LAS LAX LGA MCO MSP ORD PHX SEA SFO SLC
count 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000
mean 7.049522 11.609776 7.350537 22.283364 18.216483 16.044343 17.213870 6.938287 11.216083 14.613638 11.723369 11.302481 17.699715 11.925022 9.627240 39.274123 9.223535 6.526833 11.517644 5.083884
std 12.798122 15.004838 7.499172 16.171575 12.854437 15.286101 18.718574 10.452380 11.488504 22.619487 11.574100 10.193057 23.428830 12.647029 14.971524 28.195169 7.051518 4.795902 8.742399 5.910367
min -5.473046 -9.384106 -1.681250 -0.638838 2.690160 -6.842809 -5.456446 -7.118721 -0.530120 -14.911877 -3.695418 -2.958559 -11.311475 -2.242991 -7.691030 2.735369 -1.161593 -2.761506 0.621622 -5.717949
25% 0.944276 3.590580 0.509317 14.569873 8.591810 2.584906 0.422680 -0.543307 1.571121 1.259259 4.487245 5.442446 -0.883234 1.885350 1.193333 22.578704 2.213483 4.372180 5.289216 2.003356
50% 3.033099 8.254777 5.789286 20.526899 11.534626 20.593060 15.419463 3.360360 8.450505 4.297101 12.710526 8.144112 15.606426 8.977918 6.834459 37.995702 8.000000 7.318519 9.425178 3.878893
75% 6.361725 21.045603 13.294964 29.899200 28.705263 24.696594 22.674051 13.711864 17.064904 31.882979 17.229381 16.500000 34.984177 17.511364 19.461279 46.355837 13.665217 9.467925 13.225806 7.051546
max 42.375375 43.310580 23.325077 52.495238 38.360104 35.771626 53.617978 25.293651 39.844037 55.445578 40.793103 28.159016 61.165049 39.847025 47.264605 96.695578 20.180085 13.771812 27.918719 15.503125

It's tough to compare airports simply by looking at a big table of numbers. This will be easier if you group the records for each airport and overlay them, as you did with Envoy Air and US Airways:

Input

airport_bins = numpy.arange(-10,100,3)
hi_volume_airports_pivots.plot(kind='hist', bins=airport_bins, figsize=[12,6], alpha=.4, legend=True)
    
Output

      
    

Well, that's also tough to read. The method of overlaying distributions has limits, at least when you want to compare many distributions at once. Luckily, there's a one-dimensional way of visualizing the shape of distributions called a box plot.

Box plots are composed of the same key measures of dispersion that you get when you run .describe(), allowing it to be displayed in one dimension and easily comparable with other distributions. The components of box plots are:

Stephen Few, Information Dashboard DesignInformation Dashboard Design, Stephen Few

Try using box plots to compare the day-to-day distribution of delays at each airport:

Input

hi_volume_airports_pivots = hi_volume_airports.pivot_table(index='flight_date', columns='origin', values='arr_delay')
hi_volume_airports_pivots.plot(kind='box', figsize=[16,8])
    
Output

      
    

As you can see, it's visually simpler to compare many distributions with box plots. Airports like JFK had significant dispersion of delays, while LGA was evenly distributed around the most frequent average delay. ORD, however, was almost twice as delayed all the time, compared to every other high volume airport. Expect snow delays in Chicago in January!

Leading into this analysis, we posed a few key questions. To answer the last, Which airports are the worst to fly out of?, you can now say that you will (almost certainly) be delayed if you are flying out of Chicago in January, based on 2015 data. If you can help it, avoid connecting flights in Chicago.

Visualize the average arrival delay by date and carrier using box plots.

Lesson summary

As you've seen in this lesson, placing data in contrast enables you to better understand it. While it’s clear that all airlines and airports suffered from some delay, you can use statistics to quickly draw out abnormal trends and occurrences across the data. The deviation of data from a trend is often clearly revealed in visualization, which allows yous to visually identify unusual events, and dig deeper.

In this lesson, you learned how to:

  • Discern the distribution of a dataset
  • Describe the shape of the data with basic statistics
  • Make histograms
  • Compare distributions with histograms
  • Make box plots

Next Lesson

Resources for Further Learning

Looks like you've got a thing for cutting-edge data news.

Get the latest.

Get started