Python Tutorial
Learn Python for business analysis using real-world data. No coding experience necessary.
Start Now
Mode Studio
The Collaborative Data Science Platform
Creating Pandas DataFrames & Selecting Data
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
Goals of this lesson
In this lesson, you'll learn how to create and use a DataFrame, a Python data structure that is similar to a database or spreadsheet table. You'll learn how to:
- Describe a pandas DataFrame
- Create a pandas DataFrame with data
- Select columns in a
DataFrame
- Select rows in a
DataFrame
- Select both columns and rows in a
DataFrame
The Python data analysis tools that you'll learn throughout this tutorial are very useful, but they become immensely valuable when they are applied to real data (and real problems). In this lesson, you'll be using tools from pandas, one of the go-to libraries for data manipulation, to conduct analysis of web traffic, which can help drive valuable decisions for a business.
Pandas DataFrames
Pandas has a few powerful data structures:
- A table with multiple columns is a DataFrame.
- A column of a DataFrame, or a list-like object, is a Series.
A DataFrame is a table much like in SQL or Excel. It's similar in structure, too, making it possible to use similar operations such as aggregation, filtering, and pivoting. However, because DataFrames are built in Python, it's possible to use Python to program more advanced operations and manipulations than SQL and Excel can offer. As a bonus, the creators of pandas have focused on making the DataFrame operate very quickly, even over large datasets.
DataFrames are particularly useful because powerful methods are built into them. In Python, methods are associated with objects, so you need your data to be in the DataFrame to use these methods. DataFrames can load data through a number of different data structures and files, including lists and dictionaries, csv files, excel files, and database records (more on that here).
Loading data into a Mode Python Notebook
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets[0].head(n=5)
. Run this code so you can see the first five rows of the dataset.
datasets
is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Creating a Pandas DataFrame
Prepping a DataFrame
In Mode Python Notebooks, the first cell is automatically populated with the following code to access the data produced by the SQL query:
datasets[0].head(n=5)
The datasets
object is a list, where each item is a DataFrame corresponding to one of the SQL queries in the Mode report. So datasets[0]
is a dataframe object within the datasets
list. You can see that the above command produces a table showing the first 5 rows of the results of your SQL query.
Mode is able to do this because it comes pre-loaded with pandas. Still, you should get in the habit of giving librarias aliases so that you can refer to them easily throughout your code. Pandas is typically aliased as pd
:
import pandas as pd
You should also assign the DataFrame as a variable. Since you'll only be working with one DataFrame in this lesson, you can keep it simple and just call it data
:
data = datasets[0] # assign SQL query results to the data variable
One final step before we're ready to start analysis: text cleanup. There are a few missing values in this dataset (in SQL you'd refer to them as null
). For the sake of making this easier to look at, use the fillna()
method to replace missing values with empty strings:
data = data.fillna('') # replace missing values with strings for easier text processing
About this dataset
As mentioned above, in this lesson you'll be working with web traffic data from a nonprofit called Watsi. Every row in this dataset corresponds to a person visiting a watsi.org page (this is known as a pageview). The general flow of pageviews is referred to as web traffic.
Every pageview (row in the dataset) is composed of:
'referrer'
The url that referred the user to the site (if available). For example, if someone arrived at the page through a Facebook link,referrer
would be https://www.facebook.com'timestamp'
The time the event occurred'title'
The title of the page the user visited on the Watsi website'url'
The url the user visited. For example, https://watsi.org/team/the-meteor-chef'user_agent'
The software the user used to accessed the site, including platform, browser, and extensions'user_id'
A unique id for each user (normally they’d be numbers—we've turned them into anonymous names instead)'referrer_domain'
The domain of the url that referred the user to the site. For example, "facebook.com"'website_section'
The section of the website visited. For example, the section of https://watsi.org/team/the-meteor-chef is "team"'platform'
The device platform the user visited from. Possible values are"Desktop"
and"Mobile"
Get context for the data
Through their website, Watsi enables direct funding of medical care. Take the time to understand what that looks like in practice. Visit some of the URLs you see in this dataset to familiarize yourself with the structure of the site and content, such as Mary's patient profile. Google Watsi and consider why people might engage with the service. Context is important - it'll help you make educated inferences in your analysis of the data.
Data sampling
This dataset contains 5,000 rows, which were sampled from a 500,000 row dataset spanning the same time period. Throughout these analyses, the number of events you count will be about 100 times smaller than they actually were, but the proportions of events will still generally be reflective of that larger dataset. In this case, a sample is fine because our purpose is to learn methods of data analysis with Python, not to create 100% accurate recommendations to Watsi.
Selecting columns in a DataFrame
As you learned in the previous lesson, you can select a value in a list or dictionary using brackets:
cities[0]
(gets item at place 0 in the list "cities")city_population['Tokyo']
(gets values associated with the key'Tokyo'
in the dictionarycity_population
)
Similarly, you can use brackets to select a column in the DataFrame:
data['url']
0 https://watsi.org/
1 https://watsi.org/team/the-meteor-chef
2 https://watsi.org/gift-cards
3 https://watsi.org/
4 https://watsi.org/
Name: url, dtype: object
Selecting the column gives you access to the whole column, but will only show a preview. Below the column, the column name and data type (dtype) are printed for easy reference.
The url
column you got back has a list of numbers on the left. This is called the index, which uniquely identifies rows in the DataFrame. You will use the index to select individual rows, similar to how you selected rows from a list in an earlier lesson. A unique identifier is often necessary to refer to specific records in the dataset. For example, the DMV uses license plates to identify specific vehicles, instead of "Blue 1999 Honda Civic in California," which may or may not uniquely identify a car.
Selecting columns will be important to much of the analysis you do throughout the tutorials, especially in grouping and counting events.
Selecting rows in a DataFrame
Selecting rows is useful for exploring the data and getting familiar with what values you might see. You can select rows by using brackets and row indexes. For example, you can select the first three rows of the DataFrame with the following code:
data[:3]
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
0 | https://www.google.com/ | 2016-02-05 00:48:23 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | CHAROLETTE S | Desktop | ||
1 | https://themeteorchef.com/snippets/making-use-... | 2016-02-24 23:12:10 | Watsi | The Meteor Chef | https://watsi.org/team/the-meteor-chef | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | WARREN Q | themeteorchef.com | team | Desktop |
2 | https://watsi.org/ | 2015-12-25 17:59:35 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... | MITCHEL O | watsi.org | gift-cards | Desktop |
The ":3" between the brackets effectively means "up to index 3". Similarly, you could select everything "from index 4 up to (but not including) index 7":
data[4:7]
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
4 | https://watsi.org/fund-treatments | 2016-02-14 19:30:08 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... | ANDREE N | watsi.org | Desktop | |
5 | http://blog.watsi.org/ | 2015-10-15 06:04:40 | Watsi | Fund a medical treatment on Watsi. 100... | https://watsi.org/fund-treatments | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | SHAREN Y | watsi.org | fund-treatments | Desktop |
6 | 2015-12-25 10:23:43 | Watsi | Redeem your Watsi Gift Card | https://watsi.org/redeem | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) G... | BRICE Z | redeem | Desktop |
Finally, you can select "everything from index 4997 onward":
data[4997:]
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
4997 | http://aplus.com/a/kid-has-flexible-neck?c=654... | 2016-01-03 02:48:38 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like ... | NOELLE P | aplus.com | iPhone | |
4998 | https://watsi.org/fund-treatments?page=2 | 2016-02-07 23:47:53 | Watsi | Success! Sarah from Kenya raised $1,12... | https://watsi.org/profile/6705ce017f7e-sarah | Mozilla/5.0 (iPad; CPU OS 9_2 like Mac OS X) A... | JERICA F | watsi.org | profile | iPad |
4999 | https://watsi.org/ | 2015-11-17 16:38:25 | Watsi | Fund a medical treatment on Watsi. 100... | https://watsi.org/fund-treatments?page=4 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5... | MARIANNA I | watsi.org | fund-treatments | Desktop |
Selecting a specific row
To select a specific row, you must use the .iloc
or .loc
method, with the row's index in brackets. Which to use depends on what you are trying to return.
.iloc
is integer index-based. If you .iloc[1]
, it will return to you the row at the 1st index regardless of the index’s name. .loc
however is label-based, meaning that .iloc[1]
will only return the row at the first index if "1" is the index’s label. In the case of this dataframe .iloc[1]
and .loc[1]
will return the same row.
data.iloc[1]
referrer https://themeteorchef.com/snippets/making-use-...
timestamp 2016-02-24 23:12:10
title Watsi | The Meteor Chef
url https://watsi.org/team/the-meteor-chef
user_agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id WARREN Q
referrer_domain themeteorchef.com
website_section team
platform Desktop
Name: 1, dtype: object
data.loc[1]
referrer https://themeteorchef.com/snippets/making-use-...
timestamp 2016-02-24 23:12:10
title Watsi | The Meteor Chef
url https://watsi.org/team/the-meteor-chef
user_agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id WARREN Q
referrer_domain themeteorchef.com
website_section team
platform Desktop
Name: 1, dtype: object
This is different from selecting columns. When selecting a column, you'll use data[]
, and when selecting a row, you'll use data.iloc[]
or data.loc[]
. To learn more about the differences between .iloc
and .loc
, check out pandas documentation.
Selecting rows and columns in a DataFrame
Just as you can select from rows or columns, you can also select from both rows and columns at the same time. For example, you can select the first three rows of the title
column by naming both the column and rows in square brackets:
data['title'][:3]
0 Watsi | Fund medical treatments for people aro...
1 Watsi | The Meteor Chef
2 Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
Think about this as listing the row and column selections one after another. Putting together a column selection and a row selection:
data['title']
data[:3]
You get the combined selection:
data['title'][:3]
The brackets selecting the column and selecting the rows are separate, and the selections are applied from left to right (in this last example, the column is selected, then it is filtered down to the first 3 rows).
In fact, selecting the rows and then the column yields the same result:
data[:3]['title']
0 Watsi | Fund medical treatments for people aro...
1 Watsi | The Meteor Chef
2 Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
Lesson summary:
In this lesson, you learned to:
- Create a pandas
DataFrame
with data - Select columns in a
DataFrame
- Select rows in a
DataFrame
- Select both columns AND rows in a
DataFrame
In the next lesson, you'll learn how to count values and plot a bar chart.
Next Lesson
Pandas .values_count() & .plot()