Python Tutorial
Learn Python for business analysis using real-world data. No coding experience necessary.
Start Now
Mode Studio
The Collaborative Data Science Platform
Filtering Data in Python with Boolean Indexes
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
Goals of this lesson
This lesson builds on the previous lesson, Counting values and basic plotting.
You'll learn how to:
- Filter and segment data using boolean indexing
- Partially match text with .str.contains()
Filtering data will allow you to select events following specific patterns, such as finding pages with high pageview counts. You'll then be able to dig deeper into the source of such traffic.
Loading data into Mode Python notebooks
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets[0].head(n=5)
. Run this code so you can see the first five rows of the dataset.
datasets[0]
is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Getting oriented with the data
This lesson uses data from Watsi. For a data dictionary with more information, click here.
Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head()
method to preview the first five rows.
import pandas as pd
data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson
data.head()
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
0 | https://www.google.com/ | 2016-02-05 00:48:23 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | CHAROLETTE S | Desktop | ||
1 | https://themeteorchef.com/snippets/making-use-... | 2016-02-24 23:12:10 | Watsi | The Meteor Chef | https://watsi.org/team/the-meteor-chef | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | WARREN Q | themeteorchef.com | team | Desktop |
2 | https://watsi.org/ | 2015-12-25 17:59:35 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... | MITCHEL O | watsi.org | gift-cards | Desktop |
3 | 2016-02-05 21:19:30 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2... | MICHEL O | Desktop | |||
4 | https://watsi.org/fund-treatments | 2016-02-14 19:30:08 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... | ANDREE N | watsi.org | Desktop |
Filtering data with boolean indexing
In the previous lesson, you examined the makeup of platforms used to access Watsi’s site, pages visited, and sites that referred visitors. But what about the behavior of people who visit the site? What pages did people visit on mobile? What pages and sections of the website did referred visitors land on?
In the previous lesson, you saw that visitors most often landed on the homepage. Next you might want to know:
When people came to the homepage, where did they come from?
To answer this, you'll need to break down traffic into these subsections based on their referrers—the websites they were on before going to Watsi. This practice is called segmentation.
You'll need the records of pageviews on the homepage and the referrer domains for those specific pageviews. To select the records of pageviews that went to the homepage, you'll need to identify which rows went to the homepage based on the 'title'
column, then select them as a new DataFrame.
From a technical standpoint, you'll need to create a list indicating which rows to select—rows for which the title
is the homepage. You can use a boolean index, a Series composed of True
or False
values that correspond to rows in the dataset. The True
/False
values describe which rows you want to select, namely only the True
rows.
To create the boolean index representing pageviews of the homepage, you can compare each value of "title" to see if it's the title of the homepage, resulting in a Series of booleans:
(data['title'] == 'Watsi | Fund medical treatments for people around the world')
0 True
1 False
2 False
3 True
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 True
20 True
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 True
...
4970 False
4971 False
4972 False
4973 True
4974 True
4975 False
4976 False
4977 False
4978 False
4979 False
4980 False
4981 False
4982 False
4983 True
4984 True
4985 False
4986 False
4987 True
4988 False
4989 False
4990 False
4991 True
4992 False
4993 False
4994 False
4995 False
4996 True
4997 True
4998 False
4999 False
Name: title, dtype: bool
You'll want to assign the boolean index to a variable name: homepage_index
:
homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # copied from above
The next step is to use the boolean index to filter your data. You can do this similarly to how you select columns or rows: use the boolean index inside square brackets to select the records from the DataFrame for which the boolean index reads True
. Store the filtered dataset under a new variable name, watsi_homepage
:
watsi_homepage = data[homepage_index] # selects the "True" rows recorded in the boolean index
Now that you've got a filtered DataFrame called watsi_homepage
, take a look at where traffic came from. The fastest way to do this is to summarize the values in the referrer
column using .value_counts()
:
watsi_homepage['referrer'].value_counts()[:15]
451
https://www.google.com/ 153
https://www.google.co.in/ 53
https://watsi.org/ 22
https://www.reddit.com/ 21
https://watsi.org/fund-treatments 18
http://blog.watsi.org/ 17
https://watsi.org/about 17
https://www.google.com.au/ 16
https://www.google.ca/ 15
https://www.google.com.ph/ 15
https://www.google.co.uk/ 14
http://www.forbes.com/sites/emmajohnson/2015/02/26/best-charity-sites-to-teach-kids-about-giving/ 10
https://t.co/2chd5GTmw1 10
https://www.google.com 10
Name: referrer, dtype: int64
Links are messy
You might have noticed a few key things:
- Referral links are messy! Some have long paths, subdomains, query strings, or referral strings that make it hard to easily see that two visitors came from the same website. For example, google.com, google.co.in, and all of the other local Google domains should really be counted simply as “google” for purposes of this analysis.
- Referrals are from internal sources (watsi.org), and external sources like Google, Facebook, Reddit, and others.
- Quite a large number of referrals are missing. The top referrer is an empty one. This means that people came directly to the site, but it's impossible to be sure what actually led them to it.
Because Google is represented by so many different unique values for referrer
, it's difficult to simply say, "30% of hits came from Google searches" based on this data. Normally you'd need to clean it up a bit before you could do that.
You’ll learn about data cleanup later on in this tutorial. For now, you can instead use the column called referrer_domain
, which is just the extracted domains:
watsi_homepage['referrer_domain'].value_counts()[:10]
452
google 422
watsi.org 169
reddit.com 36
facebook.com 24
t.co 22
vessel.com 11
forbes.com 10
adzerk.net 9
thedoctorschannel.com 7
Name: referrer_domain, dtype: int64
Much easier to digest. It looks like for the Watsi homepage, direct traffic (no referrer) was a bigger contender than any other single source. Hopefully that's a positive sign of organic sharing on social media.
Practice Problem
Select all the pageviews originating from the Reddit domain, and see where traffic is landing within Watsi.
View SolutionPartially matching text with .str.contains()
Say you want to select all the records where the referrer link contains the text "medical". These links don't equal "medical", but they do include it. You can use a Series string method .str.contains()
to create a boolean index of records containing "medical".
medical_referrer_index = data['referrer'].str.contains('medical')
medical_referrals = data[medical_referrer_index]
medical_referrals
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
222 | http://www.inc.com/abigail-tracy/new-crowdfund... | 2015-12-07 23:15:52 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (iPad; CPU OS 9_1 like Mac OS X) A... | BERNADETTE W | inc.com | iPad | |
568 | http://www.google.pl/url?sa=t&source=web&cd=1&... | 2016-03-19 10:03:28 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-... | KRISTIAN E | Android | ||
3115 | https://www.reddit.com/r/Anarcho_Capitalism/co... | 2015-12-19 00:33:04 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | JAIMEE U | reddit.com | gift-cards | Desktop |
3886 | http://poormedicalcare.weebly.com/help-our-cau... | 2016-03-19 18:30:13 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK... | STEPANIE V | weebly.com | Desktop | |
3980 | http://my.parallaxsearch.com/web?qs=crowd+fund... | 2016-03-11 05:22:01 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | ELLIE U | parallaxsearch.com | Desktop | |
4859 | http://www.google.co.in/search?site=&oq=crowdf... | 2016-03-14 18:59:07 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; S... | MARRY D | Android |
If you’re already familiar with SQL, this is equivalent to writing something like this:
SELECT * FROM watsi
WHERE referrer LIKE "%medical%";
By default, .str.contains()
is case sensitive. To disregard case, use .str.contains(case=False)
.
The output has cut off the links. Use the method .tolist()
to print the full referrer strings. .tolist()
will turn the pandas Series into a plain Python list.
medical_referrals['referrer'].tolist()
['http://www.inc.com/abigail-tracy/new-crowdfunding-site-for-medical-students.html',
'http://www.google.pl/url?sa=t&source=web&cd=1&rct=j&q=medical%20treatment%20grants&ved=0ahUKEwjRo5DGv8zLAhUiS5oKHdOAC58QFggaMAA&url=https%3A%2F%2Fwatsi.org%2F&usg=AFQjCNEkfSK0vlcYRRpj89LfLVEvb8rV6A&sig2=dKNwFh-vKOj-UFZJ63zciw',
'https://www.reddit.com/r/Anarcho_Capitalism/comments/3xddbx/watsi_a_cool_new_charity_that_crowdfunds_medical/',
'http://poormedicalcare.weebly.com/help-our-cause.html',
'http://my.parallaxsearch.com/web?qs=crowd+funding+for+medical+treatment',
'http://www.google.co.in/search?site=&oq=crowdfunding+for+me&aqs=mobile-gws-lite.0.0l5&q=crowdfunding+for+medical+treatment']
Interesting. Some of the pageviews with referrer links containing "medical" were searches. Looking at these links, you can see a query string (containing the ? character). Query strings are how searches are stored in urls. For example, in the last link, the user searched "crowdfunding for medical treatment". Digging into this kind of data could help you understand more about how users are finding Watsi through search.
Practice Problem
Find the users who visited the site on a windows phone using `user_agent`. Output the full string values.
View SolutionLesson summary:
In this lesson, you learned to:
- Filter data using boolean indexing
- Partially match text with
.str.contains()
In the next lesson, you'll learn to process columns to create new columns of new data, or derived data.
Next Lesson
Deriving New Columns & Defining Python Functions