Filtering Data in Python with Boolean Indexes
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
This lesson builds on the previous lesson, Counting values and basic plotting.
You'll learn how to:
- Filter and segment data using boolean indexing
- Partially match text with .str.contains()
Filtering data will allow you to select events following specific patterns, such as finding pages with high pageview counts. You'll then be able to dig deeper into the source of such traffic.
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Clone. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets.head(n=5). Run this code so you can see the first five rows of the dataset.
datasets is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the
.head() method to preview the first five rows.
import pandas as pd
data = datasets # assign SQL query results to the data variable data = data.fillna('') # replace missing values with '' as in the previous lesson
|0||https://www.google.com/||2016-02-05 00:48:23||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4...||CHAROLETTE S||Desktop|
|1||https://themeteorchef.com/snippets/making-use-...||2016-02-24 23:12:10||Watsi | The Meteor Chef||https://watsi.org/team/the-meteor-chef||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||WARREN Q||themeteorchef.com||team||Desktop|
|2||https://watsi.org/||2015-12-25 17:59:35||Watsi | Give the gift of health with a Watsi G...||https://watsi.org/gift-cards||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1...||MITCHEL O||watsi.org||gift-cards||Desktop|
|3||2016-02-05 21:19:30||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...||MICHEL O||Desktop|
|4||https://watsi.org/fund-treatments||2016-02-14 19:30:08||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2...||ANDREE N||watsi.org||Desktop|
In the previous lesson, you examined the makeup of platforms used to access Watsi’s site, pages visited, and sites that referred visitors. But what about the behavior of people who visit the site? What pages did people visit on mobile? What pages and sections of the website did referred visitors land on?
In the previous lesson, you saw that visitors most often landed on the homepage. Next you might want to know:
When people came to the homepage, where did they come from?
To answer this, you'll need to break down traffic into these subsections based on their referrers—the websites they were on before going to Watsi. This practice is called segmentation.
You'll need the records of pageviews on the homepage and the referrer domains for those specific pageviews. To select the records of pageviews that went to the homepage, you'll need to identify which rows went to the homepage based on the
'title' column, then select them as a new DataFrame.
From a technical standpoint, you'll need to create a list indicating which rows to select—rows for which the
title is the homepage. You can use a boolean index, a Series composed of
False values that correspond to rows in the dataset. The
False values describe which rows you want to select, namely only the
To create the boolean index representing pageviews of the homepage, you can compare each value of "title" to see if it's the title of the homepage, resulting in a Series of booleans:
(data['title'] == 'Watsi | Fund medical treatments for people around the world')
0 True 1 False 2 False 3 True 4 True 5 False 6 False 7 False 8 False 9 True 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 True 20 True 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 True ... 4970 False 4971 False 4972 False 4973 True 4974 True 4975 False 4976 False 4977 False 4978 False 4979 False 4980 False 4981 False 4982 False 4983 True 4984 True 4985 False 4986 False 4987 True 4988 False 4989 False 4990 False 4991 True 4992 False 4993 False 4994 False 4995 False 4996 True 4997 True 4998 False 4999 False Name: title, dtype: bool
You'll want to assign the boolean index to a variable name:
homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # copied from above
The next step is to use the boolean index to filter your data. You can do this similarly to how you select columns or rows: use the boolean index inside square brackets to select the records from the DataFrame for which the boolean index reads
True. Store the filtered dataset under a new variable name,
watsi_homepage = data[homepage_index] # selects the "True" rows recorded in the boolean index
Now that you've got a filtered DataFrame called
watsi_homepage, take a look at where traffic came from. The fastest way to do this is to summarize the values in the
referrer column using
451 https://www.google.com/ 153 https://www.google.co.in/ 53 https://watsi.org/ 22 https://www.reddit.com/ 21 https://watsi.org/fund-treatments 18 http://blog.watsi.org/ 17 https://watsi.org/about 17 https://www.google.com.au/ 16 https://www.google.ca/ 15 https://www.google.com.ph/ 15 https://www.google.co.uk/ 14 http://www.forbes.com/sites/emmajohnson/2015/02/26/best-charity-sites-to-teach-kids-about-giving/ 10 https://t.co/2chd5GTmw1 10 https://www.google.com 10 Name: referrer, dtype: int64
You might have noticed a few key things:
- Referral links are messy! Some have long paths, subdomains, query strings, or referral strings that make it hard to easily see that two visitors came from the same website. For example, google.com, google.co.in, and all of the other local Google domains should really be counted simply as “google” for purposes of this analysis.
- Referrals are from internal sources (watsi.org), and external sources like Google, Facebook, Reddit, and others.
- Quite a large number of referrals are missing. The top referrer is an empty one. This means that people came directly to the site, but it's impossible to be sure what actually led them to it.
Because Google is represented by so many different unique values for
referrer, it's difficult to simply say, "30% of hits came from Google searches" based on this data. Normally you'd need to clean it up a bit before you could do that.
You’ll learn about data cleanup later on in this tutorial. For now, you can instead use the column called
referrer_domain, which is just the extracted domains:
452 google 422 watsi.org 169 reddit.com 36 facebook.com 24 t.co 22 vessel.com 11 forbes.com 10 adzerk.net 9 thedoctorschannel.com 7 Name: referrer_domain, dtype: int64
Much easier to digest. It looks like for the Watsi homepage, direct traffic (no referrer) was a bigger contender than any other single source. Hopefully that's a positive sign of organic sharing on social media.
Say you want to select all the records where the referrer link contains the text "medical". These links don't equal "medical", but they do include it. You can use a Series string method
.str.contains() to create a boolean index of records containing "medical".
medical_referrer_index = data['referrer'].str.contains('medical') medical_referrals = data[medical_referrer_index] medical_referrals
|222||http://www.inc.com/abigail-tracy/new-crowdfund...||2015-12-07 23:15:52||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (iPad; CPU OS 9_1 like Mac OS X) A...||BERNADETTE W||inc.com||iPad|
|568||http://www.google.pl/url?sa=t&source=web&cd=1&...||2016-03-19 10:03:28||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-...||KRISTIAN E||Android|
|3115||https://www.reddit.com/r/Anarcho_Capitalism/co...||2015-12-19 00:33:04||Watsi | Give the gift of health with a Watsi G...||https://watsi.org/gift-cards||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||JAIMEE U||reddit.com||gift-cards||Desktop|
|3886||http://poormedicalcare.weebly.com/help-our-cau...||2016-03-19 18:30:13||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...||STEPANIE V||weebly.com||Desktop|
|3980||http://my.parallaxsearch.com/web?qs=crowd+fund...||2016-03-11 05:22:01||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||ELLIE U||parallaxsearch.com||Desktop|
|4859||http://www.google.co.in/search?site=&oq=crowdf...||2016-03-14 18:59:07||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; S...||MARRY D||Android|
If you’re already familiar with SQL, this is equivalent to writing something like this:
SELECT * FROM watsi WHERE referrer LIKE "%medical%";
.str.contains() is case sensitive. To disregard case, use
The output has cut off the links. Use the method
.tolist() to print the full referrer strings.
.tolist() will turn the pandas Series into a plain Python list.
['http://www.inc.com/abigail-tracy/new-crowdfunding-site-for-medical-students.html', 'http://www.google.pl/url?sa=t&source=web&cd=1&rct=j&q=medical%20treatment%20grants&ved=0ahUKEwjRo5DGv8zLAhUiS5oKHdOAC58QFggaMAA&url=https%3A%2F%2Fwatsi.org%2F&usg=AFQjCNEkfSK0vlcYRRpj89LfLVEvb8rV6A&sig2=dKNwFh-vKOj-UFZJ63zciw', 'https://www.reddit.com/r/Anarcho_Capitalism/comments/3xddbx/watsi_a_cool_new_charity_that_crowdfunds_medical/', 'http://poormedicalcare.weebly.com/help-our-cause.html', 'http://my.parallaxsearch.com/web?qs=crowd+funding+for+medical+treatment', 'http://www.google.co.in/search?site=&oq=crowdfunding+for+me&aqs=mobile-gws-lite.0.0l5&q=crowdfunding+for+medical+treatment']
Interesting. Some of the pageviews with referrer links containing "medical" were searches. Looking at these links, you can see a query string (containing the ? character). Query strings are how searches are stored in urls. For example, in the last link, the user searched "crowdfunding for medical treatment". Digging into this kind of data could help you understand more about how users are finding Watsi through search.
In this lesson, you learned to:
- Filter data using boolean indexing
- Partially match text with
In the next lesson, you'll learn to process columns to create new columns of new data, or derived data.
Deriving New Columns & Defining Python Functions