SQL Tutorial

Python Tutorial

Learn Python for business analysis using real-world data. No coding experience necessary.

Start Now

Mode Studio

The Collaborative Data Science Platform

Filtering Data in Python with Boolean Indexes

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

This lesson builds on the previous lesson, Counting values and basic plotting.

You'll learn how to:

Filter and segment data using boolean indexing
Partially match text with .str.contains()

Filtering data will allow you to select events following specific patterns, such as finding pages with high pageview counts. You'll then be able to dig deeper into the source of such traffic.

Loading data into Mode Python notebooks

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:

Log into Mode or create an account.
Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets[0] is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Getting oriented with the data

This lesson uses data from Watsi. For a data dictionary with more information, click here.

Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head() method to preview the first five rows.

Input


import pandas as pd

Input


data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson

Input


data.head()

Output

	referrer	timestamp	title	url	user_agent	user_id	referrer_domain	website_section	platform
0	https://www.google.com/	2016-02-05 00:48:23	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4...	CHAROLETTE S	google		Desktop
1	https://themeteorchef.com/snippets/making-use-...	2016-02-24 23:12:10	Watsi \| The Meteor Chef	https://watsi.org/team/the-meteor-chef	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	WARREN Q	themeteorchef.com	team	Desktop
2	https://watsi.org/	2015-12-25 17:59:35	Watsi \| Give the gift of health with a Watsi G...	https://watsi.org/gift-cards	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1...	MITCHEL O	watsi.org	gift-cards	Desktop
3		2016-02-05 21:19:30	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...	MICHEL O			Desktop
4	https://watsi.org/fund-treatments	2016-02-14 19:30:08	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2...	ANDREE N	watsi.org		Desktop

Filtering data with boolean indexing

In the previous lesson, you examined the makeup of platforms used to access Watsi’s site, pages visited, and sites that referred visitors. But what about the behavior of people who visit the site? What pages did people visit on mobile? What pages and sections of the website did referred visitors land on?

In the previous lesson, you saw that visitors most often landed on the homepage. Next you might want to know:

When people came to the homepage, where did they come from?

To answer this, you'll need to break down traffic into these subsections based on their referrers—the websites they were on before going to Watsi. This practice is called segmentation.

You'll need the records of pageviews on the homepage and the referrer domains for those specific pageviews. To select the records of pageviews that went to the homepage, you'll need to identify which rows went to the homepage based on the 'title' column, then select them as a new DataFrame.

From a technical standpoint, you'll need to create a list indicating which rows to select—rows for which the title is the homepage. You can use a boolean index, a Series composed of True or False values that correspond to rows in the dataset. The True/False values describe which rows you want to select, namely only the True rows.

To create the boolean index representing pageviews of the homepage, you can compare each value of "title" to see if it's the title of the homepage, resulting in a Series of booleans:

Input


(data['title'] == 'Watsi | Fund medical treatments for people around the world')

Output


0        True
1       False
2       False
3        True
4        True
5       False
6       False
7       False
8       False
9        True
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20       True
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29       True
        ...  
4970    False
4971    False
4972    False
4973     True
4974     True
4975    False
4976    False
4977    False
4978    False
4979    False
4980    False
4981    False
4982    False
4983     True
4984     True
4985    False
4986    False
4987     True
4988    False
4989    False
4990    False
4991     True
4992    False
4993    False
4994    False
4995    False
4996     True
4997     True
4998    False
4999    False
Name: title, dtype: bool

You'll want to assign the boolean index to a variable name: homepage_index:

Input


homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # copied from above

The next step is to use the boolean index to filter your data. You can do this similarly to how you select columns or rows: use the boolean index inside square brackets to select the records from the DataFrame for which the boolean index reads True. Store the filtered dataset under a new variable name, watsi_homepage:

Input


watsi_homepage = data[homepage_index] # selects the "True" rows recorded in the boolean index

Now that you've got a filtered DataFrame called watsi_homepage, take a look at where traffic came from. The fastest way to do this is to summarize the values in the referrer column using .value_counts():

Input


watsi_homepage['referrer'].value_counts()[:15]

Output


                                                                                                     451
https://www.google.com/                                                                              153
https://www.google.co.in/                                                                             53
https://watsi.org/                                                                                    22
https://www.reddit.com/                                                                               21
https://watsi.org/fund-treatments                                                                     18
http://blog.watsi.org/                                                                                17
https://watsi.org/about                                                                               17
https://www.google.com.au/                                                                            16
https://www.google.ca/                                                                                15
https://www.google.com.ph/                                                                            15
https://www.google.co.uk/                                                                             14
http://www.forbes.com/sites/emmajohnson/2015/02/26/best-charity-sites-to-teach-kids-about-giving/     10
https://t.co/2chd5GTmw1                                                                               10
https://www.google.com                                                                                10
Name: referrer, dtype: int64

Links are messy

You might have noticed a few key things:

Referral links are messy! Some have long paths, subdomains, query strings, or referral strings that make it hard to easily see that two visitors came from the same website. For example, google.com, google.co.in, and all of the other local Google domains should really be counted simply as “google” for purposes of this analysis.
Referrals are from internal sources (watsi.org), and external sources like Google, Facebook, Reddit, and others.
Quite a large number of referrals are missing. The top referrer is an empty one. This means that people came directly to the site, but it's impossible to be sure what actually led them to it.

Because Google is represented by so many different unique values for referrer, it's difficult to simply say, "30% of hits came from Google searches" based on this data. Normally you'd need to clean it up a bit before you could do that.

You’ll learn about data cleanup later on in this tutorial. For now, you can instead use the column called referrer_domain, which is just the extracted domains:

Input


watsi_homepage['referrer_domain'].value_counts()[:10]

Output


                         452
google                   422
watsi.org                169
reddit.com                36
facebook.com              24
t.co                      22
vessel.com                11
forbes.com                10
adzerk.net                 9
thedoctorschannel.com      7
Name: referrer_domain, dtype: int64

Much easier to digest. It looks like for the Watsi homepage, direct traffic (no referrer) was a bigger contender than any other single source. Hopefully that's a positive sign of organic sharing on social media.

Practice Problem

Select all the pageviews originating from the Reddit domain, and see where traffic is landing within Watsi.

View Solution

Partially matching text with `.str.contains()`

Say you want to select all the records where the referrer link contains the text "medical". These links don't equal "medical", but they do include it. You can use a Series string method .str.contains() to create a boolean index of records containing "medical".

Input


medical_referrer_index = data['referrer'].str.contains('medical')
medical_referrals = data[medical_referrer_index]
medical_referrals

Output

	referrer	timestamp	title	url	user_agent	user_id	referrer_domain	website_section	platform
222	http://www.inc.com/abigail-tracy/new-crowdfund...	2015-12-07 23:15:52	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (iPad; CPU OS 9_1 like Mac OS X) A...	BERNADETTE W	inc.com		iPad
568	http://www.google.pl/url?sa=t&source=web&cd=1&...	2016-03-19 10:03:28	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-...	KRISTIAN E	google		Android
3115	https://www.reddit.com/r/Anarcho_Capitalism/co...	2015-12-19 00:33:04	Watsi \| Give the gift of health with a Watsi G...	https://watsi.org/gift-cards	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	JAIMEE U	reddit.com	gift-cards	Desktop
3886	http://poormedicalcare.weebly.com/help-our-cau...	2016-03-19 18:30:13	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...	STEPANIE V	weebly.com		Desktop
3980	http://my.parallaxsearch.com/web?qs=crowd+fund...	2016-03-11 05:22:01	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	ELLIE U	parallaxsearch.com		Desktop
4859	http://www.google.co.in/search?site=&oq=crowdf...	2016-03-14 18:59:07	Watsi \| Fund medical treatments for people aro...	https://watsi.org/	Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; S...	MARRY D	google		Android

If you’re already familiar with SQL, this is equivalent to writing something like this:

SELECT * FROM watsi
WHERE referrer LIKE "%medical%";

By default, .str.contains() is case sensitive. To disregard case, use .str.contains(case=False).

The output has cut off the links. Use the method .tolist() to print the full referrer strings. .tolist() will turn the pandas Series into a plain Python list.

Input


medical_referrals['referrer'].tolist()

Output


['http://www.inc.com/abigail-tracy/new-crowdfunding-site-for-medical-students.html',
 'http://www.google.pl/url?sa=t&source=web&cd=1&rct=j&q=medical%20treatment%20grants&ved=0ahUKEwjRo5DGv8zLAhUiS5oKHdOAC58QFggaMAA&url=https%3A%2F%2Fwatsi.org%2F&usg=AFQjCNEkfSK0vlcYRRpj89LfLVEvb8rV6A&sig2=dKNwFh-vKOj-UFZJ63zciw',
 'https://www.reddit.com/r/Anarcho_Capitalism/comments/3xddbx/watsi_a_cool_new_charity_that_crowdfunds_medical/',
 'http://poormedicalcare.weebly.com/help-our-cause.html',
 'http://my.parallaxsearch.com/web?qs=crowd+funding+for+medical+treatment',
 'http://www.google.co.in/search?site=&oq=crowdfunding+for+me&aqs=mobile-gws-lite.0.0l5&q=crowdfunding+for+medical+treatment']

Interesting. Some of the pageviews with referrer links containing "medical" were searches. Looking at these links, you can see a query string (containing the ? character). Query strings are how searches are stored in urls. For example, in the last link, the user searched "crowdfunding for medical treatment". Digging into this kind of data could help you understand more about how users are finding Watsi through search.

Practice Problem

Find the records with a referrer link containing "crowdfund".

View Solution

Practice Problem

Find the users who visited the site on a windows phone using `user_agent`. Output the full string values.

View Solution

Create a boolean index:

str.contains('IEMobile') (method)
in column data['user_agent'] (Series)

Then select the output:

data[idx] selects the data with the boolean index (DataFrame)
['user_agent'] selects the user_agent column (Series)
.tolist() translates the selected values to a list (list)

Input


idx = data['user_agent'].str.contains('IEMobile')
data[idx]['user_agent'].tolist()

Output


['Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 635) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 535 Dual SIM) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 535 Dual SIM) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 532; Orange) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; Microsoft; Lumia 640 XL Dual SIM)',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 630) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 1520) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 929) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 535 Dual SIM) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537',
 'Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 435 Dual SIM) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537']

Lesson summary:

In this lesson, you learned to:

Filter data using boolean indexing
Partially match text with .str.contains()

In the next lesson, you'll learn to process columns to create new columns of new data, or derived data.

Next Lesson

Deriving New Columns & Defining Python Functions

Mode is the intelligence layer for your data stack

Get a firsthand look at Mode's key features & workflows

SQL Tutorial

Python Tutorial

Mode Studio

Filtering Data in Python with Boolean Indexes

Goals of this lesson

Loading data into Mode Python notebooks

Getting oriented with the data

Filtering data with boolean indexing

Links are messy

Practice Problem

Partially matching text with `.str.contains()`

Practice Problem

Practice Problem

Lesson summary:

SQL Tutorial

Python Tutorial

Mode Studio

Filtering Data in Python with Boolean Indexes

Goals of this lesson

Loading data into Mode Python notebooks

Getting oriented with the data

Filtering data with boolean indexing

Links are messy

Practice Problem

Partially matching text with .str.contains()

Practice Problem

Practice Problem

Lesson summary:

Partially matching text with `.str.contains()`