ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Mode Studio

The Collaborative Data Science Platform

Filtering Data in Python with Boolean Indexes

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

This lesson builds on the previous lesson, Counting values and basic plotting.

You'll learn how to:

Filtering data will allow you to select events following specific patterns, such as finding pages with high pageview counts. You'll then be able to dig deeper into the source of such traffic.

Loading data into Mode Python notebooks

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:

  1. Log into Mode or create an account.
  2. Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
  3. Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
  4. The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets[0] is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Getting oriented with the data

This lesson uses data from Watsi. For a data dictionary with more information, click here.

Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head() method to preview the first five rows.


import pandas as pd

data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson

referrer timestamp title url user_agent user_id referrer_domain website_section platform
0 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop
1 2016-02-24 23:12:10 Watsi | The Meteor Chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q team Desktop
2 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O gift-cards Desktop
3 2016-02-05 21:19:30 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2... MICHEL O Desktop
4 2016-02-14 19:30:08 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... ANDREE N Desktop

Filtering data with boolean indexing

In the previous lesson, you examined the makeup of platforms used to access Watsi’s site, pages visited, and sites that referred visitors. But what about the behavior of people who visit the site? What pages did people visit on mobile? What pages and sections of the website did referred visitors land on?

In the previous lesson, you saw that visitors most often landed on the homepage. Next you might want to know:

When people came to the homepage, where did they come from?

To answer this, you'll need to break down traffic into these subsections based on their referrers—the websites they were on before going to Watsi. This practice is called segmentation.

You'll need the records of pageviews on the homepage and the referrer domains for those specific pageviews. To select the records of pageviews that went to the homepage, you'll need to identify which rows went to the homepage based on the 'title' column, then select them as a new DataFrame.

From a technical standpoint, you'll need to create a list indicating which rows to select—rows for which the title is the homepage. You can use a boolean index, a Series composed of True or False values that correspond to rows in the dataset. The True/False values describe which rows you want to select, namely only the True rows.

To create the boolean index representing pageviews of the homepage, you can compare each value of "title" to see if it's the title of the homepage, resulting in a Series of booleans:


(data['title'] == 'Watsi | Fund medical treatments for people around the world')

0        True
1       False
2       False
3        True
4        True
5       False
6       False
7       False
8       False
9        True
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20       True
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29       True
4970    False
4971    False
4972    False
4973     True
4974     True
4975    False
4976    False
4977    False
4978    False
4979    False
4980    False
4981    False
4982    False
4983     True
4984     True
4985    False
4986    False
4987     True
4988    False
4989    False
4990    False
4991     True
4992    False
4993    False
4994    False
4995    False
4996     True
4997     True
4998    False
4999    False
Name: title, dtype: bool

You'll want to assign the boolean index to a variable name: homepage_index:


homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # copied from above

The next step is to use the boolean index to filter your data. You can do this similarly to how you select columns or rows: use the boolean index inside square brackets to select the records from the DataFrame for which the boolean index reads True. Store the filtered dataset under a new variable name, watsi_homepage:


watsi_homepage = data[homepage_index] # selects the "True" rows recorded in the boolean index

Now that you've got a filtered DataFrame called watsi_homepage, take a look at where traffic came from. The fastest way to do this is to summarize the values in the referrer column using .value_counts():



                                                                                                     451                                                                              153                                                                             53                                                                                    22                                                                               21                                                                     18                                                                                17                                                                               17                                                                            16                                                                                15                                                                            15                                                                             14     10                                                                               10                                                                                10
Name: referrer, dtype: int64

You might have noticed a few key things:

  • Referral links are messy! Some have long paths, subdomains, query strings, or referral strings that make it hard to easily see that two visitors came from the same website. For example,,, and all of the other local Google domains should really be counted simply as “google” for purposes of this analysis.
  • Referrals are from internal sources (, and external sources like Google, Facebook, Reddit, and others.
  • Quite a large number of referrals are missing. The top referrer is an empty one. This means that people came directly to the site, but it's impossible to be sure what actually led them to it.

Because Google is represented by so many different unique values for referrer, it's difficult to simply say, "30% of hits came from Google searches" based on this data. Normally you'd need to clean it up a bit before you could do that.

You’ll learn about data cleanup later on in this tutorial. For now, you can instead use the column called referrer_domain, which is just the extracted domains:



google                   422                169                36              24                      22                11                10                 9      7
Name: referrer_domain, dtype: int64

Much easier to digest. It looks like for the Watsi homepage, direct traffic (no referrer) was a bigger contender than any other single source. Hopefully that's a positive sign of organic sharing on social media.

Practice Problem

Select all the pageviews originating from the Reddit domain, and see where traffic is landing within Watsi.

View Solution

Partially matching text with .str.contains()

Say you want to select all the records where the referrer link contains the text "medical". These links don't equal "medical", but they do include it. You can use a Series string method .str.contains() to create a boolean index of records containing "medical".


medical_referrer_index = data['referrer'].str.contains('medical')
medical_referrals = data[medical_referrer_index]
referrer timestamp title url user_agent user_id referrer_domain website_section platform
222 2015-12-07 23:15:52 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (iPad; CPU OS 9_1 like Mac OS X) A... BERNADETTE W iPad
568 2016-03-19 10:03:28 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-... KRISTIAN E google Android
3115 2015-12-19 00:33:04 Watsi | Give the gift of health with a Watsi G... Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... JAIMEE U gift-cards Desktop
3886 2016-03-19 18:30:13 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK... STEPANIE V Desktop
3980 2016-03-11 05:22:01 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... ELLIE U Desktop
4859 2016-03-14 18:59:07 Watsi | Fund medical treatments for people aro... Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; S... MARRY D google Android

If you’re already familiar with SQL, this is equivalent to writing something like this:

WHERE referrer LIKE "%medical%";

By default, .str.contains() is case sensitive. To disregard case, use .str.contains(case=False).

The output has cut off the links. Use the method .tolist() to print the full referrer strings. .tolist() will turn the pandas Series into a plain Python list.




Interesting. Some of the pageviews with referrer links containing "medical" were searches. Looking at these links, you can see a query string (containing the ? character). Query strings are how searches are stored in urls. For example, in the last link, the user searched "crowdfunding for medical treatment". Digging into this kind of data could help you understand more about how users are finding Watsi through search.

Practice Problem

Find the records with a referrer link containing "crowdfund".

View Solution

Practice Problem

Find the users who visited the site on a windows phone using `user_agent`. Output the full string values.

View Solution

Lesson summary:

In this lesson, you learned to:

  • Filter data using boolean indexing
  • Partially match text with .str.contains()

In the next lesson, you'll learn to process columns to create new columns of new data, or derived data.

Next Lesson

Deriving New Columns & Defining Python Functions

Get more from your data

Your team can be up and running in 30 minutes or less.