ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Mode Studio

The Collaborative Data Science Platform

Filtering Data in Python with Boolean Indexes

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

This lesson builds on the previous lesson, Counting values and basic plotting.

You'll learn how to:

Filtering data will allow you to select events following specific patterns, such as finding pages with high pageview counts. You'll then be able to dig deeper into the source of such traffic.

Loading data into Mode Python notebooks

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:

  1. Log into Mode or create an account.
  2. Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
  3. Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
  4. The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets[0] is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Getting oriented with the data

This lesson uses data from Watsi. For a data dictionary with more information, click here.

Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head() method to preview the first five rows.

Input

import pandas as pd
    
Input

data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson
    
Input

data.head()
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop
1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop
2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards Desktop
3 2016-02-05 21:19:30 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2... MICHEL O Desktop
4 https://watsi.org/fund-treatments 2016-02-14 19:30:08 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... ANDREE N watsi.org Desktop

Filtering data with boolean indexing

In the previous lesson, you examined the makeup of platforms used to access Watsi’s site, pages visited, and sites that referred visitors. But what about the behavior of people who visit the site? What pages did people visit on mobile? What pages and sections of the website did referred visitors land on?

In the previous lesson, you saw that visitors most often landed on the homepage. Next you might want to know:

When people came to the homepage, where did they come from?

To answer this, you'll need to break down traffic into these subsections based on their referrers—the websites they were on before going to Watsi. This practice is called segmentation.

You'll need the records of pageviews on the homepage and the referrer domains for those specific pageviews. To select the records of pageviews that went to the homepage, you'll need to identify which rows went to the homepage based on the 'title' column, then select them as a new DataFrame.

From a technical standpoint, you'll need to create a list indicating which rows to select—rows for which the title is the homepage. You can use a boolean index, a Series composed of True or False values that correspond to rows in the dataset. The True/False values describe which rows you want to select, namely only the True rows.

To create the boolean index representing pageviews of the homepage, you can compare each value of "title" to see if it's the title of the homepage, resulting in a Series of booleans:

Input

(data['title'] == 'Watsi | Fund medical treatments for people around the world')
    
Output

0        True
1       False
2       False
3        True
4        True
5       False
6       False
7       False
8       False
9        True
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20       True
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29       True
        ...  
4970    False
4971    False
4972    False
4973     True
4974     True
4975    False
4976    False
4977    False
4978    False
4979    False
4980    False
4981    False
4982    False
4983     True
4984     True
4985    False
4986    False
4987     True
4988    False
4989    False
4990    False
4991     True
4992    False
4993    False
4994    False
4995    False
4996     True
4997     True
4998    False
4999    False
Name: title, dtype: bool
    

You'll want to assign the boolean index to a variable name: homepage_index:

Input

homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # copied from above
    

The next step is to use the boolean index to filter your data. You can do this similarly to how you select columns or rows: use the boolean index inside square brackets to select the records from the DataFrame for which the boolean index reads True. Store the filtered dataset under a new variable name, watsi_homepage:

Input

watsi_homepage = data[homepage_index] # selects the "True" rows recorded in the boolean index
    

Now that you've got a filtered DataFrame called watsi_homepage, take a look at where traffic came from. The fastest way to do this is to summarize the values in the referrer column using .value_counts():

Input

watsi_homepage['referrer'].value_counts()[:15]
    
Output

                                                                                                     451
https://www.google.com/                                                                              153
https://www.google.co.in/                                                                             53
https://watsi.org/                                                                                    22
https://www.reddit.com/                                                                               21
https://watsi.org/fund-treatments                                                                     18
http://blog.watsi.org/                                                                                17
https://watsi.org/about                                                                               17
https://www.google.com.au/                                                                            16
https://www.google.ca/                                                                                15
https://www.google.com.ph/                                                                            15
https://www.google.co.uk/                                                                             14
http://www.forbes.com/sites/emmajohnson/2015/02/26/best-charity-sites-to-teach-kids-about-giving/     10
https://t.co/2chd5GTmw1                                                                               10
https://www.google.com                                                                                10
Name: referrer, dtype: int64
    

You might have noticed a few key things:

  • Referral links are messy! Some have long paths, subdomains, query strings, or referral strings that make it hard to easily see that two visitors came from the same website. For example, google.com, google.co.in, and all of the other local Google domains should really be counted simply as “google” for purposes of this analysis.
  • Referrals are from internal sources (watsi.org), and external sources like Google, Facebook, Reddit, and others.
  • Quite a large number of referrals are missing. The top referrer is an empty one. This means that people came directly to the site, but it's impossible to be sure what actually led them to it.

Because Google is represented by so many different unique values for referrer, it's difficult to simply say, "30% of hits came from Google searches" based on this data. Normally you'd need to clean it up a bit before you could do that.

You’ll learn about data cleanup later on in this tutorial. For now, you can instead use the column called referrer_domain, which is just the extracted domains:

Input

watsi_homepage['referrer_domain'].value_counts()[:10]
    
Output

                         452
google                   422
watsi.org                169
reddit.com                36
facebook.com              24
t.co                      22
vessel.com                11
forbes.com                10
adzerk.net                 9
thedoctorschannel.com      7
Name: referrer_domain, dtype: int64
    

Much easier to digest. It looks like for the Watsi homepage, direct traffic (no referrer) was a bigger contender than any other single source. Hopefully that's a positive sign of organic sharing on social media.

Practice Problem

Select all the pageviews originating from the Reddit domain, and see where traffic is landing within Watsi.

View Solution

Partially matching text with .str.contains()

Say you want to select all the records where the referrer link contains the text "medical". These links don't equal "medical", but they do include it. You can use a Series string method .str.contains() to create a boolean index of records containing "medical".

Input

medical_referrer_index = data['referrer'].str.contains('medical')
medical_referrals = data[medical_referrer_index]
medical_referrals
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
222 http://www.inc.com/abigail-tracy/new-crowdfund... 2015-12-07 23:15:52 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (iPad; CPU OS 9_1 like Mac OS X) A... BERNADETTE W inc.com iPad
568 http://www.google.pl/url?sa=t&source=web&cd=1&... 2016-03-19 10:03:28 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-... KRISTIAN E google Android
3115 https://www.reddit.com/r/Anarcho_Capitalism/co... 2015-12-19 00:33:04 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... JAIMEE U reddit.com gift-cards Desktop
3886 http://poormedicalcare.weebly.com/help-our-cau... 2016-03-19 18:30:13 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK... STEPANIE V weebly.com Desktop
3980 http://my.parallaxsearch.com/web?qs=crowd+fund... 2016-03-11 05:22:01 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... ELLIE U parallaxsearch.com Desktop
4859 http://www.google.co.in/search?site=&oq=crowdf... 2016-03-14 18:59:07 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; S... MARRY D google Android

If you’re already familiar with SQL, this is equivalent to writing something like this:

SELECT * FROM watsi
WHERE referrer LIKE "%medical%";

By default, .str.contains() is case sensitive. To disregard case, use .str.contains(case=False).

The output has cut off the links. Use the method .tolist() to print the full referrer strings. .tolist() will turn the pandas Series into a plain Python list.

Input

medical_referrals['referrer'].tolist()
    
Output

['http://www.inc.com/abigail-tracy/new-crowdfunding-site-for-medical-students.html',
 'http://www.google.pl/url?sa=t&source=web&cd=1&rct=j&q=medical%20treatment%20grants&ved=0ahUKEwjRo5DGv8zLAhUiS5oKHdOAC58QFggaMAA&url=https%3A%2F%2Fwatsi.org%2F&usg=AFQjCNEkfSK0vlcYRRpj89LfLVEvb8rV6A&sig2=dKNwFh-vKOj-UFZJ63zciw',
 'https://www.reddit.com/r/Anarcho_Capitalism/comments/3xddbx/watsi_a_cool_new_charity_that_crowdfunds_medical/',
 'http://poormedicalcare.weebly.com/help-our-cause.html',
 'http://my.parallaxsearch.com/web?qs=crowd+funding+for+medical+treatment',
 'http://www.google.co.in/search?site=&oq=crowdfunding+for+me&aqs=mobile-gws-lite.0.0l5&q=crowdfunding+for+medical+treatment']
    

Interesting. Some of the pageviews with referrer links containing "medical" were searches. Looking at these links, you can see a query string (containing the ? character). Query strings are how searches are stored in urls. For example, in the last link, the user searched "crowdfunding for medical treatment". Digging into this kind of data could help you understand more about how users are finding Watsi through search.

Practice Problem

Find the records with a referrer link containing "crowdfund".

View Solution

Practice Problem

Find the users who visited the site on a windows phone using `user_agent`. Output the full string values.

View Solution

Lesson summary:

In this lesson, you learned to:

  • Filter data using boolean indexing
  • Partially match text with .str.contains()

In the next lesson, you'll learn to process columns to create new columns of new data, or derived data.

Next Lesson

Deriving New Columns & Defining Python Functions

Get more from your data

Your team can be up and running in 30 minutes or less.