Python Tutorial
Learn Python for business analysis using real-world data. No coding experience necessary.
Start Now
Mode Studio
The Collaborative Data Science Platform
Deriving New Columns & Defining Python Functions
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
Goals of this lesson
This lesson builds on the pandas DataFrame data type you learned about in a previous lesson. You’ll learn how to:
- Derive new columns from existing data
- Write and test functions using
in
,if
,else
, andelif
- Define functions using parameters and arguments
- Apply functions to
DataFrames
using.apply()
- Select multiple columns
Loading data into Mode Python notebooks
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets[0].head(n=5)
. Run this code so you can see the first five rows of the dataset.
datasets[0]
is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Getting oriented with the data
This lesson uses data from Watsi. For a data dictionary with more information, click here.
Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head()
method to preview the first five rows.
import pandas as pd
data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson
data.head()
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | |
---|---|---|---|---|---|---|---|---|---|
0 | https://www.google.com/ | 2016-02-05 00:48:23 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | CHAROLETTE S | Desktop | ||
1 | https://themeteorchef.com/snippets/making-use-... | 2016-02-24 23:12:10 | Watsi | The Meteor Chef | https://watsi.org/team/the-meteor-chef | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | WARREN Q | themeteorchef.com | team | Desktop |
2 | https://watsi.org/ | 2015-12-25 17:59:35 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... | MITCHEL O | watsi.org | gift-cards | Desktop |
3 | 2016-02-05 21:19:30 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2... | MICHEL O | Desktop | |||
4 | https://watsi.org/fund-treatments | 2016-02-14 19:30:08 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... | ANDREE N | watsi.org | Desktop |
Deriving new columns from existing data
As you remember from the previous lesson, people used different platforms (iPhone, Windows, OSX, etc) to view pages on Watsi's site. Count the values in the platform column to get an idea of the distribution (for a quick refresher on distributions, check out this lesson:
data['platform'].value_counts()
Desktop 3934
iPhone 515
Android 388
iPad 154
Opera Mini 7
IEMobile 1
BlackBerry 1
Name: platform, dtype: int64
But say that instead, you want to compare Mobile and Desktop, treating all mobile devices as one way of interacting with Watsi’s site. Maybe you have a thesis about how people are more likely to search for Watsi on their desktop computer, but not on their phone. To do this, you need to create a new value for every row with one of two possible values: “Mobile” or “Desktop.” You can do this by creating a derived column based on the values in the platform
column. This new column is what’s known as a derived column because it’s been created using data from one or more existing columns.
To get the feel for this, start by creating a new column that is not derived from another column.
Creating a new column
Creating a column is much like creating a new key-value pair in a dictionary. By assigning values to the new column name, you add a column to the DataFrame:
data['new'] = 2 # the value for all rows
data[:3] # let's see that new column
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | new | |
---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.google.com/ | 2016-02-05 00:48:23 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | CHAROLETTE S | Desktop | 2 | ||
1 | https://themeteorchef.com/snippets/making-use-... | 2016-02-24 23:12:10 | Watsi | The Meteor Chef | https://watsi.org/team/the-meteor-chef | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | WARREN Q | themeteorchef.com | team | Desktop | 2 |
2 | https://watsi.org/ | 2015-12-25 17:59:35 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... | MITCHEL O | watsi.org | gift-cards | Desktop | 2 |
Make sure you scroll all the way to the right to check out the new column you just made.
Just as you saw with dictionaries in the first lesson, assigning values to an existing column will overwrite that column:
data['new'] = 'overwritten'
data[:3]
referrer | timestamp | title | url | user_agent | user_id | referrer_domain | website_section | platform | new | |
---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.google.com/ | 2016-02-05 00:48:23 | Watsi | Fund medical treatments for people aro... | https://watsi.org/ | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... | CHAROLETTE S | Desktop | overwritten | ||
1 | https://themeteorchef.com/snippets/making-use-... | 2016-02-24 23:12:10 | Watsi | The Meteor Chef | https://watsi.org/team/the-meteor-chef | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... | WARREN Q | themeteorchef.com | team | Desktop | overwritten |
2 | https://watsi.org/ | 2015-12-25 17:59:35 | Watsi | Give the gift of health with a Watsi G... | https://watsi.org/gift-cards | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... | MITCHEL O | watsi.org | gift-cards | Desktop | overwritten |
This is a simple example—you’ve just set the value for every row to be the same. In reality, you’ll almost never have use for a column where the values are all the same number. Instead, you’ll use functions to determine the value in each row of your new column.
Writing and testing Python functions
Say you wanted to compare just two categories—mobile and desktop. You can define mobile platforms in this list of strings:
mobile = ['iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry']
You'll use this list to filter values in the platform column. But first, you’ll need to learn a few tools for comparing values.
The in
operator
The handy Python operator in
allows you to evaluate whether something exists in a list. The evaluation returns a boolean. Here's how you check if "iPad", "Desktop", and "Monty Python" are mobile platforms:
print 'iPad' in mobile
print 'Desktop' in mobile
print 'Monty Python' in mobile
True
False
False
This is very similar to the IN
operator in SQL, where you might use:
SELECT * FROM tutorial.watsi_events
WHERE platform IN ('iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry');
if
control statements
Python has control statements, or pieces of logic, that will help you create your own functions. For example, you can check if the "Opera Mini" platform is in the mobile list and then print something if it returns a boolean of True.
if 'Opera Mini' in mobile:
print 'great success.'
great success.
If statements must result in a True
or False
. If the if statement results in True
, as in the above case, it will execute the code after the colon. Otherwise, it does not execute the code after the colon, like this:
if 'The Marriage of Figaro' in mobile:
print 'great success.'
'The Marriage of Figaro'
is not in the mobile
list, so the above statement evaluates to False
, skips the code indented after the colon, and nothing is printed. This is very similar to how the CASE statement works in SQL.
This little bit of logic opens up a world of possibilities. Using an if statement, you can write a function that decides what to do based on the values you find.
else
and elif
control statements
If a value is not found in the mobile
list, you might want to do something else with it. If the if
statement evaluates to false, as the last one did, you might want the function to take a different action.
if 'The Marriage of Figaro' in mobile:
print 'great success.'
else:
print 'that is immobile.'
that is immobile.
The code after else:
will execute when the if
statement returns False
. The keyword elif
, similarly, would evaluate if nothing before it had returned True
. The statement runs from top to bottom, and if a statement evaluates to True
, it executes the code after the colon, and then does not look at any other elif
or else
statements in the series. For example:
operas = ['The Marriage of Figaro', 'The Magic Flute', 'La traviata']
if 'The Marriage of Figaro' in mobile: print 'grave success.' elif 'The Marriage of Figaro' in operas: print 'that is a gravely beautiful piece.' else: print 'that is immobile.'
that is a gravely beautiful piece.
As you can see, the else
statement was not executed because the elif
statement evaluated to True and ran the print
statement 'that is a gravely beautiful piece.'
Defining functions
Functions are reusable code blocks that you can use to perform a single action. Functions can take in values (called "parameters" or "arguments") and perform logic. For more on the basics of functions, click here.
In the last statement you wrote, you performed logic using the if
statement. Here's how you might rewrite it to take an argument:
def is_in_mobile(platform): #define the "is_in_mobile" function, which accepts an argument called "platform"
if platform in mobile:
print 'great success.'
Let's break this down:
def
— the keyword that signals to Python that you're writing a functionis_in_mobile
— the name of your function, which can be anything you like, evenmy_hovercraft_is_full_of_eels()
(platform):
— the parameter(s)
Now you can give the function a value, and it will execute the code you defined. Check to see if the BlackBerry phone is in the list mobile:
is_in_mobile('BlackBerry')
great success.
Parameters and arguments
The parameter is a very important part of the function. Think of it as a temporary variable name you use when you define the function, but that gets replaced when you run the function. In the above example, platform
is the parameter. Functions can have many parameters—just look at the .plot()
function you used in an earlier lesson.
When you run the function, the thing that replaces the parameter is called the argument. In the above example, 'BlackBerry'
is the argument.
Here’s another example of a function in action, this time adding on an else
statement:
def is_in_mobile(platform):
if platform in mobile:
print 'great success.'
else:
print ':('
is_in_mobile('SnackBerry')
:(
return statements
Let's add another layer by writing a function that will allow you to label records as either 'mobile'
or 'desktop'
. To do this, you’ll use return
statements.
A return
statement is simple—it tells the computer "this is the result. Make it available for further use and end the if
statement here." A return
statement is different from a print
statement, because when it executes, return
makes the value available to store as a variable or to use in another function. print
simply makes the value appear on the screen.
The function below takes in a platform
argument and checks if the platform
is in the mobile
list. Note that after each of these if
/else
statements, there’s a return
statement.
If platform is in the mobile
list, it returns "Mobile"
and terminates there. If the platform is't in the mobile
list, the function continues to the next evaluation—whether platform is equal to "Desktop"
—and so forth.
Try it out by first writing a function that accepts the platform
argument:
def filter_desktop_mobile(platform):
if platform in mobile:
return 'Mobile'
elif platform == 'Desktop':
return 'Desktop'
else:
return 'Not Known'
Now try running that function with 'Android'
as the argument. This will effectively replace the word platform
in the above function with 'Android'
and then return the result. In this case, the returned result will be printed because it is the only output from the cell above:
filter_desktop_mobile('Android')
'Mobile'
The real use of return
as opposed to print
is the fact that you can assign the value to a variable name. For example:
what_is_android = filter_desktop_mobile('Android')
print what_is_android
Mobile
Functions best practices
Keep functions simple, do one thing.
Generally, functions should only do one logical thing. This is up to your interpretation, of course, but ask any seasoned programmer or data scientist for their advice (and war stories), and you'll find out that keeping it simple is the key to sanity.
Indent with spaces
As you saw above, the code inside for and if statements is indented. In other languages such a SQL and JavaScript, whitespace only matters for readability. But in Python, tabs and spaces can change what the code means. The notebook will also help automatically indent your code, to the customary 4-space indentation.
Test all the cases (if you can)
You can test your function to make sure it does what you expect. Testing is a big part of analysis, and helps you ensure that your code is working as expected. Go ahead and test some of the possible cases:
print filter_desktop_mobile('iPhone')
print filter_desktop_mobile('Desktop')
print filter_desktop_mobile('Paper Plane')
Mobile
Desktop
Not Known
Success! The function did what was expected, given some likely values.
Applying functions to DataFrames
The .apply()
method allows you to apply a function to a column of a DataFrame. You can put the values of the existing platform
column through the filter_desktop_mobile
function you wrote and get a resulting Series:
data['platform'].apply(filter_desktop_mobile)
0 Desktop
1 Desktop
2 Desktop
3 Desktop
4 Desktop
5 Desktop
6 Desktop
7 Desktop
8 Desktop
9 Mobile
10 Mobile
11 Desktop
12 Mobile
13 Desktop
14 Desktop
15 Mobile
16 Mobile
17 Desktop
18 Desktop
19 Desktop
20 Mobile
21 Desktop
22 Desktop
23 Desktop
24 Desktop
25 Desktop
26 Mobile
27 Desktop
28 Desktop
29 Mobile
...
4970 Desktop
4971 Desktop
4972 Desktop
4973 Desktop
4974 Desktop
4975 Desktop
4976 Desktop
4977 Desktop
4978 Desktop
4979 Mobile
4980 Desktop
4981 Mobile
4982 Desktop
4983 Desktop
4984 Desktop
4985 Desktop
4986 Desktop
4987 Desktop
4988 Desktop
4989 Desktop
4990 Desktop
4991 Mobile
4992 Mobile
4993 Desktop
4994 Desktop
4995 Desktop
4996 Desktop
4997 Mobile
4998 Mobile
4999 Desktop
Name: platform, dtype: object
This series looks as expected—just "Desktop"
and "Mobile"
values. You can store these values in a new column using the following code:
data['platform_type'] = data['platform'].apply(filter_desktop_mobile)
Selecting multiple columns
To select multiple columns, you can pass a list of column names you want to select into the square brackets:
data[['platform','platform_type']][14:18] # rows with different values to make sure it worked
platform | platform_type | |
---|---|---|
14 | Desktop | Desktop |
15 | Android | Mobile |
16 | iPhone | Mobile |
17 | Desktop | Desktop |
Looks like the function worked.
Now count the values and use a bar chart to see how these the platforms stack up:
data['platform_type'].value_counts().plot(kind='bar')
Practice Problem
Store the length of each row's referrer
value in a new
column. Hint: We used a method to measure length in a
previous lesson.
Practice Problem
Create a derived column from referrer_domain
that filters
domain types of 'organization' (for '.org') and 'company' (for '.com'),
labeling any others as 'other'. Then plot a bar chart of their relative
frequency. Hint: Use the in
keyword
creatively.
For extra bonus points, select the records that were not referred from Watsi.org, and plot their relative frequency. Hint: Think about what values are not equal to.
View SolutionLesson summary:
In this lesson, you learned about:
- Derive new columns from existing data
- Write and test functions using
in
,if
,else
, andelif
- Define functions using parameters and arguments
- Apply functions to DataFrames using
.apply()
- Select multiple columns
In the next lesson, you'll learn about grouping data for comparison.
Next Lesson
Pandas .groupby(), Lambda Functions, & Pivot Tables