Deriving New Columns & Defining Python Functions
Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
This lesson builds on the pandas DataFrame data type you learned about in a previous lesson. You’ll learn how to:
- Derive new columns from existing data
- Write and test functions using
- Define functions using parameters and arguments
- Apply functions to
- Select multiple columns
Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
- Log into Mode or create an account.
- Navigate to this report and click Clone. This will take you to the SQL Query Editor, with a query and results pre-populated.
- Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
- The first input cell is automatically populated with
datasets.head(n=5). Run this code so you can see the first five rows of the dataset.
datasets is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the
.head() method to preview the first five rows.
import pandas as pd
data = datasets # assign SQL query results to the data variable data = data.fillna('') # replace missing values with '' as in the previous lesson
|0||https://www.google.com/||2016-02-05 00:48:23||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4...||CHAROLETTE S||Desktop|
|1||https://themeteorchef.com/snippets/making-use-...||2016-02-24 23:12:10||Watsi | The Meteor Chef||https://watsi.org/team/the-meteor-chef||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||WARREN Q||themeteorchef.com||team||Desktop|
|2||https://watsi.org/||2015-12-25 17:59:35||Watsi | Give the gift of health with a Watsi G...||https://watsi.org/gift-cards||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1...||MITCHEL O||watsi.org||gift-cards||Desktop|
|3||2016-02-05 21:19:30||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...||MICHEL O||Desktop|
|4||https://watsi.org/fund-treatments||2016-02-14 19:30:08||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2...||ANDREE N||watsi.org||Desktop|
As you remember from the previous lesson, people used different platforms (iPhone, Windows, OSX, etc) to view pages on Watsi's site. Count the values in the platform column to get an idea of the distribution (for a quick refresher on distributions, check out this lesson:
Desktop 3934 iPhone 515 Android 388 iPad 154 Opera Mini 7 IEMobile 1 BlackBerry 1 Name: platform, dtype: int64
But say that instead, you want to compare Mobile and Desktop, treating all mobile devices as one way of interacting with Watsi’s site. Maybe you have a thesis about how people are more likely to search for Watsi at their desktop computer, but not on their phone. To do this, you need to create a new value for every row with one of two possible values: “Mobile” or “Desktop.” You can do this by creating a derived column based on the values in the
platform column. This new column is what’s known as a derived column because it’s been created using data from one or more existing columns.
To get the feel for this, start by creating a new column that is not derived from another column.
Creating a column is much like creating a new key-value pair in a dictionary. By assigning values to the new column name, you add a column to the DataFrame:
data['new'] = 2 # the value for all rows
data[:3] # let's see that new column
|0||https://www.google.com/||2016-02-05 00:48:23||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4...||CHAROLETTE S||Desktop||2|
|1||https://themeteorchef.com/snippets/making-use-...||2016-02-24 23:12:10||Watsi | The Meteor Chef||https://watsi.org/team/the-meteor-chef||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||WARREN Q||themeteorchef.com||team||Desktop||2|
|2||https://watsi.org/||2015-12-25 17:59:35||Watsi | Give the gift of health with a Watsi G...||https://watsi.org/gift-cards||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1...||MITCHEL O||watsi.org||gift-cards||Desktop||2|
Make sure you scroll all the way to the right to check out the new column you just made.
Just as you saw with dictionaries in the first lesson, assigning values to an existing column will overwrite that column:
data['new'] = 'overwritten' data[:3]
|0||https://www.google.com/||2016-02-05 00:48:23||Watsi | Fund medical treatments for people aro...||https://watsi.org/||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4...||CHAROLETTE S||Desktop||overwritten|
|1||https://themeteorchef.com/snippets/making-use-...||2016-02-24 23:12:10||Watsi | The Meteor Chef||https://watsi.org/team/the-meteor-chef||Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...||WARREN Q||themeteorchef.com||team||Desktop||overwritten|
|2||https://watsi.org/||2015-12-25 17:59:35||Watsi | Give the gift of health with a Watsi G...||https://watsi.org/gift-cards||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1...||MITCHEL O||watsi.org||gift-cards||Desktop||overwritten|
This is a simple example—you’ve just set the value for every row to be the same. In reality, you’ll almost never have use for a column where the values are all the same number. Instead, you’ll use functions to determine the value in each row of your new column.
Say you wanted to compare just two categories—mobile and desktop. You can define mobile platforms in this list of strings:
mobile = ['iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry']
You'll use this list to filter values in the platform column. But first, you’ll need to learn a few tools for comparing values.
The handy Python operator
in allows you to evaluate whether something exists in a list. The evaluation returns a boolean. Here's how you check if "iPad", "Desktop", and "Monty Python" are mobile platforms:
print 'iPad' in mobile print 'Desktop' in mobile print 'Monty Python' in mobile
True False False
This is very similar to the
IN operator in SQL, where you might use:
SELECT * FROM tutorial.watsi_events WHERE platform IN ('iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry');
Python has control statements, or pieces of logic, that will help you create your own functions. For example, you can check if the "Opera Mini" platform is in the mobile list and then print something if it returns a boolean of True.
if 'Opera Mini' in mobile: print 'great success.'
If statements must result in a
False. If the if statement results in
True, as in the above case, it will execute the code after the colon. Otherwise, it does not execute the code after the colon, like this:
if 'The Marriage of Figaro' in mobile: print 'great success.'
'The Marriage of Figaro' is not in the
mobile list, so the above statement evaluates to
False, skips the code indented after the colon, and nothing is printed. This is very similar to how the CASE statement works in SQL.
This little bit of logic opens up a world of possibilities. Using an if statement, you can write a function that decides what to do based on the values you find.
If a value is not found in the
mobile list, you might want to do something else with it. If the
if statement evaluates to false, as the last one did, you might want the function to take a different action.
if 'The Marriage of Figaro' in mobile: print 'great success.' else: print 'that is immobile.'
that is immobile.
The code after
else: will execute when the
if statement returns
False. The keyword
elif, similarly, would evaluate if nothing before it had returned
True. The statement runs from top to bottom, and if a statement evaluates to
True, it executes the code after the colon, and then does not look at any other
else statements in the series. For example:
operas = ['The Marriage of Figaro', 'The Magic Flute', 'La traviata']
if 'The Marriage of Figaro' in mobile: print 'grave success.' elif 'The Marriage of Figaro' in operas: print 'that is a gravely beautiful piece.' else: print 'that is immobile.'
that is a gravely beautiful piece.
As you can see, the
else statement was not executed because the
elif statement evaluated to True and ran the
'that is a gravely beautiful piece.'
Functions are reusable code blocks that you can use to perform a single action. Functions can take in values (called "parameters" or "arguments") and perform logic. For more on the basics of functions, click here.
In the last statement you wrote, you performed logic using the
if statement. Here's how you might rewrite it to take an argument:
def is_in_mobile(platform): #define the "is_in_mobile" function, which accepts an argument called "platform" if platform in mobile: print 'great success.'
Let's break this down:
def— the keyword that signals to Python that you're writing a function
is_in_mobile— the name of your function, which can be anything you like, even
(platform):— the parameter(s)
Now you can give the function a value, and it will execute the code you defined. Check to see if the BlackBerry phone is in the list mobile:
The parameter is a very important part of the function. Think of it as a temporary variable name you use when you define the function, but that gets replaced when you run the function. In the above example,
platform is the parameter. Functions can have many parameters—just look at the
.plot() function you used in an earlier lesson.
When you run the function, the thing that replaces the parameter is called the argument. In the above example,
'BlackBerry' is the argument.
Here’s another example of a function in action, this time adding on an
def is_in_mobile(platform): if platform in mobile: print 'great success.' else: print ':('
Let's add another layer by writing a function that will allow you to label records as either
'desktop'. To do this, you’ll use
return statement is simple—it tells the computer "this is the result. Make it available for further use and end the
if statement here." A
return statement is different from a
return makes the value available to store as a variable or to use in another function.
The function below takes in a
platform argument and checks if the
platform is in the
mobile list. Note that after each of these
else statements, there’s a
If platform is in the
mobile list, it returns
"Mobile" and terminates there. If the platform is't in the
mobile list, the function continues to the next evaluation—whether platform is equal to
"Desktop"—and so forth.
Try it out by first writing a function that accepts the
def filter_desktop_mobile(platform): if platform in mobile: return 'Mobile' elif platform == 'Desktop': return 'Desktop' else: return 'Not Known'
Now try running that function with
'Android' as the argument. This will effectively replace the word
platform in the above function with
'Android' and then return the result. In this case, the returned result will be printed because it is the only output from the cell above:
The real use of
return as opposed to
what_is_android = filter_desktop_mobile('Android') print what_is_android
Keep functions simple, do one thing.
Generally, functions should only do one logical thing. This is up to your interpretation, of course, but ask any seasoned programmer or data scientist for their advice (and war stories), and you'll find out that keeping it simple is the key to sanity.
You can test your function to make sure it does what you expect. Testing is a big part of analysis, and helps you ensure that your code is working as expected. Go ahead and test some of the possible cases:
print filter_desktop_mobile('iPhone') print filter_desktop_mobile('Desktop') print filter_desktop_mobile('Paper Plane')
Mobile Desktop Not Known
Success! The function did what was expected, given some likely values.
.apply() method allows you to apply a function to a column of a DataFrame. You can put the values of the existing
platform column through the
filter_desktop_mobile function you wrote and get a resulting Series:
0 Desktop 1 Desktop 2 Desktop 3 Desktop 4 Desktop 5 Desktop 6 Desktop 7 Desktop 8 Desktop 9 Mobile 10 Mobile 11 Desktop 12 Mobile 13 Desktop 14 Desktop 15 Mobile 16 Mobile 17 Desktop 18 Desktop 19 Desktop 20 Mobile 21 Desktop 22 Desktop 23 Desktop 24 Desktop 25 Desktop 26 Mobile 27 Desktop 28 Desktop 29 Mobile ... 4970 Desktop 4971 Desktop 4972 Desktop 4973 Desktop 4974 Desktop 4975 Desktop 4976 Desktop 4977 Desktop 4978 Desktop 4979 Mobile 4980 Desktop 4981 Mobile 4982 Desktop 4983 Desktop 4984 Desktop 4985 Desktop 4986 Desktop 4987 Desktop 4988 Desktop 4989 Desktop 4990 Desktop 4991 Mobile 4992 Mobile 4993 Desktop 4994 Desktop 4995 Desktop 4996 Desktop 4997 Mobile 4998 Mobile 4999 Desktop Name: platform, dtype: object
This series looks as expected—just
"Mobile" values. You can store these values in a new column using the following code:
data['platform_type'] = data['platform'].apply(filter_desktop_mobile)
To select multiple columns, you can pass a list of column names you want to select into the square brackets:
data[['platform','platform_type']][14:18] # rows with different values to make sure it worked
Looks like the function worked.
Now count the values and use a bar chart to see how these the platforms stack up:
referrervalue in a new column. Hint: We used a method to measure length in a previous lesson.
Create a derived column from
referrer_domain that filters domain types of 'organization' (for '.org') and 'company' (for '.com'), labeling any others as 'other'. Then plot a bar chart of their relative frequency. Hint: Use the
in keyword creatively.
For extra bonus points, select the records that were not referred from Watsi.org, and plot their relative frequency. Hint: Think about what values are not equal to.
In this lesson, you learned about:
- Derive new columns from existing data
- Write and test functions using
- Define functions using parameters and arguments
- Apply functions to DataFrames using
- Select multiple columns
In the next lesson, you'll learn about grouping data for comparison.
Pandas .groupby(), Lambda Functions, & Pivot Tables