ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Additional Resources

template

Defining Metrics: A Template 
for Working With Stakeholders

Guide

10 Things Modern Business 
Intelligence Should Enable Today

video

Watch a Product Tour of Mode

Get started with Mode

Set Operations in SQL and Python: a Comparison

Image of author
David Wallace, Data Science Evangelist

March 22, 2018

6 minute read

2018-03-22-set-operations-python-sql-og-intersection

As an analyst or data scientist, developing a deep understanding of multiple analytical programming languages can create an advantage in today’s data landscape. The prevailing dialogue around a multi-lingual approach, especially with SQL and Python, typically portrays the languages as functionally discrete. But there are many functions that can be accomplished by both SQL and Python.

Exploring the overlapping functionality of two programming languages can help those of us familiar with one become more adept with the other. We can make smarter decisions about how to combine and leverage each, and get better at choosing the right tool for every task.

We’ve written previously about the differences in how Python and SQL execute window functions and group by operations. Here we'll focus on Set Operations, which allow us to combine and compare distinct data sets. These are common tasks in analysis, and understanding how to do them in both SQL or Python will help you choose the best tool for the job.

To see how all the examples mentioned in this post are implemented in practice, check out this example report.

Set Operations

Analytical problems often require the examination of multiple distinct sets of data. In the search for answers, it's often useful to compare or combine distinct sets of data to create one, new set of data.

For example, you may have two separate tables containing transactional data that you'd like to turn into a single transactions table. Or you may want to create a new transactions table that only contains transactions that can be found in both transaction tables.

In SQL, these types of problems can be solved using set operators. Set operators allow you to compare and conditionally concatenate the result sets of two SQL statements. Examples of SQL set operators include:

  • UNION: Returns the resulting rows from both query statements being compared. By default, UNION will not return duplicate rows if identical rows exist in the result sets of both SQL statements. You can supply the UNION set operator with an optional ALL keyword if you'd like to return duplicate rows.
  • INTERSECT: Returns rows that exist in the result sets of both query query statements being compared.
  • EXCEPT (or MINUS): Returns rows that exist in one, but not both result sets of the query statements being compared.

Though there is no specific category of Set Operators in Python, there are a variety of functions available in Python that accomplish these same things.

To illustrate how you would use these set operations in practice, consider a scenario where you are a data scientist for a two-sided marketplace. In your marketplace, there are buyers and sellers, but they are not mutually exclusive. Currently, buyers and sellers live in separately in the database under the tables buyers and sellers, respectively. Let's run through a few different scenarios where set operations may come into play:

UNION and UNION ALL

Let's say you wanted to combine all of the users that exist in the buyers and sellers tables into a single new "users" table.

In SQL, you can use the UNION set operator with the optional ALL keyword to achieve this:

-- SQL
select user_id
from modeanalytics.buyers
union all
select user_id
from modeanalytics.sellers

In Python, the UNION ALL set operation performed above can be replicated using the pandas .concat() function.

The pandas .concat() method concatenates pandas objects (e.g. Dataframes, Series, etc.) along a chosen axis. Assuming you have stored the user_id fields from the buyers and sellers database tables in two pandas dataframe objects (buyers and sellers), you could replicate the UNION ALL set operation performed above using the following Python code:

# Python
users = pd.concat([buyers, sellers])

It's important to note that when using either UNION with the ALL keyword in SQL, or the pandas .concat() method in Python, you will return duplicate rows for users that exist in both tables (i.e. users that are both buyers and sellers). But let's say you wanted to return only a single instance of a user that exists in both tables.

In SQL, remove the ALL keyword from the UNION set operator:

-- SQL
select user_id
from modeanalytics.buyers
union
select user_id
from modeanalytics.sellers

In Python, chain the .drop_duplicates() method on the concatenated pandas object:

# Python
users = pd.concat([buyers, sellers]).drop_duplicates()

INTERSECT

What if instead, you wanted to create a table of unique users that are both buyers and sellers?

In SQL, you can use the INTERSECT set operator:

-- SQL
select user_id
from modeanalytics.buyers
intersect
select user_id
from modeanalytics.sellers

In Python, you can use the pandas .merge() method on a dataframe:

# Python
buyers_and_sellers = buyers.merge(sellers)

By default, the pandas .merge() method will attempt an "inner" merge using all of the columns from both dataframes. This default behavior essentially replicates the behavior of our INTERSECT SQL set operator.

EXCEPT

Now let's say you want to return all users that are buyers, but not sellers.

In SQL, you can use the EXCEPT set operator:

-- SQL
select user_id
from modeanalytics.buyers
except
select user_id
from modeanalytics.sellers

It's important to note that SQL's EXCEPT set operator will return rows that exist in the first table, but not the second. If you instead wanted to return a table containing users who are sellers, but not buyers, you would need to reverse the order of your SELECT statements:

-- SQL
select user_id
from modeanalytics.sellers
except
select user_id
from modeanalytics.buyers

In Python, you could leverage pandas boolean indexing techniques in tandem with pandas .isin() indexing to replicate the functionality of the SQL EXCEPT set operator:

# Python
buyers_not_sellers = buyers[buyers.user_id.isin(sellers.user_id) == False]
sellers_not_buyers = sellers[sellers.user_id.isin(buyers.user_id) == False]`

Blur The Line

At this point, you've seen how to replicate a significant amount of SQL set operator functionality in Python. Ultimately, the choice of programming paradigm is up to you, and may even change on a project-by-project basis. Only by learning about the functional overlap of analytical languages can you make these decisions effectively. But that’s the benefit of being able to use multiple data analysis languages; you can customize a hybrid approach to suit your needs as they evolve.

Get our weekly data newsletter

Work-related distractions for data enthusiasts.

Additional Resources

template

Defining Metrics: A Template 
for Working With Stakeholders

Guide

10 Things Modern Business 
Intelligence Should Enable Today

video

Watch a Product Tour of Mode

Get started with Mode