As an analyst or data scientist, developing a deep understanding of multiple analytical programming languages can create an advantage in today’s data landscape. The prevailing dialogue around a multi-lingual approach, especially with SQL and Python, typically portrays the languages as functionally discrete. But there are many functions that can be accomplished by both SQL and Python.
Exploring the overlapping functionality of two programming languages can help those of us familiar with one become more adept with the other. We can make smarter decisions about how to combine and leverage each, and get better at choosing the right tool for every task.
We’ve written previously about the differences in how Python and SQL execute window functions and group by operations. Here we'll focus on Set Operations, which allow us to combine and compare distinct data sets. These are common tasks in analysis, and understanding how to do them in both SQL or Python will help you choose the best tool for the job.
To see how all the examples mentioned in this post are implemented in practice, check out this example report.
Analytical problems often require the examination of multiple distinct sets of data. In the search for answers, it's often useful to compare or combine distinct sets of data to create one, new set of data.
For example, you may have two separate tables containing transactional data that you'd like to turn into a single transactions table. Or you may want to create a new transactions table that only contains transactions that can be found in both transaction tables.
In SQL, these types of problems can be solved using set operators. Set operators allow you to compare and conditionally concatenate the result sets of two SQL statements. Examples of SQL set operators include:
UNION: Returns the resulting rows from both query statements being compared. By default,
UNIONwill not return duplicate rows if identical rows exist in the result sets of both SQL statements. You can supply the
UNIONset operator with an optional
ALLkeyword if you'd like to return duplicate rows.
INTERSECT: Returns rows that exist in the result sets of both query query statements being compared.
MINUS): Returns rows that exist in one, but not both result sets of the query statements being compared.
Though there is no specific category of Set Operators in Python, there are a variety of functions available in Python that accomplish these same things.
To illustrate how you would use these set operations in practice, consider a scenario where you are a data scientist for a two-sided marketplace. In your marketplace, there are buyers and sellers, but they are not mutually exclusive. Currently, buyers and sellers live in separately in the database under the tables
Let's run through a few different scenarios where set operations may come into play:
UNION and UNION ALL
Let's say you wanted to combine all of the users that exist in the
sellers tables into a single new "users" table.
In SQL, you can use the
UNION set operator with the optional
ALL keyword to achieve this:
-- SQL select user_id from modeanalytics.buyers union all select user_id from modeanalytics.sellers
In Python, the
UNION ALL set operation performed above can be replicated using the pandas
.concat() method concatenates pandas objects (e.g. Dataframes, Series, etc.) along a chosen axis. Assuming you have stored the
user_id fields from the
sellers database tables in two pandas dataframe objects (
sellers), you could replicate the
UNION ALL set operation performed above using the following Python code:
# Python users = pd.concat([buyers, sellers])
It's important to note that when using either
UNION with the
ALL keyword in SQL, or the pandas
.concat() method in Python, you will return duplicate rows for users that exist in both tables (i.e. users that are both buyers and sellers). But let's say you wanted to return only a single instance of a user that exists in both tables.
In SQL, remove the
ALL keyword from the
UNION set operator:
-- SQL select user_id from modeanalytics.buyers union select user_id from modeanalytics.sellers
In Python, chain the
.drop_duplicates() method on the concatenated pandas object:
# Python users = pd.concat([buyers, sellers]).drop_duplicates()
What if instead, you wanted to create a table of unique users that are both buyers and sellers?
In SQL, you can use the
INTERSECT set operator:
-- SQL select user_id from modeanalytics.buyers intersect select user_id from modeanalytics.sellers
In Python, you can use the pandas
.merge() method on a dataframe:
# Python buyers_and_sellers = buyers.merge(sellers)
By default, the pandas
.merge() method will attempt an "inner" merge using all of the columns from both dataframes. This default behavior essentially replicates the behavior of our
INTERSECT SQL set operator.
Now let's say you want to return all users that are buyers, but not sellers.
In SQL, you can use the
EXCEPT set operator:
-- SQL select user_id from modeanalytics.buyers except select user_id from modeanalytics.sellers
It's important to note that SQL's
EXCEPT set operator will return rows that exist in the first table, but not the second. If you instead wanted to return a table containing users who are sellers, but not buyers, you would need to reverse the order of your
-- SQL select user_id from modeanalytics.sellers except select user_id from modeanalytics.buyers
# Python buyers_not_sellers = buyers[buyers.user_id.isin(sellers.user_id) == False] sellers_not_buyers = sellers[sellers.user_id.isin(buyers.user_id) == False]`
Blur The Line
At this point, you've seen how to replicate a significant amount of SQL set operator functionality in Python. Ultimately, the choice of programming paradigm is up to you, and may even change on a project-by-project basis. Only by learning about the functional overlap of analytical languages can you make these decisions effectively. But that’s the benefit of being able to use multiple data analysis languages; you can customize a hybrid approach to suit your needs as they evolve.