As an analyst or data scientist, developing a deep understanding of multiple analytical programming languages can create an advantage in today’s data landscape. The prevailing dialogue around a multi-lingual approach, especially with SQL and Python, typically portrays the languages as functionally discrete. But there are many functions that can be accomplished by both SQL and Python.
Exploring the overlapping functionality of two programming languages can help those of us familiar with one become more adept with the other. We can make smarter decisions about how to combine and leverage each, and get better at choosing the right tool for every task.
We’ve written previously about the differences in how Python and SQL execute window functions and group by operations. Here we'll focus on Set Operations, which allow us to combine and compare distinct data sets. These are common tasks in analysis, and understanding how to do them in both SQL or Python will help you choose the best tool for the job.
To see how all the examples mentioned in this post are implemented in practice, check out this example report.
Set Operations
Analytical problems often require the examination of multiple distinct sets of data. In the search for answers, it's often useful to compare or combine distinct sets of data to create one, new set of data.
For example, you may have two separate tables containing transactional data that you'd like to turn into a single transactions table. Or you may want to create a new transactions table that only contains transactions that can be found in both transaction tables.
In SQL, these types of problems can be solved using set operators. Set operators allow you to compare and conditionally concatenate the result sets of two SQL statements. Examples of SQL set operators include:
UNION
: Returns the resulting rows from both query statements being compared. By default,UNION
will not return duplicate rows if identical rows exist in the result sets of both SQL statements. You can supply theUNION
set operator with an optionalALL
keyword if you'd like to return duplicate rows.INTERSECT
: Returns rows that exist in the result sets of both query query statements being compared.EXCEPT
(orMINUS
): Returns rows that exist in one, but not both result sets of the query statements being compared.
Though there is no specific category of Set Operators in Python, there are a variety of functions available in Python that accomplish these same things.
To illustrate how you would use these set operations in practice, consider a scenario where you are a data scientist for a two-sided marketplace. In your marketplace, there are buyers and sellers, but they are not mutually exclusive. Currently, buyers and sellers live in separately in the database under the tables buyers
and sellers
, respectively.
Let's run through a few different scenarios where set operations may come into play:
UNION and UNION ALL
Let's say you wanted to combine all of the users that exist in the buyers
and sellers
tables into a single new "users" table.
In SQL, you can use the UNION
set operator with the optional ALL
keyword to achieve this:
-- SQL
select user_id
from modeanalytics.buyers
union all
select user_id
from modeanalytics.sellers
In Python, the UNION ALL
set operation performed above can be replicated using the pandas .concat()
function.
The pandas .concat()
method concatenates pandas objects (e.g. Dataframes, Series, etc.) along a chosen axis. Assuming you have stored the user_id
fields from the buyers
and sellers
database tables in two pandas dataframe objects (buyers
and sellers
), you could replicate the UNION ALL
set operation performed above using the following Python code:
# Python
users = pd.concat([buyers, sellers])
It's important to note that when using either UNION
with the ALL
keyword in SQL, or the pandas .concat()
method in Python, you will return duplicate rows for users that exist in both tables (i.e. users that are both buyers and sellers). But let's say you wanted to return only a single instance of a user that exists in both tables.
In SQL, remove the ALL
keyword from the UNION
set operator:
-- SQL
select user_id
from modeanalytics.buyers
union
select user_id
from modeanalytics.sellers
In Python, chain the .drop_duplicates()
method on the concatenated pandas object:
# Python
users = pd.concat([buyers, sellers]).drop_duplicates()
INTERSECT
What if instead, you wanted to create a table of unique users that are both buyers and sellers?
In SQL, you can use the INTERSECT
set operator:
-- SQL
select user_id
from modeanalytics.buyers
intersect
select user_id
from modeanalytics.sellers
In Python, you can use the pandas .merge()
method on a dataframe:
# Python
buyers_and_sellers = buyers.merge(sellers)
By default, the pandas .merge()
method will attempt an "inner" merge using all of the columns from both dataframes. This default behavior essentially replicates the behavior of our INTERSECT
SQL set operator.
EXCEPT
Now let's say you want to return all users that are buyers, but not sellers.
In SQL, you can use the EXCEPT
set operator:
-- SQL
select user_id
from modeanalytics.buyers
except
select user_id
from modeanalytics.sellers
It's important to note that SQL's EXCEPT
set operator will return rows that exist in the first table, but not the second. If you instead wanted to return a table containing users who are sellers, but not buyers, you would need to reverse the order of your SELECT
statements:
-- SQL
select user_id
from modeanalytics.sellers
except
select user_id
from modeanalytics.buyers
In Python, you could leverage pandas boolean indexing techniques in tandem with pandas .isin()
indexing to replicate the functionality of the SQL EXCEPT
set operator:
# Python
buyers_not_sellers = buyers[buyers.user_id.isin(sellers.user_id) == False]
sellers_not_buyers = sellers[sellers.user_id.isin(buyers.user_id) == False]`
Blur The Line
At this point, you've seen how to replicate a significant amount of SQL set operator functionality in Python. Ultimately, the choice of programming paradigm is up to you, and may even change on a project-by-project basis. Only by learning about the functional overlap of analytical languages can you make these decisions effectively. But that’s the benefit of being able to use multiple data analysis languages; you can customize a hybrid approach to suit your needs as they evolve.