SQL Tutorial
Basic SQL
Intermediate SQL
Putting it together
SQL Aggregate Functions
SQL COUNT
SQL SUM
SQL MIN/MAX
SQL AVG
SQL GROUP BY
SQL HAVING
SQL CASE
SQL DISTINCT
SQL Joins
SQL INNER JOIN
SQL Outer Joins
SQL LEFT JOIN
SQL RIGHT JOIN
SQL Joins Using WHERE or ON
SQL FULL OUTER JOIN
SQL UNION
SQL Joins with Comparison Operators
SQL Joins on Multiple Keys
SQL Self Joins
Advanced SQL
SQL Analytics Training
Python Tutorial
Learn Python for business analysis using real-world data. No coding experience necessary.
Start Now
Mode Studio
The Collaborative Data Science Platform
SQL Joins
Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
In this lesson we'll cover:
Intro to SQL joins: relational concepts
Up to this point, we've only been working with one table at a time. The real power of SQL, however, comes from working with data from multiple tables at once. If you remember from a previous lesson, the tables you've been working with up to this point are all part of the same schema in a relational database. The term "relational database" refers to the fact that the tables within it "relate" to one another—they contain common identifiers that allow information from multiple tables to be combined easily.
To understand what joins are and why they are helpful, let's think about Twitter.
Twitter has to store a lot of data. Twitter could (hypothetically, of course) store its data in one big table in which each row represents one tweet. There could be one column for the content of each tweet, one for the time of the tweet, one for the person who tweeted it, and so on. It turns out, though, that identifying the person who tweeted is a little tricky. There's a lot to a person's Twitter identity—a username, a bio, followers, followees, and more. Twitter could store all of that data in a table like this:
Let's say, for the sake of argument, that Twitter did structure their data this way. Every time you tweet, Twitter creates a new row in its database, with information about you and the tweet.
But this creates a problem. When you update your bio, Twitter would have to change that information for every one of your tweets in this table. If you've tweeted 5,000 times, that means 5,000 changes. If many people on Twitter are making lots of changes at once, that's a lot of computation to support. Instead, it's much easier for Twitter to store everyone's profile information in a separate table. That way, whenever someone updates their bio, Twitter would only have to change one row of data instead of thousands.
In an organization like this, Twitter now has two tables. The first table—the users table—contains profile information, and has one row per user. The second table—the tweets table—contains tweet information, including the username of the person who sent the tweet. By matching—or joining—that username in the tweets table to the username in the users table, Twitter can still connect profile information to every tweet.
The anatomy of a join
Unfortunately, we can't use Twitter's data in any working examples (for that, we'll have to wait for the NSA's SQL Tutorial), but we can look at a similar problem.
In the previous lesson on conditional logic, we worked with a table of data on college football players—benn.college_football_players
. This table included data on players, including each player's weight and the school that they played for. However, it didn't include much information on the school, such as the conference the school is in—that information is in a separate table, benn.college_football_teams
.
Let's say we want to figure out which conference has the highest average weight. Given that information is in two separate tables, how do you do that? A join!
SELECT teams.conference AS conference,
AVG(players.weight) AS average_weight
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY teams.conference
ORDER BY AVG(players.weight) DESC
There's a lot of new stuff happening here, so we'll go step-by-step.
Aliases in SQL
When performing joins, it's easiest to give your table names aliases. benn.college_football_players
is pretty long and annoying to type—players
is much easier. You can give a table an alias by adding a space after the table name and typing the intended name of the alias. As with column names, best practice here is to use all lowercase letters and underscores instead of spaces.
Once you've given a table an alias, you can refer to columns in that table in the SELECT
clause using the alias name. For example, the first column selected in the above query is teams.conference
. Because of the alias, this is equivalent to benn.college_football_teams.conference
: we're selecting the conference
column in the college_football_teams
table in benn
's schema.
Practice Problem
Write a query that selects the school name, player name, position, and weight for every player in Georgia, ordered by weight (heaviest to lightest). Be sure to make an alias for the table, and to reference all column names in relation to the alias.
Try it out See the answerJOIN and ON
After the FROM
statement, we have two new statements: JOIN
, which is followed by a table name, and ON
, which is followed by a couple column names separated by an equals sign.
Though the ON
statement comes after JOIN
, it's a bit easier to explain it first. ON
indicates how the two tables (the one after the FROM
and the one after the JOIN
) relate to each other. You can see in the example above that both tables contain fields called school_name
. Sometimes relational fields are slightly less obvious. For example, you might have a table called schools
with a field called id
, which could be joined against school_id
in any other table. These relationships are sometimes called "mappings." teams.school_name
and players.school_name
, the two columns that map to one another, are referred to as "foreign keys" or "join keys." Their mapping is written as a conditional statement:
ON teams.school_name = players.school_name
In plain English, this means:
Join all rows from the
players
table on to rows in theteams
table for which theschool_name
field in theplayers
table is equal to theschool_name
field in theteams
table.
What does this actually do? Let's take a look at one row to see what happens. This is the row in the players table for Wake Forest wide receiver Michael Campanaro:
During the join, SQL looks up the school_name
—in this case, "Wake Forest"—in the school_name
field of the teams
table. If there's a match, SQL takes all five columns from the teams
table and joins them to ten columns of the players
table. The new result is a fifteen column table, and the row with Michael Campanaro looks like this:
Two columns are cut off from the image, but you can see the full result here.
When you run a query with a join, SQL performs the same operation as it did above to every row of the table after the FROM
statement. To see the full table returned by the join, try running this query:
SELECT *
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
Note that SELECT *
returns all of the columns from both tables, not just from the table after FROM
. If you want to only return columns from one table, you can write SELECT players.*
to return all the columns from the players table.
Once you've generated this new table after the join, you can use the same aggregate functions from a previous lesson. By running an AVG
function on player weights, and grouping by the conference
field from the teams table, you can figure out each conference's average weight.
Next Lesson
SQL INNER JOIN