Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
In this lesson we'll cover:
Up to this point, we've only been working with one table at a time. The real power of SQL, however, comes from working with data from multiple tables at once. If you remember from a previous lesson, the tables you've been working with up to this point are all part of the same schema in a relational database. The term "relational database" refers to the fact that the tables within it "relate" to one another—they contain common identifiers that allow information from multiple tables to be combined easily.
To understand what joins are and why they are helpful, let's think about Twitter.
Twitter has to store a lot of data. Twitter could (hypothetically, of course) store its data in one big table in which each row represents one tweet. There could be one column for the content of each tweet, one for the time of the tweet, one for the person who tweeted it, and so on. It turns out, though, that identifying the person who tweeted is a little tricky. There's a lot to a person's Twitter identity—a username, a bio, followers, followees, and more. Twitter could store all of that data in a table like this:
Let's say, for the sake of argument, that Twitter did structure their data this way. Every time you tweet, Twitter creates a new row in its database, with information about you and the tweet.
But this creates a problem. When you update your bio, Twitter would have to change that information for every one of your tweets in this table. If you've tweeted 5,000 times, that means 5,000 changes. If many people on Twitter are making lots of changes at once, that's a lot of computation to support. Instead, it's much easier for Twitter to store everyone's profile information in a separate table. That way, whenever someone updates their bio, Twitter would only have to change one row of data instead of thousands.
In an organization like this, Twitter now has two tables. The first table—the users table—contains profile information, and has one row per user. The second table—the tweets table—contains tweet information, including the username of the person who sent the tweet. By matching—or joining—that username in the tweets table to the username in the users table, Twitter can still connect profile information to every tweet.
Unfortunately, we can't use Twitter's data in any working examples (for that, we'll have to wait for the NSA's SQL Tutorial), but we can look at a similar problem.
In the previous lesson on conditional logic, we worked with a table of data on college football players—
benn.college_football_players. This table included data on players, including each player's weight and the school that they played for. However, it didn't include much information on the school, such as the conference the school is in—that information is in a separate table,
Let's say we want to figure out which conference has the highest average weight. Given that information is in two separate tables, how do you do that? A join!
SELECT teams.conference AS conference, AVG(players.weight) AS average_weight FROM benn.college_football_players players JOIN benn.college_football_teams teams ON teams.school_name = players.school_name GROUP BY teams.conference ORDER BY AVG(players.weight) DESC
There's a lot of new stuff happening here, so we'll go step-by-step.
When performing joins, it's easiest to give your table names aliases.
benn.college_football_players is pretty long and annoying to type—
players is much easier. You can give a table an alias by adding a space after the table name and typing the intended name of the alias. As with column names, best practice here is to use all lowercase letters and underscores instead of spaces.
Once you've given a table an alias, you can refer to columns in that table in the
SELECT clause using the alias name. For example, the first column selected in the above query is
teams.conference. Because of the alias, this is equivalent to
benn.college_football_teams.conference: we're selecting the
conference column in the
college_football_teams table in
FROM statement, we have two new statements:
JOIN, which is followed by a table name, and
ON, which is followed by a couple column names separated by an equals sign.
ON statement comes after
JOIN, it's a bit easier to explain it first.
ON indicates how the two tables (the one after the
FROM and the one after the
JOIN) relate to each other. You can see in the example above that both tables contain fields called
school_name. Sometimes relational fields are slightly less obvious. For example, you might have a table called
schools with a field called
id, which could be joined against
school_id in any other table. These relationships are sometimes called "mappings."
players.school_name, the two columns that map to one another, are referred to as "foreign keys" or "join keys." Their mapping is written as a conditional statement:
ON teams.school_name = players.school_name
In plain English, this means:
Join all rows from the
playerstable on to rows in the
teamstable for which the
school_namefield in the
playerstable is equal to the
school_namefield in the
What does this actually do? Let's take a look at one row to see what happens. This is the row in the players table for Wake Forest wide receiver Michael Campanaro:
During the join, SQL looks up the
school_name—in this case, "Wake Forest"—in the
school_name field of the
teams table. If there's a match, SQL takes all five columns from the
teams table and joins them to ten columns of the
players table. The new result is a fifteen column table, and the row with Michael Campanaro looks like this:
Two columns are cut off from the image, but you can see the full result here.
When you run a query with a join, SQL performs the same operation as it did above to every row of the table after the
FROM statement. To see the full table returned by the join, try running this query:
SELECT * FROM benn.college_football_players players JOIN benn.college_football_teams teams ON teams.school_name = players.school_name
SELECT * returns all of the columns from both tables, not just from the table after
FROM. If you want to only return columns from one table, you can write
SELECT players.* to return all the columns from the players table.
Once you've generated this new table after the join, you can use the same aggregate functions from a previous lesson. By running an
AVG function on player weights, and grouping by the
conference field from the teams table, you can figure out each conference's average weight.
SQL INNER JOIN