mode helix
NEWIntroducing Helix—the first instant, responsive data engine.Learn more

SQL Joins

Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.

Intro to SQL joins: relational concepts

Up to this point, we've only been working with one table at a time. The real power of SQL, however, comes from working with data from multiple tables at once. If you remember from a previous lesson, the tables you've been working with up to this point are all part of the same schema in a relational database. The term "relational database" refers to the fact that the tables within it "relate" to one another—they contain common identifiers that allow information from multiple tables to be combined easily.

To understand what joins are and why they are helpful, let's think about Twitter.

Twitter has to store a lot of data. Twitter could (hypothetically, of course) store its data in one big table in which each row represents one tweet. There could be one column for the content of each tweet, one for the time of the tweet, one for the person who tweeted it, and so on. It turns out, though, that identifying the person who tweeted is a little tricky. There's a lot to a person's Twitter identity—a username, a bio, followers, followees, and more. Twitter could store all of that data in a table like this:

Let's say, for the sake of argument, that Twitter did structure their data this way. Every time you tweet, Twitter creates a new row in its database, with information about you and the tweet.

But this creates a problem. When you update your bio, Twitter would have to change that information for every one of your tweets in this table. If you've tweeted 5,000 times, that means 5,000 changes. If many people on Twitter are making lots of changes at once, that's a lot of computation to support. Instead, it's much easier for Twitter to store everyone's profile information in a separate table. That way, whenever someone updates their bio, Twitter would only have to change one row of data instead of thousands.

In an organization like this, Twitter now has two tables. The first table—the users table—contains profile information, and has one row per user. The second table—the tweets table—contains tweet information, including the username of the person who sent the tweet. By matching—or joining—that username in the tweets table to the username in the users table, Twitter can still connect profile information to every tweet.

The anatomy of a join

Unfortunately, we can't use Twitter's data in any working examples (for that, we'll have to wait for the NSA's SQL Tutorial), but we can look at a similar problem.

In the previous lesson on conditional logic, we worked with a table of data on college football players—benn.college_football_players. This table included data on players, including each player's weight and the school that they played for. However, it didn't include much information on the school, such as the conference the school is in—that information is in a separate table, benn.college_football_teams.

Let's say we want to figure out which conference has the highest average weight. Given that information is in two separate tables, how do you do that? A join!

SELECT teams.conference AS conference,
       AVG(players.weight) AS average_weight
  FROM benn.college_football_players players
  JOIN benn.college_football_teams teams
    ON teams.school_name = players.school_name
 GROUP BY teams.conference
 ORDER BY AVG(players.weight) DESC

There's a lot of new stuff happening here, so we'll go step-by-step.

Aliases in SQL

When performing joins, it's easiest to give your table names aliases. benn.college_football_players is pretty long and annoying to type—players is much easier. You can give a table an alias by adding a space after the table name and typing the intended name of the alias. As with column names, best practice here is to use all lowercase letters and underscores instead of spaces.

Once you've given a table an alias, you can refer to columns in that table in the SELECT clause using the alias name. For example, the first column selected in the above query is teams.conference. Because of the alias, this is equivalent to benn.college_football_teams.conference: we're selecting the conference column in the college_football_teams table in benn's schema.

Write a query that selects the school name, player name, position, and weight for every player in Georgia, ordered by weight (heaviest to lightest). Be sure to make an alias for the table, and to reference all column names in relation to the alias.

JOIN and ON

After the FROM statement, we have two new statements: JOIN, which is followed by a table name, and ON, which is followed by a couple column names separated by an equals sign.

Though the ON statement comes after JOIN, it's a bit easier to explain it first. ON indicates how the two tables (the one after the FROM and the one after the JOIN) relate to each other. You can see in the example above that both tables contain fields called school_name. Sometimes relational fields are slightly less obvious. For example, you might have a table called schools with a field called id, which could be joined against school_id in any other table. These relationships are sometimes called "mappings." teams.school_name and players.school_name, the two columns that map to one another, are referred to as "foreign keys" or "join keys." Their mapping is written as a conditional statement:

ON teams.school_name = players.school_name

In plain English, this means:

Join all rows from the players table on to rows in the teams table for which the school_name field in the players table is equal to the school_name field in the teams table.

What does this actually do? Let's take a look at one row to see what happens. This is the row in the players table for Wake Forest wide receiver Michael Campanaro:

During the join, SQL looks up the school_name—in this case, "Wake Forest"—in the school_name field of the teams table. If there's a match, SQL takes all five columns from the teams table and joins them to ten columns of the players table. The new result is a fifteen column table, and the row with Michael Campanaro looks like this:

Two columns are cut off from the image, but you can see the full result here.

When you run a query with a join, SQL performs the same operation as it did above to every row of the table after the FROM statement. To see the full table returned by the join, try running this query:

SELECT *
  FROM benn.college_football_players players
  JOIN benn.college_football_teams teams
    ON teams.school_name = players.school_name

Note that SELECT * returns all of the columns from both tables, not just from the table after FROM. If you want to only return columns from one table, you can write SELECT players.* to return all the columns from the players table.

Once you've generated this new table after the join, you can use the same aggregate functions from a previous lesson. By running an AVG function on player weights, and grouping by the conference field from the teams table, you can figure out each conference's average weight.

Next Lesson

SQL INNER JOIN

Looks like you've got a thing for cutting-edge data news.

Get the latest.

Get started