ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

SQL Joins Using WHERE or ON

Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.

In this lesson we'll cover:

Filtering in the ON clause

Normally, filtering is processed in the WHERE clause once the two tables have already been joined. It's possible, though that you might want to filter one or both of the tables before joining them. For example, you only want to create matches between the tables under certain circumstances.

Using Crunchbase data, let's take another look at the LEFT JOIN example from an earlier lesson (this time we'll add an ORDER BY clause):

SELECT companies.permalink AS companies_permalink,
       companies.name AS companies_name,
       acquisitions.company_permalink AS acquisitions_permalink,
       acquisitions.acquired_at AS acquired_date
  FROM tutorial.crunchbase_companies companies
  LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
    ON companies.permalink = acquisitions.company_permalink
 ORDER BY 1

Compare the following query to the previous one and you will see that everything in the tutorial.crunchbase_acquisitions table was joined on except for the row for which company_permalink is '/company/1000memories':

SELECT companies.permalink AS companies_permalink,
       companies.name AS companies_name,
       acquisitions.company_permalink AS acquisitions_permalink,
       acquisitions.acquired_at AS acquired_date
  FROM tutorial.crunchbase_companies companies
  LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
    ON companies.permalink = acquisitions.company_permalink
   AND acquisitions.company_permalink != '/company/1000memories'
 ORDER BY 1

What's happening above is that the conditional statement AND... is evaluated before the join occurs. You can think of it as a WHERE clause that only applies to one of the tables. You can tell that this is only happening in one of the tables because the 1000memories permalink is still displayed in the column that pulls from the other table:

Filtering in the WHERE clause

If you move the same filter to the WHERE clause, you will notice that the filter happens after the tables are joined. The result is that the 1000memories row is joined onto the original table, but then it is filtered out entirely (in both tables) in the WHERE clause before displaying results.

SELECT companies.permalink AS companies_permalink,
       companies.name AS companies_name,
       acquisitions.company_permalink AS acquisitions_permalink,
       acquisitions.acquired_at AS acquired_date
  FROM tutorial.crunchbase_companies companies
  LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
    ON companies.permalink = acquisitions.company_permalink
 WHERE acquisitions.company_permalink != '/company/1000memories'
    OR acquisitions.company_permalink IS NULL
 ORDER BY 1

You can see that the 1000memories line is not returned (it would have been between the two highlighted lines below). Also note that filtering in the WHERE clause can also filter null values, so we added an extra line to make sure to include the nulls.

Sharpen your SQL skills

For this set of practice problems, we're going to introduce a new dataset: tutorial.crunchbase_investments. This table is also sourced from Crunchbase and contains much of the same information as the tutorial.crunchbase_companies data. It it structured differently, though: it contains one row per investment. There can be multiple investments per company—it's even possible that one investor could invest in the same company multiple times. The column names are pretty self-explanatory. What's important is that company_permalink in the tutorial.crunchbase_investments table maps to permalink in the tutorial.crunchbase_companies table. Keep in mind that some random data has been removed from this table for the sake of this lesson.

It is very likely that you will need to do some exploratory analysis on this table to understand how you might solve the following problems.

Practice Problem

Write a query that shows a company's name, "status" (found in the Companies table), and the number of unique investors in that company. Order by the number of investors from most to fewest. Limit to only companies in the state of New York.

Try it out See the answer

Practice Problem

Write a query that lists investors based on the number of companies in which they are invested. Include a row for companies with no investor, and order from most companies to least.

Try it out See the answer

Next Lesson

SQL FULL OUTER JOIN

Get more from your data

Your team can be up and running in 30 minutes or less.