Welcome to the Advanced SQL Tutorial! If you skipped the beginning tutorials, you should take a quick peek at this page to get an idea of how to get the most out of this tutorial. For convenience, here’s the gist:
- Open another window to Mode. Sign up for an account if you don’t have one.
- For each lesson, start by running
SELECT * on the relevant dataset so you get a sense of what the raw data looks like. Do this in that window you just opened to Mode.
- Run all of the code blocks in the lesson in Mode in the other window. You’ll learn more if you really examine the results and understand what the code is doing.
For this lesson, we’ll use the same Crunchbase data from a previous lesson. It was collected on Feb 15, 2014, and large portions of the data were dropped for the sake of this tutorial. In this example, we’ll also use a modified version of this data with date formats cleaned up to work better with SQL.
In previous lessons, you learned that certain functions work on some data types, but not others. For example,
COUNT works with any data type, but
SUM only works for numerical data (if this doesn’t sound familiar, you should revisit this lesson). This is actually more complicated than it appears: in order to use
SUM, the data must appear to be numeric, but it must also be stored in the database in a numeric form.
You might run into this, for example, if you have a column that appears to be entirely numeric, but happens to contain spaces or commas. Yes, it turns out that numeric columns cannot contain commas—If you upload data to Mode with commas in a column full of numbers, Mode will treat that column as non-numeric. Generally, numeric column types in various SQL databases do not support commas or currency symbols. To make things more complicated, SQL databases can store data in many different formats with different levels of precision.
INTEGER data type, for example, only stores whole numbers—no decimals. The
DOUBLE PRECISION data type, on the other hand, can store between 15 and 17 significant decimal digits (almost certainly more than you need unless you’re a physicist). There are a lot of data types, so it doesn’t make sense to list them all here. For the complete list, click here.
Here is the list of exact data types stored in Mode:
||With these rules
||Any characters, with a maximum field length of 1024 characters.
||Stores year, month, day, hour, minute and second values as YYYY-MM-DD hh:mm:ss.
||Numerical, with up to 17 significant digits decimal precision.
||Only TRUE or FALSE values.
“Imported as” refers to the types that is selected in the import flow (see image below), “Stored as” refers to the official SQL data type, and the third column explains the rules associated with the SQL data type.
Changing a column’s data type
It’s certainly best for data to be stored in its optimal format from the beginning, but if it isn’t, you can always change it in your query. It’s particularly common for dates or numbers, for example, to be stored as strings. This becomes problematic when you want to sum a column and you get an error because SQL is reading numbers as strings. When this happens, you can use
CONVERT to change the data type to a numeric one that will allow you to perform the sum.
You can actually achieve this with two different type of syntax. For example,
CAST(column_name AS integer) and
column_name::integer produce the same result.
You could replace
integer with any other data type that would make sense for that column—all values in a given column must fit with the new data types.
Mode Community (the site you’re using to complete this tutorial) performs implicit conversion in certain circumstances, so data types are rarely likely to be problematic. However, if you’re accessing an internal database (your employer’s, for example), you may need to be careful about managing data types for some functions.
Sharpen your SQL skills
founded_at_clean columns in the
tutorial.crunchbase_companies_clean_date table to strings (varchar format) using a different formatting function for each one.