SQL Date Format
Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
If you live in the United States, you're probably used to seeing dates formatted as MM-DD-YYYY or a similar, month-first format. It's an odd convention compared to the rest of the world's standards, but it's not necessarily any worse than DD-MM-YYYY. The problem with both of these formats is that when they are stored as strings, they don't sort in chronological order. For example, here's a date field stored as a string. Because the month is listed first, the
ORDER BY statement doesn't produce a chronological list:
SELECT permalink, founded_at FROM tutorial.crunchbase_companies_clean_date ORDER BY founded_at
You might think that converting these values from
date might solve the problem, but it's actually not quite so simple. Mode (and most relational databases) format dates as YYYY-MM-DD, a format that makes a lot of sense because it will sort in the same order whether it's stored as a date or as a string. Excel is notorious for producing date formats that don't play nicely with other systems, so if you're exporting Excel files to CSV and uploading them to Mode, you may run into this a lot.
Here's an example from the same table, but with a field that has a cleaned date. Note that the cleaned date field is actually stored as a string, but still sorts in chronological order anyway:
SELECT permalink, founded_at, founded_at_clean FROM tutorial.crunchbase_companies_clean_date ORDER BY founded_at_clean
The lesson on data cleaning provides some examples for converting poorly formatted dates into proper date-formatted fields.
Assuming you've got some dates properly stored as a
time data type, you can do some pretty powerful things. Maybe you'd like to calculate a field of dates a week after an existing field. Or maybe you'd like to create a field that indicates how many days apart the values in two other date fields are. These are trivially simple, but it's important to keep in mind that the data type of your results will depend on exactly what you are doing to the dates.
When you perform arithmetic on dates (such as subtracting one date from another), the results are often stored as the
interval data type—a series of integers that represent a period of time. The following query uses date subtraction to determine how long it took companies to be acquired (unacquired companies and those without dates entered were filtered out). Note that because the
companies.founded_at_clean column is stored as a string, it must be cast as a timestamp before it can be subtracted from another timestamp.
SELECT companies.permalink, companies.founded_at_clean, acquisitions.acquired_at_cleaned, acquisitions.acquired_at_cleaned - companies.founded_at_clean::timestamp AS time_to_acquisition FROM tutorial.crunchbase_companies_clean_date companies JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions ON acquisitions.company_permalink = companies.permalink WHERE founded_at_clean IS NOT NULL
In the example above, you can see that the
time_to_acquisition column is an interval, not another date.
You can introduce intervals using the
INTERVAL function as well:
SELECT companies.permalink, companies.founded_at_clean, companies.founded_at_clean::timestamp + INTERVAL '1 week' AS plus_one_week FROM tutorial.crunchbase_companies_clean_date companies WHERE founded_at_clean IS NOT NULL
The interval is defined using plain-English terms like '10 seconds' or '5 months'. Also note that adding or subtracting a
date column and an
interval column results in another
date column as in the above query.
You can add the current time (at the time you run the query) into your code using the
SELECT companies.permalink, companies.founded_at_clean, NOW() - companies.founded_at_clean::timestamp AS founded_time_ago FROM tutorial.crunchbase_companies_clean_date companies WHERE founded_at_clean IS NOT NULL
Data Wrangling with SQL