Interesting Data Sets

A robust data set is usually the first step toward answering a question. We've collected articles including whacky and useful data sets for training machine learning models, practicing an analytical language, or finding compelling insights.

Level 5 Dataset

Lyft open-sourced their autonomous driving dataset from their Level 5 self-driving fleet, including raw sensor camera and LiDAR inputs. - Lyft

Earth Engine Data Catalog

Google Earth's public data archive includes more than forty years of historical imagery and scientific datasets, spanning climate, weather, and night-time light. - Google Developers

Where Does the U.S. Government Keep Its Data?

The U.S. Federal government's statistical work doesn't end with the Census Bureau. In fact, there are 13 principal agencies that are key to data collection. Here's a list of all the data they publish (in API form where possible). - Sam Tyner

Election integrity data archive

Twitter published the full dataset of 9 million tweets from Russian troll farms and 1 million tweets from Iranian ones. Unbox your 8TB drive and get crackin'. - Twitter

More Cool Public Datasets and Lots of Ideas for Exploring Them

In the spirit of encouraging data discovery and exploration, here are 5 public datasets, along with some questions you might ask and interesting visualizations you could make for each. - Mode

The Strawberry Capital of the World is the early death capital of the U.S.: lessons from a landmark dataset

The U.S. National Center for Health Statistics has released the most detailed local health data ever. See how your neighborhood stacks up against the national average life expectancy. - Wonkblog

Why We’re Sharing 3 Million Russian Troll Tweets

In concert with two Clemson University professors, FiveThirtyEight has opened up the fullest empirical record to date of the “troll factory” Internet Research Agency's actions on social media. - FiveThirtyEight

Census Oddities

So many analyses are built on data from the U.S. Census and American Community Survey, but those datasets have their own quirks you need to watch out for. - Carto Blog

US House PSCI Social Media Ads

Last Thursday, Democratic members of the House Intelligence Committee released 8.8 gigabytes of information about Facebook ads paid for by Russians attempting to interfere in American politics. The data has since been converted to a CSV, so you can explore it for yourself. -

Need a ratings boost? Make a Halloween episode.

This analysis of over 24,000 episode ratings from 184 television shows proves that Halloween TV episodes aren’t just filler. - Kaylin Walker

The Anatomy of a Thousand Typefaces

Say goodbye to endlessly scrolling through the font menu in your word processor. Instead, use this database of typefaces, classified by characteristics like width, spacing, and stroke contrast. - Florian Schulz

9 Elements of Deal-Closing Sales Demos, According to New Data

Forward this one to your sales team. This is yet another good example of a company using their proprietary dataset (in this case, recordings of sales calls) to tell stories and generate interest in their brand. -

Quick, Draw! The Data

These doodles are a unique data set that can help developers train new neural networks, help researchers see patterns in how people around the world draw, and help artists create things we haven’t begun to think of. - Google

We’re Sharing A Vast Trove Of Federal Payroll Records

Buzzfeed, via the Freedom of Information Act, got their hands on a dataset comprising four decades of salaries, titles, and demographic details about millions of U.S. government employees, as well as how they moved through the federal bureaucracy. - Buzzfeed

3 Million Instacart Orders, Open Sourced

Instacart has released an anonymized dataset containing a sample of over 3 million grocery orders from more than 200,000 users. Download the data and dig in. - Engineering at Instacart

Executive Office of the President Open Data Archive Backup

Data downloaded from the White House website on January 20, 2017. - Maxwell Ogden

TrumpWorld Data

Buzzfeed put together a dataset to shed light on Trump’s giant network of businesses, investments, and corporate connections. Right now, it includes more than 1,700 people and organizations. Explore the data yourself via Github or Google Sheets. - Buzzfeed


Follow this brand new Twitter account for tons of open, online datasets. - Twitter

The DataRefuge Project

DataRescue events create trustworthy copies of federal climate and environmental data, while the Internet Archive,, and a consortium of major research libraries holds these copies. - PPEH Lab

Academic Torrents

Getting your hands on interesting data can be a chore. Some clever folks at the University of Massachusetts put together a platform for distributing datasets and research papers with BitTorrent technology. - Academic Torrents

20 Weird & Wonderful Datasets for Machine Learning

Getting your hands on a robust dataset is the hardest part of machine learning. Finding interesting datasets is tougher still. From UFO sightings to beautiful Flickr photos, you’re sure to find something to train your model. - Oliver Cameron

San Francisco Housing Construction History

When someone mentions San Francisco’s housing shortage, they usually cite a limited dataset containing San Francisco Chronicle rental listings from 1979-2001. Eric Fischer took it upon himself to collect decades of new information by transcribing Chronicle rental ads from 1948-1979 and Craigslist rental listings from 2001 onward. - Eric Fischer

Zika Data Guide

It’s surprisingly hard to find data on the Zika virus outbreak. That’s why Buzzfeed’s Jeremy Singer-Vine put together a collection of links to of Zika datasets for people to contribute to and use for reference. - Buzzfeed

A terrifying and hilarious map of squirrel attacks on the U.S. power grid

Explore this nutty dataset in detail - Wonkblog

Yahoo News Feed

A collection of 110 billion Yahoo News user actions, and the largest publicly-released machine learning dataset to date. - Yahoo Labs