# Learning Data Science

Many people have landed jobs as data scientists without any formal training because the internet is abundant in free resources for learning data science. This section includes tutorials for analytical languages such as SQL, Python, and R, career advice, and how-to posts about performing common tasks like A/B testing and

## The Mind at Work: Guido Van Rossum on How Python Makes Thinking in Code Easier

A wonderful read for any Pythonista.
- *Work in Progress*

## Emails from R: Blastula 0.3

This new R package makes it easy for you to send coworkers emails showcasing beautiful plots (and emoji subject lines!).
- *R Studio Blog*

## Calculating New and Returning Customers in R

This step-by-step tutorial sets you up with a simple and clean calculation for new and returning customers.
- *Towards Data Science*

## A New palette() for R

R got a glow up! Here’s how to take the new color palette for a spin.
- *R Developer Blog*

## How to Read and Write Data Files in Python

Bookmark this and save yourself a Google search.
- *End-to-End Machine Learning Library*

## The Problem with “Biased Data”

If you asked one hundred people what “biased data” means to them, you might just get back one hundred different answers. To make progress, we need an agreed-upon language and framework for understanding bias in machine learning.
- *Harini Suresh*

## Data Science Foundations: Know Your Data. Really, Really, Know It

Really knowing your data means more than just understanding the data layout or organization. You need to go all the way down to get a look at how the data is collected and generated, too.
- *Towards Data Science*

## Character Encodings — The Pain That Won’t Go Away

Go deep with this series on how character encoding quirks can thwart your analyses.
- *Better Programming*

## Data Science Archetypes

Are you a Generalist? A Detective? An Oracle? A Maker?
- *End-to-End Machine Learning Library*

## We’ll Do It Live: Updating Machine Learning Models on Flask/uWSGI with No Downtime

It’s trickier than you may think. This tutorial with code examples will walk you through the nitty gritty.
- *WW Tech Blog*

## Questions to Ask About Your Data

Print this comic out and keep it handy for anytime you’re doing exploratory analysis.
- *Julia Evans*

## What Data Patterns Can Lie Behind a Correlation Coefficient?

To interpret a correlation coefficient, you're gonna need the corresponding scatterplot.
- *Jan Vanhove*

## List of Time Series Databases

If you’re building a product to support many large-scale time-series users, you’ll need to shop around for the right database. This list will get you started, with open-source and proprietary options.
- *Misframe*

## Exploring Your Data With Just 1 Line of Python

Short, sweet, and to the point.
- *Towards Data Science*

## almanac

This new package allows R users to do things like construct a business calendar, which they can then use to shift dates forward and skip over weekends and holidays.
- *Davis Vaughan*

## Exploring Your Data With Just 1 Line of Python

Short, sweet, and to the point.
- *Towards Data Science*

## How Much Have You Spent on Amazon? Analyzing Amazon Data

The Director of Data Science at HelioCampus gives this tutorial gets a hearty recommendation: “This is the kind of project I mean when I talk about project-driven learning—it has relevance to you, so there's motivation to find out the answers beyond just learning a technical skill. And plenty of variations of Qs to ask.”
- *Dataquest*

## How to Spot Red Flags in a Data Science Job Opportunity

What signs should you look for to detect work-life balance problems, a lack of data science understanding, or a wimpy manager?
- *Towards Data Science*

## Data Integrity in Survey Collection

This thread recounts how one research study was infiltrated by bots (it happens more often than you might think!) and suggests tips for ensuring better data quality in online surveys.
- *Melissa Simone*

## janitor

janitor has simple functions for Examine and clean dirty data faster and save your thinking for the fun stuff. Built with beginning and intermediate R users in mind!
- *Sam Firke*

## loadtest: an R Package for Load Testing

“As APIs become more accessible to the data science community, so should engineering best practices around those APIs. However, most load testing tools are crafted for engineers or testing specialists–so we fixed that.”
- *T-Mobile Tech*

## How Much Do Data Scientists Make?

Who knew Walmart and Ancestry.com paid so well?
- *Towards Data Science*

## Where to Learn Statistics

Pick a handful of these resources to try out, and get started!
- *End-to-End Machine Learning*

## Best Practices for Analyzing Large-scale Health Data From Wearables and Smartphone Apps

If you’re working with health data, you need to be mindful of the privacy, selection bias, and policy implications.
- *Nature*

## Teacups, Giraffes, and Statistics

Whether you’re starting from scratch or want to deepen your familiarity with statistics, this tutorial is worth checking out for the playful approach and delightful illustrations.
- *Teacups, Giraffes, and Statistics*

## Mastering Shiny

If you’ve been wanting to try out Shiny, a framework for creating web applications using R code, here’s your chance! From academia to big pharma to Silicon Valley, Shiny is now used in almost as many niches and industries as R itself.
- *Hadley Wickham*

## Reproducible Data Workflows With Drake

Learn how to use drake, an R package that provides a powerful, flexible workflow management tool for reproducible data analysis pipelines.
- *Garrick Aden-Buie*

## Pandas Tricks

New tips every weekday morning that will help you to work faster, write better code, and impress your friends!
- *Kevin Markham*

## Introducing the Funneljoin Package

Do you work with data consisting of events with their time and associated user? Often find yourself asking “first this then that” questions? You probably have a problem funneljoin can help with.
- *Hooked on Data*

## Tidylo: Tidy log odds ratio weighted by uninformative prior

Use this R package in your everyday workflow when you want to compare how the frequency of some feature differs across some set or group.
- *Julia Silge*

## Data Helpers

Check out this list of data professionals who have volunteered to answer questions, promote, or mentor newcomers in data science, engineering, and analysis.
- *Angela Bassa*

## Practical Psychology for Data Scientists

How to recognize and sidestep eight common cognitive biases.
- *Towards Data Science*

## Why You Swipe Right

Using two months of swipes from his Tinder profile, one man evaluated his dating preferences. You might feel a bit voyeuristic, but this full series is worth a read for its examination of race and online dating.
- *Ajay Sharma*

## Crushed It! Landing a Data Science Job

How to crack your first (or fourth) round of data science interviews.
- *Erin Shellman*

## Python is Weird (an Unabashedly Biased Intro to Python for R Users)

Most Pandas tutorials start with a solid assumption that you know Python and you’re completely devoted to the religious tenant of being Pythonic. This one doesn’t, and it’ll help you wrap your head around this new way of thinking.
- *Eric R. Scott*

## Instagram Data Analysis Using Panoply and Mode

A thorough walkthrough, from connecting a database through the final visualizations.
- *Towards Data Science*

## Type Stable Estimation

This paper argues that code objects in statistical software should match up with the actual mathematical objects involved in formal data modeling.
- *Aleatoric*

## Build Your Career in Data Science

While there are lots of good blog posts on individual topics, there really isn't one place people can go to get a better understanding of a data science career. Until this book!
- *Manning Publications*

## Lyft Data Scientist Shares Five Pieces of Career Advice

A nice quick read that covers topics like starting a new role, stakeholder management, and building a consultancy.
- *Towards Data Science*

## Matrices as Tensor Network Diagrams

This framework is great way to wrap your head around matrices (and it makes proofs cleaner and simpler!).
- *Math3ma*

## Find Your Slow with profvis

profvis identifies problem patches in your R code that are slowing everything down, shaving satisfying seconds off the running time of your project.
- *Megan Stodel*

## Javascript Statistics Snippets

Ever wanted to do something simple—like generate random values from a distribution—without importing a whole new library? This repo's got your back.
- *Nick Strayer*

## gpt-2-simple

This Python package allows you to easily retrain OpenAI's GPT-2 text-generating model on new texts, like Buzzfeed article titles.
- *Max Woolf*

## Follow-up: I Found Two Identical Packs of Skittles, Among 468 Packs With a Total of 27,740 Skittles

Analyzing packs of Skittles (or sometimes M&Ms) seems to be a very common exercise in introductory statistics. But what's the likelihood of identifying two identical packs of Skittles? And how does that likelihood stack up against reality?
- *Possibly Wrong*

## How to Filter in R: A Detailed Introduction to the dplyr Filter Function

There are many ways to filter in R. Consider dplyr filter for its user-friendly syntax, how easy it is to work with, and how nicely it plays with the other dplyr functions.
- *Michael Toth*

## Why software projects take longer than you think – a statistical model

“How much time do you need?” has a whole new meaning now.
- *Erik Bernhardsson*

## A Simple Approach To Templated SQL Queries In Python

As Instacart's former VP of Data Science put it: “Almost every data product I’ve built has had parameterized SQL in it, and this is a great guide for how to do it well!”
- *Towards Data Science*

## Excel Error, but Could Happen in Any Tool

JD Long sums up this post perfectly: it “walks through an analysis where the results depend on how null values are handled. Good reminder to 1) understand if data has nulls 2) be thoughtful about handling nulls 3) compare groups with/without nulls.”
- *Junk Charts*

## Escaping Excel Hell with Python and Pandas

Tune into this fun discussion about introducing the Excel users in your life to Python.
- *TalkPython*

## 10 things R can do that might surprise you

R is building on its solid data analysis foundations and is rapidly becoming an all-purpose connective language for data science.
- *Simply Statistics*

## Advice for New Data Scientists

While this post is intended primarily for data scientists embedded in product teams, many of the tips can be generalized to any new hire in a tech role.
- *Airbnb Engineering & Data Science*

## Using Deep Learning to “Read Your Thoughts” — With Keras and EEG

Saying a word in one’s mind, even if not spoken aloud, can result in the firing of the nerves controlling the muscles involved in speech. With some readily available equipment, you can train a model to classify these sub-vocalized words in less than a day.
- *Justin Alvey*

## Tidy Tuesday Screencast: Tidying and Analyzing US PhDs in R

If you spend a lot of time importing Excel spreadsheets, don't miss this episode: it focuses on the process of importing, cleaning, and tidying messy data.
- *David Robinson*

## Journey to Data Science

Need a dose of inspiration? This thread of folks who recently became data scientists will give you the warm fuzzies.
- *Twitter*

## SQL: One of the Most Valuable Skills

SQL is permanent. SQL is flexible. SQL can be your super power.
- *Craig Kerstiens*

## The Ultimate List of Data Science Podcasts

When they say “ultimate,” they really mean it. Fire one of these up the next time you're exercising at the gym, commuting to work, or doing chores.
- *Real Python*

## Minimally Sufficient Pandas

Limiting Pandas to a small subset can keep your focus on the actual data analysis and not on the syntax. This detailed guide offers a single approach to completing a variety of common data analysis tasks.
- *Dunder Data*

## Learning From Eight Years of Data Science Mistakes

This talk covers mistakes made during analyses (including communication when delivering results) team and infrastructure mistakes, plus some advice for incoming data scientists.
- *rstudio::conf 2019*

## Data science curriculum roadmap

This set of topic recommendations is a good starting point for data-centric academic programs looking to revise their curriculum or start a completely new one.
- *Brandon Rohrer*

## Rstudio::conf 2019: lessons learned

Couldn't make Rstudio::conf? Get caught up on what you missed with this summary of five major themes from the talks there.
- *Brooke Watson*

## Statistics: P values are just the tip of the iceberg

Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one.
- *Nature*

## Solving the Model Representation Problem With broom

An introduction to the broom package, which aims to create a framework for representing statistical models, estimation methods, and fits with R objects.
- *Alex Hayes*

## Going Off the Map: Exploring purrr’s Other Functions

Learn how to use some of purrr’s lesser known functions to write cleaner and more concise code.
- *Hooked on Data*

## Preparing for a Tech Talk, Part 1: Motivation

This series covers the process preparing for a tech talk—from conceiving the idea to the actual day of the presentation. Up first: why and how to pick a topic.
- *Overreacted*

## How Do I? …

The ultimate reference material for R folks: a searchable table of 190+ R-stats tasks with code snippets.
- *Sharon Machlis*

## Selection Effects

Why do you often feel like you’re in the slower of two lanes during rush hour? Or why you feel like the bus is taking forever to get here? These scenarios (and many others!) can be explained by selection bias.
- *Carl T. Bergstrom*

## The ‘knight on an infinite chessboard’ puzzle: efficient simulation in R

A nice walkthrough of solving a chess conundrum, with an eye on keeping the simulation fast and interpretable.
- *Variance Explained*

## How to Develop the Five Soft Skills That Will Make You a Great Analyst

Soft skills tend to be more difficult to learn than hard skills, which is exactly why we all need to work on them. Here's a framework for assessing yourself and improving those skills.
- *Mode*

## Level up from `cron` to Airflow with R on Your Macbook

You can use Airflow in the same way you might use cron to schedule and execute jobs. Here's how to get it up and running.
- *Cerebral Mastication*

## Real-time Process for Completing a Task in R

“I'm sitting down to start a task in R. I don't entirely know how to complete it. I'm going to try to document my process in this thread in real time.” This thread is really enlightening (and a relief for anyone who feels like half their job is Googling).
- *We are R-Ladies*

## The Lesser Known Stars of the Tidyverse

A walkthrough of how to use some more obscure R packages and functions in exploratory analysis.
- *Hooked on Data*

## Battling the Bots

An interesting profile of a musicologist who turned his background identifying fraud in music composition into a job examining how online propaganda campaigns work.
- *Foreign Policy*

## Tidy Tuesday

Through this weekly data project, members of the RStats community get a new dataset on which to practice their wrangling and data visualization skills. Catch up on what's been done so far (https://twitter.com/hashtag/tidytuesday?src=hash) or get ready for tomorrow's dataset.
- *R for Data Science*

## Tidyeval

Trying to wrap your head around non-standard evaluation? Check out this tutorial for tidy evaluation in R.
- *Ian Lyttle*

## Some Important Data Science Tools that aren’t Python, R, SQL or Math

Data scientists don’t exist in a vacuum. Here are some of the tools you'll need to be capable of building production-ready applications.
- *Towards Data Science*

## How Becoming Not a Data Scientist Made Me a Better Data Scientist

Working as a software engineer helped Joel Grus understand how to write better code... as a data scientist.
- *Joel Grus*

## The hacker's guide to uncertainty estimates

Estimating uncertainty is easier said than done. This post covers a whole arsenal of tricks, including confidence intervals, Monte Carlo methods, and inverse Hessians.
- *Erik Bernhardsson*

## Scipy Lecture Notes

Sebastian Raschka surfaced this well-maintained hub of knowledge with a ringing endorsement: “I think this is really an under-appreciated resource. Probably the most comprehensive guide out there, it's free, and constantly updated!”
- *Scipy Lecture Notes*

## Chromebook Data Science

These free MOOCs exist so anyone with the ability to read, write, and do basic math can get into data science using nothing but a web browser and an internet connection.
- *Simply Statistics*

## JOINs in SQL, Python, and R

Though SQL has long been the industry standard for accessing relational data, nowadays, it’s more and more common to do this same work in a scripting language like Python or R. Here's how.
- *Mode*

## Strata speaker slides & videos

Catch up on all the presentations from the Strata Data Conference in NYC last week.
- *O'Reilly Conferences*

## Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity

The New York Times anonymous op-ed has spurred a bevy of data scientists to try to uncover the author using natural language processing. Here’s one attempt that serves as a nice R tutorial to boot.
- *Variance Explained*

## What You Need to Know Before Considering a PhD

This post suggests pondering the practical experience offered by industry jobs and the disproportionately high rate of depression amongst graduate students before you apply for a PhD program in machine learning or data science.
- *fast.ai*

## 4 Things You Should Stop Doing in SQL and Start Doing in Python

Each language has its strengths and we’ve often pondered the distinctions, but there are some actions in SQL that are simply more efficient in Python.
- *Mode*

## What Data Scientists Really Do, According to 35 Data Scientists

Here are the common themes that have emerged that have emerged from speaking with data scientists both in and outside tech.
- *Harvard Business Review*

## Get More From Your Salesforce Data: 4 SQL Queries to Write First

This post walks through a sample report replicating common Salesforce CRM reporting in SQL, so you can more easily audit, adjust, and extend that analysis.
- *Mode*

## Guidelines For A/B Testing

There are many ways A/B Testing can go wrong, but most of them won’t be obvious. Here are 12 guidelines that will help you guard against some common mistakes and set you up for success.
- *Hooked on Data*

## The Podcast of Small Differences

The first episode of this new “data science flavored” podcast covers what the two hosts—of physics and economics backgrounds—wished they knew on day one of their first data science jobs.
- *Otis Anderson & Ian Blumenfeld*

## Knowing Your Blindspot as an Analyst

“I thought that being the one with access to data made me the arbiter of truth, and that I was right by default when talking with someone who wasn’t using quantitative information to back up their ideas. I was wrong.”
- *Mode*

## Partitioning the Variation in Data

“Why do things vary?” is one of the fundamental questions you can ask during any exploratory data analysis. Here's how to gauge if the variation you're witnessing is fixed or random.
- *Simply Statistics*

## Analyzing IMDb Data The Intended Way, with R and ggplot2

IMDb has made their official dataset more accessible to analyze just for fun. Check out the data with this step-by-step tutorial, chock-full of code examples.
- *Max Woolf*

## Speed up your R Work

A tutorial on how to speed up work in R by partitioning data and process-level parallelization with rqdatatable, data.table, and dplyr.
- *Win-Vector Blog*

## Red Flags In Data Science Interviews

Companies will never straight up tell you they are bad to work for. Here are 12 warning signs to look out for when you’re interviewing at a company with multiple data scientists or analysts.
- *Hooked on Data*

## Add Constrained Optimization To Your Toolbelt

Stitch Fix shows how they use constrained optimization to get work to stylists and warehouses in a manner that’s fair and efficient, without cutting corners on client experience. For those fluent in Python: you should be able to model your own business problem by the end of this post.
- *Multithreaded*

## A year as a Data Scientist right after college: An honest review

A fresh-faced data scientist shares his experience in the workforce—what lived up to his expectations, and what didn’t.
- *Towards Data Science*

## Advice For Applying To Data Science Jobs

This the most thorough, well-organized post on the data science job application process we’ve ever seen. Bookmark it immediately.
- *Hooked on Data*

## Trustworthy Data Analysis

The manner in which you present the results of an analysis is part of the analysis and plays a large role in determining whether people trust your work or not.
- *Simply Statistics*

## Rethinking Academic Data Sharing

“There is a reasonable debate going on regarding whether companies should be able to share [personal] data and for what purposes. Academics have to realize that they are also part of this debate and that any decisions made in that domain will likely affect them.”
- *Simply Statistics*

## UTC is Enough for Everyone, Right?

Although this post is aimed at programmers, there's a lot in here for analysts as well, especially in regards to how to properly store time in databases.
- *Zach Holman*

## Seven Strategies for Optimizing Numerical Code

Some advice on how to use reporting as a means to create strong stakeholder relationships in your organization.
- *Locally Optimistic*

## 7 R Data Science Influencers to Follow

Whether you’re new to the R community, or you’re already an active package-creator or analyst, listening in on R Twitter conversations is great way to stay up to speed.
- *Mode*

## Exploring The Structure and Dependencies of An R Package

pkgnet is an R library designed for the analysis of… R libraries! With a graph representation of a package and its dependencies, you can prioritize functions to unit test and examine recursive dependencies you take on by using a given package.
- *UptakeOpenSource*

## A Shiny App to Visualize and Share My Dogs’ Medical History

What’s a digital nomad and R-user to do when she needs to share her dogs’ medical records with multiple vets? Build a Shiny app, of course! Here’s how.
- *Jenna Allen*

## One Analyst’s Guide for going from Good to Great

Junior analysts, hearken! This guide is perfect for breaking through your skill plateau.
- *Fishtown Analytics*

## Lumpers and Splitters: Tensions in Taxonomies

“As data scientists tasked with segmenting clients and products, we find ourselves in the same boat with species taxonomists, straddling the line between lumping individuals into broad groups and splitting into small segments.”
- *MultiThreaded*

## 5 Data Scientists on Making the Leap from Academia to Industry

We asked five leaders who transitioned from research backgrounds to data science jobs to share their thoughts on the process, from how they landed their first job to things they wished they'd known.
- *Mode*

## Stats 337: Readings in Applied Data Science

An excellent reading list!
- *Hadley Wickham*

## Set Operations in SQL and Python: a Comparison

Set operations take center stage in the latest installment of our Bridge the Gap series. Learn how to compare and combine data sets in both SQL and Python, so you can choose the best tool for the job.
- *Mode*

## Data-driven unit testing for data scientists and quant developers alike

The key to good unit testing is paying attention to the data in your tests and focus on testing the most important parts of your model or system. These guidelines will help you streamline your unit testing and avoid ambiguous results.
- *Cartesian Faith*

## SQL for Data Analyst

By the end of this free and beginner-friendly course, you’ll be able to write efficient SQL queries to successfully handle a variety of data analysis tasks.
- *Udacity*

## How to rewrite your SQL queries in Pandas, and more

A phrasebook that you'll come back to time and time again. Bookmark it!
- *codeburst.io*

## Semantics of timezone-aware datetime arithmetic

One reason why you can't "just use UTC" all the time is that you often need "wall time" semantics—the relationship between two times as displayed by the clock on the wall, regardless of the absolute elapsed duration between them. Here's how to deal with that in Python.
- *Paul Ganssle*

## Conversations with Future Data Scientists

Ryan Swanstrom put together a YouTube playlist of his answers to questions from aspiring data scientists like “How do I transition to data science?” or “Why should I start a data science project?” Most of these videos are under the 2-minute mark.
- *Data Science 101*

## Resources for Data Science Job Seekers

Here’s what you need to help you nail the job hunt and land a role you’ll love.
- *Mode*

## Data Science at the Command Line

Clear your weekend. O’Reilly has put this hands-on-guide online, for free!
- *O’Reilly Media*

## Bridge the Gap: Window Functions in Python and SQL

When we understand how Python and SQL overlap, we can make smarter decisions about which to use and when. Our new Bridge the Gap series explores just that, starting with a tool that most of us use everyday.
- *Mode*

## My Journey Into Data Science and Bio-Informatics — Part 1: Programming

One year ago, the author of this post had never executed a single line of code. Today, he works on a team trying to understand the underlying genetic alterations of neuroblastoma, a devastating tumor that affects young children. These are the resources and courses he used to get there.
- *O’Reilly Media*

## Introducing DataFramed, a Data Science Podcast

Here’s something new for your ears. This podcast promises to explore what modern data science looks like in practice via in-depth conversations with practitioners.
- *DataCamp*

## Imposter Syndrome in Data Science

“I’ve accepted that I will never be able to learn everything there is to know in data science — I will never know every algorithm, every technology, every cool package, or even every language — and that’s okay.”
- *Caitlin Hudon*

## Myths and mistakes of PyCon proposals

Here are some tips for getting your proposal accepted from a bonafide member of the PyCon Program Committee.
- *Irina Truong*

## Why old-school PostgreSQL is so hip again

How did a 21-year-old piece of technology become the world’s fourth most popular database?
- *InfoWorld*

## Don’t Ignore Bears: The Pitfalls of Summarizing Data with Medians

Some folks are big fans of the median as a summary statistic. But it has some big downsides—as all statistics do.
- *Towards Data Science*

## It Came from the Data Lake

Do you really need a data lake for that project? Or can you replace Hadoop with your laptop? Check out this presentation to learn how to use Python to process larger data sets (5-10 GB) on your local machine.
- *Vicki Boykis*

## An Interactive Tutorial on Numerical Optimization

People often implement numerical optimization algorithms in machine learning projects without much thought as to how they work. This post aims to change that with interactive visual representations of each algorithm.
- *Ben Frederickson*

## Changepoint Analysis of Time Series Data

Learn how you can use the changepoint R package to identify when a video switches from one scene to the next.
- *Uru*

## Causal Inference With pandas.DataFrames

There's now a causality package in Python to make causal inference more accessible so analysts and data scientists can incorporate it into their day-to-day. The intro is worth reading, even if you're not a Python user.
- *Adam Kelleher*

## How do you convince other people to use R?

Tired of being the lone R user in your organization? Try out these arguments on your colleagues.
- *Simply Statistics*

## Python's strftime directives

This reference for changing date/time formats in Python is so handy that one Twitter user (https://twitter.com/kscottz/status/922627756914962433) said of its creator: “This person has saved the world a thousand years of human effort. This person deserves a beer.”
- *strftime.org*

## Landing a Data Science Gig in New York City

Trying to break into the NYC data science job market? Sans a PhD? This guide was tailor-made for you.
- *Ground Truth*

## R for Journalists

This site is a great launch pad for anyone who's new to R, journalist or not. Each post provides step-by-step instructions and code for making a visualization with data about a current event.
- *R for Journalists*

## R Studio Community

R Studio recently opened up a forum. It's a great place to hang out with other R users, talk with R package developers during open office hours, or ask newbie questions if you're intimidated by Stack Overflow.
- *R Studio*

## Fast GeoSpatial Analysis in Python

If you get frustrated by the sluggishness of Python's GeoSpatial stack, check out this experiment. Combining Cython, Dask, and GeoPandas sped up the mapping of 120 million geospatial data points by 30x.
- *Matthew Rocklin*

## Becoming a 10x Data Scientist

Whether or not you believe 10x developers exist, data scientists can learn a ton from seasoned developers who are considered incredibly prolific and proficient.
- *Algorithmia*

## Practical Data Science for Stats

Many aspects of day-to-day analytics work are missing from the conventional statistics literature and curriculum. This bookmark-worthy collection aims to solve that problem, with tons of preprints on modern analytical workflows.
- *PeerJ*

## Python Cheat Sheet for Data Science: Intermediate

A handy reference for Pythonistas who have been around the block a few times.
- *Dataquest*

## Buggy Python Code: The 10 Most Common Mistakes That Python Developers Make

This list ain’t for rookies. Here are some of the subtle, harder-to-catch errors that have even advanced Python users tearing their hair out.
- *Toptal*

## Giving Your First Data Science Talk

Here’s why you should consider giving a talk and how to prep. Our favorite insight: your audience is the you from six months (or one year or five years) ago.
- *Hooked on Data*

## Using optaplanner to plan water supplies

Does your job involve a lot of resource planning? Learn how to use OptaPlanner—an open-source constraint satisfaction solver—in the most Silicon Valley way possible: planning out water logistics for Burning Man.
- *Richard Weiss*

## Craft Your Python Like Poetry

The Python style guide PEP 8 specifies line length at 79 characters, but that doesn't mean you should wrap lines when they hit an arbitrary length. If you need to sharpen your poetic sensibilities, these code examples will teach you how to write readable, beautiful Python.
- *Trey Hunner*

## You Say Data, I Say System

Every spreadsheet or database view or visualization is the result of an entire system of decisions: how to collect, compute, and represent the data. This article provides an excellent framework for being mindful of the choices that shape the end product you see on your screen.
- *Hacker Noon*

## I have data. I need insights. Where do I start?

What to do when your boss dumps a bunch of data in your lap and says “tell me something interesting.”
- *Towards Data Science*

## Py 2.0

Check this out if you’ve got an iPhone and want to learn to code Python, SQL, HTML—actually, pretty much any language—on the go.
- *Product Hunt*

## 29 common beginner Python errors on one page

Beginner or not, you’ll want to print out this flowchart and keep it at your desk.
- *Python for Biologist*

## How to Call B.S. on Big Data: A Practical Guide

One of our favorite tips in here: 'If you’d ask [a question] at a car dealership, you should ask it online, too.'
- *The New Yorker*

## 4 steps to conducting a proper root cause analysis

Whip out this guide the next time your boss asks you a question like “Why is revenue down?”
- *Outlier AI*

## The Hitchhiker’s Guide to d3.js

Intimidated by the long list of functions in d3’s API documentation? Paralyzed by choosing from dozens of d3 tutorials? Start here.
- *Ian Johnson*

## Methodologies as Vanity Metrics

“When you work on learning new methods (Now I know Random Forest! Now I know K-L Divergence! Now I know Deep Learning!) it feels good—you’re exercising your brain, you know something you didn’t before—and it’s easy to think you’re progressing. But methods don’t in and of themselves drive value.”
- *Ian Blumenfeld*

## Profiling a Dataset of Craft Beers

Learn how to summarize a dataset with descriptive statistics using this fun Python tutorial.
- *Jean-Nicholas Hould*

## Setting up SQL for beginners is hard

SQL’s human-language-like syntax and declarative nature make it the perfect language for people with no coding experience. But getting data available in the right structure presents a major barrier to entry. Here’s how to quickly build a stack for teaching SQL to others.
- *Vicki Boykis*

## Alternatives to a Degree to Prove Yourself in Deep Learning

Why blogging might be the best way to land a job offer.
- *fast.ai*

## The Etymology of Trig Functions

Way more engaging than your high school math class.
- *Matthew Conlen*

## How to ask questions data science can solve

Asking the right questions is half the battle. This post takes a different approach to formulating questions, by mapping them to the tools of the trade.
- *Towards Data Science*

## 1,000+ Women in Data Science

Your Twitter feed just got so much better.
- *Renee Teate*

## Group-by From Scratch

What’s the best way to split-apply-combine in Python? Although pandas groupby() is the widely-accepted default answer, there are situations where using built-in Python operations and NumPy and SciPy operations are more effective.
- *Jake VanderPlas*

## Taking Prophet for a Spin

Been meaning to try Prophet? Check out this walkthrough of Facebook’s Bayesian-influenced time series forecasting package (for both R and Python!).
- *Fast Forward Labs*

## What’s Wrong With My Time Series

When you want to test a model’s predictive power, cross validation is usually the way to go. However, since data points in a time series are dependent on each other, randomly selecting subsets for training and testing won’t do. Check out these other ways to determine error sources in time series.
- *MultiThreaded*

## Mathematicians becoming data scientists: Should you? How to?

Tips for determining if you’ll actually like the work data scientists do and positioning your mathematics background as an asset when you’re interviewing.
- *Quomodocumque*

## How to change careers and become a data scientist - one quant’s experience

One quant shares her story of switching from energy trading to data science: the resources she used, the classes she took, her decision to move to the Bay Area, and her advice for handling tech culture shock.
- *fast.ai*

## The Zero Bug

Hidden errors can be worse than visible errors. This post presents a fallacy that plagues many data analysts: common data aggregation tools usually can’t “count to zero” from examples.
- *Win-Vector Blog*

## I ranked every Intro to Data Science course on the internet, based on thousands of data points

There are a ton of data science training options online, but which one is the best?
- *freeCodeCamp*

## Unlearning descriptive statistics

If you’ve ever used an arithmetic mean, a Pearson correlation, or a standard deviation to describe a dataset, this post is for you.
- *Stijn Debrouwere*

## Guide to Encoding Categorical Values in Python

There are a ton of ways to turn categorical variables from text attributes into numerical values. Here’s how to implement the many options offered by pandas and scikit-learn on your own datasets.
- *Practical Business Python*

## Intro to Data Science for Academics

From Reed College to Revenue at Twitter, one data scientist shares his insights on how academics can be successful in industry—by finding ways to create value in every corner of the business.
- *Noah Pepper*

## Data Science for Beginners

“These videos are basic but useful, whether you’re interested in doing data science or you work with data scientists.”
- *Microsoft Azure*

## The best R package for learning to “think about visualization”

Spoiler alert: it’s ggplot2.
- *Sharp Sight Labs*

## My Experience as a Freelance Data Scientist

Itching to strike out on your own? Read up on the pros and cons before you give your two weeks notice.
- *Greg Reda*

## Matching to estimate the causal effects of firing an NFL coach

To fire or not to fire? When a football team gives their coach the boot, are they better off for it? (Bonus: a nice primer on causal inference.)
- *StatsbyLopez*

## How These Three Women Made Mid-Career Pivots Into Data Science

How do we narrow the gender gap in data science? Early STEM education for girls isn’t the only solution. Here are the journeys of three women who switched from creative jobs to data roles mid-career.
- *Fast Company*

## What’s the state of the job market in data science and machine learning?

“Th[e] proliferation of courses, resources, books and startups would hint that machine learning is becoming more and more accessible to the average programmer and that the market is on track to getting saturated quickly. Is this the current trend?”
- *Hacker News*

## What library do you use for information theory in Python?

This thread is a goldmine if you’re looking to calculate entropy, mutual information, or any other information theory metric.
- *Randy Olson*

## The Game Theory of the Yankee Swap

Want to get the best present at this year’s White Elephant gift exchange? Prep for total domination with these Python models.
- *Ben Casselman*

## Time Series Analysis in Python- Linear Models to GARCH

A well-written, comprehensive primer on the time series models available in Python.
- *BlackArbs*

## How the Circle Line rogue train was caught with data

When a series of signal interferences led to massive disruptions on a Singapore subway line, a team of data scientists stepped in to solve the mystery… with Python!
- *Data.gov.sg*

## Building a Financial Model with Pandas

Expand your knowledge of Python and Pandas and analyze your mortgage payment options. Two birds, one stone.
- *Practical Business Python*

## Text Analysis and Visualization

Ever wanted to try text analysis in Python, but didn’t know where to start? Here’s your launch pad.
- *Irene Ros*

## 8 Data Science Skills That Every Employee Needs

A nice primer to share with your colleagues.
- *Amplitude*

## Is Bayesian A/B Testing Immune to Peeking? Not Exactly

A common A/B testing mistake is to monitor the test and stop it when the p-value reaches a certain threshold. Many have suggested that using Bayesian methods eliminates this “peeking problem,” but all is not as it appears.
- *Variance Explained*

## PostgreSQL Date Functions (and 7 Ways to Use Them in Business Analysis)

PostgreSQL date functions (like DATE_TRUNC, EXTRACT, and AGE) make wrangling timestamps much easier. Here are 7 examples of applying these date functions to business scenarios.
- *Mode*

## How to Master Anti Joins and Apply Them to Business Problems

How to perform an anti join using LEFT JOIN and WHERE. Plus three examples of using anti joins in business scenarios.
- *Mode*

## What Would It Take To Turn Blue States Red?

Explore this interactive data visualization to see how small voting shifts among different demographics can impact the Presidential election.
- *FiveThirtyEight*

## Farmers Markets

Can you find real maple syrup outside of Vermont? Or seafood in the midwest? Or pet food anywhere? Check out these interactive visualizations to see what you’re most likely to find at a farmers market near you.
- *Susie Lu*

## On Average

Does the average person actually exist? Probably not, as it turns out. Learn how the concept of “average” influences product design, and why that’s not always a good thing.
- *99% Invisible*

## Asking good questions is hard (but worth it)

Although this framework is written from a programmer’s perspective, it’s a great read for analysts and the folks who ask them questions day-in and day-out.
- *Julia Evans*

## Goodbye, Ivory Tower. Hello, Silicon Valley Candy Store.

Some economists are trading in their professorships for tech jobs: 'Instead of thinking about national or global trends, they are studying the data trails of consumer behavior to help digital companies make smart decisions that strengthen their online marketplaces in areas like advertising, movies, music, travel and lodging.'
- *New York Times*

## Postgres Data Types to Redshift Data Types

Switching from one flavor of SQL to another can be a major pain. This table translates Postgres data types to their equivalent in Redshift. Definitely worth starring on Github.
- *Rob Story*

## The Three Faces of Bayes

The term “Bayesian” can refer to a variety of philosophies and ideas. Read this article before the next quant-heavy cocktail party you attend, so you’ll know what’s what.
- *Slackpropagation*

## R Psychologist

Puzzled by p-values? Confounded by confidence intervals? Stumped by significance testing? This site is a bevy of interactive visualizations illustrating tricky statistical concepts. Even if you’re a statistical genius, it’s worth a visit to play around.
- *Kristoffer Magnusson*

## 3 Reasons Counting is the Hardest Thing in Data Science

Counting isn’t technically difficult; the real challenge lies in managing relationships and office politics that surround the task.
- *Dayne Batten*

## Forget Python vs. R: how they can work together

Apparently we can all get along. The folks at Civis Analytics share the benefits of using both languages and give an example of how you can use C as a bridge to both Python and R. (Slides and a video from the original SciPy talk are also available.)
- *Civis Analytics*

## 70+ Resources for Transitioning to a Data Science Career

Considering a career in data science? Time to read up. Here's a list of tutorials, tips for interviewing, and stories from people who've made it.
- *Mode*

## Top 20 Pandas, NumPy, and SciPy Functions on Github

Some of the most popular Python functions, visualized in Python.
- *Alexander Galea*

## Ethics for powerful algorithms

Contrary to a ProPublica investigation, COMPAS—a proprietary algorithm used to predict police recidivism and inform parole—isn’t statistically biased against black people. However, that doesn’t mean COMPAS isn’t deeply unfair. This is the first of four posts digging into data science ethics.
- *Abe Gong*

## Build Algorithms Like You Give a Damn

Discussions at the 2016 WrangleConf focused on data science ethics and strategies for combatting harm by opening communication, recognizing bias, and fighting indifference.
- *Mode*

## Understanding Bias: A Pre-requisite For Trustworthy Results

“What causes bias? How can we correct it, and how does our picture of how the world works factor in to that?”
- *Adam Kelleher*

## A visual guide to Bayesian thinking

The best single source we’ve found for demystifying how Bayes’ Rule works, the intuition behind it, and how you can use it to inform your thinking.
- *Julia Galef*

## Practical advice for analysis of large, complex data sets

“This document has been read more than anything else I’ve done at Google over the last eleven years. Even four years after the last major update, I find that there are multiple Googlers with the document open any time I check.”
- *The Unofficial Google Data Science Blog*

## The Theorem Every Data Scientist Should Know

Quick! Define the Central Limit Theorem. Scratching your head? You’re not alone. And yet, this theorem is key to what data scientists do every day: make statistical inferences about data.
- *Jean-Nicholas Hould*

## Thinking in SQL vs Thinking in Python

Using a new language requires a new mindset. Our chief analyst shares his learnings from adding Python to his SQL workflow.
- *Mode*

## If Correlation Doesn’t Imply Causation, Then What Does?

This tweet sums up our feelings on this article exactly: 'Love that it gives a framework for thinking about correlations that isn’t just ¯ (ツ)_/¯'
- *Adam Kelleher*

## Building a data science portfolio

Much like writers and designers, data scientists are now expected to provide portfolios when they apply for jobs. Here’s what you need to know to get started.
- *Dataquest*

## Escaping Excel Hell with Python & Pandas

A great presentation on the problems that arise from spreadsheet analysis and how you can ditch Excel by learning some Python.
- *Chris Moffitt*

## Scientific Python Cheat Sheet

For those moments when you forget how to make a contour line plot in matplotlib or write a function in pure Python.
- *Institut de Physique du Globe de Paris*

## 10 Useful Python Data Visualization Libraries for Any Discipline

While many Python data visualizations libraries are narrowly focused on accomplishing a certain task, these libraries can be used regardless of your field.
- *Mode*

## What SQL Analysts Need to Know About Python

Here's some info on the importance of Python and how to use it in day-to-day analysis.
- *Segment*

## Easier data analysis in Python with pandas

A series of video tutorials for pandas newbies who know some Python. Each video answers a student-posed question using real-world data.
- *Data School*

## PyData London Conference Presentations

A few weekends ago PyData hosted a conference in London, and they just released videos and slides of a bunch of the presentations.
- *PyData*

## Modern Pandas

This tutorial is great for experienced Python users looking to stay sharp on pandas. One Twitter user summed it up perfectly as “the abbreviated Strunk & White of data analysis.”
- *Tom Augspurger*

## Spreadsheet Thinking vs. Database Thinking

This a great read for anyone who’s new to working with relational databases.
- *eagereyes*

## SQL Joins Visualizer

Many a learner has embarked on the quest to learn SQL, only to be thwarted by the task of mastering joins. Never again. Click the type of join you want to execute and this site will generate the right code.
- *SQL Joins Visualizer*

## 10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Forgive the click-baity title. This is actually a really well-done roundup of the statistical and machine learning methods data scientists use daily, with Python and R scripts for each.
- *Takashi J. Ozaki*

## An Introduction to Inference

A good first step for those who work with data frequently and want to learn more about Bayesian statistical methods. From the author: 'It will be a bit mathy, but nothing beyond kahn-level probability.'
- *Vincent D. Warmerdam*

## 6 Lesser Known Python Data Analysis Libraries

You’ve heard of NumPy and Pandas and matplotlib. Now check out these other handy libraries for dealing with data.
- *Jyotiska Khasnabish*

## This is the difference between statistics and data science

Another blog post trying to define data science? We know. We know. BUT! This one presents an interesting angle: the difference between a data scientist and a statistician comes down to product knowledge.
- *Mixpanel*

## How to Find Correlative Metrics For Conversion Optimization

A thorough walk-through of how to find correlative metrics and leverage them for conversion. It’s jam-packed with examples and advice from experts, plus a handy list of tools.
- *ConversionXL*

## Lift analysis - A data scientist’s secret weapon

Learn how to spot flaws in machine learning models with lift analysis (and why you should add it to your list of evaluation metrics).
- *Andy Goldschmidt*

## Not So Standard Deviations: Episode 11 - Start and Stop

If you haven’t listened to NSSD yet, you’re missing out on an inside look at how data scientists work in industry and academia. In this episode, statisticians Hilary Parker and Dr. Roger Peng discuss their methods for tackling the beginning and ending parts of analyses (discussion starts at 20:43).
- *Not So Standard Deviations*

## A Practical Guide to Anonymizing Datasets with Python & Faker

Sometimes you just want to show off an analysis or chart you built for your company… without revealing your company’s data. Now you can.
- *District Data Labs*

## Writing Data—an introduction to choosing & using data formats

JSON, CSV, or HDF5? This guide outlines the perks and pitfalls of file formats for alphanumeric data.
- *Build Things Together*

## Friction Between Programming Professionals and Beginners

In many technical forums, there’s a pattern of beginners asking a vague question and forum veterans responding with snarky or curt replies. Here are some suggestions both parties can use to keep conversations productive.
- *Programming for Beginners*

## Practical skills that practical data scientists need

Last week, Noah Lorang of Basecamp wrote that, most of the time, data scientists don’t need AI to solve business problems. They just need simple arithmetic. In this post, he elaborates on the skills he uses and questions he asks every day.
- *Signal v. Noise*

## Data scientists mostly just do arithmetic and that’s a good thing

A vast majority of the time, businesses don’t need machine learning to solve their problems. They need accurate, actionable data and people who consider context, know basic math, write SQL, and understand what makes businesses tick.
- *Signal v. Noise*

## The Elements of Python Style

This document goes beyond PEP8 to cover the core of what I think of as great Python style. It is opinionated, but not too opinionated. It goes beyond mere issues of syntax and module layout, and into areas of paradigm, organization, and architecture.
- *Andrew Montalenti*

## The Art of Naming Things

Nothing’s worse than when you open a new dataset only to find it’s full of indecipherable labels. This two-part article provides suggestions to keep your naming convention consistent, concise, and informative while preventing data loss and a whole lot of headaches.
- *Penn State*

## LowClass Python—Style Guide for Data Scientists

This style guide is meant for use by advanced beginner to advanced intermediate developers of scientific code in Python. In other words, non-professional programmers...for example, data scientists.
- *Columbia University Applied Data Science*

## Guess the Correlation

How good are you at gauging the correlation between two variables in a scatter plot? Find out!
- *Omar Wagih*

## A menagerie of messed up data analyses and how to avoid them

Don’t let mistakes botch your analyses. This post outlines six examples and offers advice for taking proactive measures against them.
- *Simply Statistics*

## Writing More Legible SQL

It’s easy to get lazy when writing SQL. Here are a few tips for cleaning up your queries so others can actually read your work.
- *Craig Kerstiens*

## How to Make the Leap from Excel to SQL

Learning SQL is easier when you have Excel in your toolbelt. And moving your analysis into SQL will seriously speed up your workflow.
- *Mode*

## AMA Data Scientist—Jake Porway of DataKind

Highlights of the discussion include advice for budding data scientists, ethical challenges, and opportunities to do good with data.
- *Reddit*

## Getting to the “Plateau of Productivity” with Python

Using the Gartner Hype Cycle as a framework, this post provides a load of context and tips for anyone who wants to pursue Python. As an added benefit, you could apply this structure to learning any technical language or tool.
- *Practical Business Python*

## The Missing 11th of the Month

According to Google’s Ngrams database, the 11th is mentioned significantly less than other monthly ordinals. But why? We don’t want to spoil the conclusion, but this post is a good reminder of why you shouldn’t blindly trust data.
- *Dr. David Hagen*

## Not Even Scientists Can Easily Explain P-values

We want to know if results are right, but a p-value doesn’t measure that. It can’t tell you the magnitude of an effect, the strength of the evidence or the probability that the finding was the result of chance.
- *FiveThirtyEight*

## Blinded by Statistical Significance

Putting too much stock in an arbitrary threshold may lead to bad decisions.
- *KelloggInsight Blog*

## The Field Guide to Data Science

Booz Allen just released The Second Edition of The Field Guide to Data Science, which walks you through how to use data to generate value for your organization. The guide includes practical advice, tested processes, and insights that are helpful for anyone who touches data, whether you’re a senior exec, a practioner, or a newbie.
- *Booz Allen Hamilton*

## Big Data Still Requires Humans To Make Meaningful Connections

It’s easy to get swept up in the exciting opportunities big data presents and forget that data alone isn’t a solution—it’s a tool to help solve problems. This article hits on a sentiment we’ve been hearing a lot lately—“we still need humans to help make sense of the data we are collecting.”
- *TechCrunch*