Webinar: Why logical layers matter, and how to use them -Watch now

Data Munging Articles

Need to anonymize a dataset? Get dates and times in a particular format? Replace null values? All of these tasks fall under the umbrella of data munging the process of cleaning and formatting data to be more consumable and convenient for analysis. These tutorials will help you get started with data munging (aka “data wrangling”)—whether you do it manually, or with the assistance of certain tools like Python libraries and R packages.

Go Collect Some $#*(&% Data

“Collecting data has somehow fallen out of the conversation. This isn’t healthy, both for our own profession, but also for the world at large.”-Counting Stuff

Practicing Data Prep with Wikipedia Data

A lot of publicly available datasets are unnaturally clean and don’t match the level of mess you’d encounter in a real production environment. Enter Wikipedia’s voluminous data set, full of weird activity from all over the internet.-Counting Stuff

Fuzzy Name Matching

People’s names are messy. Is Allison Joy Lee the same person as Alli Lee and AJ Lee? With this matcher, deduplicating contacts becomes much easier.-Andrew Zamler-Carhart

Research quality data and research quality databases

“Creating research quality data is the way that you refine and structure data to make it conducive to doing science. It means that the data is no longer as general purpose, but it means you can use it much, much more efficiently for the purpose you care about—getting answers to your questions.”-Simply Statistics

Data preparation in the age of deep learning

“When companies are spending millions or more dollars on training data, it's absolutely essential that they do it in a smart way.”-O’Reilly Data Show Podcast

Real-world data cleanup with Pandas and Python

Cleaning data is a tedious yet essential part of every analyst’s day. Learn how to use Python and Pandas to ensure that their data is clean, without worrying about overlooking any potential issues.-TrendCT

Handy Python Libraries for Formatting and Cleaning Data

These Python libraries will make the crucial task of data cleaning a bit more bearable—from anonymizing datasets to wrangling dates and times.-Mode

How to Treat Missing Values in Your Data

There’s nothing worse than opening up a new dataset only to discover it’s missing a ton of values. This two-part post evaluates techniques for handling missing data.-CleverTap

What every data scientist should know about data anonymization

You can uniquely identify a person with surprisingly little data. This PyData Berlin presentation walks you through the process of anonymizing a data set as well as best practices.-Katharina Rasch

A Practical Guide to Anonymizing Datasets with Python & Faker

Sometimes you just want to show off an analysis or chart you built for your company… without revealing your company’s data. Now you can.-District Data Labs

The Quartz guide to bad data

This comprehensive reference is intended for journalists, but it’s a worthwhile read for anyone working with data. Familiarize yourself with common issues—ambiguous field names, inconsistent date formats, biased samples—so you can catch data quality problems early.-Quartz

Get our weekly data newsletter

Work-related distractions for every data enthusiast.