Data Cleaning
Data cleaning is a process by which inaccurate, poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct a survey and ask people for their phone numbers, people may enter their numbers in different formats. Before it can be used, those phone numbers need to be standardized so that they’re all formatted the same.
Data can be messy like this for lots of reasons. Addresses can be formatted inconsistently; records can get duplicated and need to be identified and reconciled; some records may use different terms, like “Closed won” and “Closed Won” to represent what should be the same values; null values need to be handled correctly; and so on.
###An example: How data can be cleaned
Data can be cleaned in a number of ways. Sometimes, it’s done manually in SQL queries, in Python scripts, or in Excel. Sometimes, people use tools like Trifacta that are designed to programmatically clean data. And sometimes, it’s incorporated into ETL processes that clean data as they extract and load it into a warehouse.
##Opinion: Data cleaning, data prep, and data modeling are all slightly different
Data cleaning often gets conflated with two other related terms: data prep, and data modeling. We think of these words as meaning three different, albeit overlapping, things.
Before defining these terms, however, it’s useful to describe what makes data messy:
-
It’s wrong. Sometimes—especially when data is manually entered by people—the data is just wrong. For example, Salesforce data is often the source of truth for revenue data. This data, however, is created by sales reps filling out fields in Salesforce. People input dates and quantities wrong or create duplicates on accident. Errors also pop up in machine-generated data, especially if data from test data sources gets mixed in with production data.
-
It’s inconsistent. People express the same ideas in different ways. For example, when defining a customer’s plan, one rep may input “Monthly” and another may choose “month to month.” Or when logging someone’s phone number, one system may log it with the country code and the other may not.
-
It’s built for machines. A lot of machine-generated data is created in a format that’s useful for a machine but not a human. For example, when logging high-volume event data, that data’s typically logged in a way to make it easy to store by nesting certain fields inside one another. This structure is often good for machines, but makes it harder for people to analyze.
-
It’s scattered all over the place. Even if the data itself is clean, it may be scattered across a number of sources. To make it useful, you need to centralize it in some way such that it can be combined.
To solve these problems, companies typically first centralize their data into a single data warehouse before cleaning it up in any other way. If the source data is messy (e.g., if phone numbers are inconsistent in Salesforce), the data in the warehouse will be messy. This architecture is commonly referred to ELT (extract, load, and transform) as opposed to ETL (extract, transform, and load).
Data cleaning is a process by which inaccurate, poorly formatted, or otherwise messy data is organized and corrected.
Next, they prep the centralized data. Once the data is centralized, data teams use tools like dbt or Airflow to transform raw data into something more suitable for analysis. This process corrects issues like fixing phone number formats, identifying duplicates, or cleaning up data structures to make them more human friendly. It also introduces more complex business logic, like defining which purchases should be included in revenue calculations (for example, we may want to exclude returns or purchases made with gift cards). This type of enrichment is less about cleaning messy data, and more about refining raw material into a more valuable alloy.
Unfortunately, these steps are often mixed together, so it’s difficult to precisely define each step. That said, roughly speaking, this is how we’d talk about it:
-
Data cleaning is correcting errors or inconsistencies, or restructuring data to make it easier to use. This includes things like standardizing dates and addresses, making sure field values (e.g., “Closed won” and “Closed Won”) match, parsing area codes out of phone numbers, and flattening nested data structures. It’s about applying rules, not business logic. We also tend to think of data cleaning as something that applies to an individual table rather than an entire schema.
-
Data modeling is transforming data with business logic so that you can make sense of it. Rather than correcting errors, this is more about connecting and reshaping clean data into business concepts, metrics, etc. For example, at Mode, we use Salesforce opportunities to define revenue. But we can’t compute revenue directly from the values on the record because we account for revenue differently on contracts of different lengths. We model our revenue data by adding a new field that recomputes revenue using our accounting logic.
-
Data prep is an umbrella term for anything that alters a dataset so that it can be analyzed or graphed. Data modeling is data prep, and data cleaning is data prep. One-off futzing with data in Excel is also data prep, though not a particularly reliable version of it.
Related terms:
ETL tools , Transformation tools