Webinar: Why logical layers matter, and how to use them -Watch now

Data Engineering Articles

Data engineers empower company initiatives by building tools, infrastructure, frameworks, and services to get data “in shape” for analysts to query. This section includes articles about building and maintaining scalable data infrastructure, data modeling, piping data from one database to another (ETL), integrating data generated from SaaS tools into a single data warehouse, and optimizing data processing and storage.

Everyone Should Care About Data Storage

From data warehouses, lakes, to realtime applications: they’re all part of the journey to making data useful.-Sarah's Newsletter

The Future History of Data Engineering

While Data Engineering is growing rapidly, so too are the forces that will undermine the need for Data Engineers. Most businesses' data engineering needs have been solved or will shortly be solved by managed services that 10 years ago would require endless and extensive self-built ETL pipelines, databases and tools.-Group by 1

From Data Engineer to SysAdmin: Put Down the K8s Cluster, Your Pipelines Can Run Without It

“I’ve been operating Kubernetes... in a data engineering team for almost three years now, and I’d be wary of using it if I had the choice in the future.”-Jonathon Belotti

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

At many other organizations, teams manage data infrastructure costs by setting budgets and other heavy guardrails to limit spending. But that doesn’t fly at Netflix. Instead, they provide decision-makers with cost transparency and as much efficiency context as possible.-The Netflix Tech Blog

Scalable User Privacy

Spotify built a service to make sure every part of their complex and diverse data processing ecosystem is compliant with privacy standards.-Spotify Labs

Deleting Data Distributed Throughout Your Microservices Architecture

At Twitter, data deletion isn’t an event. It’s a process that hunts down data beyond the reach of an API call: in backend systems, caches, and offline datasets.-Twitter Engineering

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

“We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.”-Zhamak Dehghani

Data Engineering Hierarchy of Needs

A company’s level of data infrastructure maturity depends on its needs. Unless you’re operating on the level of Netflix, you’re probably better served by focusing on ETL, rather than optimizing analyses.-Music and Tech

Observability for Data Engineering

Standard monitoring tools don’t cut it because data pipelines behave very differently than software applications and infrastructure.-Databand

“Oh, You’re a Data… Something?”: The Misunderstood Role of a Data Engineer

This one goes out to every data engineer who's been mis-introduced as a data scientist.-ssense-tech

Using AWK and R to parse 25tb

Check out these lessons a PhD candidate learned by setting up a query system for some big data his lab generated. He sped up queries by 4,800 times using old-school techniques-Live Free or Dichotomize

Comparing Python and SQL for Building Data Pipelines

“Just because Python could do the job doesn’t mean it should.”-Towards Data Science

Does my Startup Data Team Need a Data Engineer?

Data engineering is no longer about building pipelines. Their tasks have shifted, and so too should your plans for when and what to hire them for.-Fishtown Analytics

Putting the Power of Kafka into the Hands of Data Scientists

How Stitch Fix’s data engineers built a robust, centralized data integration platform tailored to the needs of their data scientists-MultiThreaded

Capturing Data Evolution in a Service-Oriented Architecture

How Airbnb built a scalable, performant, reliable, lossless Change Data Capture service that enabled propagating and reacting to data mutations in real time.-Airbnb Engineering & Data Science

Optimizing BigQuery: Cluster your tables

Learn how to use this new BigQuery feature, and you could be running the same query for one-tenth the cost.-Felipe Hoffa

Goodbye Microservices: From 100s of problem children to 1 superstar

How and why Segment transitioned their data infrastructure from a microservice architecture to a single, monolithic service.-Segment

A Beginner’s Guide to Data Engineering — The Series Finale

Complexity increases as your team grows. Data engineering is no exception to this rule. Tackle that complexity by identifying and automating ETL patterns that are regularly present in people’s workflows.-Robert Chang

The Future of Data Engineering is the Convergence of Disciplines

Jasmine Tsai, Director of Engineering for Clover Heath's data platform covers how Clover built their ideal data infrastructure, the skillsets in play on the data team, and the future of the data engineering industry itself.-Mode

Give meaning to 100 billion analytics events a day

How one digital advertising company employs Kafka, Dataflow and BigQuery to ingest and transform a large stream of events.-Teads Engineering Blog

3 Industry Leaders on the Future of Data Tooling

There’s a bright future ahead for data engineering, one in which the tools and technology we depend on are increasingly designed with depth and cohesion in mind. Here are the tools three leaders in the data engineering field are most excited about.-Mode

A Beginner’s Guide to Data Engineering — Part II

The second part of this series covers data modeling, data partitioning, and ETL best practices, all with code examples from Airbnb's open source ETL tool Airflow.-Towards Data Science

Down with Pipeline debt / Introducing Great Expectations

This new Python library aims to help you beat down pipeline debt—type of technical debt that infests backend data systems—by conducting automated tests of data (instead of code) that happen at batch time (instead of compile or deploy time).-Great Expectations

What is a Senior Data Visualization Engineer

“It differs from an analyst role in that the focus is not on a question but rather on an audience that typically needs something more than a single report and who expects views into the data that generate more than just the expected insights.”-Elijah Meeks

Scaling Event Tables with Redshift Spectrum

As Mode’s customer base grew, we reached a point where our infrastructure wasn’t capable of handling the exponentially increasing volume of event data. Here’s how we saved Redshift performance by offloading 75% of our event data to S3 in less than a week.-Mode

A Beginner’s Guide to Data Engineering — Part I

The perfect primer for aspiring data scientists who need to learn the basics to evaluate job opportunities or early-stage founders who are about to build the company’s first data team.-Robert Chang

Selecting a Cloud Provider

Since its inception, Etsy has hosted its site and services in self-managed data centers. Now the company is switching over to Google Cloud Platform. Their CTO shares what went into their five-month-long evaluation process.-Code as Craft

The Missing Layers of the Analytics Stack

Collect, transform, analyze. These are the three pillars that support the modern analytics stack. Looking ahead, new layers may be added to streamline current sticking points, like data cleansing and anomaly detection.-Fishtown Analytics

Apache Airflow for the confused

Do you need a clear explanation about this task orchestration tool, sans the technical language? This post unpacks the jargon with a very apropos metaphor—air traffic controllers.-NYC Capital Planning

Big Data Processing at Spotify: The Road to Scio (Part 1)

Using Scio, a built in-house Scala API, Spotify is able to run the majority of their workloads with a single system, with little operational overhead.-Spotify Labs

What, exactly, is dbt?

Go deep on dbt, a command line tool that handles the T (transform) in ETL.-Fishtown Analytics

Segment vs Fivetran vs Stitch: Which Data Ingest Should You Use?

Choosing a pipeline tool comes down to which of these criteria is your top priority: harnessing an open source framework, handling high volumes of data with minimal downtime, or getting your data into third-party tools.-Stephen Levin

ZATA: How we used Kubernetes and Google Cloud to expose our Big Data platform as a set of RESTful web services

An inside look at zulily's data platform, which makes data accessible to analysts, systems, and applications without sacrificing speed or storage options.-Tech @ zulily

How Stitch Consolidates A Billion Records Per Day

Ever wanted to know how the people who make ETL tools set up their data infrastructure? Wonder no more.-StackShare

Choosing an ETL tool for your analytics stack

In the market for an ETL solution? Here's the criteria we employed when we evaluated ETL vendors for our own use here at Mode.-Mode

Airflow and the Future of Data Engineering: A Q&A

“[F]uture startups will be catapulted up the data maturity curve with access to better, cheaper, more accessible analytics software and services.”-Astronomer

The Rise of the Data Engineer

An in-depth manifesto for data science’s younger sibling.-Maxime Beauchemin

The State of Data Engineering

What makes a data engineer, well, a data engineer? And why does it feel like everyone is looking to hire one? This new study of LinkedIn data reveals that the number of data engineers doubled from 2013-2015, but demand still far outpaces supply.-Stitch Data

Goods: Organizing Google’s datasets

Most companies store their data in a central repository where everyone can go to publish or retrieve a dataset. Google manages their data in different way: they’ve built (surprise!) a crawling engine to index datasets and gather metadata about them. This gives folks the freedom to make and use datasets however they like.-null

When to use unstructured datatypes in Postgres–Hstore vs. JSON vs. JSONB

PostgresSQL has supported NoSQL for a while now, but when should you use the relational mode and when should you use non-relational mode? And if you use NoSQL, which data type should you pick?-Citus Data

Non-Mathematical Feature Engineering techniques for Data Science

This article is worth Pocketing for the straightforward, plain-English explanation of feature engineering alone. (And the best practices for pre-processing data ain’t bad either.)-Sachin Joglekar

Bridging the Gap Between Data Science and Data Engineering

Josh Wills, Director of Data Engineering at Slack, shares his thoughts on how data engineers and data scientists work best together.-Hakka Labs

The Purpose of Platforms in Data Science

How do you scale your data science org without hiring more people? Optimize for technical efficiency. In Uber’s case, that means data engineers building self-serve platforms to address specific problems in data scientists’ workflows.-Kevin Novak

Building Thumbtack’s Data Infrastructure

In this post, Thumbtack data engineer Nate Kupp sheds light on the company’s process for evaluating tools to add to their tech stack. It’s a goldmine for startups contemplating how to build a sustainable data infrastructure.-Thumbtack Engineering

Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department

Here’s one suggestion for fixing the sometimes hairy relationships between data scientists and engineers optimize for autonomy, not technical efficiency.-Stitchfix

Choosing a Database for Analytics

A comprehensive rundown of criteria to consider when you’re ready to dedicate a database to analytics. Use this guide to evaluate your options depending on the type and size of your data, the state of your engineering resources, and your need to analyze data in real-time.-Segment

Get our weekly data newsletter

Work-related distractions for every data enthusiast.