Data Engineering Articles
Data engineers empower company initiatives by building tools, infrastructure, frameworks, and services to get data “in shape” for analysts to query. This section includes articles about building and maintaining scalable data infrastructure, data modeling, piping data from one database to another (ETL), integrating data generated from SaaS tools into a single data warehouse, and optimizing data processing and storage.
Everyone Should Care About Data Storage
From data warehouses, lakes, to realtime applications: they’re all part of the journey to making data useful.-Sarah's Newsletter
The Future History of Data Engineering
While Data Engineering is growing rapidly, so too are the forces that will undermine the need for Data Engineers. Most businesses' data engineering needs have been solved or will shortly be solved by managed services that 10 years ago would require endless and extensive self-built ETL pipelines, databases and tools.-Group by 1
From Data Engineer to SysAdmin: Put Down the K8s Cluster, Your Pipelines Can Run Without It
“I’ve been operating Kubernetes... in a data engineering team for almost three years now, and I’d be wary of using it if I had the choice in the future.”-Jonathon Belotti
Byte Down: Making Netflix’s Data Infrastructure Cost-Effective
At many other organizations, teams manage data infrastructure costs by setting budgets and other heavy guardrails to limit spending. But that doesn’t fly at Netflix. Instead, they provide decision-makers with cost transparency and as much efficiency context as possible.-The Netflix Tech Blog
Scalable User Privacy
Spotify built a service to make sure every part of their complex and diverse data processing ecosystem is compliant with privacy standards.-Spotify Labs
Deleting Data Distributed Throughout Your Microservices Architecture
At Twitter, data deletion isn’t an event. It’s a process that hunts down data beyond the reach of an API call: in backend systems, caches, and offline datasets.-Twitter Engineering
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
“We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.”-Zhamak Dehghani
Data Engineering Hierarchy of Needs
A company’s level of data infrastructure maturity depends on its needs. Unless you’re operating on the level of Netflix, you’re probably better served by focusing on ETL, rather than optimizing analyses.-Music and Tech
Observability for Data Engineering
Standard monitoring tools don’t cut it because data pipelines behave very differently than software applications and infrastructure.-Databand
“Oh, You’re a Data… Something?”: The Misunderstood Role of a Data Engineer
This one goes out to every data engineer who's been mis-introduced as a data scientist.-ssense-tech
Using AWK and R to parse 25tb
Check out these lessons a PhD candidate learned by setting up a query system for some big data his lab generated. He sped up queries by 4,800 times using old-school techniques-Live Free or Dichotomize
Comparing Python and SQL for Building Data Pipelines
“Just because Python could do the job doesn’t mean it should.”-Towards Data Science
Does my Startup Data Team Need a Data Engineer?
Data engineering is no longer about building pipelines. Their tasks have shifted, and so too should your plans for when and what to hire them for.-Fishtown Analytics
Putting the Power of Kafka into the Hands of Data Scientists
How Stitch Fix’s data engineers built a robust, centralized data integration platform tailored to the needs of their data scientists-MultiThreaded
Capturing Data Evolution in a Service-Oriented Architecture
How Airbnb built a scalable, performant, reliable, lossless Change Data Capture service that enabled propagating and reacting to data mutations in real time.-Airbnb Engineering & Data Science
Optimizing BigQuery: Cluster your tables
Learn how to use this new BigQuery feature, and you could be running the same query for one-tenth the cost.-Felipe Hoffa
Goodbye Microservices: From 100s of problem children to 1 superstar
How and why Segment transitioned their data infrastructure from a microservice architecture to a single, monolithic service.-Segment
A Beginner’s Guide to Data Engineering — The Series Finale
Complexity increases as your team grows. Data engineering is no exception to this rule. Tackle that complexity by identifying and automating ETL patterns that are regularly present in people’s workflows.-Robert Chang
The Future of Data Engineering is the Convergence of Disciplines
Jasmine Tsai, Director of Engineering for Clover Heath's data platform covers how Clover built their ideal data infrastructure, the skillsets in play on the data team, and the future of the data engineering industry itself.-Mode
Give meaning to 100 billion analytics events a day
How one digital advertising company employs Kafka, Dataflow and BigQuery to ingest and transform a large stream of events.-Teads Engineering Blog
3 Industry Leaders on the Future of Data Tooling
There’s a bright future ahead for data engineering, one in which the tools and technology we depend on are increasingly designed with depth and cohesion in mind. Here are the tools three leaders in the data engineering field are most excited about.-Mode
A Beginner’s Guide to Data Engineering — Part II
The second part of this series covers data modeling, data partitioning, and ETL best practices, all with code examples from Airbnb's open source ETL tool Airflow.-Towards Data Science
Down with Pipeline debt / Introducing Great Expectations
This new Python library aims to help you beat down pipeline debt—type of technical debt that infests backend data systems—by conducting automated tests of data (instead of code) that happen at batch time (instead of compile or deploy time).-Great Expectations
What is a Senior Data Visualization Engineer
“It differs from an analyst role in that the focus is not on a question but rather on an audience that typically needs something more than a single report and who expects views into the data that generate more than just the expected insights.”-Elijah Meeks
Scaling Event Tables with Redshift Spectrum
As Mode’s customer base grew, we reached a point where our infrastructure wasn’t capable of handling the exponentially increasing volume of event data. Here’s how we saved Redshift performance by offloading 75% of our event data to S3 in less than a week.-Mode
A Beginner’s Guide to Data Engineering — Part I
The perfect primer for aspiring data scientists who need to learn the basics to evaluate job opportunities or early-stage founders who are about to build the company’s first data team.-Robert Chang
Selecting a Cloud Provider
Since its inception, Etsy has hosted its site and services in self-managed data centers. Now the company is switching over to Google Cloud Platform. Their CTO shares what went into their five-month-long evaluation process.-Code as Craft
The Missing Layers of the Analytics Stack
Collect, transform, analyze. These are the three pillars that support the modern analytics stack. Looking ahead, new layers may be added to streamline current sticking points, like data cleansing and anomaly detection.-Fishtown Analytics
Apache Airflow for the confused
Do you need a clear explanation about this task orchestration tool, sans the technical language? This post unpacks the jargon with a very apropos metaphor—air traffic controllers.-NYC Capital Planning
Big Data Processing at Spotify: The Road to Scio (Part 1)
Using Scio, a built in-house Scala API, Spotify is able to run the majority of their workloads with a single system, with little operational overhead.-Spotify Labs
What, exactly, is dbt?
Go deep on dbt, a command line tool that handles the T (transform) in ETL.-Fishtown Analytics
Segment vs Fivetran vs Stitch: Which Data Ingest Should You Use?
Choosing a pipeline tool comes down to which of these criteria is your top priority: harnessing an open source framework, handling high volumes of data with minimal downtime, or getting your data into third-party tools.-Stephen Levin
ZATA: How we used Kubernetes and Google Cloud to expose our Big Data platform as a set of RESTful web services
An inside look at zulily's data platform, which makes data accessible to analysts, systems, and applications without sacrificing speed or storage options.-Tech @ zulily
How Stitch Consolidates A Billion Records Per Day
Ever wanted to know how the people who make ETL tools set up their data infrastructure? Wonder no more.-StackShare
Choosing an ETL tool for your analytics stack
In the market for an ETL solution? Here's the criteria we employed when we evaluated ETL vendors for our own use here at Mode.-Mode
Airflow and the Future of Data Engineering: A Q&A
“[F]uture startups will be catapulted up the data maturity curve with access to better, cheaper, more accessible analytics software and services.”-Astronomer
The Rise of the Data Engineer
An in-depth manifesto for data science’s younger sibling.-Maxime Beauchemin
The State of Data Engineering
What makes a data engineer, well, a data engineer? And why does it feel like everyone is looking to hire one? This new study of LinkedIn data reveals that the number of data engineers doubled from 2013-2015, but demand still far outpaces supply.-Stitch Data
Goods: Organizing Google’s datasets
Most companies store their data in a central repository where everyone can go to publish or retrieve a dataset. Google manages their data in different way: they’ve built (surprise!) a crawling engine to index datasets and gather metadata about them. This gives folks the freedom to make and use datasets however they like.-null
When to use unstructured datatypes in Postgres–Hstore vs. JSON vs. JSONB
PostgresSQL has supported NoSQL for a while now, but when should you use the relational mode and when should you use non-relational mode? And if you use NoSQL, which data type should you pick?-Citus Data
Non-Mathematical Feature Engineering techniques for Data Science
This article is worth Pocketing for the straightforward, plain-English explanation of feature engineering alone. (And the best practices for pre-processing data ain’t bad either.)-Sachin Joglekar
Bridging the Gap Between Data Science and Data Engineering
Josh Wills, Director of Data Engineering at Slack, shares his thoughts on how data engineers and data scientists work best together.-Hakka Labs
The Purpose of Platforms in Data Science
How do you scale your data science org without hiring more people? Optimize for technical efficiency. In Uber’s case, that means data engineers building self-serve platforms to address specific problems in data scientists’ workflows.-Kevin Novak
Building Thumbtack’s Data Infrastructure
In this post, Thumbtack data engineer Nate Kupp sheds light on the company’s process for evaluating tools to add to their tech stack. It’s a goldmine for startups contemplating how to build a sustainable data infrastructure.-Thumbtack Engineering
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department
Here’s one suggestion for fixing the sometimes hairy relationships between data scientists and engineers optimize for autonomy, not technical efficiency.-Stitchfix
Choosing a Database for Analytics
A comprehensive rundown of criteria to consider when you’re ready to dedicate a database to analytics. Use this guide to evaluate your options depending on the type and size of your data, the state of your engineering resources, and your need to analyze data in real-time.-Segment