Editor’s note, April 2021: A lot has changed since this post was written in 2017. We’ve added short notes throughout where things have changed, but left the original text, so that you can see how our thinking has evolved. We’ve also added a general update at the end.
For most companies, essential data starts its life in all sorts of silos. Many businesses are running on a growing range of SaaS platforms, each of which holds a partial view of how a company is doing. With CRM data siloed in Salesforce, user data in an application database, and help tickets tucked away in Zendesk, it's nearly impossible to put together a complete picture of what's going on. Throw in a billing system, an NPS platform, and an email or marketing automation service provider, and you've got a whole mess of data and not many useful insights.
The solution for managing and making use of all that data? Piping it into a data warehouse that organizes and stores it under one roof, making it easy to combine, query, and analyze. Getting this data into a warehouse you control, and can connect to any analytics tool of choice, is vital to keeping your business growing quickly.
Building your own ETL process (aka Extract, Transform, and Load) from scratch, however, can be a massive undertaking. Engineering a custom pipeline from a tool like Salesforce into a data warehouse is daunting enough; engineering a new pipeline for every SaaS service you use is a fool's errand.
Fortunately, most companies can buy, rather than build, their way to a solution. A number of SaaS ETL tools have emerged that do nearly all the heavy lifting for you, right out of the box. Because these tools underpin your data infrastructure, it's important to get it right from the get-go. We hope to offer a bit of guidance, drawing from our experience evaluating ETL vendors for our own use here at Mode. Here's what we'll cover:
- When to make the leap to ETL
- ETL approaches: Syncers versus transformers
- How to compare ETL tools
- Mode's ETL choices
- The takeaway: Know thyself
Early on at Mode, our data was stored in a few places. In a setup that's probably familiar to lots of web—and particularly SaaS—companies, our initial data infrastructure was made up of several data sources:
- An application database: We set up a replica and began querying it directly using Mode from the moment our software could execute queries. This database was our primary source of truth for signups, user actions, and any state-based application data.
- A web event stream: We initially started with two event streams: one we logged ourselves directly in the application database and one collected by Segment and piped into Google Analytics. This second event stream was then written into a Redshift warehouse, which we queried when we needed answers GA couldn't provide.
- Third-party tools: In those days, Salesforce, Marketo, and Intercom contained the most important data to our business. When we first started using these tools, we relied on Salesforce's and Marketo's built-in reporting. Because Intercom didn't have customer-facing analytics at the time, we initially did no reporting with it.
This setup was fine for a while. We could answer most of our questions, and weren't ready to make a big investment to solve problems we didn't yet have or to integrate with third-party tools we were still evaluating.
As our business grew, however, our questions became more complex and having data in lots of different places didn't cut it. When we started asking questions based on what we could answer rather than what we wanted to answer, we knew it was time to look for a new solution.
We had a few goals in mind when we started looking for an ETL solution. First, we knew we wanted to buy, not build. With a vendor, we could be up and running in days compared to months—and at a fraction of the cost. Moreover, vendors take care of maintenance, make updates as APIs change, and keep on top of changing data structures, freeing up our engineering team to work on more mission-critical projects.
Second, because our data sources would undoubtedly evolve (and already have—we've added new event logs and additional third-party tools), we wanted an ETL process flexible enough to solve future needs.
And finally, we were focused on finding tools that moved data into Amazon Redshift, which we had already started using to warehouse our event stream data.
(Update: We’ve since migrated to Snowflake.)
When we first started evaluating ETL vendors, our initial thought was that they were all pretty similar. But as with so many other technical tools, there's more than meets the eye. Though most can pull data from lots of sources and most can push that data into various warehouses, the tools take different approaches to solving the problem. Generally, we've found that most data pipelines are one of two types: data syncers or data transformers.
Data syncers have streamlined setups and workflows aimed at helping you quickly move data from one source to another. The UI for these pipelines tends to be focused on showing the status of these syncs, and target simplicity over functionality. When we were looking, Blendo, Fivetran, Segment, and Stitch fell into this category.
(Update: Though all of these products still exist, Fivetran and Stitch are clear leaders in this specific category)
By contrast, data transformers not only sync data, but can also transform and enrich it. These products often have more involved applications, more complex pricing structures based on compute time or data throughput, and tend to support a more diverse set of data sources, including APIs, log files, S3, and other unstructured sources. By offering a transformation layer, they provide more functionality than the syncers, but are often more expensive, both monetarily and in setup and support costs. At the time of our evaluation, Alooma, DataVirtuality, Etleap, Keboola, and Xplenty were the main contenders here.
(Update: All of these products also exist; however, they’ve evolved in different directions and overlap less than they did in 2017.)
Figuring out which type of vendor you need is the first step to figuring out which specific vendor to choose. If you're mostly interested in moving data out of third party sources and into a centralized warehouse, you probably want a syncer. However, if you're working with massive log files or generic file stores like S3, a transformer might make your analysis more efficient.
When we started to look for a ETL vendor, we thought our evaluation would come down to two things: supported integrations and price. Find the vendors with the right integrations and buy the one with the lowest cost.
We knew from the outset that we were leaning toward syncers rather than transformers because most of our data was in existing databases or third-party sources and just needed to be more accessible. Moreover, to make debugging our pipelines easier, we didn't want the tool to do a lot of heavy transformations (which we prefer to handle in the database directly). As such, we didn't want to pay the price—whether financially or mentally—for a complex ingestion tool if we were using it as a simple pipe.
But as we dug deeper, we realized that beyond syncing data, our needs were far more complex. Here are the additional criteria we used to evaluate vendors, ranked from most to least important (to us). Your priority order may shake out differently.
- Security: Mode meets strict security and compliance standards such as HIPAA. Though “security is everyone's first priority” at nearly every company, at Mode, it's legally bound to be so. As such, we have non-negotiable security requirements.
- Data quality: The data itself must be fundamentally trustworthy. Mappings can't be garbled and rows can't be dropped. Our chosen ETL would sit behind data that drives product, marketing, and sales decisions. Triggers for reengagement and sales emails would be built on this data. Compensation plans would be based on this data. While we'll surely make the wrong decisions from time to time, or occasionally send the wrong email to the wrong lead, we want these errors to be of our own making, not because data was missing or incorrect. Moreover, poor data quality is particularly dangerous because you rarely recognize it for a long time. Not having an ETL tool is a problem; having an ETL tool that drops a nearly-but-not-quite random 5% of your data is dangerous.
- Data completeness: Not all third-party integrations are the same. One tool may pull daily ad interactions from Facebook, while another might only get weekly data. Or one tool provides details on Salesforce opportunity histories while the other only shows the current state of your opportunities. When it comes to data completeness, it's not a question of whether or not a tool is successful in its mission of moving data around, but whether or not the tool syncs the thing we want. For example, a tool might purposely omit UTM parameters from Facebook ad data, an oversight we'd find frustrating when tracking campaign performance. At minimum, we wanted to make sure the tool pulled the data we needed (and generally, the more data it provided, the better).
- Data usability: The data, not the web application, is the primary way you'll interact with your ETL tool. This is the true surface area of the product. If it's painful to query—even because of “simple” problems, like table and column names that are hard to understand, hard to type, or aggressively abbreviated—you'll feel this pain on a daily and hourly basis. Tools that sync extraneous data, like multiple timestamps with little clarity about how they're different, can also make data hard to work with, especially for people less familiar with the schema.
- Support: Regardless of how much you emphasize data quality, the tool will break from time to time. Something will happen to your network configuration and it won't connect. You'll see some data that doesn't look right. You'll hit rate limits with a third-party vendor and won't be sure how to deal with it. Because our data is operationally important, quick response times help put us at ease.
- Integrations: This fell further down our list than we expected, in part because no tool had the full set of integrations we needed. Rather than evaluating tools based on what they had, we instead evaluated on what we thought they'd have soon. We got a sense of roadmap from talking to the vendors, from seeing which vertical they currently supported (primarily e-commerce? SaaS services?), and from getting a sense of how quickly they'd add new integrations.
- Price: This matters for obvious reasons, but not necessarily in obvious ways. The pricing structure matters more than the list price. In general, we're leery of vendors that set prices based on the amount of data they process. We knew that, as we grew, we'd gather a lot more data (potentially 10x a year or more), both from adding more sources and from logging more detailed information from those additional sources. We didn't want a tool that would discourage us from logging data, from bringing in data from more third parties, or from updating our data as frequently as we'd like. This discourages us from capturing the data that could help us make better informed decisions, something that just doesn't sit right with us.
- Additional features: This is a bit of wild card. Some features could be game-changers; others could be insignificant. In our case, we were particularly interested being notified when sync goes awry and in API functionality that would let us easily connect internal tools. While it wasn't a concern for us, we imagine that the ability to connect with Google Sheets or import unusual data sources is something that many companies would find attractive.
- Application surface area: The application itself didn't matter that much to us. If the tool does a good job syncing data how you want it when you want it, you won't spend much time in the actual application. Though a better UI would be a plus, this was a dimension we were comfortable trading down on.
- Setup: For the same reasons as the application surface area, we didn't put a lot of emphasis here. A 10x worse setup process takes 100 minutes instead of 10. Debugging a single data quality problem could take all day.
Ultimately, we chose Etleap, Segment, Fivetran and Stitch.
Despite being a transformer, we chose Etleap as our primary ETL provider because they excelled in the top five dimensions we evaluated. They met the required security bar; they provided reliable data (which we can vouch for and have heard from other customers of theirs); they had the most complete data; and they have excellent, quick support. Though their data wasn't quite as usable as that from other tools, they had a fairly limited integration list, and the application isn't as clean as others, we were happy to make tradeoffs on these features for the areas they were great at (and they've made big strides in these areas since we chose the platform).
We also doubled down on Segment in part because we were already using their core product (the single tracking API) and because they supported a couple of critical integrations that Etleap didn't. It was as strong as any of the other products in the first group of tools, so we added their ETL service to our plan.
Finally, Fivetran offered the largest list of integrations, some of which we couldn't find elsewhere. We brought them on as a complementary tool to Etleap and Segment.
When we completed our evaluation, Stitch had just launched and their open source project, Singer, wasn't yet available. We're huge fans of what the folks at Stitch are building. They now support more than 65 integrations, the most of any ETL vendor we’re aware of, and Singer makes it easy for anyone to add more if your data lives in a product that is not very widely used. They doubled down on data completeness, and rolled out support for additional warehouses including BigQuery and Snowflake. Their application also provides a lot of options for controlling which tables and columns to sync. We're keeping an eye on Stitch as our needs continue to evolve.
The most important takeaway here, however, isn't the vendors we chose, but how essential use-case was when evaluating ETL vendors. Though we didn't set out to find a transformer, we ended up with one because it handled the vagaries of our use-case better than the other vendors we looked at. At the same time, the syncers we chose offered integrations that were more relevant given our data sources. Using them in combination provides us the functionality we need, at least for now.
Likewise, your company's use-case should drive your evaluation when choosing an ETL tool. Start by figuring out whether you're looking for a syncer or a transformer, then dive into the details of what you really need. Opting for the more complicated UI of an ETL platform that focuses on data transformation is worthwhile if you're working with a lot of unstructured data or if you're looking to do a lot of data enrichment. If you prefer to build most of your analysis elsewhere—say, in a platform like Mode, perhaps?—a tool that focuses on data syncing should serve you just fine.
April 2021 update: Where are we now?
If we were to buy an ETL tool today, we’d overwhelmingly follow the same framework outlined here, with one minor change to our evaluation criteria: we’d also assess the tool’s syncing speed. The ETL tools discussed here usually load data into warehouses incrementally, only updating records that have changed since the last sync. When source systems change, however, you sometimes have to run full refreshes. This isn’t that common, but it’s not a rare occurrence either. If the ETL tool takes a long time to do this, your entire data pipeline, including the reports that people need to make day-to-day decisions, can get backed up.
More generally, because of changes in the market, even though our evaluation would look similar, we’d make (and have made) different choices than those discussed here.
First, Snowflake dramatically lowered the cost of data storage relative to Redshift, making it more affordable for us to write as much raw data as we want into our warehouse. Second, as dbt has matured, there’s less need for an ETL tool to be responsible for any transformation. And finally, Stitch and Fivetran have become the clear market leaders among the syncing tools.
Because of this, our data stack is now made up of Snowflake, dbt, Stitch, and Fivetran (we use both because neither supports all the sources we need). We still use Segment, but only to log and record events, and not to sync data from third-party applications into our warehouse.