Jasmine Tsai shares lessons from building and leading Clover Health’s data platform.
Jasmine Tsai got her first taste of data engineering through a common gateway: she was tasked with a data project as a software engineer. Tsai was part of a team that rewrote Change.org’s non-profit subscriptions system from scratch, to turn it into a stream processing system. This project introduced her to working with data problems.
Years later, after moving into data engineering full time, Tsai has emerged as a leading voice in the community, especially on the subject of healthcare data. She's now Director of Engineering for Clover Health's Data Platform, and is known for sharing her experiences through blogs as well as speaking engagements at industry events like WrangleConf.
We sat down with Tsai to discuss her experience growing into the field of data engineering and building Clover Health's data platform, and to get her take on what the future holds for data engineers and their teams.
The Spectrum of Data Flexibility
When Tsai first started taking on data engineering problems at Clover, in 2016, she found that the landscape of tools available was ill-suited to her needs.
On one hand, there are the legacy data systems built for the healthcare industry. Data in the healthcare industry is notoriously fragmented, reflecting the extremely decentralized nature of the industry itself. Healthcare datasets are often so unwieldy that sophisticated analysis is nearly impossible. To handle this complexity, conventional data systems in the healthcare industry are highly structured and inflexible. This creates a heavy administrative burden on data teams and makes it hard to answer questions quickly.
On the other side of the spectrum, there were new-school tools built for (and by) the current generation of software startups. These include things like ETL or data-pipeline-as-a-service tools. These services were built to analyze a relatively narrow set of data types. They're fast and flexible, but they can't handle the immensely fragmented data that Clover Health deals with on a daily basis.
“We found that there were either graphical, enterprise-focused, old-school tools, with highly-defined data sources, or fast-moving Silicon Valley data engineering tools, with required methodology that didn't fit what we needed,” Tsai said.
Tsai and the Clover team realized that for their system, they would need to be very selective about using tools from either side of that spectrum, and would have to build their own tools to fill out the rest.
“We had to take the assumptions and tools from each end of the spectrum and decide for ourselves which ones we should actually use,” Tsai said.
They incorporated some of the structure of the old school healthcare systems, so that their new toolset could handle a vast variety of data types. But they built with the aim of helping the data platform team stay flexible and agile.
“Because we have so many complex data sources, we need to manage the ingest stage really, really well. The enterprise world has tools to do this, but they're too slow-moving for us,” Tsai said. “So we had to build our own ingest system that takes inspiration from more modern software, like actually using code as the UI, but that still incorporates a lot of the structured mentality of the old-school tools.”
What resulted from this was a system of Clover's own making, with one foot in the old world and one foot in the new.
“We had to define our own view of data engineering entirely,” Tsai said.
For other companies inadequately served by the data tools available to them, Tsai recommends proceeding carefully before building their own.
“Don't assume every data engineering problem is the same. Imagine the ideal landscape of what questions you actually want to answer with your data before you make any tooling decisions. Determine how much of that functionality can be achieved through tweaking existing tools. Do as much as possible by fitting those tools to the shape of your problem. Only then, when you see where real gaps exist, should you consider building your own thing to fill those gaps.”
The Spectrum of Data Skillsets
The data team at Clover has grown significantly since Tsai started there, and she's grown into a leadership role with it. As the field of data engineering has evolved, so have the team's needs and structure.
Just as there's a spectrum of data tools, there is a similar spectrum of data skills that Tsai has had to navigate and manage in order to create a successful team at Clover. On one side of the spectrum, there's infrastructure engineering: all the work that goes into building the foundations of an organization's data apparatus. On the other, there's cutting-edge research in subjects like AI and ML. In the middle of the spectrum, there are practitioners like embedded analysts and product-focused data scientists. These individuals are proficient in both how data is acquired and delivered, and how it is analyzed and used.
At its edges, this spectrum is widening. The value of specialists who can bring cutting-edge expertise to bear is growing. Infrastructure is getting more complex. More resources are being dedicated to the development of data tools, which in turn results in more specialist knowledge required to manage a data stack. The volume of relevant data is increasing, as is the complexity and variety of data sources. It's no longer a rudimentary challenge to abstract away pipeline complexity. As Tsai says, “Data infrastructure is becoming way more hard-core.”
But in the middle of the spectrum, the distinctions are blurring. Data engineers are learning more about analytics so they can create better pipelines. Analysts are learning more sophisticated data science techniques to deliver better insights. Data scientists are joining engineering teams to integrate machine learning into actual products. And as a leader helping these interdisciplinarians define their careers, Tsai has found there isn't a clear blueprint for managing generalists.
“Hybrid roles are very valuable, but that value is hard to define. They don't fit in the data science track and they don't fit in the engineering track,” Tsai said. “We've been discussing a new track internally called Data Insights, which would fall somewhere in between. These people tend to have a lot of product insights, and they can prototype really quickly. It's not necessarily machine learning, it's not necessarily infrastructure; it's more about the value you can generate through breadth and flexibility.”
Tsai has used this spectrum to help plan the growth of the data team at Clover, and to help her team plan the growth of their own careers. For other organizations looking to build out their data teams with this framework, Tsai recommends first considering exactly what you want data to do for your business.
“If a product has data at the core of its thesis, you'll definitely need two specialists, like an optimization-oriented data engineer and a data scientist,” Tsai said. “On the other hand, if data is going to be used to help improve other facets of the business, and its not going to be the core of the product, then it's probably best to start with a hybrid role that can get both engineering and analytics off the ground.”
The Convergence of Engineering Disciplines
Tsai believes the future of data engineering is intertwined with the future of all engineering. This is because many of the biggest opportunities for the data engineering field in the near future will be in areas where data engineering overlaps with other fields, especially software engineering.
“Sometimes it can feel like data engineering is on the fringe of engineering. There are great tools being developed in this field, but then you see back-end engineers that are unfamiliar with data engineering that will set out to build their own things for data problems that have already been solved. Somehow these communities are just not well connected,” Tsai said. “And the same thing happens with data engineering not borrowing enough from other fields. A lot of the time, complex pipelines could be made a lot simpler with an application layer, something that we could learn just by taking a look at the stacks in web applications.
“I think the future of data engineering is going to rely on moving the community closer to other back-end engineering streams. We need to connect the conversations.”
As long as data engineering is considered a niche specialty, Tsai says, the knowledge divide will remain. But data is relevant in every facet of software, and therefore every facet of software engineering could benefit from more cross-pollination with data engineering.
“Take application engineering, for example. The task queues you would build into an application are basically all mini data pipelines. It's a problem when data processing is considered a specialty skillset, even though there are tons of places where it can be applied. Having some basic data processing skills will help anyone.”
But as organizations become more sophisticated in their use of data, data engineering will become a bigger priority, and more people will enter the practice from adjacent fields. With them will come new and valuable perspectives.
“We need to question all our assumptions. The more people with varied expertise who come in and ask, 'Why are you doing it this way?', the better.”
Who Should Be a Data Engineer?
If one of the biggest opportunities for data engineering to grow comes from incorporating new perspectives, it'll be vital to attract as many new people to the field as possible. So we asked Tsai what she would tell someone considering the data engineering practice for themselves.
“I would say, you don't have to have hands-on data science expertise to get started with data engineering. You can start with just software engineering skills and take on a data science project to learn about the problems you'll need to solve as a data engineer,” Tsai said. “If you enjoy designing cool systems, data engineering is awesome. You get to eliminate repetitive work. You get to make everyone else's work less painful and more productive. You have the opportunity to make huge scale improvements that impact large groups of people.”
But it's not all just system-building.
“Some of the best work you do will not start with a system-based approach. Just set out to answer a question, or solve a problem,” Tsai said. “There is always more than one way to accomplish a goal.”