January 5, 2023
NaN minute read
Do we really need another semantic layer?
This was the question we got from a handful of customers after we announced Datasets last week. In the simplest terms, Datasets are tables, defined by queries in Mode. When a data team creates a Dataset, it creates an automatically-updating table in Mode that anyone can build visualizations on top of. Rather than every report needing to be powered directly by SQL, reports can now be powered by Datasets—no new query required.
Offering a means to create dashboards without SQL has been our most requested capability for years, so many customers were excited about this new feature. But it also raised an obvious question: How should customers use it with dbt?
Thousands of data teams are already using dbt to define tables in their warehouse; many are now also defining metrics in dbt’s Semantic Layer as well. On the surface, Datasets—SQL queries that define tables—also look like a semantic layer. Does that mean Datasets are incompatible with dbt, or that customers will need to duplicate logic across dbt and Mode?
To say it as clearly as possible: No, and Datasets isn’t a semantic layer. We didn’t build it to be a store for complex business logic, or to be a repository of canonical SQL queries that define key business entities. As long-time believers of dbt’s vision—and the separation between semantic layers and BI tools—we’d encourage all of our customers to use dbt as the place for these things. Instead, we built Datasets to complement dbt in three ways: access, curation, and speed.
Clean and well-modeled data is necessary for any organization to be data-driven—but it’s not sufficient. People around the business need access to that data, including (and often, especially) people who aren’t comfortable writing SQL. This creates an obvious gap: How do data teams expose the tables that they’ve created and curated in dbt to people who want to explore them, but don’t want to do it in a SQL client?
Datasets bridge this gap. Analysts can create a dataset by writing a simple query, and can then share it or put it into a public Collection so that the right teams can access it and use drag-and-drop visualization tools on top. If data teams already have well-defined business concepts in dbt (as models or as recently-proposed “entities”), the query can be as simple as
`SELECT * FROM dim_customers`. In these cases, we think of Datasets as a means for making dbt models explorable, and for allowing data teams to create subsets of core models without cluttering up their dbt projects, and not as another layer to add more business logic on top of what is already centralized in dbt.
Even in the perfect dbt project, there’s not a perfect mapping between dbt’s models and the datasets that business teams want to explore. In some cases, a model in dbt might include far more columns than a business user wants to look at (think of a customer table that includes all of the various fields from a Salesforce account object). In other cases, a team might want to only look at a subset of data, like all of the leads that came in through a holiday promotion. Or, different department heads might want to look at hiring pipeline metrics for their organization but aren’t permitted to see the same data for other teams.
In these cases, it’s not practical or permissible for a Mode Dataset to exactly mirror the customers, leads, or applicants' models in dbt. Though data teams could resolve this by creating slices of each model in dbt—e.g.,
`SELECT * FROM leads WHERE lead_source = "holiday-promo"`—that creates a lot of unnecessary clutter.
With Datasets, data teams can create these subsets directly in Mode without having to overload dbt with every possible slice or permutation. The hardened and complex semantic concepts can still live exclusively in dbt, and Datasets are an easy way to share and curate different cuts with people who need to see and explore it.
Sometimes urgent one-off questions come up, and we have to ship data to a stakeholder quickly. There’s a fire; our holiday promotion was mispriced; the head of marketing needs to see a list of every prospect who got the discount code to see if we need to contact them about the error.
In these cases, we don’t have time to build that dataset in dbt. Nor do we want to; it’s an issue that we (hopefully) will only have to deal with once. But we do want to get the data to the marketing team, so we typically write a SQL query, send the result as a CSV, and let them explore it in Excel.
We also built Datasets to help in these instances. Rather than sending a CSV, an analyst could turn the same query into a Dataset, and the marketing team could build their triage reports directly on top of it. If they needed updated data, or additional columns about each prospect, the analyst can tweak the Dataset, and everything else will automatically update.
In these cases, the Dataset query would likely be more complicated and include true semantic logic. But it’s not permanent logic that data teams want to persist in a governance layer; it’s temporary and meant to be created quickly, without the usual procedural guardrails that most teams require when updating models in dbt.
We’ve long believed that a modern semantic layer would be one that’s independent and universal, and a modern BI tool would be one that integrates with but isn’t dependent on, that semantic layer. We’ve been thrilled to partner with dbt Labs as they work towards the former vision. And for our part, we’re committed to the latter vision—and believe Datasets will be a key part of that collaboration.
If you’re ready to learn more, Join our live webinar event: Don’t Get Lost in the Semantics on January 26th to hear Anna Filippova, Director of Community & Data at dbt Labs, and Benn Stancil, Co-Founder and Chief Analytics Officer at Mode discuss the return of semantic layers, what they mean for the data industry today, and how to unleash their power for your teams in 2023.
Why logical layers matter, and how to use them.
Work-related distractions for data enthusiasts.