April 17, 2023•Updated on July 24, 2023
NaN minute read
Without data governance, there is no self-serve analytics.
Data governance builds trust and a data-driven culture.
There are 3 types of data governance: 1) Data quality 2) Business logic governance 3) data security
Every organization should have a global governance and local governance layer (read more about that in the second half of this article)
In this article, we're unpacking data governance—what it means, why it's important, and an approach to a data governance strategy that will help any company scale.
Looking for our perspective on data governance? Scroll down to the bottom of this article.
Data governance, as the name suggests, is a system of government for data. Though “government” is also hard to define, we all have a good sense of what it means—it’s a collection of institutions and laws that put guardrails around society. Data governance is similar. Anything that keeps data more organized and under control—from technical applications that test if data is accurate to processes that control which dashboards are marked as trustworthy—are part of the “data government.”
Data governance is important because data is important.
“Is this data up to date?” “Is this lead defined the same in my report as yours?” “Is this salary report only viewable by the finance team?” All of these questions lead back to data governance—the processes and systems we use to trust data, and get on the same page.
Data is only as useful as it is accurate, consistent, and discoverable—and without well-established data governance practices, it’s rarely any of those things.
Businesses need data governance so that data can be trusted and used throughout the organization to inform decisions.
Without data governance, there is no self-service analytics.
Though data governance can take a lot of different forms, the overarching goal is the same: To make sure that an organization has reliable, secure, and well-understood data.
The benefits of data governance practices include:
Maintaining data quality. That data accurately measures what it’s supposed to be measuring.
Maintaining data security. That the right people have access to the right data.
Maintaining data usage. That trust in data is kept throughout the organization.
Sometimes, fast-growing organizations can struggle with embracing a positive data governance—favoring speed over process. But it can be helpful to remind stakeholders that all data governance processes will save your coworkers, who are already spread thin, time and headaches in the future.
Self-service analytics is necessary—here's what makes the practice successful.
Here are a few examples of specific business problems that data governance aims to solve:
The data team and finance team are calculating revenue in the same ways.
Some people have access to a finance dashboard, and some people don’t.
Only the data team has access to the database.
The logic that calculates sales commission payouts is always correct.
Data is well organized and discoverable in a data warehouse.
Every report that references signups uses the same definition of a signup.
Eventually, all poorly governed data leads to the same place: People lose trust in data. They no longer trust that it accurately reflects what’s happening in the world; they no longer believe it can help them make decisions; and they no longer trust that it’s secure and can be safely shared.
Daniel Sternberg, Head of Data at Notion, talked about how to prevent losing trust in data in our webinar: Small Teams, Big Impact, which you can watch here.
Before this happens, a number of other problems often emerge first:
Data changes unexpectedly. Historical values change suddenly and unexpectedly, and data teams are frequently asked to validate that some dashboard is still right.
Dashboards don’t match. Two reports that are supposed to show the same metric are different. And even small differences matter—if someone tells me that their birthday is on March 8th, and tells you that their birthday is March 9th, we wouldn’t say, “Well, it must be one or the other, and both days are pretty close anyway.” We’d instead wonder if the person was lying, and couldn’t be trusted at all.
People start questioning if reports and dashboards are accurate. Instead of trusting data tools by default, people are skeptical by default.
People create their own versions of everything. When people can’t trust the dashboards they’re given, they’ll start figuring out their own solutions, from creating their own dashboards to managing their departments with their own Excel workbooks. (Use our Metrics definition template to get everyone on the same page.)
People argue about what’s happening, rather than what to do about it. Meetings that are supposed to be about making decisions devolve into disputes about what’s actually happening and who has accurate data.
Data gets leaked to people who aren’t supposed to see it. Someone sees internal finance or HR data that they aren’t supposed to see; a customer is sent another customer’s data on accident.
People can’t find the data they need. Data can also be too locked down. If people have to jump through too many administrative hoops to find what they need, they’ll eventually stop looking.
Every organization has these problems. As one data leader once told us, “Nobody looks in the mirror and says 'I do governance well.'” But, just remember that every effort adds up to saved time down the road.
If your data team is spending more of their time trying to make up for the challenges above, and less time working on strategic business problems, it's likely a sign that they need to pause and invest more in data governance.
Create a data program that scales.
Because data governance encompasses so many practices, different folks at organizations tend to care more about different subsets of challenges.
Data engineers tend to think about data quality; analysts and analytics engineers will often emphasize business logic governance; and IT leaders usually stress data security and access. Let’s take a closer look at each.
Data engineers, who are responsible for sourcing raw data and building data pipelines, are often primarily focused on data quality. High-quality data accurately measures what it says it’s measuring, in a number of ways.
High-quality data means:
It’s correct. The values—the price of an item, the date of sale, a customer’s name—should be accurate. Inaccurate values can find their way into a dataset for lots of reasons, though the most common are bugs in logging software, or manual input errors, like recording the close date of a new contract incorrectly.
It’s complete. A table of new customer orders should include one record for every order; no orders should be missing. (And there shouldn’t be duplicates either.)
It’s clean. Order records shouldn’t have missing values, phone numbers should be well formatted. Philadelphia should be spelled one way, not 57.
It’s understandable. A human should be able to look at a table and understand what it means. For example, if a table has four timestamps like `time_id`, `occurred_at`, `created_at`, and `ts`, people are likely to get confused about what each column means. High-quality data would either remove duplicate columns, or use better, more differentiated column names.
Though data engineers are often responsible for producing high-quality data, data quality is important to everyone, from the data analyst who’s trying to answer a question for a business partner, to the CEO who needs to rely on data to understand how their company is performing.
Data quality alone, however, isn’t enough to make data usable and trustworthy. In most businesses, raw data needs to be transformed into metrics. Metrics that we use every day—ARR, a lead, traffic visit—all had to be defined from raw data.
For example, suppose a company wants to track and report on how many users visit their website every week. Though this seems like a simple metric, there are lots of details that are important to clarify:
How do we define a user? Does the person have to be logged in? If someone visits the site from multiple devices, do they count as multiple users?
How do we define a week? Does it start on Monday or Sunday? Which time zone do we use?
How do we define visiting the website? Does it count if they use the company’s mobile app? Are users who programmatically access the website via its API considered visitors?
To create a “weekly website visitors” metric, we have to answer all of these questions, and then encode those answers somewhere such that all of our reports and dashboards compute those metrics correctly. Those answers and that encoding is typically what’s referred to as business logic.
Use this template to define metrics with stakeholders.
Even if companies have high-quality data, it’s important that everyone in the organization consistently uses the same business logic when computing the same metrics. If they don’t, dashboards that are supposed to show the same thing won’t match and people won’t trust data.
When analysts and analytics engineers, who are often responsible for defining metrics and building dashboards, talk about data governance, this is often the kind of data governance they care most about.
Data governance can also refer to the layers of permissions and access controls that manage who can see what data. For example, a company may keep data on employees’ salaries, and only people in HR should have access to it. Public companies have business performance data that could be used to pre-emptively—and illegally—trade shares; it’s important to know and control who has access to this data to prevent insider trading. And many companies want to make sure employees can’t see their customer’s personal information unless absolutely necessary.
Security and IT teams are usually responsible for making sure that companies handle these cases correctly. If a Chief Security Officer is concerned about data governance, they’re likely worried about someone having unauthorized access to something they shouldn’t see.
Each of these dimensions of governance is important. For the remainder of this post, we’ll focus on the type of governance that data teams spend the most time trying to get right: Business logic governance.
To implement data governance that can meet all of your company's data needs, you need a combination of both global and local layers of governance. This provides value and coverage across a variety of data use cases and takes into account governance across multiple layers of your modern data stack.
Global governance is any form of governance that sits outside of your BI tool, enabling it to be applied anywhere in your modern data stack. For example, keeping logic in dbt.
Local governance, on the other hand, is any form of governance that happens locally in specific tools in your stack. For example, defining temporary metrics in your BI tool.
For a data governance strategy, we recommend creating two layers of business logic governance: A global layer, where data is cleaned, prepared, and core key business metrics are calculated; and a local layer that’s used for ad hoc questions, exploratory analysis, and infrequently used metrics.
A global governance should keep all of your business logic governed in a single, central place—without requiring all of our data to be consumed in one place too. In other words, a global governance layer should ensure that governed data can be accessed anywhere (application agnostic).
New tools are starting to make this possible, with dbt being the most popular. dbt allows data teams to transform data inside of their data warehouse. Because most data tools can connect to data warehouses, that means that tables that dbt cleans and models can be used by every other tool that connects to the warehouse—and there we have global governance, with flexible access.
This route could give us the best of both worlds: Easy governance and flexible access.
dbt also released a semantic layer that allows teams to define how to join tables together and how to compute metrics on top of those tables. This pushes the abilities of dbt’s global governance layer further, enabling data teams to also globally manage key metrics.
Tools that integrate with dbt’s semantic layer (learn about Mode's integration), can read directly from the business logic stored in dbt directly, ensuring that every metric is calculated consistently and reliably across time and different tools.
It's infeasible to canonize every metric the business will need in a global layer.
Imagine that a data team defined all of its marketing leads in the global governance layer. There’s a table called (creatively)
`leads`; it has one row per lead, and includes a bunch of information about each person.
The marketing team is planning a new campaign, and wants a list of all the leads that bought an item from your store and returned it in less than seven days. It’s an easy ask: All an analyst needs to do is join the leads table to the purchases, filter the purchases to those that were returned in seven days, and remove all the leads that didn’t have such a return.
However, as simple as it is, it doesn’t make sense to put this logic in the global governance layer. This logic is for one temporary campaign; we might need it for a few months, but it would clutter up our database to add a new table called
`leads_who_returned_purchases_in_seven_days` or a column on the leads table called
`purchases_returned_in_seven_days`. It’s also impractical to update a centralized governance layer like dbt every time the marketing team asks for a new list of emails.
This highlights the limitations of global governance layers. While it’s best to canonize core metrics like leads and revenue in a central layer, it’s infeasible to canonize every metric the business will need in its lifetime.
Most teams recognize this and simply accept that some work will be largely ungoverned. The urgent questions from the CEO will get answered in one-off SQL queries and emails. Business leaders will export data from a dashboard and play with it in Excel. These types of behaviors are typically seen as inevitable and, unfortunately, inherently ungovernable.
We believe that this type of work can be governed—it just needs to be governed in a different way.
Local governance are layers of business logic that exist in a single tool and can’t be easily accessed by other tools. For example, data models that are configured directly in a BI tool or queries in an analytics platform like Mode are both forms of local governance.
Local governance should be used for temporary metrics
Local governance should be used for temporary metrics or metrics that don’t need to be rigorously standardized or accessed by everyone in the organization.
Local governance should be used for ad hoc work
Local governance should also be used for ad hoc requests—e.g., asks for temporary datasets like the leads list example above or one-off questions that can help a business leader make an important decision—shouldn’t be governed in the same way as core KPI dashboards. They should be governed locally, in the tool you're using.
This is because with ad hoc work, flexibility and speed is very important, and rules and procedures that control how people can work with data, are often directly opposed to these goals.
Still, when done right, good governance can accelerate ad hoc work. If an analyst is asked to help figure out why retention rates are falling, their analysis should start with the governed concept of retention rates. If analysts have to re-create the retention rate metric first—say, they have to write a SQL query that produces the same result as a dashboard in a BI tool—they end up wasting valuable time.
Local governance should happen on top of global governance
That’s the first key to local governance—ad hoc analysis should happen on top of global governance layers, not independent of them. If a data team had carefully defined key entities and metrics somewhere, they should never have to recreate those entities. They may want to expand on them and combine them with new information, as was the case in the example with the leads table; but they should never have to recreate the leads.
Furthermore, one of the hardest parts about governing ad hoc analysis is that it’s often scattered haphazardly around a company. There are saved SQL queries over there, charts pasted in emails over there, Excel files everywhere.
This makes ad hoc analysis hard to find and replicate. This pattern happens all the time: An analyst helps a team make an important decision and saves their work in local files. Six months later, the team needs to make a related decision and asks for an updated analysis. The analyst recently left the company, so a new analyst has to recreate the old work. In doing so, they often use slightly different business logic (remember all the ways that a company could define weekly website visitors?), which creates the same distrust that all data governance problems create.
To solve this, ad hoc work should be well-organized, discoverable, and shared in central repositories. That’s the second key to ad hoc governance—no critical business decision should be made based on analysis that lives across multiple files or on individuals’ computers.
Managing business logic used to be relatively simple: companies would buy a BI tool and could define metrics directly in it.
Now, our data stacks require governance to be reached in considerably more data tools. Today’s data tooling is considerably more complicated. Data teams often source data from dozens of different sources and load it into one or more analytical warehouses. Transformation tools like dbt and Airflow manipulate and aggregate that data. Data is then consumed by dozens of additional tools, including BI tools, analytical applications, and data science and machine learning platforms. Data is also often written back into operational tools to help automate things like marketing email campaigns.
This complexity presents a problem: It’s very easy to define business logic in every tool, but this type of duplication often leads to mistakes where things don’t match. Moreover, as logic changes, every tool has to be updated——e.g., the sales team introduces a new segment, or the finance team updates how they recognize revenue. This is next to impossible to get right, even for the best teams. This approach helps teams best manage this complexity.
Because data governance is so intertwined with how people work, good governance, like legal governance, can’t be managed through a single tool. Data leaders need to establish good governance processes as well.
For example, to ensure that the right people have access to the right dashboards, data teams may set up permission groups in an identity provider and apply those permissions to dashboards. This requires both a technical solution—permissions in the dashboard tool—and a process to make sure new employees are added to the right groups.
No tool can solve data governance alone, but they can help.
Similarly, making sure you always compute sales commissions correctly requires both a technical solution for computing commissions and a procedural solution for making sure the data team is aware of if and when the sales team adjusts quotas.
A company’s governance is only as good as the people and processes that administer it. No tool can solve data governance alone, but they can help. Good tools—and tools used in the right ways—can complement one another, and make it easier to follow good governance practices.
Mode, a modern BI tool that centers data teams, supports the governance model outlined above in three ways:
Mode integrates with global governance solutions. Most BI tools have semantic layers built in, and require data teams to use those layers to get the most out of the BI tool. Mode, by contrast, integrates directly with dbt. Metrics that are defined in dbt’s Semantic Layer are automatically accessible in Mode, allowing teams to get the full benefits of a BI tool while taking advantage of a global governance layer.
Mode includes a lightweight—and optional—governance layer. While dbt is great for defining core metrics and key business entities, it’s not ideal for creating datasets like the leads example. Mode includes a lightweight governance layer for creating reusable datasets, which can be used to extend a governance layer without adding clutter to core models.
Mode centralizes ad hoc analysis. Mode’s code-first workflows enable data teams to do exploratory, one-off analysis directly inside of a BI tool. Because everything created in Mode is stored in a central, searchable repository, people can always return to old work and see the logic behind it. There’s no need to search through emails and Slack to find prior analyses, and analysts never have to recreate old work from scratch.
Join a group session for an introductory demo of the Mode Platform. We’ll walk through key features and workflows.
Business teams and data teams need to work on analytics together—here's how to accomplish that.
Ultimately, people’s willingness to use data depends enormously on trust. Even simple practices with data governance can go far when it comes to getting everyone in your organization to use data.
When people trust data, they'll use it—increasing the value of your company's data investment.
Work-related distractions for data enthusiasts.