June 30, 2023
NaN minute read
No one wants to go over data compute, especially when every dollar counts. But when a BI tool is being used across an entire organization, it’s easy for overages to sneak up.
There are ways to prevent it, though. With the insight that comes with being on the BI tool side of the equation, we’ve complied insights and tips to help teams stay within their compute bandwidth.
But first, what is data compute? Let's make sure we're all on the same page.
In general, data compute refers to the processing power that tools use in the execution of your queries. Most cloud-based applications track this in one form or another.
In almost all of your cloud warehouses or database tools, more complex queries require more compute (or processing power). Complex queries might scan rows multiple times, perform recursive logic on very large tables, or require joining multiple tables that haven’t been properly indexed with primary keys. Depending on the tool, these operations could be very expensive.
Data compute at Mode
At Mode, every paying customer has a monthly allocation in gigabytes (1 gigabyte is 1 billion bytes) of uncompressed data that can be ingested into Mode to power Modern BI at your organization. Essentially, this is the total amount of data that can be returned by successfully completed queries in reports and datasets during the calendar month.
See firsthand in our webinar how we're building a more collaborative, flexible, and accessible future of BI together with ThoughtSpot.
Depending on the packages of your BI tool, data compute overages can result in additional fees.
At Mode, there are operational costs associated with the infrastructure that supports our ability to provide timely, on-demand access to data. Your monthly compute allocation is one of the ways that we track usage in Mode, and depending on your specific plan, data compute overages can result in additional fees.
Depending on the SQL skill of your users, the permissions you grant them, and the culture of data usage in your company, it’s possible that access to data may result in unexpected bandwidth charges—either by a cloud warehouse provider or even your contractual limits in your BI tool.
Actions that can lead to unnecessary compute overages include:
Business users re-running reports that are already up to date
Low-impact reports running automatically
Parameterized reports that pull excessive volumes of data
Poorly optimized query result schemas.
To manage data compute in your organization, it’s best to start with a checklist like this and then modify it to anticipate your users' behavior. You can modify these recommendations in a way that works for them, your needs, and for your business overall.
Minimize the result size of queries that power your most frequently run dashboards.
Restrict query access to your data warehouse connections to users who will be generating queries and datasets that provide relevant business insights in your BI tools. You can use Access Control & Permissions in Mode to accomplish this.
Look for ways to distribute reports or datasets that can be re-used by different teams across an organization. In Mode, you can add more people to different Collections and Datasets, letting them access more reports that might help them find their answer without querying.
Monitor resource usage from your database with your BI tool to keep abreast of how your organization is using your resources. Mode’s Discovery Database enables you to track resource usage & identify the most valuable reports in your organization.
Adjust scheduled reports & datasets with automatable integrations like your BI tool’s API. Mode’s core API and our Webhooks integration can be used to automatically adjust and remove schedules for potentially expensive, low-value query runs.
The complete guide to a data governance strategy that ensures flexibility and safety.
Mode is an extremely powerful platform for tech-savvy users who aren’t necessarily members of your data team. These users, however, don’t always have the context for your desired use case or writing SQL queries to justify the economic impact that empowerment creates.
As such, we find that the best way to manage data compute in Mode is to selectively enable users to query your database when they have a business need. There are two features we’d like to briefly highlight to enable this.
Datasets enable your data team to create governed, sharable, reusable datasets that can be given to your users to explore and answer their own questions that aren’t answered by existing reports. We recommend creating Datasets for commonly requested tables and making these available to your users to replace commonly submitted queries.
Each connection that you add to your workspace has a default level of permissions. These permissions can be overridden either by the group a user is in or on a per-user basis. The three tiers are:
Query: Users with this permission level will be able to submit and return queries against this data source connection. They can do this in reports or by creating a Dataset.
View: Users with this permission level will be able to view reports that are powered by this data source connection, as well as view and use datasets that are powered by this data source connection, but they will
Restricted: Users with this permission will not be able to see reports or datasets or even the Data Source Connection in their schema browser in Mode.
We recommend that you do not use “Query” as the default level of permissions; instead, use “Restricted” and then assign specific users or groups of users the appropriate permissions. The data team or folks who have critical business metrics needs—and may regularly submit potentially very expensive queries to your cloud warehouse—are the only ones who should have Query access.
For many Mode users, the above recommendations will be a good start. For many data leaders at large and mid-market companies, a more robust strategy may be required to get the most out of your monthly data compute budget.
It’s important to point out that these tools may require additional resources within your organization. To achieve the recommendations below, you may need the following skills:
SQL (Brush up with our SQL School)
One of Python/NodeJS or another scripting language
Access to an external notebook or scheduler (We strongly recommend you do not enter API credentials in a Mode notebook environment)
Optionally, a tool for Webhooks integration like Zapier
Mode’s Discovery Database is a daily-updating reso
urce that tracks all activity by users in your Mode workspace, including query runs and report views. It should be closely monitored when possible.
We recommend either creating a query like this example: Daily Data Usage that will track the amount of data compute used in your workspace daily, or cloning the Database Resource Monitor example report. You may also check your Workspace Stats page, in your Workspace Settings in Mode, for similar insights.
In addition to the above, we have a few relevant endpoints with the Mode API, as well as example scripts that will help you achieve virtually any goal.
Identify frequently-run, but low-traffic reports
You can run a query against the Discovery Database to identify all schedules in your workspace, along with information about the number of views and query run failures that are associated with that report. This option lets you identify infrequently viewed or failing schedules without prohibiting the access you've granted in the organization to your data source connection.
You might find that there's a report in someone’s personal collection that is scheduled to run every 15 minutes but has been viewed only once in the last 90 days. This report might have contributed significantly to your compute usage without providing any value to your organization.
Identify queries that have undesirable schedules
We recommend building a customized Discovery Database query that identifies undesirable schedules (including you determining what constitutes “undesirable”) and then using some code, like in these example functions, to remove those schedules. This will have the effect of preventing runaway data usage for queries.
For most of our customers, rules like “this report hasn’t been viewed in X days,” or “this report is in a personal collection,” or “the queries in this report fail more than 10% of the time” are appropriate signifiers of a schedule that should be deleted.
An additional example is available here: Delete Schedules using API (an example report that you can clone), which contains a Discovery Database query and some example code in the Mode Notebook. We recommend cloning the report into your workspace and pointing the query at your Discovery Database connection, and then copying the code in the Notebook to your preferred tool for managing scheduled scripts.
Establish more control over the query schedules that people can create
You can implement even finer control over who can create various run schedules by preventing users from creating reports with “15-minute”, “30-minute,” or “hourly” run schedules. You can do this either by all users, certain users, or against certain connections.
We have an additional example here: Update Schedules Using API (another example report you can clone), which is similar to the above example. You can clone this report, update the Discovery Database query to match your preferred rules, and then modify the example Notebook code to enforce your specific preferences.
To summarize the above recommendations related to the Mode API:
Use the Mode Discovery Database to Identify schedules that do not provide your organization value—like those that are running too frequently or those that generate reports that are infrequently viewed.
Use the Mode API to modify or delete those schedules.
Run the above logic on a regular cadence to proactively curate your Mode workspace.
Create a data program that scales.
Your workspace may prefer even more advanced recommendations than that. If that’s the case, we recommend taking advantage of Mode’s Webhooks capability to check live running reports.
In this case, you can create a subscription to a Target URL (via the Webhooks page in your Workspace settings), and then every time a specified action is taken in Mode, that Target URL will receive a standard POST message with information about the event that has occurred. Many of our customers use Zapier, which enables you to build quick integrations between tools like Mode and Slack that enable custom webhooks integration.
To use this for managing data compute, we would want to send an alert when a report finishes running. At the target URL, which can be a Zapier zap, a Google Apps Script project, or your internal custom tooling, we can then use the Mode API to process the event and identify if it is a report running on a schedule, identify it’s querying a large amount of data, and identify if the report runner is on a permit/allow list for certain actions. Then, also using the Mode API, we can force changes, like deleting/updating the schedule, using the examples for the previous section as guidance.
The primary advantage of using Webhooks is that it is slightly more proactive than Discovery-Database-based workflows, as the Discovery Database updates only on a daily basis. The downside is that they likely require dedicated engineering/IT resources to implement.
Curious about how Mode works? Sit back and watch the video—no reps ;)
Work-related distractions for data enthusiasts.