Data-Informed
Eliminating Data Silos, Part 1: Creating a Data Dictionary that Actually Works
Our recent market survey showed us that data silos remain a huge problem. If you're one of the 86% of teams with data silos, this blog series is for you! We'll be giving actionable advice on technical and cultural issues before assembling all these posts into a single reference ebook.
As our recent survey indicated, data silos remain a widespread problem. One way to reduce data silos is to build a data dictionary—a document that defines all the data points your organization collects and uses. Data dictionaries will help you make sure that everyone at your company uses key business terms consistently and that there’s a single measure (or set of measures) for them. The goal? To reduce confusion and keep your data reliable.
Below, we’ll give some advice on how to create a data dictionary that actually does what it’s supposed to do, which is to keep all teams on the same page when it comes to data. Many companies fail at this, because they:
Don’t want to put the time in to make a data dictionary
Don’t get input and buy-in from key stakeholders across departments
Don't keep their data references up to date
We don't want you to have these problems, which is why we're giving you this guide.
First, what is a data dictionary?
Put simply, a data dictionary is a structured repository that gives teams a comprehensive description of your company’s data and how it’s used. Data dictionaries usually take the form of a table that starts with data terms, then adds the terms’ definitions, calculation formulas, and source information.
At Heap, we decided to call our data dictionary a Business Metrics Dictionary, because we realized it was most helpful (for us) to focus on high-level business terms (like market-qualified lead (MQL), annual recurring revenue (ARR), churn rate, renewal rate, etc.) rather than getting in the weeds with technical information.
Ultimately, you’ll want to build your data dictionary into a backend database. That will help determine your data structure, data elements, and validation rules. Just remember that you’ll need to get alignment across teams on your terms and formulas before building it. (Again, lack of alignment is the easiest way to make sure your data dictionary won’t work as it’s supposed to!)
What’s actually important when creating a data dictionary
When you’re building a data dictionary, you should always start by outlining all data points that are part of your organization’s regular operations. This ensures that the relevance, meaning, and quality of the data are the same for everyone across your company.
After identifying key terms, you’ll need to list, define, and provide the source of each data point and add in any other details that might be necessary for teams to understand how to use it. At Heap, we found that the following elements were the most important to explain in our data dictionary, because they captured key pieces of information teams would regularly need to reference:
Business term. This is the term that teams will agree to use when referencing a certain data element. For example, if you’re referencing a timestamp for when an opportunity goes into negotiation phase, you can name this business term “Negotiation.”
Term category. This is the category your data point falls into. Examples include sales, revenue, growth, retention, product, etc.
Definition. Use simple language and common words to define and describe what the data point is, when it’s used, and what teams need to know when referencing it. A CEO or brand-new employee should be able to read this definition and understand what the term means.
Formula. If applicable, this is your chance to show teams exactly how each data point is calculated. As with definitions, you’ll want to use as much plain English as possible when writing out formulas so that anyone can understand them.
Source used. This is where your data point comes from. For example, data might come from your own database, or from a third-party tool like Salesforce.
Source name. This is the actual name of the column in your source/database. An example could be “Negotiation_Salesforce.”
Source owner. This is typically the team that owns or manages the data (e.g. sales, product, finance, customer success, etc.).
Entries in your data dictionary might then look like this:
CAC (Customer Acquisition Cost)
Marketing
This acronym stands for “customer acquisition cost.” It measures the average amount of money our company spends on user acquisition.
CAC = Total cost (marketing + sales) divided by the number of new customers
internal_database_name
customer_acquisition_cost
Marketing
Negotiation
Sales
Timestamp to indicate that an opportunity has received verbal signoff and entered negotiation phase.
N/A
Salesforce
Negotiation_Salesforce
Sales
Daily Active Users (DAU)
Product
This is the number of unique users who engage with your product in a one-day window.
DAU = [unique users] + [unique returning users] over the course of one day
Heap
daily_active_users
Product
You may also want to add columns in your data dictionary for data types, domain values, date created, date last updated, relationships to other data points, validation rules, who approved the data point, and other elements that might be important for your organization.
How to build a data dictionary that actually works: 5 steps
Now that you know what a data dictionary entails, it’s time to start building one! This is where preparation and collaboration—two things many companies fail to take into account when starting this process—come into play. To ensure that your data dictionary becomes a go-to tool for all teams, do the following:
Step 1: Assemble a list of key terms
Time to complete: 2 weeks
First, create a Google Sheet. This will serve as a list of all business terms to include in your data dictionary. Consider building separate tabs that separate terms by department.
A big part of this step is getting input from relevant teams about what terms should be included and how they should be defined. This step is one of the most important for making a data dictionary that works. But it’s one where many companies drop the ball. (Why? Because it takes both time and extra cross-functional communication.)
At Heap, we realized it was essential to consult the following teams:
Data
Engineering
Sales
Product
Marketing
Finance
Operations
While your needs might differ, building a list of terms will allow you to:
Identify key terms you’ve seen used inconsistently across teams to describe actions, situations, or entities that your data describes. Examples might include total revenue vs. ARR (annual recurring revenue), gross sales vs. net sales, registrations vs. installations, etc.
Circulate your list among as many stakeholders as possible so that no terms are missed.
Ensure each term is unique so there won’t be confusion with others.
Once we finished our business term list at Heap, we followed steps 2-5 (see below) to build out data dictionary entries in blocks of 15-30 terms per quarter. We did this until the project was complete.
Breaking the work up like this ensures that when you reach the feedback step, team members can spend minimal time for review. Building your data dictionary in manageable blocks also gives colleagues an opportunity to add more business terms that need to be defined.
Step 2: Define the terms
Time to complete: 3 weeks
Once you have all your terms, it’s time to define them! Using plain language, clearly and succinctly describe each term so its meaning won’t confuse the teams who will eventually be referencing it. Keep your wording conversational. Avoid jargon. A brand-new employee should be able to read this definition and understand what the term means it.
Here are a few examples:
ARR
Short for “Annual Recurring Revenue,” which refers to all types of recurring revenue projected over one year. This can include subscriptions, membership fees, or licensing fees.
DAY_ID
Timestamp to indicate the date an action started.
MAU
Short for “Monthly Active Users,” which counts the number of users who have logged in to our web or mobile app each month.
When defining terms, don’t forget to include columns in your Google Sheet for any necessary source information that shows where the term came from and what team manages it.
Once all terms have been identified and defined it’s time to…
3. Write up formulas
Time to complete: 2 weeks
Does your data take the form of a string or a raw number, or does it need to be calculated with a formula? If it’s the latter, you’ll need to note the formula clearly and accurately, calling on other named data points as needed. Again, it’s best to incorporate plain English into these formulas if possible so that anyone can understand them later on. Here’s an example of how you might write out the formula for calculating customer lifetime value (CLTV):
CLTV = AOV (Average Order Value) × purchase frequency × customer lifespan
When figuring out your formulas, you should always ask yourself:
Where does this data come from, and what team owns it?
What tool gathers the data?
How is it calculated, and what should the formula look like?
What other terms or formulas might this formula impact?
It’s best to finish this step before getting feedback from teams. Not only do formulas add extra clarification to the definition, but defining them ahead of feedback means stakeholders only have to review the document once vs. multiple times.
4. Get feedback
Time to complete: 1-2 weeks
Once you’ve identified business terms, defined them, written up their formulas, and added any necessary source information, it’s time to get feedback. Ask key stakeholders to leave comments in your Google Sheet as they review your terms and definitions. You’ll want to get answers to the following questions:
Do these terms properly reflect the company’s data?
Will the names chosen conflict with any other data points that need tracking?
Have any terms been forgotten?
Are the definitions accurate?
Are the formulas accurate?
Have all relationships between terms been listed out?
During this phase, ask each stakeholder for a detailed review of each term and formula that pertains to them—they may be able to surface issues or use cases that you and other stakeholders haven’t thought of. If they come back with notes for improvement, work with them to finalize the formulas. Generally, you can give these stakeholders a week or so to make corrections or return notes to you.
Remember: Once you’ve incorporated all feedback from all teams, get final signoff from leadership. This is a key step many companies forget! Whether you’re getting signoff from your CEO, your CTO, or some other department head who has final authority, it will solidify your new dictionary as a go-to source of truth for everything related to your company’s data. The terms you decide on should be the terms all teams across your company use going forward—no exceptions!
5. Maintain your data dictionary
While some teams continue to keep their data dictionaries in a simple spreadsheet, Heap uses Tableau. While any business intelligence tool that supports this function will work, just remember that linking your data dictionary to your actual data can help you save time if you need to make updates down the line.
Also, if any of your business terms change over time, make sure you update the terminology in your data dictionary while also alerting different departments and key stakeholders—including leadership—about the update. Similarly, if your company adds new business terms, you’ll want to add them to your data dictionary by following the process outlined above.
----------------------------
Once you’ve built your data dictionary and your leadership has signed off on it as a source of truth, congrats! You’ve just taken a big first step in reducing data silos at your organization. This will have real impact on teams while also improving efficiency and encouraging collaboration across departments.
Getting started is easy
Start today and elevate your analytics from reactive reporting to proactive insights. What are you waiting for?