Data-Informed
Eliminating Data Silos, Part 2: Building a Tech Stack & Data Warehouse for a Silo-Free Future
Our recent market survey showed us that data silos remain a huge problem. If you're one of the 86% of teams with data silos, this blog series is for you! We'll be giving actionable advice on technical and cultural issues before assembling all these posts into a single reference ebook.
----
In our previous blog on reducing data silos, we showed how to build a data dictionary that actually works. In this second installment, we’ll show how you should optimize your tech stack and data warehouse. Where do you start when you have hundreds of options for tools that capture, store, and transform data?
In our recent market survey, we found that companies struggle to get the tools in their tech stack to work together. In our experience, the top points of failure tend to be:
Not setting up permissions properly. Failing to manage user roles and permissions across your tech stack can create data silos and put up roadblocks across teams and departments.
Leaving gaps in their stacks. Data silos occur when there isn’t seamless integration between tools and the data they gather. The good news? Reverse ETL, which we cover in the next post, will help you fix this problem.
Build in ways to test their data. When you can’t test your data for functionality, performance, and reliability, it’s easy to end up with messy data. We’ll cover this in a future post as well.
You don’t have to fall into these traps! There are always ways to set up your tech stack and warehouse so that tools work together and your data stays clean, no matter how you’re capturing it.
What is a tech stack?
For data and engineering teams, a tech stack is the set of tools that helps your part of the company do what it needs to do. Modern tech stacks also often include tools that measure product and user analytics, along with other digital tools that integrate with your company data for purposes of marketing, operations, sales, customer service, and other functions.
Remember: Before you build your tech stack, it’s essential to know which business questions teams at your company will need to answer on a day-to-day basis. When setting up a tech stack with the elimination of data silos in mind, make sure you understand where your proposed setup might leave blind spots for certain teams.
What is a data warehouse?
A data warehouse is a data management system designed to support business intelligence activities, particularly data analytics. It’s a catch-all storage system for data (including historical data) and a means for teams to query that data. Today, data warehouses typically include:
A relational database. This lets you store and manage your data.
An extract, transform, and load (ETL) or extract, load, and transform (ELT) solution. Both of these terms describe the process of pulling and transforming your data from its source of origin so that it inhabits your warehouse in a structure that makes it easy to analyze and blend with other data. Today, it’s far more common to load all raw data into your warehouse before transforming it to suit your needs.
Data mining, reporting, and analysis capabilities. Your warehouse should enable you to search all your data from every source, perform statistical analytics, and report findings back to teams who depend on them.
Setting up your data warehouse properly will help you reduce silos by allowing you to integrate all your data from multiple sources and access it whenever you need to. To do this—and to make sure even non-technical teams can access the data they require to do their jobs—you need to set up your tech stack properly.
Documentation: the most important step of setting up a tech stack
When setting up your tech stack to prevent data silos, you’ll want to keep efficiency in mind. As such, you’ll need to document everything. From the beginning.
If you’re moving quickly, it can be tantalizingly easy to ignore documentation. This is a mistake! If you’re committed to eliminating silos, you need to commit to documenting everything. If you don’t, we promise you’ll see chaos down the line.
Here’s the “what,” “who,” “where,” and “when” of documentation:
What
Document everything! At very least, you want to document the following:
Commonly used tables. Document these column by column, explaining what each represents.
Data sources. Document how it gets to the warehouse. Is it coming from a platform like Salesforce? A production database? A billing system?
Code. Your code will be read more than it is edited. Make sure it’s both readable and thoroughly documented so engineers know exactly what they’re looking at.
Components and versions. Document the software and platforms you’re using (including versions). This includes SaaS tools, operating systems, cloud providers, etc.
Configurations and deployment. Document every configuration for every tool, including API keys, connection strings, and environment variables. Describe the deployment process for each.
Dependencies and integrations. Map out how different tools interact with each other, and note any dependencies, API integrations, and data mapping techniques.
Monitoring and metrics. Define key performance indicators (KPIs) for your tech stack, focusing on performance, stability, and resource use. Also note how your monitoring tools are used and how they track these KPIs.
Maintenance and update processes. Outline any regular maintenance tasks that must be performed, including backups, upgrades, security patches, testing processes, and rollback procedures.
For your data warehouse, make sure to document the following
Data model and schema. Clearly define the data model, including entity relationships, attributes, and data types. Also document the schema for each table or view.
Data ingestion and processing. Document how you ingest data from different sources, and make sure to note any processes for data cleaning, normalization, or enrichment.
Data quality and governance. Define metrics and methods for measuring data quality, and outline policies for governance, including access controls, logging, and audit trails.
Querying and reporting. Note frequently used queries for accessing and analyzing warehouse data, and document reporting tools and dashboards used to present data insights to key stakeholders.
Security and compliance. Document security measures for data access, encryption, and authorization within the warehouse, along with relevant data privacy regulations your teams need to be aware of.
Once complete, your documentation should be stored in a table and look something like this:
Schema | sales_data |
Name | transaction_date |
Primary key | salesforce_account_id_date |
A record in this table represents… | [Explain what the record represents] |
Source code | [Source code URL] |
Who
Your data sources should have defined owners (and owners should be documented!). Typically, whoever is responsible for the data should also be responsible for documentation.
Start by defining the business owner, or the person/team using the data most often—they’ll often be the best owner (e.g. your revenue operations team who create data in Salesforce). You may also have or want to define a technical owner, or the person/team creating the data by logging things that happen in a product (e.g. engineers). It’s fine if your business owner and technical owner are the same.
Where
Documentation should happen in whatever tool your company typically uses to store information about any of its defined processes (at Heap we use Confluence). If a documentation tool has already caught on with one or more teams at your company, it’s best to use this tool so all the information that teams need lives in one place.
When
Any time a data asset is created or changed. A great way to make sure this happens is to incentivize people (e.g. give a fun gift or gift card to the person who does the most documenting per quarter).
Another fun idea: “documentathons” (oh yeah) or hack weeks focused on documentation.
How to align your data warehouse with your tech stack
When setting up your data warehouse and tech stack to work together seamlessly, take the following steps:
1. Pick the right warehouse
When you're considering which data warehouse service to use, you’ll have a lot of options, including Snowflake, Google Cloud BigQuery, Microsoft Azure, Clickhouse, Dremio, and others. They all function similarly, with the biggest differences being how they price their services. Make sure the billing setup for the warehouse you choose fits your usage patterns.
In the current market, Snowflake is narrowly beating BigQuery, but a majority of Heap customers use Snowflake. Regardless of the warehouse service you choose, you’ll want to consider how your warehouse handles data volume at scale, query speed, concurrent queries, monitoring, and workload management.
Also consider the ecosystem you already operate in. If you’re already using Google Cloud Platform, using a data warehouse service from a different vendor can create a form of vendor lock-in. This can make it difficult for their competing services to be integrated, and transferring data between vendors can also create additional costs.
Once you’re setting up your warehouse, use clear names for schemas and tables, stay organized, and make it very clear what data is scratch data and what is production data. It’s also helpful to avoid complex permission structures.
2. Set up version control. For everything.
Everything that can be version controlled should be! Version control acts as a source of truth by offering complete visibility, traceability, and ongoing monitoring of future changes to your code, data, and other projects. This becomes essential when different teams (such as system admins or developers) are working with the same database.
Platforms like GitHub, GitLab, GitBucket, and others allow for collaboration and code sharing. The repository you set up can store and implement version control for your product code, open source libraries, data science code, scripts, notebooks, and documentation. When it comes to your data, leveraging a Git workflow is essential, as it keeps a history of every change to your data-related projects.
Always treat version control as if it’s a real product, not just a “scratch pad” for your work. Also, make sure to clearly define your release processes (setup, what information is included, etc.) so that everyone can follow them and be able to help each other.
3. Use a dependency checklist
In tech stacks, architecture is often illustrated by a stack diagram, where higher-level components are dependent on lower-level components. When setting up your tech stack, your biggest dependency issue will likely be whether your ingestion tool supports the SaaS products your teams use on a daily basis. While most mainstream SaaS products will integrate successfully, some may not. Use the following checklist to check for:
Integration methods. Does your ingestion tool effectively communicate with RESTful APIs or other integration methods provided by the SaaS applications?
Transformation capabilities. Do you know what your ingestion tool’s data transformation capabilities are and whether this data can adapt to the data structures of your target SaaS applications?
Data formats and protocols. Do you know the supported data formats and protocols of both the ingest tool and the SaaS application so you can ensure seamless data transfer?
Authentications and authorizations. Do you know how your ingest tool handles authentication and authorization when interacting with SaaS applications?
Rate limits and scalability abilities. Do you know whether your ingest tool can handle any rate limits on API requests imposed by the SaaS applications and how scalable it is when working with large volumes of data?
Function failure. Do you know how your ingest tool handles errors and retries when a function fails?
Monitoring and logging methods. Do you know how best to implement monitoring and logging to track the status of data ingestion processes, bringing visibility to API calls, error rates, and other relevant metrics?
Security compliance. Do you know whether the ingest tool complies with all necessary security standards and regulations when dealing with sensitive data from SaaS applications?
4. Use service accounts
Create service accounts for each tool so you know where any changes come from. Service accounts are non-human accounts with administrative privileges, and you should use them to run applications, automated services, and other background processes. Because service accounts let you assign an identity and permissions to business-critical computer programs or processes that do important background tasks, they often have extensive access to your systems. This means you need to be very careful when it comes to security, monitoring, and proper maintenance!
Because service accounts are also not linked to individuals, companies need to know who is in control of them at all times. To implement them safely, do the following:
Use a dedicated tool. Hashicorp Vault, Akeyless, Microsoft Entra ID, or Egnyte are some of the top identity-based secrets and encryption management system.
Integrate with existing IAM (identity access and management) systems. This can automate updates and improve data accuracy.
Document everything (as always). Make sure to note all relevant details about each service account, including its purpose, permissions, owner, and associated systems.
Define roles and permissions. Define different access levels based on user roles and responsibilities. This will help you ensure that information is accessible only to authorized personnel.
Make sure your accounts are searchable. Because you will likely have numerous service accounts, make sure whatever platform or system you use makes it easy to find specific information about each one.
Keep everything up to date. Regularly review and update your documentation to reflect any changes made to your service accounts.
5. Restrict your “golden tables”
“Golden tables” are considered the definitive and authoritative source for specific data within an organization. Since you’ll have multiple tools and users that need to work with your data, it’s important to set up permissions properly on all levels of access to your golden tables. Here’s a good method to follow:
Inventory and catalog your data. Create a list of tables, their data elements, and their sensitivity levels. Consider making this part of your data dictionary.
Assign data ownership. Designate individuals as primary owners responsible for each golden table’s integrity and security, and make sure to consult them when assessing access requests.
Implement role-based access control (RBAC). Establish clear roles based on job functions and data needs (e.g. analyst, developer, manager, etc.). Map permissions to roles by assigning each one an appropriate access level (e.g. read, write, update, delete, etc.), aligning with data sensitivity.
Operate on the “need-to-know” principle. Not every employee will need to see every bit of data your company has, but you may find unexpected crossover in needs between teams. For example, if you lock down company financial data too much, your product or customer success teams may not be able to use it for their analysis, which means you’ve created a silo that prevents them from doing their jobs. Make sure to communicate across teams to ensure permission settings are optimized for each role based on need, but avoid giving unnecessary access.
Build an access request and approval process. Establish a structured process for employees to request and justify data access. Involve data owners and relevant stakeholders in reviewing and approving access requests, and make sure all employees go through basic data safety training.
Use data minimization techniques. Always restrict access to minimum data required for each employee’s task, and consider data masking and obfuscation to protect sensitive information during analysis.
6. Take snapshots
Snapshots should be performed at least once per day, so there’s always a record of your most recent data. They’ll come in handy if, for example, account statuses change between days or months and you want to see what an account looked like at an earlier date.
Snapshots should be stored as tables in your warehouse, and each one should be consistently named (e.g. “[table_name]_snapshot”). DBT has a great tool for snapshotting, but you can also build an incremental table that inserts one row daily to capture the state of your data at the end of each day, with the primary key of the table you’re snapshotting used to uniquely identify records. (These snapshots are known as “slowly changing dimensions,” or “SCD.”) Over time, the incremental data builds a historical record of changes to the data in the original table. This allows for historical analysis and reporting.
These snapshots should be stored alongside raw data and, as a safeguard, have a heavy retention policy in case any data gets wiped. Daily snapshotting is easiest, but if you have large data volumes, these snapshots can get very expensive to store and costly to query.