Data-Driven
Eliminating Data Silos, Part 4: How To Use Data Integrity Testing to Keep Data Clean and Reliable
Our recent market survey showed us that data silos remain a huge problem. If you're one of the 86% of teams with data silos, this blog series is for you! We'll be giving actionable advice on technical and cultural issues before assembling all these posts into a single reference ebook.
Check out Post 1 on Creating a Data Dictionary that Actually Works.
Check out Post 2 on Building a Tech Stack & Data Warehouse for a Silo-Free Future.
Check out Post 3 on Using Reverse ETL to banish silos forever.
-------
Q: Once you’ve built your data dictionary, set up a working tech stack, and implemented a reverse ETL pipeline, what’s the final step to ensuring your data stays clean? A: Data integrity testing!
In this post, we’ll cover:
What data integrity testing is and why it matters to your business.
The types of data integrity you should always test for.
Things you should always consider with data integrity tests
How to build a robust framework for data integrity testing
What is data integrity testing?
Data integrity testing is the process of systematically evaluating the quality, accuracy, consistency, and completeness of your data. The goal is to make sure your data stays trustworthy and fit for its intended use.
Integrity testing involves executing a series of tests that uncover errors, inconsistencies, and deviations from defined expectations. Unfortunately, many companies make mistakes with the process. Here are a few common ones:
Insufficient test coverage. If you rely only on simple validations and basic checks, you might neglect deeper examinations.
Inadequate test frequency and monitoring. If you don’t run integrity tests often enough, you might miss real-time insights or allow errors to persist for extended periods.
Overlooking human error. Don’t forget that human errors can also infiltrate your data! These can include mistakes in data entry, data transcription, validation, and more.
What kinds of data integrity should you test for?
In general, data integrity testing should focus on two kinds of data integrity: physical integrity and logical integrity.
Physical integrity
Physical integrity tests focus on detecting physical errors (or corruption) in you hardware and IT infrastructure. Common types of physical data integrity tests are
Disk integrity checks. These include 1) verifying the integrity of file system structures (and finding missing files, corrupt directories, or metadata inconsistencies); 2) checking physical disc surfaces for bad sectors; and 3) generating unique checksum values for files and checking them later to make sure they haven’t been altered or corrupted.
Data transmission tests. Data needs to journey from point A to point B, and your physical hardware’s ability to transmit data should be tested regularly. This can be done with 1) error-correcting codes (ECC) that ensure data arrives intact at its destination; 2) data validation rules that enforce data type, format, and range checks upon entry or exit from a system; or 3) packet checksums to identify any packets lost or corrupted during transmission.
Backup and recovery validation. Verify the completeness and readability of backup files to ensure that they can be restored successfully in the event of data loss. Also periodically test your restore processes to make sure your data can be recovered accurately and quickly when needed.
Data corruption detection. Proactively scan your warehouse to identify and correct data corruption before it causes errors or data loss. Using an error-logging mechanism to track and record data errors for analysis and troubleshooting will also help you find patterns and root causes of corruption.
Data monitoring and alerts. Make sure to implement monitoring tools to track disk health metrics, input/output errors, and other indicators of possible physical issues. Also make sure to set up trigger notifications to alert you when issues are detected so that you can course correct.
Security controls. As a part of your data governance, you should always take measures to prevent unauthorized data access, particularly when it could lead to intentional or unintentional corruption. You should also encrypt your data to prevent unauthorized access or changes in the event physical security measures are breached.
Logical integrity
There are a few types of logical data integrity you should always be testing for. They are:
Domain integrity. This helps you ensure that each data column’s value conforms to its defined data type, format, and range of accepted values. You should always test for incorrect data types (e.g. text in a numeric field), values outside allowed ranges (e.g. negative age for people), and invalid data formats (e.g. incorrect date patterns).
Entity integrity. This helps you make sure that each record in any given table has a unique identifier (also known as a primary key) and that no critical fields are null. You should always implement tests for duplicate primary keys, missing primary keys, and null values in columns that should not allow for null values.
Referential integrity. This is a crucial concept in relational databases, because it ensures consistency and accuracy in data relationships between different tables. It involves interactions between primary keys and foreign keys (a column in another table that references the primary key in the reference table). You should always test for orphaned records (references to non-existent records) and violations of foreign key constraints, which define what happens when records are deleted or updated.
Logical consistency. This verifies that logical dependencies and relationships between data pieces comply with defined rules and business logic. Always test for contradictory or conflicting data and violations of any business-critical rules.
Data completeness testing. This ensures that all necessary data elements are present and have valid values. You should always test for things like missing or incomplete data and null values in required fields.
What should you be testing?
So … given the above, what sorts of things should you be testing for? A good list would include:
Granularity tests
Granularity tests allow you to inspect your data at different scales:
Row level. Row-level tests check specific data points for accuracy, validity, and adherence to business rules. Are timestamps correct? Are IDs formatted consistently? This is where you catch inconsistencies at their root.
Table level. This is where you analyze relationships between data elements within a table. For example, this could involve checking for foreign key constraints between user ID and session data or verifying aggregate calculations like total page views.
Aggregate level. Aggregate metrics summarize data from a larger data set into a single, high-level value. This is the bird’s eye view of your data, and from here, you’ll spot trends, patterns, issues, or anomalies. Testing these metrics will help you verify integrity and accuracy while ensuring that the calculations used to generate both correctly reflect the underlying data.
Things that should never be true
If you have data points that should never be true but are marked as true, you have a problem. For example, you might find that you have negative customer IDs, orders that “happened” before an actual order date, or duplicate unique identifiers. You should always define data constraints and validation rules to prevent impossible values, use timestamp checks to make sure data records adhere to logical sequence, and enforce primary and alternate key constraints to prevent duplicate records.
Things that should always be true
Conversely, there will always be data elements that should always be true. For example, customers should always have valid email addresses; inventory levels should always be zero or positive; non-null values should always exist for required fields.
Tolerance/thresholds for large data
Once you have a large amount of data, certain margins of error may be acceptable with certain data points. If this is the case, set percentage-based thresholds for acceptable error rates or outlier occurrences. Analyze data variability using standard deviation or other statistical measures to define realistic thresholds, and implement dynamic thresholds that adjust based on data volume, trends, and historical benchmarks.
When testing massive data sets, you should always sample a statistically significant subset. Prioritize testing for critical data elements and metrics that significantly influence teams’ decision-making processes, and test incrementally, dividing large data sets into smaller batches for testing in stages. This reduces resource needs and processing time.
Accuracy, conformity, consistency, and timeliness
These four aspects are crucial for clean data. Here’s what to watch out for with each:
Accuracy. Ensure all expected data is present and without missing values, and check for adherence to expected data types, formats, and ranges. Verify that all data aligns across different sources, systems, and time periods.
Conformity. Your data should also conform with all business rules by enforcing compliance with business logic and constraints. Also make sure you adhere to industry or organization data standards (e.g. ISO, HIPAA, etc.) and uphold defined data quality and privacy regulations.
Consistency. Next, always identify and address any unintended duplication of records, and verify uniqueness by enforcing constraints for key identifiers (e.g. customer IDs, product codes, etc.). Make sure your primary key and foreign key relationships are intact by verifying that links between tables are valid.
Timeliness. Finally, always keep your data fresh and up-to-date, and make sure it reflects the current state of operations. Check for latency by measuring and minimizing delays in data processing availability, and make data retention a priority by enforcing policies for data archiving and deletion based on business needs and compliance requirements.
Anomalies
If you have data points that deviate significantly from expected patterns or trends, dig into them. These typically can be classified as outliers (data points that fall outside typical range); inliers (data points that seem normal and unexpected but actually imply subtle shifts or hidden patterns); and collectives (groups of data points that exhibit unusual relationships or patterns that aren’t observed anywhere else in the data set).
You should test for and identify anomalies in the following ways:
Statistical methods. Use standard deviation, z-scores, or other statistical measures to define thresholds for outliers. Data points falling outside these thresholds are potential anomalies.
Machine learning algorithms. Leverage anomaly detection algorithms that automatically identify unusual patterns and clusters in data, even in complex data sets. Many of your tools will already offer built-in anomaly detection capabilities, but you can also consider open-source tools like PyOD and AnomalyDetectionTS.
Visual inspection. Never underestimate the power of human analysis! Visualize data through charts and graphs to identify unusual trends or suspicious patterns that statistical methods might miss.
Data drift
Data drift refers to the phenomenon of statistical properties of a data set changing over time. This can happen due to changes in the data source, evolution of user behavior, or system updates or changes. It can be detrimental to your models analytics, because it can lead to reduced accuracy, misleading insights, and system performance issues.
To test for data drift, you can use several methods, including:
Statistical comparison. You can flag significant deviations by comparing key statistical measures like mean, median, standard deviation, and distributions of various features over time.
Control charts. These can help you visualize changes in statistical properties over time, making it easier to identify trends and shifts that might indicate a shift.
Distance metrics. These are powerful tools for quantifying the change between data distribution over time. Popular metrics include Wasserstein Distance (WD), Kullback-Leibler Divergence (KL-Divergence), Jensen-Shannon Distance (JSD), and Total Variation Distance (TVD).
Machine-learning based approaches. These go beyond distance metrics capture complex and nuanced changes in your data. Apart from anomaly detection algorithms, key approaches include change point detection algorithms and drift detection for specific data types. While this approach can occur in real time and offer adaptability and customization opportunities, it does have potential pitfalls. These include the necessity for effective model training, possible false positives, and challenges with interpreting results.
How to build a robust framework for data integrity testing
Now that you know what to watch out for when it comes to data integrity testing, it’s time to build a robust testing framework! Here’s what to do:
1. Define your objectives and scope
Defining your objectives and scope should be the first step of any data integrity testing efforts, because it will help you ensure that everyone is on the same page. Make sure you:
Identify critical data assets. Determine which data sets are most essential for business operations and decision-making.
Prioritize areas of focus. Consider factors like risk exposure, regulatory compliance requirements, and potential impact on business outcomes.
Establish clear goals. Define what you aim to achieve with data integrity testing. This might include error reduction, data quality improvement, and better compliance.
2. Gather business requirements and rules
Next, it’s time to communicate across teams to figure out what your business requirements and rules need to be. This should include:
Engaging all stakeholders. Collaborate with data owners, users, and subject matter experts to understand business processes, data discrepancies, and expected data characteristics.
Documenting business rules. Define the constraints, relationships, and logic that must be upheld in your data to ensure that its integrity and meaningfulness remain intact.
Identifying key metrics. Determine the specific data quality metrics that will be used to measure and track progress toward your objectives.
3. Design comprehensive test cases
Once you’ve collaborated with teams across your company to figure out the requirements for data testing, build comprehensive designs for your test cases. Make sure to:
Cover different levels of granularity. Build tests that examine individual data elements, entire tables, and aggregate levels to uncover issues at various scales.
Address multiple dimensions of integrity. As discussed above, include tests for accuracy, completeness, consistency, timeliness, validity, and adherence to business rules.
Use a variety of testing techniques. Incorporate static tests (e.g. data profiling, validation rules, etc.); dynamic tests (e.g. data flow analysis, process monitoring, etc.); and exploratory testing to uncover unforeseen issues.
4. Choose appropriate tools and technologies
Within your data integrity testing framework, you’ll need specific tools to help you do the job, including:
Data quality tools. Use tools specifically designed for data profiling, cleansing, validation, and testing (e.g. DBT, Great Expectations, Informatica Data Quality, etc.).
Data integration platforms. Integrate data testing into your data pipelines or ELT processes for automated execution and early error detection. Top tools for this include. Fivetran, FME, and Informatica Power Center.
Business intelligence (BI) tools. Use the data validation and testing capabilities built into your BI tools to ensure the integrity of data used for analysis and reporting. Top BI tools include Microsoft Power BI, Tableau, Looker Studio, and Domo.
5. Integrate with data pipelines and workflows
Now it’s time to integrate your tools! Here are three things to do to ensure that this step goes smoothly:
Automate testing. Incorporate data integrity tests into your data ingestion, transformation, and loading processes to catch errors early and prevent downstream issues.
Trigger tests based on events. Set up tests to run automatically when data is updated, new data sources are added, or specific events occur.
Integrate with DevOps practices. Adopt continuous integration and continuous delivery (CI/CD) principles to ensure data quality throughout the development lifecycle.
6. Establish review and correction processes
In the event you come across bad data, you’ll need a process in place to remedy whatever issues are happening. Here’s how to build a process:
Categorize possible errors. These categories should include data inaccuracy, incompleteness, inconsistency, duplication, and outlier situations. Set thresholds by defining acceptable error rates or deviation levels for each data element in context of historical trends and business impact.
Implement data quality checks. These should include automated tests for the above issues, periodic manual reviews, and alerts to trigger immediate action when specific error thresholds are crossed or anomalies are detected.
Establish a review workflow. Identify roles and responsibilities across teams; categorize possible errors based on severity and potential business impact; and document all identified errors, assigned reviewers, correction actions, and resolution outcomes in a single centralized log that relevant teams can access.
Develop correction strategies. Establish a process for cleaning erroneous data through editing it, replacing it, or deleting it, depending on severity and how important the data is. Also build processes for data enrichment (e.g. fill in missing information from alternative data sources or machine learning techniques); data harmonization (through data mapping, standardization processes, and source system configuration changes); and data archiving (with clear annotations and timestamps to ensure your history is saved for visibility during future audits).
Continuously improve. Continue analyzing root causes; review and update your processes as needed; and train and educate relevant data stakeholders on best practices, error identification, and correction techniques.
And don’t forget about your data dictionary! Every bit of relevant data at your company should already be defined—just make sure that all teams treat this dictionary as a source of truth when disputing aspects of a data element. By forcing everyone to think about your data in the same way, you’ll break down silos and ensure that your data is clean, correct, and accessible to everyone for needs it.
----
Data integrity is an ongoing process, not a one-time project. By implementing these detailed recommendations, you can establish robust review and correction processes that effectively test bad data, fix it, and ensure that data silos remain a thing of the past.