The cost of bad data is pretty high. If you’re using bad data in your analytics, then you’re likely to make bad decisions based on it. Given that data is now considered to be every organization’s most valuable asset, it’s worth investing the time to make sure it’s as accurate and reliable as possible. And looking at your data integrity is a great place to start.
How does data integrity differ from data quality?
First of all, let’s consider the difference between data integrity and data quality. Data integrity is the completeness, accuracy, consistency, timeliness and compliance of the data within your systems. Data integrity includes: physical integrity, which means that the data is stored on a safe, reliable physical platform; logical integrity, which means that data is accurate and correct, and unchanged, even if it is used in different contexts in a relational database; and compliance, which means that data meets necessary compliance standards, such as the standards for GDPR.
Data integrity is a component of data quality. You can’t have high quality data if you don’t have data integrity. Data quality includes all of the characteristics of data integrity as well as:
- Reasonability Does the data reasonably meet the criteria established for it?
- Uniqueness Have duplicate records been removed?
- Validity Does the data meet the criteria established for the particular field and/or table in which it appears?
- Accessibility Can users get to the data they need when they need it?
Here are some characteristics of data integrity to consider.
Is your data complete? There are several different ways to answer this question. It’s pretty easy to tell if one record is complete, such as having the name, email address, and phone number for one customer. It’s harder to determine if you have a complete set of data. Do we have a record in our database for every single customer of ours?
Data that is accurate is in the correct form and provides the appropriate information within context. Achieving data accuracy is the responsibility of every person in your organization who enters data, updates it, moves it, or downloads it.
Is data consistent throughout the organization? There can be many answers to this question. Consistency can be very low level: Are our customers’ phone numbers in our database all formatted the same way? Consistency can also be very high level: Are different groups within our organization using the same data sets? Making sure data is consistent is a big job that spans all of your processes from data entry to data integration.
Data must be collected promptly to be valuable. Data that’s a day or even an hour old isn’t as valuable as data that was collected a few minutes ago or better yet in real time. Data that was collected years ago may be valuable, but only in certain contexts, such as when measuring progress over the years.
Does data meet all of the applicable compliance standards for your organization? If your organization is required to comply with GDPR or other privacy regulations, it’s important to make sure that you store your data safely and that personally identifiable information (PII) is protected.
Here are a few ways you can perform data integrity analysis:
Test your data regularly for incomplete or redundant entries
If you test frequently, you can see start to see patterns. An increasing number of incomplete or redundant entries means that your data integrity is getting worse. A decreasing number of these issues means that your data is getting better.
Look for missing data
Missing data or empty values within a field indicates there’s a problem with your data integrity. Missing data means that your data is failing in terms of completeness. Again, if you’re testing for this issue on a regular basis, you can determine if your data integrity is getting better or worse.
Watch for increasing storage costs
If your data storage costs are increasing, but the amount of data that you are using in your organization is about the same, then you may have an issue. Rising costs in this case indicate that it’s time to deduplicate your data.
Want to Learn More About Data Integrity?
Given the increasing volume and complexity of data, and the speed and scale needed to handle it, the only place you can compete effectively—and cost-effectively—is in the cloud. Matillion provides a complete data integration and transformation solution that is built for cloud data platforms.
Matillion Data Loader is a free SaaS-based data integration tool that seamlessly loads valuable business data into your cloud data warehouse. With a code-free, wizard-based pipeline builder to common data sources like Salesforce, Google Analytics, and more, Matillion Data Loader can help make your first data migration (and every one after) quick and easy. It’s also free. Sign up today to try Matillion Data Loader and kickstart your data migration project.
Matillion ETL software is cloud-native data integration and transformation, built to support leading cloud data warehouse environments, including Snowflake, Delta Lake on Databricks, Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse.
Request a demo to learn more about how you can unlock the potential of your data with Matillion ETL’s cloud-based approach to data transformation.