What characteristic of data quality that refers to all the data attributes and their values are defined at the correct level of details?

Data Quality and MDM

David Loshin, in Master Data Management, 2009

5.3 Dimensions of Data Quality

We must have some yardstick for measuring the quality of master data. Similar to the way that data quality expectations for operational or analytical data silos are specified, master data quality expectations are organized within defined data quality dimensions to simplify their specification and measurement/validation. This provides an underlying structure to support the expression of data quality expectations that can be reflected as rules employed within a system for validation and monitoring. By using data quality tools, data stewards can define minimum thresholds for meeting business expectations and use those thresholds to monitor data validity with respect to those expectations, which then feeds into the analysis and ultimate elimination of root causes of data issues whenever feasible.

Data quality dimensions are aligned with the business processes to be measured, such as measuring the quality of data associated with data element values or presentation of master data objects. The dimensions associated with data values and data presentation lend themselves well to system automation, making them suitable for employing data rules within the data quality tools used for data validation. These dimensions include (but are not limited to) the following:

Uniqueness

Accuracy

Consistency

Completeness

Timeliness

Currency

5.3.1 Uniqueness

Uniqueness refers to requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures. Asserting uniqueness of the entities within a data set implies that no entity logically exists more than once within the MDM environment and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set. For example, in a master product table, each product must appear once and be assigned a unique identifier that represents that product across the client applications.

The dimension of uniqueness is characterized by stating that no entity exists more than once within the data set. When there is an expectation of uniqueness, data instances should not be created if there is an existing record for that entity. This dimension can be monitored two ways. As a static assessment, it implies applying duplicate analysis to the data set to determine if duplicate records exist, and as an ongoing monitoring process, it implies providing an identity matching and resolution service inlined within the component services supporting record creation to locate exact or potential matching records.

5.3.2 Accuracy

Data accuracy refers to the degree with which data correctly represent the “real-life” objects they are intended to model. In many cases, accuracy is measured by how the values agree with an identified source of correct information (such as reference data). There are different sources of correct information: a database of record, a similar corroborative set of data values from another table, dynamically computed values, or perhaps the result of a manual process. Accuracy is actually quite challenging to monitor, not just because one requires a secondary source for corroboration, but because real-world information may change over time. If corroborative data are available as a reference data set, an automated process can be put in place to verify the accuracy, but if not, a manual process may be instituted to contact existing sources of truth to verify value accuracy. The amount of effort expended on manual verification is dependent on the degree of accuracy necessary to meet business expectations.

5.3.3 Consistency

Consistency refers to data values in one data set being consistent with values in another data set. A strict definition of consistency specifies that two data values drawn from separate data sets must not conflict with each other. Note that consistency does not necessarily imply correctness.

The notion of consistency with a set of predefined constraints can be even more complicated. More formal consistency constraints can be encapsulated as a set of rules that specify consistency relationships between values of attributes, either across a record or message, or along all values of a single attribute. However, there are many ways that process errors may be replicated across different platforms, sometimes leading to data values that may be consistent even though they may not be correct.

An example of a consistency rule verifies that, within a corporate hierarchy structure, the sum of the number of customers assigned to each customer representative should not exceed the number of customers for the entire corporation.

Consistency Contexts

Between one set of attribute values and another attribute set within the same record (record-level consistency)

Between one set of attribute values and another attribute set in different records (cross-record consistency)

Between one set of attribute values and the same attribute set within the same record at different points in time (temporal consistency)

Across data values or data elements used in different lines of business or in different applications

Consistency may also take into account the concept of “reasonableness,” in which some range of acceptability is imposed on the values of a set of attributes.

5.3.4 Completeness

The concept of completeness implies the existence of non-null values assigned to specific data elements. Completeness can be characterized in one of three ways. The first is asserting mandatory value assignment—the data element must have a value. The second expresses value optionality, essentially only forcing the data element to have (or not have) a value under specific conditions. The third is in terms of data element values that are inapplicable, such as providing a “waist size” for a hat.

5.3.5 Timeliness

Timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. In the MDM environment, this concept is of particular interest, because synchronization of data updates to application data with the centralized resource supports the concept of the common, shared, unique representation. The success of business applications relying on master data depends on consistent and timely information. Therefore, service levels specifying how quickly the data must be propagated through the centralized repository should be defined so that compliance with those timeliness constraints can be measured.

5.3.6 Currency

Currency refers to the degree to which information is up to date with the world that it models and whether it is correct despite possible time-related changes. Currency may be measured as a function of the expected frequency rate at which the master data elements are expected to be updated, as well as verifying that the data are up to date, which potentially requires both automated and manual processes. Currency rules may be defined to assert the “lifetime” of a data value before it needs to be checked and possibly refreshed. For example, one might assert that the telephone number for each vendor must be current, indicating a requirement to maintain the most recent values associated with the individual's contact data. An investment may be made in manual verification of currency, as with validating accuracy. However, a more reasonable approach might be to adjust business processes to verify the information at transactions with counterparties in which the current values of data may be established.

5.3.7 Format Compliance

Every modeled object has a set of rules bounding its representation, and conformance refers to whether data element values are stored, exchanged, and presented in a format that is consistent with the object's value domain, as well as consistent with similar attribute values. Each column has metadata associated with it: its data type, precision, format patterns, use of a predefined enumeration of values, domain ranges, underlying storage formats, and so on. Parsing and standardization tools can be used to validate data values against defined formats and patterns to monitor adherence to format specifications.

5.3.8 Referential Integrity

Assigning unique identifiers to those data entities that ultimately are managed as master data objects (such as customers or products, etc.) within the master environment simplifies the data's management. However, the need to index every item using a unique identifier introduces new expectations any time that identifier is used as a foreign key across different data applications. There is a need to verify that every assigned identifier is actually assigned to an entity existing within the environment. Conversely, for any “localized” data entity that is assigned a master identifier, there must be an assurance that the master entity matches that identifier. More formally, this is referred to as referential integrity. Rules associated with referential integrity often are manifested as constraints against duplication (to ensure that each entity is represented once, and only once) and reference integrity rules, which assert that all values used for all keys actually refer back to an existing master record.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123742254000059

Important Roles of Data Stewards

David Plotkin, in Data Stewardship, 2014

Understanding the Dimensions of Data Quality

As the Data Stewards move forward to define the data quality rules, it is helpful to understand the dimensions of data quality. Data quality dimensions are ways to categorize types of data quality measurements. In general, different dimensions require different measurements. For example, measuring completeness requires analyzing whether there is a value (any value) in a field, whereas measuring validity requires comparing existing format, patterns, data types, ranges, values, and more to the defined set of what is allowable.

While many dimensions have been listed and defined in various texts on data quality, there are key dimensions for which Business Data Stewards can relatively easily define data quality rules. Once the conformance to those rules has been measured by a data profiling tool, data quality can be assessed by Data Stewards. As mentioned earlier, the data quality necessary for business purposes can be defined, and data profiling will tell you when you’ve achieved that goal.

The data quality dimensions that are most often used when defining data quality rules are listed in Table 7.1.

Gauging Metadata Quality

Many of the same or similar dimensions of data quality can be used to gauge the quality of the metadata collected and provided by Business Data Stewards. Measuring the metadata quality by these dimensions helps to quantify the progress and success of the metadata component of Data Stewardship.The metadata quality dimensions may include:

Completeness: Are all the required metadata fields populated? For example, a derived data element may have a definition but not a derivation, and is thus not complete.

Validity: Does the content meet the requirements for the particular type of metadata? For example, if the metadata is a definition, does the definition meet your published standards and make sense?

Accessibility: Is all that metadata you’ve worked so hard on easily available and usable to those who need it? Do they understand how to access metadata (e.g., clicking a link to view an associated definition)? Do they understand the value of using the metadata?

Timeliness: Is the metadata available and up to date at the time someone is attempting to use it to understand the data he or she is accessing?

Consistency: Is all the metadata consistent, or does it conflict with other metadata? For example, does a usage business rule demand that data be used when the creation business rules state it hasn’t been created yet?

Accuracy: Is the metadata from a verifiable source, and does it describe the data it is attached to correctly? For example, if the metadata is a definition, does the definition correctly define the data element in question?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124103894000076

Dimensions of Data Quality

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

8.2.4 Classifying Dimensions

The classifications for the practical data quality dimensions are the following:

1.

Accuracy

2.

Lineage

3.

Structural consistency

4.

Semantic consistency

5.

Completeness

6.

Consistency

7.

Currency

8.

Timeliness

9.

Reasonableness

10.

Identifiability

The relationships between the dimensions are shown in Figure 8.2.

What characteristic of data quality that refers to all the data attributes and their values are defined at the correct level of details?

Figure 8.2. Practical dimensions of data quality.

The main consideration of isolating critical dimensions of data quality is to provide universal metrics for assessing the level of data quality in different operational or analytic contexts. Using this approach, the data stewards can work with the line-of-business managers to:

Define data quality rules that represent the validity expectations,

Determine minimum thresholds for acceptability, and

Measure against those acceptability thresholds.

In other words, the assertions that correspond to these thresholds can be transformed into rules used for monitoring the degree to which measured levels of quality meet defined and agreed-to business expectations. In turn, metrics that correspond to these conformance measures provide insight into examining the root causes that are preventing the levels of quality from meeting those expectations.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000087

Key Concepts

Danette McGilvray, in Executing Data Quality Projects (Second Edition), 2021

Data Quality Dimensions as Used in the Ten Steps

There is no universal standard list of data quality dimensions, though there are similarities between them. Each of the data quality dimensions as used in the Ten Steps Process is defined in Table 3.3. The dimensions included are those features and aspects of data that provide the most practically useful information about data quality which most organizations are interested in, and are feasible to assess, improve, and manage within the usual constraints of most organizations. They have been proven through years of experience and are also based on knowledge learned from other experts.

Table 3.3. Data Quality Dimensions in the Ten Steps Process – Names, Definitions, and Notes

Data Quality Dimensions – what they are and how to use them. A data quality dimension is a characteristic, aspect, or feature of data. Data quality dimensions provide a way to classify information and data quality needs. Dimensions are used to define, measure, improve, and manage the quality of data and information. Instructions for assessing the quality of data using the following dimensions are included in Step 3 – Assess Data Quality in Chapter 4: The Ten Steps Process.
Notes: The data quality dimensions in The Ten Steps methodology are categorized roughly by the techniques or approach used to assess each dimension. This helps to better scope and plan a project by providing input when estimating the time, money, tools, and human resources needed to do the data quality work. Differentiating the data quality dimensions in this way helps to: 1) match dimensions to business needs and data quality issues, 2) prioritize which dimensions to assess and in which order, 3) understand what you will (and will not) learn from assessing each data quality dimension, and 4) better define and manage the sequence of activities in your project plan within time and resource constraints.

SubstepData Quality Dimension Name, Definition, and Notes
3.1 Perception of Relevance and Trust. The subjective opinion of those who use information and/or those who create, maintain, and dispose of data regarding: 1) Relevance – which data is of most value and importance to them, and 2) Trust – their confidence in the quality of the data to meet their needs.
Notes: It is often said that perception is reality, and people act in accordance with their perceptions. Users’ feelings about relevance and trust influence whether they readily accept and use the organizatiońs data and information (or not). If users believe the data quality is poor, it is less likely they will use the organizatiońs data sources or they will create their own spreadsheets or databases to manage their data. This leads to the proliferation of “spreadmarts” with duplicate and inconsistent data, often without adequate access and security controls in place.
This dimension gathers the opinions of those who use and/or manage the data and information through a formalized survey (individual interviews, group workshops, online surveys, etc.). While in contact with users, it makes sense to ask questions about both the value/business impact and trust in data quality. Because the reason for surveying the users could be prompted from either the data quality or business impact viewpoint, it is included as both a Data Quality Dimension in Step 3.1 and as a Business Impact Technique in Step 4.7.
Opinions about the quality of the data can be compared to other data quality assessment results showing the actual data quality. This allows you to uncover and address any gaps between perception and reality.
An assessment of the Perception of Relevance and Trust could be conducted on an annual or biennial basis much as employee satisfaction surveys are conducted. Results are compared for trends over time and to ascertain: 1) If the data being managed is still relevant to the users, and 2) If actions to prevent data quality issues are leading to expected increases in trust of the data.
3.2 Data Specifications. Data Specifications include any information and documentation that provide context, structure, and meaning to data. Data Specifications provide information needed to make, build, produce, assess, use, and manage data and information. Examples include metadata, data standards, reference data, data models, and business rules. Without the existence, completeness, and quality of data specifications it is difficult to produce high-quality data and harder to measure, understand, and manage the quality of data content.
Notes: Data specifications provide the standard against which to compare data quality assessment results. They also provide instructions for manually entering data, designing data load programs, updating information, and developing applications.
3.3 Data Integrity Fundamentals. The existence (completeness/fill rate), validity, structure, content, and other basic characteristics of data.
Notes: Most other dimensions of quality build on what is learned in Data Integrity Fundamentals. If you dońt know anything else about your data, you need to know what is learned in this assessment. Data Integrity Fundamentals uses the technique of data profiling to assess basic characteristics of data such as completeness/fill rate, validity, lists of values and frequency distributions, patterns, ranges, maximum and minimum values, precision, and referential integrity. What other lists may call out as separate dimensions of data quality (e.g., completeness and validity) are part of this overarching dimension because all can be assessed using the same technique of data profiling.
Note that this dimension assesses the quality, which is different than how you may choose to report results on a report or dashboard. You may want to report completeness and validity separately, for example, or you may combine results and report quality by a particular data field or dataset. Dońt confuse the reporting categorization and presentation with the detail you need to know and learn from this data quality dimension.
3.4 Accuracy. The correctness of the content of the data as compared to an agreed-upon and accessible authoritative source of reference.
Notes: The word accuracy is often used by some as a synonym for data quality in general. Data professionals know they are not the same. Data accuracy requires comparing the data to the real-world object it represents (the authoritative source of reference). Sometimes it is not possible to access the real-world object represented by the data, and in those cases a carefully chosen substitute for it may be used as the authoritative source of reference. The assessment of accuracy can be a manual and time-consuming process. Following are examples of what is learned through assessing Data Integrity Fundamentals using the technique of data profiling compared to what is learned in an Accuracy assessment:

Data profiling reveals if an item number contains a valid code indicating a make or buy part, but only an accuracy assessment using someone familiar with item #123 can determine if that particular item actually is a make item or a buy item.

Data profiling shows if a customer record has a valid pattern for a postal code, but only an accuracy assessment that contacts the customer or uses a secondary authoritative source such as a postal service list can tell if it is his or her postal code.

Data profiling displays the value for the number of products on hand in an inventory database and confirms it is the correct data type. But only by completing an accuracy assessment with someone manually counting or scanning products on the shelf and comparing that number to the record in the inventory system can you know if the inventory count in the database is an accurate reflection of the inventory on hand.

3.5 Uniqueness and Deduplication. The uniqueness (positive) or unwanted duplication (negative) of data (fields, records, or datasets) existing within or across systems or datastores.
Notes: There are many hidden costs associated with duplicate records. For example, duplicate vendor records with the same name and different addresses make it difficult to ensure that payment is sent to the correct address. When purchases by one company are associated with duplicate master records, the credit limit for that company can unknowingly be exceeded. This can expose the business to unnecessary credit risks.
Identifying duplicates requires different processes and tools than Data Integrity Fundamentals, Accuracy, or the other data quality dimensions.
3.6 Consistency and Synchronization. Equivalence of data stored or used in various datastores, applications, and systems.
Notes: Equivalence is the degree to which data stored in multiple places are conceptually equal. The same technique of data profiling used in Data Integrity Fundamentals can be used on the same data as it appears in multiple datastores or datasets. Results are compared for consistency. Consistency and Synchronization is called out as a separate data quality dimension because each datastore that is compared adds time and work to your project. If it is necessary, it should be done, but knowing that you are assessing Consistency and Synchronization, in addition to Data Integrity Fundamentals will help you better plan for the additional resources and time required.
3.7 Timeliness. Data and information are current and ready for use as specified and in the time frame in which they are expected.
Notes: This data quality dimension is easily added onto Consistency and Synchronization if the factor of how long it takes for the data to move from one datastore to another is a concern. It is called out separately here because you may or may not want to look at Timeliness. It is easy to add in or take out as needed.
As the world changes, there will always be a gap between when the real-world object changes and when data representing it is updated in a database and ready to be utilized. This assessment can also examine the timing of data throughout its life cycle and determine if data is being updated in time to meet business needs.
3.8 Access. The ability to control how authorized users can view, modify, use, or otherwise process data and information.
Notes: Access is often summed up as enabling the right individuals to access the right resources, at the right times, under the right circumstances. Defining the right individuals, right resources, right times, and right circumstances are business decisions, which need to balance enabling access against protecting sensitive data and information. Other functions such as Information Security or a separate Access Management team often typically manage access. Your data quality team should work closely with them if your project addresses this data quality dimension of Access.
3.9 Security and Privacy. Security is the ability to protect data and information assets from unauthorized access, use, disclosure, disruption, modification, or destruction (US Department of Commerce, n.d.). Privacy for an individual is the ability to have some control over how data about them as a person is collected and used. For an organization, it is the ability to comply with how people want their data to be collected, shared, and used.
Notes: Securing and protecting data require specialized expertise. This dimension does NOT replace that. However, data professionals should find and work with information security teams and others responsible for privacy in their organizations to ensure: 1) Appropriate security and privacy safeguards are in place for the data with which they are concerned, 2) They understand the security and privacy requirements with which the data they manage must comply. The data quality dimension of Access (Step 3.8) is focused on access for authorized users. Security also includes access, but it is focused on unauthorized access.
The Information Security world (InfoSec) often talks about protecting through the CIA Triad: Confidentiality, Integrity, and Availability.
3.10 Presentation Quality. The format, appearance, and display of data and information support their collection and uses.
Notes: Presentation quality applies to user interfaces, reports, surveys, dashboards, etc. Presentation quality can affect the quality of the data when it is collected. For example, does a dropdown list contain valid options or codes? Are questions framed in such a way that users understand what is being asked and can provide correct responses? Are the various methods for collecting the same data presented in a consistent manner? Presenting information in a way that is understandable from the user’s perspective increases the quality of data from the time of collection. In this respect, good presentation quality is an excellent way to prevent data quality errors.
Presentation quality also applies when using the data. For example, is the user interpreting a report correctly? Does it have descriptive column headings and include dates? Do the graphics, reports, or user interfaces correctly reflect the data behind them? If the data in the datastore is of high quality itself but is misunderstood when shown to a user, you still do not have data quality.
3.11 Data Coverage. The comprehensiveness of data available compared to the total data universe or population of interest.
Notes: This dimension looks at how the datastore reflects the total population of interest to the business. For example, a database should contain all customers in North and South America, but there are concerns that the database really reflects only a portion of the company’s customers. Coverage in this example is the percentage of customers actually captured in the datastore compared to the population of all customers that should be in it. This dimension is more in-depth than general decisions made about population, coverage, and selection criteria when developing the project data capture and assessment plans.
3.12 Data Decay. Rate of negative change to the data.
Notes: Knowing data decay rates helps determine whether mechanisms should be put into place to maintain the data and what the frequency of those updates should be. This dimension is an example where just the idea of the dimension can spur you to action, often without the need for an in-depth assessment. Volatile data requiring a high reliability level requires more frequent updates than data with a lower decay rate or data where the quality level may not need to be so high. If essential data is already known to change quickly, then efforts should also quickly move to finding solutions. Determine how to become aware of the change in the real world. Develop processes where data can be updated within the organization as soon and as close to the real-world change as possible.
3.13 Usability and Transactability. Data produces the desired business transaction, outcome, or use for which it was intended.
Notes: This dimension is a way to determine if the data is fit for purpose and is a final checkpoint of data quality. Even if the right people have defined business requirements and prepared data to meet them, the data must produce the expected outcome or use – or we still do not have high-quality data. Can the data be used to complete a transaction? Can the invoice be created? Can a sales order be completed? Can a lab order be produced? Can an insurance claim be initiated? Can an item master record be used properly when building a Bill of Materials? Can a report be correctly generated? If the answer is no, we still do not have high-quality data.
3.14 Other Relevant Data Quality Dimensions. Other characteristics, aspects, or features of data and information that are deemed to be important for your organization to define, measure, improve, monitor, and manage.
Notes: There may be characteristics of data that are specific to an organization, uses of the data, or aspects of a data quality issue that are not already covered in the list of data quality dimensions. For example, within one corporate finance group, the main concern was reconciliation between systems. Some could call this completeness, but the word “reconciliation” is clearer to them and specific to their context. If another dimension is chosen, it can still be used within the context of the full Ten Steps Process, just like any of the others.

Instructions for conducting an assessment for each of the dimensions can be found in Step 3 – Assess Data Quality in Chapter 4, with a separate substep for each of the dimensions as noted in the table. The dimensions are numbered for reference purposes and do not propose the order for completing the assessments. Suggestions for helping you select the DQ dimensions most relevant to your needs and to include in your project are discussed below.

Many lists of data quality dimensions exist and may use different terms or categories for different purposes than shown here. Once again, the dimensions used in the Ten Steps are categorized roughly by how you would assess each dimension. For example, the dimension of Data Integrity Fundamentals includes what other lists may call out as separate dimensions (e.g., completeness and validity). Both, along with other characteristics of data, are part of the overarching dimension called Data Integrity Fundamentals because all can be assessed using the technique of data profiling.

Warning! Do not get overwhelmed by the list of data quality dimensions. You will not assess all of them. Having them categorized this way actually makes it easier for you to select the ones most relevant to your business needs and data quality issues.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128180150000098

Data Quality Management

Mark Allen, Dalton Cervo, in Multi-Domain Master Data Management, 2015

Assess and Define

The actions of assessing and defining are what make DQM different from other solution design activities, such as application development. With typical application development, business units have a very clear understanding of the functionality required for them to perform their jobs. New requirements, although vulnerable to misinterpretation, are usually very specific and can be stated without much need for research. Also, software defects sometimes might be difficult to replicate, diagnose, and fix, but again, the functionality to be delivered is mostly clear.

With data, the landscape is a bit different. One of the mantras for data quality is fitness for purpose. That means data must have the quality required to fulfill a business need. The business is clearly the owner of the data, and data’s fitness for purpose is the ultimate goal. However, there are some issues. First, the business does not know what it does not know. A lot of the time, company personnel have no idea of the extent of the problem they face when it comes to data quality or lack thereof. Second, fitness for purpose can be difficult to quantify. There are certain situations where you just want to get your data as good as possible to minimize any associated costs, but it is not clear how much they can be improved. Third, the business might not fully understand the ripple effects of a data change.

In a multi-domain MDM implementation, there are certain predefined data-quality expectations, such as the ability to link inconsistent data from disparate sources, survive the best information possible, and provide a single source of truth. Still, the current quality state is mostly unknown, which tremendously increases the challenge of scoping the work and determining the results. The business might have a goal, but achieving it can be rather difficult.

The only way to really understand what can be accomplished is to perform an extensive data-quality assessment. It cannot be stated enough how data profiling is important. To efficiently and effectively correct data problems, it is a must to clearly understand what the issue is, its extent, and its impact. To get there, data must be analyzed fully and methodically. For an extensive discussion of data profiling, consult Chapter 8.

A recommended approach to organize data-quality problems is to categorize them into dimensions. Data-quality dimensions allow complex areas of data quality to be subdivided into groups, each with its own particular way of being measured. Data-quality dimensions are distinguishable characteristics of a data element, and the actual distinction should be business-driven. Since it is possible to characterize a particular data element in many ways, there is no single definitive list of data-quality dimensions. They can be defined in many ways with slight variations or even overlapping context. The following list represents the type of dimensions and definitions generally used:

Completeness: Level of data missing or unusable.

Conformity: Degree of data stored in a nonstandard format.

Consistency: Level of conflicting information.

Accuracy: Degree of agreement with an identified source of correct information.

Uniqueness: Level of nonduplicates.

Integrity: Degree of data corruption.

Validity: Level of data matching a reference.

Timeliness: Degree to which data is current and available for use in the expected time frame.

Data-quality dimensions are also very useful when reporting data-quality metrics as part of the Monitoring and Control phase, as will be discussed later in this section.

At the end of this step, it is possible to have a clearly stated set of requirements on what needs to be accomplished for each specific data-quality activity. Remember that most of the time when the process is initiated, there is not yet enough information to produce a complete set of requirements. This step complements the initiation by adding much-needed analysis.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128008355000099

Entity Identity Resolution

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

16.8.1 Quality of the Sources

Different data sources exhibit different characteristics in terms of their conformance to business expectations defined using the types of data quality dimensions described in chapter 8. In general, if a data set is subjected to an assessment and is assigned a quality score, those relative scores become one data point. In other words, given a record from a data set with a high quality score matched with a record from a data set with a lower quality score, it would be reasonable to select values from the first record over the second when they vary.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000166

Bringing It All Together

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

20.2.6 Data Quality Assessment

The absence of clearly defined measurements of business impacts attributable to poor data quality prevents developing an appropriate business case for introducing data quality management improvement, and chapter 11 looked at an approach for assessment process that quickly results in a report documenting potential data quality issues. That report enumerates a prioritized list of clearly identified data quality issues, recommendations for remediation, and suggestions for instituting data quality inspection and control for data quality monitoring. It also proposes opportunities for improvement, providing an ability to provide an objective assessment of critical data and determine whether the levels of data quality are sufficient to meet business expectations, and if not, evaluate the value proposition and feasibility of data quality improvement.

This process essentially employs data profiling techniques to review data, identify potential anomalies, contribute to the specification of data quality dimensions and corresponding metrics, and recommend operational processes for data quality inspection. The process involves these five steps:

1.

Planning for data quality assessment

2.

Evaluating business process

3.

Preparing for data profiling

4.

Profiling and analyzing

5.

Synthesizing results

Connecting potential data issues to business impacts and considering the scope of those relationships allow the team to prioritize discovered issues. The data quality assessment report documents the drivers, the process, the observations, and recommendations from the data profiling process, as well as recommendations relating to any discovered or verified anomalies that have critical business impact, including tasks for identifying and eliminating the root cause of the anomaly. Some suggestions might include:

Inspection and monitoring

Additional (deeper) review of the data

One-time cleansing

Review of data model

Review of application code

Review of operational process use of technology

Review of business processes

The data quality assessment is the lynchpin of the data quality program, because it baselines the existing state, provides hard evidence as to the need, and suggests measures for inspecting data quality moving forward. In turn, the process can be used to evaluate whether the costs associated with data quality improvement would provide a reasonable return on the investment. Either way, this repeatable assessment provides tangible results that can either validate perceptions of poor data quality or demonstrate that the data meets business needs.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000208

Metrics and Performance Improvement

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

6.5.2 Building a Control Chart

Our next step is building a control chart. The control chart is made up of data points consisting of individual or aggregated measures associated with a periodic sample, enhanced with the center line and the upper and lower control limits.

These are the steps for building a control chart for measuring data quality:

1.

Select one or more data quality dimensions that will be charted. Make use of the Pareto analysis to help determine the variables or attributes that most closely represent the measured problem, since trying to track down the most grievous offenders is a good place to start.

2.

If the goal is to find the source of particular problems, make sure to determine what the right variables are for charting. For example, if the dimension that is being charted is timeliness, consider making the charted variable the “number of minutes late,” instead of “time arrived.” Keep in mind while choosing variables that the result of charting should help in determining the source and diagnosis of any problems.

3.

Determine the proper location within the information chain to attach the measurement probe. This choice should reflect the following characteristics:

a.

It should be early enough in the information processing chain that detection and correction of a problem at that point can prevent incorrectness further along the data flow.

b.

It should be in a location in the information chain that is easily accessed and retooled, so as not to cause too much chaos in implementing the charting process.

c.

It should not be in a place such that observation of the sample can modify the data being observed.

4.

Decide which kind of control chart is to be used. The choices are:

a.

Variables chart, which measures individual measurable characteristics; a variables chart will provide a lot of information about each item being produced.

b.

Attributes chart, which measures the percentage or number of items that vary from the expected; an attributes chart summarizes information about the entire process, focusing on cumulative effects rather than individual effects.

5.

Choose a center line and control limits for the chart. The center line can either be the average of past measurements, the average of data that has not yet been measured or collected, or a predefined expected standard. The upper control limit (UCL) is set at three standard deviations (+3σ) above the center line, and the lower control limit (LCL) is set at three standard deviations (−3σ) below the center line.

6.

Choose the sample. The sample may consist of measuring individual data values, or measuring a collection of data values for the purpose of summarization. The sampler should be careful not to take a sample at a point in the process or a point in time when there is only a small possibility that the taking of the sample can have any changing affects.

7.

Choose a method for collecting and logging the sample data. This can range from asking people to read gauges and write down the answers in a notebook to having an integrated mechanism for measuring and logging sample results.

8.

Plot the chart and calculate the center line and control limits based on history.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000063

The Important Roles of Data Stewards

David Plotkin, in Data Stewardship (Second Edition), 2021

Understanding the Dimensions of Data Quality

As the Business Data Stewards move forward to define the data quality rules, it is helpful to understand the dimensions of data quality. Data Quality dimensions are ways to categorize types of data quality measurements. In general, different dimensions require different measurements. For example, measuring completeness requires analyzing whether there is a value (any value) in a field, whereas measuring validity requires comparing existing data formats, patterns, data types, ranges, values, and more to a defined set of what is allowable.

While many dimensions have been listed and defined in various texts on data quality, there are key dimensions for which the Business Data Stewards can relatively easily define data quality rules. Once the conformance to those rules has been measured by a data profiling tool, the data quality can be assessed by Business Data Stewards. As mentioned previously, the data quality necessary for business purposes can be defined, and data profiling will tell you when you’ve achieved that goal.

The data quality dimensions that are often used when defining data quality rules are listed in Table 7.1.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128221327000073

Data Quality and Measurement

Laura Sebastian-Coleman, in Measuring Data Quality for Ongoing Improvement, 2013

Data Quality Dimensions

Data and information quality thinkers have adopted the word dimension to identify those aspects of data that can be measured and through which its quality can be quantified. While different experts have proposed different sets of data quality dimensions (see the companion web site for a summary), almost all include some version of accuracy and validity, completeness, consistency, and currency or timeliness among them.

If a quality is “a distinctive attribute or characteristic possessed by someone or something” then a data quality dimension is a general, measurable category for a distinctive characteristic (quality) possessed by data. Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. They allow us to understand quality in relation to a scale and in relation to other data measured against the same scale or different scales whose relation is defined. A set of data quality dimensions can be used to define expectations (the standards against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset.

Measuring the quality of data requires understanding expectations and determining the degree to which the data meets them. Often this takes the form of knowing data’s distinctive characteristics and assessing the degree to which they are present in data being measured.

Put in these terms, measurement seems simple. But there are at least two complications: First, most organizations do not do a good job of defining data expectations. Very few make clear, measurable assertions about the expected condition of data. Second, without clear expectations, it is not possible to measure whether the data meets them. This doesn’t mean you need measurement to know you have poor quality data. As David Loshin has pointed out, “the definition of poor quality data is like Supreme Court Justice Potter Stewart’s definition of obscenity: We know it when we see it” (Loshin, 2001, p. 101).

Let’s take the first complication. In defining requirements, many information technology projects focus on what data needs to be made available or what functionality needs to be put in place to process or access the data. Consequently, data requirements often focus on data sources, modeling, source-to-target mapping, and the implementation of business intelligence tools. Very few projects define requirements related to the expected condition of the data. In fact, many people who work in information technology (IT) think the content of the data they manage is irrelevant to the system built. With limited knowledge of the data, they have no basis for knowing whether or not the data meets expectations. More challenging still, data can be used for different purposes. Each of those purposes may include a different set of expectations. In some cases, expectations may be in conflict with one another. It is risky not to recognize these differences. Failure to do so means that data might be misused or misunderstood.

The second complication goes to the purpose of this book: It can be hard to measure expectations, but it is impossible to measure them if you cannot define them. Data is not a thing with hard and fast boundaries (Redman, 2007), but data does come to us in forms (records, tables, files, messages) that have measurable characteristics. Fields are specified to contain particular data. Files are expected to contain particular records. We can align these characteristics with the expectations of data consumers or the requirements of representation. From the combination, we can establish specific measurements of these characteristics.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123970336000043

What characteristics define data quality?

There are five traits that you'll find within data quality: accuracy, completeness, reliability, relevance, and timeliness – read on to learn more. Is the information correct in every detail? How comprehensive is the information? Does the information contradict other trusted resources?

What is a characteristic of data quality quizlet?

Characteristics of data quality are based on 4 domains: Data applications, data collection, data warehousing, data analysis. Accuracy Accessibility Comprehensiveness Consistency Currency Definition Granularity Precision Relevancy Timeliness.

Which data quality traits refers to the availability and accessibility of the data?

The last factor in determining the quality of the data is timeliness. This trait refers to the availability and accessibility of the selected data.

What are the characteristics of data integrity?

Data integrity requires that data be complete, accurate, consistent, and in context. Data integrity is what makes the data actually useful to its owner.