David Loshin, in
Master Data Management, 2009 We must have some yardstick for measuring the quality of master data. Similar to the way that data quality expectations for operational or
analytical data silos are specified, master data quality expectations are organized within defined data quality dimensions to simplify their specification and measurement/validation. This provides an underlying structure to support the expression of data quality expectations that can be reflected as rules employed within a system for validation and monitoring. By using data quality tools, data stewards can define minimum thresholds for meeting business expectations and use those
thresholds to monitor data validity with respect to those expectations, which then feeds into the analysis and ultimate elimination of root causes of data issues whenever feasible. Data quality dimensions are aligned with the business processes to be measured, such as measuring the quality of data associated with data element values or presentation of master data objects. The dimensions associated with data values and data presentation lend themselves well to system automation, making
them suitable for employing data rules within the data quality tools used for data validation. These dimensions include (but are not limited to) the following: Uniqueness Accuracy Consistency Completeness Timeliness Currency Uniqueness refers to requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures. Asserting uniqueness of the entities within a data set implies that no entity logically
exists more than once within the MDM environment and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set. For example, in a master product table, each product must appear once and be assigned a unique identifier that represents that product across the client applications. The dimension of uniqueness is characterized by stating that no entity exists more than once within the data set. When
there is an expectation of uniqueness, data instances should not be created if there is an existing record for that entity. This dimension can be monitored two ways. As a static assessment, it implies applying duplicate analysis to the data set to determine if duplicate records exist, and as an ongoing monitoring process, it implies providing an identity matching and resolution service inlined within the component services supporting record creation to locate exact or potential matching
records. Data accuracy refers to the degree with which data correctly represent the “real-life” objects they are intended to model. In many cases, accuracy is measured by how the values agree with an identified source of correct information (such as reference data). There are different sources of correct information: a database of record, a similar corroborative set of data values from
another table, dynamically computed values, or perhaps the result of a manual process. Accuracy is actually quite challenging to monitor, not just because one requires a secondary source for corroboration, but because real-world information may change over time. If corroborative data are available as a reference data set, an automated process can be put in place to verify the accuracy, but if not, a manual process may be instituted to contact existing sources of truth to verify value accuracy.
The amount of effort expended on manual verification is dependent on the degree of accuracy necessary to meet business expectations. Consistency refers to data values in one data set being consistent with values in another data set. A strict definition of consistency specifies that two data values drawn from separate data sets must not
conflict with each other. Note that consistency does not necessarily imply correctness. The notion of consistency with a set of predefined constraints can be even more complicated. More formal consistency constraints can be encapsulated as a set of rules that specify consistency relationships between values of attributes, either across a record or message, or along all values of a single attribute. However, there are many ways that process
errors may be replicated across different platforms, sometimes leading to data values that may be consistent even though they may not be correct. An example of a consistency rule verifies that, within a corporate hierarchy structure, the sum of the number of customers assigned to each customer representative should not exceed the number of customers for the entire corporation. Consistency Contexts Between one set of attribute values and another attribute set within the same record (record-level consistency) Between one set of attribute values and another attribute set in different records (cross-record consistency) Between one set of attribute values and the same attribute set within the same record at different points in time (temporal
consistency) Across data values or data elements used in different lines of business or in different applications Consistency may also take into account the concept of “reasonableness,” in which some range of acceptability is imposed on the values of a set of attributes. The concept of completeness implies the existence of non-null values assigned to specific data elements. Completeness can be characterized in one of three ways. The first is asserting mandatory value assignment—the data element must have a value. The second expresses value optionality, essentially only forcing the data element to have (or not have) a value under specific conditions. The third is in terms of data element values that are inapplicable, such as
providing a “waist size” for a hat. Timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. In the MDM environment, this concept is of particular interest, because synchronization of data
updates to application data with the centralized resource supports the concept of the common, shared, unique representation. The success of business applications relying on master data depends on consistent and timely information. Therefore, service levels specifying how quickly the data must be propagated through the centralized repository should be defined so that compliance with those timeliness constraints can be measured. Currency refers to the degree to which information is up to date with the world that it models and whether it is correct despite possible time-related changes. Currency may be measured as a function of the expected frequency rate at which the master data elements are expected to be updated, as well as verifying that the data are up to date, which potentially requires both automated and manual processes. Currency rules may be defined to assert the
“lifetime” of a data value before it needs to be checked and possibly refreshed. For example, one might assert that the telephone number for each vendor must be current, indicating a requirement to maintain the most recent values associated with the individual's contact data. An investment may be made in manual verification of currency, as with validating accuracy. However, a more reasonable approach might be to adjust business processes to verify the information at transactions with
counterparties in which the current values of data may be established. Every modeled object has a set of rules bounding its representation, and conformance refers to whether data element values are stored, exchanged, and presented in a format that is consistent with the object's value domain, as well as consistent with similar attribute values. Each column has metadata
associated with it: its data type, precision, format patterns, use of a predefined enumeration of values, domain ranges, underlying storage formats, and so on. Parsing and standardization tools can be used to validate data values against defined formats and patterns to monitor adherence to format specifications. Assigning unique
identifiers to those data entities that ultimately are managed as master data objects (such as customers or products, etc.) within the master environment simplifies the data's management. However, the need to index every item using a unique identifier introduces new expectations any time that identifier is used as a foreign key across different data applications. There is a need to verify that every assigned identifier is actually assigned to an entity existing within the environment.
Conversely, for any “localized” data entity that is assigned a master identifier, there must be an assurance that the master entity matches that identifier. More formally, this is referred to as referential integrity. Rules associated with referential integrity often are manifested as constraints against duplication (to ensure that each entity is represented once, and only once) and reference integrity rules, which assert that all values used for all keys actually refer back to an existing
master record. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123742254000059 Important Roles of Data StewardsDavid Plotkin, in Data Stewardship, 2014 Understanding the Dimensions of Data QualityAs the Data Stewards move forward to define the data quality rules, it is helpful to understand the dimensions of data quality. Data quality dimensions are ways to categorize types of data quality measurements. In general, different dimensions require different measurements. For example, measuring completeness requires analyzing whether there is a value (any value) in a field, whereas measuring validity requires comparing existing format, patterns, data types, ranges, values, and more to the defined set of what is allowable. While many dimensions have been listed and defined in various texts on data quality, there are key dimensions for which Business Data Stewards can relatively easily define data quality rules. Once the conformance to those rules has been measured by a data profiling tool, data quality can be assessed by Data Stewards. As mentioned earlier, the data quality necessary for business purposes can be defined, and data profiling will tell you when you’ve achieved that goal. The data quality dimensions that are most often used when defining data quality rules are listed in Table 7.1. Gauging Metadata QualityMany of the same or similar dimensions of data quality can be used to gauge the quality of the metadata collected and provided by Business Data Stewards. Measuring the metadata quality by these dimensions helps to quantify the progress and success of the metadata component of Data Stewardship.The metadata quality dimensions may include: ■Completeness: Are all the required metadata fields populated? For example, a derived data element may have a definition but not a derivation, and is thus not complete. ■Validity: Does the content meet the requirements for the particular type of metadata? For example, if the metadata is a definition, does the definition meet your published standards and make sense? ■Accessibility: Is all that metadata you’ve worked so hard on easily available and usable to those who need it? Do they understand how to access metadata (e.g., clicking a link to view an associated definition)? Do they understand the value of using the metadata? ■Timeliness: Is the metadata available and up to date at the time someone is attempting to use it to understand the data he or she is accessing? ■Consistency: Is all the metadata consistent, or does it conflict with other metadata? For example, does a usage business rule demand that data be used when the creation business rules state it hasn’t been created yet? ■Accuracy: Is the metadata from a verifiable source, and does it describe the data it is attached to correctly? For example, if the metadata is a definition, does the definition correctly define the data element in question? Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780124103894000076 Dimensions of Data QualityDavid Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011 8.2.4 Classifying DimensionsThe classifications for the practical data quality dimensions are the following: 1.Accuracy 2.Lineage 3.Structural consistency 4.Semantic consistency 5.Completeness 6.Consistency 7.Currency 8.Timeliness 9.Reasonableness 10.Identifiability The relationships between the dimensions are shown in Figure 8.2. Figure 8.2. Practical dimensions of data quality. The main consideration of isolating critical dimensions of data quality is to provide universal metrics for assessing the level of data quality in different operational or analytic contexts. Using this approach, the data stewards can work with the line-of-business managers to: •Define data quality rules that represent the validity expectations, •Determine minimum thresholds for acceptability, and •Measure against those acceptability thresholds. In other words, the assertions that correspond to these thresholds can be transformed into rules used for monitoring the degree to which measured levels of quality meet defined and agreed-to business expectations. In turn, metrics that correspond to these conformance measures provide insight into examining the root causes that are preventing the levels of quality from meeting those expectations. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000087 Key ConceptsDanette McGilvray, in Executing Data Quality Projects (Second Edition), 2021 Data Quality Dimensions as Used in the Ten StepsThere is no universal standard list of data quality dimensions, though there are similarities between them. Each of the data quality dimensions as used in the Ten Steps Process is defined in Table 3.3. The dimensions included are those features and aspects of data that provide the most practically useful information about data quality which most organizations are interested in, and are feasible to assess, improve, and manage within the usual constraints of most organizations. They have been proven through years of experience and are also based on knowledge learned from other experts. Table 3.3. Data Quality Dimensions in the Ten Steps Process – Names, Definitions, and Notes Data Quality Dimensions – what they are and how to use them. A data quality dimension is a
characteristic, aspect, or feature of data. Data quality dimensions provide a way to classify information and data quality needs. Dimensions are used to define, measure, improve, and manage the quality of data and information. Instructions for assessing the quality of data using the following dimensions are included in Step 3 – Assess Data Quality in Chapter 4: The Ten Steps Process.
Instructions for conducting an assessment for each of the dimensions can be found in Step 3 – Assess Data Quality in Chapter 4, with a separate substep for each of the dimensions as noted in the table. The dimensions are numbered for reference purposes and do not propose the order for completing the assessments. Suggestions for helping you select the DQ dimensions most relevant to your needs and to include in your project are discussed below. Many lists of data quality dimensions exist and may use different terms or categories for different purposes than shown here. Once again, the dimensions used in the Ten Steps are categorized roughly by how you would assess each dimension. For example, the dimension of Data Integrity Fundamentals includes what other lists may call out as separate dimensions (e.g., completeness and validity). Both, along with other characteristics of data, are part of the overarching dimension called Data Integrity Fundamentals because all can be assessed using the technique of data profiling. Warning! Do not get overwhelmed by the list of data quality dimensions. You will not assess all of them. Having them categorized this way actually makes it easier for you to select the ones most relevant to your business needs and data quality issues. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780128180150000098 Data Quality ManagementMark Allen, Dalton Cervo, in Multi-Domain Master Data Management, 2015 Assess and DefineThe actions of assessing and defining are what make DQM different from other solution design activities, such as application development. With typical application development, business units have a very clear understanding of the functionality required for them to perform their jobs. New requirements, although vulnerable to misinterpretation, are usually very specific and can be stated without much need for research. Also, software defects sometimes might be difficult to replicate, diagnose, and fix, but again, the functionality to be delivered is mostly clear. With data, the landscape is a bit different. One of the mantras for data quality is fitness for purpose. That means data must have the quality required to fulfill a business need. The business is clearly the owner of the data, and data’s fitness for purpose is the ultimate goal. However, there are some issues. First, the business does not know what it does not know. A lot of the time, company personnel have no idea of the extent of the problem they face when it comes to data quality or lack thereof. Second, fitness for purpose can be difficult to quantify. There are certain situations where you just want to get your data as good as possible to minimize any associated costs, but it is not clear how much they can be improved. Third, the business might not fully understand the ripple effects of a data change. In a multi-domain MDM implementation, there are certain predefined data-quality expectations, such as the ability to link inconsistent data from disparate sources, survive the best information possible, and provide a single source of truth. Still, the current quality state is mostly unknown, which tremendously increases the challenge of scoping the work and determining the results. The business might have a goal, but achieving it can be rather difficult. The only way to really understand what can be accomplished is to perform an extensive data-quality assessment. It cannot be stated enough how data profiling is important. To efficiently and effectively correct data problems, it is a must to clearly understand what the issue is, its extent, and its impact. To get there, data must be analyzed fully and methodically. For an extensive discussion of data profiling, consult Chapter 8. A recommended approach to organize data-quality problems is to categorize them into dimensions. Data-quality dimensions allow complex areas of data quality to be subdivided into groups, each with its own particular way of being measured. Data-quality dimensions are distinguishable characteristics of a data element, and the actual distinction should be business-driven. Since it is possible to characterize a particular data element in many ways, there is no single definitive list of data-quality dimensions. They can be defined in many ways with slight variations or even overlapping context. The following list represents the type of dimensions and definitions generally used: •Completeness: Level of data missing or unusable. •Conformity: Degree of data stored in a nonstandard format. •Consistency: Level of conflicting information. •Accuracy: Degree of agreement with an identified source of correct information. •Uniqueness: Level of nonduplicates. •Integrity: Degree of data corruption. •Validity: Level of data matching a reference. •Timeliness: Degree to which data is current and available for use in the expected time frame. Data-quality dimensions are also very useful when reporting data-quality metrics as part of the Monitoring and Control phase, as will be discussed later in this section. At the end of this step, it is possible to have a clearly stated set of requirements on what needs to be accomplished for each specific data-quality activity. Remember that most of the time when the process is initiated, there is not yet enough information to produce a complete set of requirements. This step complements the initiation by adding much-needed analysis. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780128008355000099 Entity Identity ResolutionDavid Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011 16.8.1 Quality of the SourcesDifferent data sources exhibit different characteristics in terms of their conformance to business expectations defined using the types of data quality dimensions described in chapter 8. In general, if a data set is subjected to an assessment and is assigned a quality score, those relative scores become one data point. In other words, given a record from a data set with a high quality score matched with a record from a data set with a lower quality score, it would be reasonable to select values from the first record over the second when they vary. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000166 Bringing It All TogetherDavid Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011 20.2.6 Data Quality AssessmentThe absence of clearly defined measurements of business impacts attributable to poor data quality prevents developing an appropriate business case for introducing data quality management improvement, and chapter 11 looked at an approach for assessment process that quickly results in a report documenting potential data quality issues. That report enumerates a prioritized list of clearly identified data quality issues, recommendations for remediation, and suggestions for instituting data quality inspection and control for data quality monitoring. It also proposes opportunities for improvement, providing an ability to provide an objective assessment of critical data and determine whether the levels of data quality are sufficient to meet business expectations, and if not, evaluate the value proposition and feasibility of data quality improvement. This process essentially employs data profiling techniques to review data, identify potential anomalies, contribute to the specification of data quality dimensions and corresponding metrics, and recommend operational processes for data quality inspection. The process involves these five steps: 1.Planning for data quality assessment 2.Evaluating business process 3.Preparing for data profiling 4.Profiling and analyzing 5.Synthesizing results Connecting potential data issues to business impacts and considering the scope of those relationships allow the team to prioritize discovered issues. The data quality assessment report documents the drivers, the process, the observations, and recommendations from the data profiling process, as well as recommendations relating to any discovered or verified anomalies that have critical business impact, including tasks for identifying and eliminating the root cause of the anomaly. Some suggestions might include: •Inspection and monitoring •Additional (deeper) review of the data •One-time cleansing •Review of data model •Review of application code •Review of operational process use of technology •Review of business processes The data quality assessment is the lynchpin of the data quality program, because it baselines the existing state, provides hard evidence as to the need, and suggests measures for inspecting data quality moving forward. In turn, the process can be used to evaluate whether the costs associated with data quality improvement would provide a reasonable return on the investment. Either way, this repeatable assessment provides tangible results that can either validate perceptions of poor data quality or demonstrate that the data meets business needs. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000208 Metrics and Performance ImprovementDavid Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011 6.5.2 Building a Control ChartOur next step is building a control chart. The control chart is made up of data points consisting of individual or aggregated measures associated with a periodic sample, enhanced with the center line and the upper and lower control limits. These are the steps for building a control chart for measuring data quality: 1.Select one or more data quality dimensions that will be charted. Make use of the Pareto analysis to help determine the variables or attributes that most closely represent the measured problem, since trying to track down the most grievous offenders is a good place to start. If the goal is to find the source of particular problems, make sure to determine what the right variables are for charting. For example, if the dimension that is being charted is timeliness, consider making the charted variable the “number of minutes late,” instead of “time arrived.” Keep in mind while choosing variables that the result of charting should help in determining the source and diagnosis of any problems. 3.Determine the proper location within the information chain to attach the measurement probe. This choice should reflect the following characteristics: a.It should be early enough in the information processing chain that detection and correction of a problem at that point can prevent incorrectness further along the data flow. b.It should be in a location in the information chain that is easily accessed and retooled, so as not to cause too much chaos in implementing the charting process. c.It should not be in a place such that observation of the sample can modify the data being observed. 4.Decide which kind of control chart is to be used. The choices are: a.Variables chart, which measures individual measurable characteristics; a variables chart will provide a lot of information about each item being produced. b.Attributes chart, which measures the percentage or number of items that vary from the expected; an attributes chart summarizes information about the entire process, focusing on cumulative effects rather than individual effects. 5.Choose a center line and control limits for the chart. The center line can either be the average of past measurements, the average of data that has not yet been measured or collected, or a predefined expected standard. The upper control limit (UCL) is set at three standard deviations (+3σ) above the center line, and the lower control limit (LCL) is set at three standard deviations (−3σ) below the center line. 6.Choose the sample. The sample may consist of measuring individual data values, or measuring a collection of data values for the purpose of summarization. The sampler should be careful not to take a sample at a point in the process or a point in time when there is only a small possibility that the taking of the sample can have any changing affects. 7.Choose a method for collecting and logging the sample data. This can range from asking people to read gauges and write down the answers in a notebook to having an integrated mechanism for measuring and logging sample results. 8.Plot the chart and calculate the center line and control limits based on history. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000063 The Important Roles of Data StewardsDavid Plotkin, in Data Stewardship (Second Edition), 2021 Understanding the Dimensions of Data QualityAs the Business Data Stewards move forward to define the data quality rules, it is helpful to understand the dimensions of data quality. Data Quality dimensions are ways to categorize types of data quality measurements. In general, different dimensions require different measurements. For example, measuring completeness requires analyzing whether there is a value (any value) in a field, whereas measuring validity requires comparing existing data formats, patterns, data types, ranges, values, and more to a defined set of what is allowable. While many dimensions have been listed and defined in various texts on data quality, there are key dimensions for which the Business Data Stewards can relatively easily define data quality rules. Once the conformance to those rules has been measured by a data profiling tool, the data quality can be assessed by Business Data Stewards. As mentioned previously, the data quality necessary for business purposes can be defined, and data profiling will tell you when you’ve achieved that goal. The data quality dimensions that are often used when defining data quality rules are listed in Table 7.1. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780128221327000073 Data Quality and MeasurementLaura Sebastian-Coleman, in Measuring Data Quality for Ongoing Improvement, 2013 Data Quality DimensionsData and information quality thinkers have adopted the word dimension to identify those aspects of data that can be measured and through which its quality can be quantified. While different experts have proposed different sets of data quality dimensions (see the companion web site for a summary), almost all include some version of accuracy and validity, completeness, consistency, and currency or timeliness among them. If a quality is “a distinctive attribute or characteristic possessed by someone or something” then a data quality dimension is a general, measurable category for a distinctive characteristic (quality) possessed by data. Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. They allow us to understand quality in relation to a scale and in relation to other data measured against the same scale or different scales whose relation is defined. A set of data quality dimensions can be used to define expectations (the standards against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset. Measuring the quality of data requires understanding expectations and determining the degree to which the data meets them. Often this takes the form of knowing data’s distinctive characteristics and assessing the degree to which they are present in data being measured. Put in these terms, measurement seems simple. But there are at least two complications: First, most organizations do not do a good job of defining data expectations. Very few make clear, measurable assertions about the expected condition of data. Second, without clear expectations, it is not possible to measure whether the data meets them. This doesn’t mean you need measurement to know you have poor quality data. As David Loshin has pointed out, “the definition of poor quality data is like Supreme Court Justice Potter Stewart’s definition of obscenity: We know it when we see it” (Loshin, 2001, p. 101). Let’s take the first complication. In defining requirements, many information technology projects focus on what data needs to be made available or what functionality needs to be put in place to process or access the data. Consequently, data requirements often focus on data sources, modeling, source-to-target mapping, and the implementation of business intelligence tools. Very few projects define requirements related to the expected condition of the data. In fact, many people who work in information technology (IT) think the content of the data they manage is irrelevant to the system built. With limited knowledge of the data, they have no basis for knowing whether or not the data meets expectations. More challenging still, data can be used for different purposes. Each of those purposes may include a different set of expectations. In some cases, expectations may be in conflict with one another. It is risky not to recognize these differences. Failure to do so means that data might be misused or misunderstood. The second complication goes to the purpose of this book: It can be hard to measure expectations, but it is impossible to measure them if you cannot define them. Data is not a thing with hard and fast boundaries (Redman, 2007), but data does come to us in forms (records, tables, files, messages) that have measurable characteristics. Fields are specified to contain particular data. Files are expected to contain particular records. We can align these characteristics with the expectations of data consumers or the requirements of representation. From the combination, we can establish specific measurements of these characteristics. Read full chapter URL: https://www.sciencedirect.com/science/article/pii/B9780123970336000043 What characteristics define data quality?There are five traits that you'll find within data quality: accuracy, completeness, reliability, relevance, and timeliness – read on to learn more. Is the information correct in every detail? How comprehensive is the information? Does the information contradict other trusted resources?
What is a characteristic of data quality quizlet?Characteristics of data quality are based on 4 domains: Data applications, data collection, data warehousing, data analysis. Accuracy Accessibility Comprehensiveness Consistency Currency Definition Granularity Precision Relevancy Timeliness.
Which data quality traits refers to the availability and accessibility of the data?The last factor in determining the quality of the data is timeliness. This trait refers to the availability and accessibility of the selected data.
What are the characteristics of data integrity?Data integrity requires that data be complete, accurate, consistent, and in context. Data integrity is what makes the data actually useful to its owner.
|