Data quality issues tend to surface when the end-user can’t access a data source or the data presents unusually, resulting in a cascade of both investigation and finger-pointing. An attorney trying to review an email attachment in an eDiscovery platform only to find that it isn’t there. A transaction monitoring analyst trying to open an OFAC alert, but the language in the alert is illegible nonsense. End users in the litigation, banking, and consumer space are often frustrated by such issues. The next questions are how did the data issue occur, how to fix it, and who is going to manage that process?
Anecdotally, more and more regulators are pushing for data ownership and affected private companies are responding with clearly defined ownership of data quality risk from the testers and quality assurance staff to the actual users or data owners (where there is a difference).
In an era where data quality issues are being specifically identified as a root cause in enforcement actions, the more highly regulated the entity is and/or the more data it manages, the more there is a need for identifying, tracking, and reflecting the appropriate stewardship of those risks. Here, we will discuss some of the fundamentals of building out a data quality issue management (DQIM) tracking system, as well as some of the common pitfalls.
Putting a Name to Your Data Quality Issues
There is an old adage which says that “at the root of every conflict is a misunderstanding.” This holds as true for interpersonal relationships as it does for data quality. The first challenge in establishing a DQIM is agreeing on the taxonomy and nomenclature of your company’s data issues. This challenge is multifaced from the start:
- There are a litany of industry-accepted terms which are often used synonymously even though they mean (to varying degrees) very different things. In the world of anti-money laundering compliance, for example, the concepts of Customer Identification Program, Know Your Customer, Customer Due Diligence, and Enhanced Due Diligence are sometimes used interchangeably. Despite being related thematically, they each hold distinct meaning;
- Not all practitioners and companies agree on what their actual data quality dimensions are, or should be, resulting in some data quality dimensions being overlooked, orphaned, or conflated with the next closest facet;
- Industry and institutional terminology related to data quality is often times “refreshed”, but the “legacy” (i.e. demised or outdated) terminology is used for convenience, which then gets overlooked or adopted by the staff responsible for oversight, resulting in potential confusion.
The Basics of a DQIM Dashboard
So again, it is up to the data owners to agree on, document, and then broadcast what they believe the most current and prescient data quality dimensions are for their businesses and use those definitions as the foundation for their DQIM dashboard.
In terms of pure data quality issues (as opposed to secondary issues related to data), the principal data quality dimensions in a DQIM table/platform should include:
Completeness refers to whether a data value/element under consideration is present where the data dictionary/glossary requires it to be.
Accuracy relies on the data element being objectively correct, whereas Validity is more akin to a subjective definition of data standards or values set by the company.
Consistency refers to the definition of those data elements as they may be used across systems, processes, or data sets, as distinguished from Uniqueness, meaning that the data element isn’t recorded more than once across those same venues.
Lastly, Timeliness (also called “Currency”) refers to the contemporaneousness of the data element with its real-world functionality.
Many practitioners include Reasonableness and Conformity as data quality dimensions, meaning whether the data is incorrect/incomplete from a comparative basis and alignment to metadata requirements respectively.
The Data Quality Issue Ownership Structure
What - At the onset, the DQIM platform should align to the initial core definitions of the above data quality dimensions, and additional considerations added in as needed based on the complexity of the intuition. These are necessary because, ideally, every single data quality issue that is raised should be categorically aligned to one of these data quality dimensions. Every tagged data quality issue should explain if it is due, for example, to completeness. This data tag should also include what was happening when the issue was discovered, along with who identified it. Lastly, the "what" should identify what vertical within the organization the issue arose in, so that it can be properly attributed to a steward (see the “who,” below).
Where – This is a fairly straightforward consideration within the DQIM: Where is the issue based? Is it an issue with the core data, or rather in the end-user platform? For example, many financial institutions rely on a DOS-based operating system to warehouse customer profiles which may have multiple user pages. As those DOS-based systems are then interfacing with more sophisticated HTML-based systems, data may be lost in the ETL process. Correctly identifying the scene of the “crime” will then categorize the steps that need to be taken to address it (the “how”).
When – The “when” of a data quality issue is a critical question because it helps determine the depth of the issue itself, meaning how long the issue has been going on. That timeline and the nature of the data quality issue (the “what”) then drive the pace and/or urgency of the remediation efforts needed. For example, if there was a relatively recent (i.e. one week) issue where embedded screenshots were not ported from the client’s email into the eDiscovery platform that can be more easily addressed than if the issue had gone on for years, already having been produced to various stakeholders.
Why – The purely diagnostic query, a good DQIM management system will try to establish the root cause of the failure. Many times this is attributed to an issue with the data conversion, or the data architecture itself. Data shift is much easier to identify because nothing in the host system will be where it is supposed to be, whereas a data completeness issue could be more closely linked to the ETL process.
Who – Second only to the “how,” the DQIM must outline the issue owner, more specifically by role than by name. There is far too much turnover in most companies to say that an individual person holds the responsibility for remediation, so ultimately the role holder/team lead should be identified as the issue owner. That issue owner (again, by role) is the party responsible for shepherding, even if indirectly through the remediation process.
How – Lastly, and possibly most significantly, the remediation itself needs to address all of the other questions that preceded it. The “how” should describe in a measurable, iterative process, the steps that will be taken and the completion milestones at each stage of the process.
Quality Control for the Quality Controllers
Ultimately the features and utility of the platform will drive the success of the DQIM. If, for example, the platform allows for comprehensive data categorization as the information is moved from host to review platform, that will make data quality issue spotting much easier. Initial data tagging should, in an optimal setting, expand out the nature of data before, during, and after the ETL process to streamline the identification of those underlying issues. This should enable a more comprehensive DQIM dashboard as it's built and expanded.