Understanding and Measuring Data Quality

28/4/2022

Published by Unbox.ai

Introduction
Modern companies now unanimously recognize the value of data for driving business growth. However, high-quality data is much more valuable than data assets of poor quality. As companies accumulate petabytes of data from various sources, it becomes imperative to focus on the quality of data and filter out bad data.

Data is the fundamental building block for predictive machine learning models. Although having access to greater amounts of data is beneficial, it doesn’t always translate to better-performing machine learning models. Sampling training data that passes quality checks and meets certain acceptance criteria can significantly boost the accuracy of the model predictions.

In this article, you’ll learn more about why high-quality data is essential for building robust machine learning models, expanding on the various parameters that define data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. You’ll also explore a few mechanisms you can implement to measure and improve the quality of your data.

What is data quality?
Data quality is a measure of how suitable the data is for its intended applications in data analytics, data science, or machine learning. There are several dimensions along which data quality is measured, which include the following:

Accuracy
Completeness
Consistency
Timeliness
Uniqueness
Validity

Measuring the quality of data in terms of the above parameters is critical for organizations to assess whether their in-house data is suitable for downstream applications.

Why is data quality important?
Data quality is an important determinant of the quality of decision-making within an organization. Poor-quality data leads to inaccurate analytics and machine learning models, which might adversely impact various business operations as well as customer experience. Decisions and business strategies based on flawed data can have massive consequences.
Typical data-quality issues include data security and data that is incomplete, duplicated, inconsistent, incorrect, missing, poorly defined, poorly organized, or stale.

In the context of data science use cases, the consequences of using poor-quality data can be immense—machine learning models trained on low-quality data invariably generate weak or inaccurate predictions, which are not easy to troubleshoot.

Deep-learning models in particular are very data-hungry, and their state-of-the-art performance is driven by the massive amounts of data on which they are trained. In this context, recent work has shown that training models with less data reflects real-world scenarios better and is increasingly becoming the norm.

The cost of bad data to organizations is also enormous—as per an IBM study, the yearly cost of poor-quality data in the US alone is equal to USD 3.1 trillion. Therefore, it is paramount for organizations to invest in proper measurement and evaluation of data quality before building data-driven applications or devising new business strategies.

Determining data quality
Several organizations, from IMF to World Bank, have formulated Data Quality Assessment Frameworks (DQAF) to establish clear guidelines for measuring the quality of data in terms of accuracy, completeness, consistency, timeliness, uniqueness, and validity.
This section will focus on each of these data-quality dimensions and discuss how they define the quality of data.

Accuracy
Accuracy, as the term implies, is a pivotal aspect of data quality—it means that the information is correct. Naturally, inaccurate information can cause many significant problems for a business.

For instance, consider an example in which the time of financial transactions is incorrectly recorded due to a failure to update to daylight saving time. In such a scenario, the timing offset could lead to inaccurate analysis and reporting of core business metrics like daily sales and revenue.

Such data inaccuracies can lead to potentially damaging consequences of incorrect financial and tax filings that could result in financial penalties by regulatory bodies.

Completeness
Completeness refers to how comprehensive the data is and whether it contains all the fields and values necessary to make them fit for the intended purpose. Incomplete data often contains empty or missing values across rows or columns and is unusable for further analysis.

For instance, if a customer’s email address is missing, then this customer may not feature in any marketing campaigns, resulting in a potential loss of business for the company.

Consistency
Consistency is another fundamental trait of data quality, as it can affect the usage of the entire data set. If a data set has millions of records but some rows store a customer’s name as “CustomerName” while the remaining rows store the same information as “FirstName” and “LastName” separately, it might lead to inaccurate results and analysis.

Another common example of inconsistent data is related to the underlying format or units of specific data fields. For instance, data like time is often kept in inconsistent formats, and units of money may be recorded differently from country to country.

Timeliness
Timeliness refers to how recent and up-to-date the information is. For a number of applications, timely data is essential as it captures the current trends and patterns in customer behavior or business health.

Data tends to lose its value over time and can drastically affect the quality of business decisions as well as predictions from machine learning models trained on older data. It can cost organizations lost time and money, in addition to reputational damage.

Uniqueness
Uniqueness refers to the lack of duplication or overlap within a data set or across data sets. Modeling redundant information can often lead to spurious correlations or results that can adversely affect statistical analysis as well as model predictions.
Thus, uniqueness is a critical dimension of data quality that is important to build trust in the data for downstream use cases.

Validity
For several data fields, validation checks are important. For instance, a mobile phone number is usually ten digits long, and zip codes in the US should have five digits. When data does not conform to standard formats or business-specific rules, it is said to be invalid. Invalid data can cause grave errors in downstream analytics and necessitates careful scrutiny of every data column before using it.

Truncation of data also leads to data-validity problems. For instance, a user may mistakenly input six digits for a US zip code, which gets truncated to five digits. While such an input may pass data-validation checks, it is ultimately inaccurate.

Additional sources of data-validity errors arise due to mismatched data formats. For instance, a data type like zip code may be inconsistently saved in numeric or string format.

Improving data quality
There are numerous methods for improving data quality.
The first step often involves data profiling—that is, doing an initial assessment of the current state of the data sets. Defining what is good data is also critical to establishing guardrails around selecting data for further usage.

Furthermore, a number of checks for data validation, completeness, consistency, and timeliness can be defined and have to be met by all current and new data sets.
Data standardization across the organization helps to meet data-quality standards so that every stakeholder across different divisions has the same understanding of the various data sets and fields.

Implementing a robust data governance framework can also help businesses improve the quality of organizational data.

Finally, recent advances in machine learning and deep learning can also be used to identify and improve the quality of data in a more scalable and reproducible fashion. For example, in the deep-learning study, a data-quality assessment framework grounded in statistics and deep learning was used to identify outliers in a data set of salary information published by the state of Arkansas, USA.

As the size of organizational data is bound to increase exponentially in the coming years, companies ought to allocate dedicated resources and investments in new techniques from fields like machine learning and deep learning to measure and provide statistical insights into the quality of their data.

Conclusion
In this article, you’ve learned what data quality is and why it is important for organizations to measure and evaluate the quality of their in-house data. Poor-quality data can have significant consequences for a business in terms of inaccurate analytics, predictive machine learning models trained on bad data, as well as ill-informed business decisions and strategies.

Data quality can be measured in terms of a number of parameters such as accuracy, completeness, consistency, timeliness, uniqueness, and validity. Each of these data-quality dimensions are important, and organizations can improve the quality of their data by having robust data profiling, standardization, and validation checks in place. More recently, advances from machine learning and deep learning can also be harnessed to quantitatively define and evaluate the quality of data.

Related Blogs

Comments

The Metric Layer & how it fits into the Modern Data Stack

25/4/2022

Comments

Published by Transform

Introduction
A metric layer is a centralized repository for key business metric. This “layer” sits between an organization’s data storage and compute layer and downstream tools where metric logic lives—like downstream business intelligence tools.

A metric layer is a semantic layer where data teams can centrally define and store business metrics (or key performance indicators) in code. It then becomes a source of truth for metric—which means people who analyze data in downstream tools like Hex, Mode, or Tableau will all be working with the same metric logic in their analyses.

The metric layer is a relatively new concept in the modern data stack, mainly because until recently, it was only available to companies with large or sophisticated data teams. Now it is more readily available to all organizations with metric platforms like Transform.

In this article, you’ll learn what a metric layer is, how to use your data warehouse as a data source for the metric layer, and how to get value from this central metric repository by consuming metrics in downstream tools.

How a Metric Layer fits into a Modern Data StackThe modern data stack is composed of a number of elements organized in the order of how data flows:

Managed ETL (or ELT) pipeline that ingests data from a variety of data sources
Data storage solution in the form of a data warehouse or data lake on-premise or in the cloud
Data transformation pipeline that processes stored data using languages like SQL and YAML for downstream business operations, analytics, and data science solutions
BI or data visualization platform
Data governance framework
Metric layer / metric store

One central benefit of a metric layer is that it sits between the data warehouse and downstream analytics tools. People can access metrics in business intelligence (BI) tools like Tableau, Mode, and Hex, bringing metrics consistency across all business analysis.

Use cases for the Metric Layer
The formulation and implementation of metric layers was pioneered by prominent tech companies like Airbnb, Spotify, Slack, and Uber. Airbnb designed a metric layer called Minerva to serve as a single source of truth (SSOT) metric platform. They did this by standardizing the way metrics are created, calculated, served, and used across the organization.

Uber built uMetric, a standardized metric platform that underlies the entire lifecycle of a metric from definition, discovery, planning, calculation, quality, and consumption. These pillars not only enable rapid metric computation for business decisions, but also help create useful features for training ML models and promoting data democratization.

A new component in the Modern Data StackWith the emergence of big data, predictive analytics, and data science, most companies have access to enormous amounts of valuable data. Many organizations have evolved their data stack to simplify computation, transformation, and access to key business metrics, which can accelerate data-driven decision-making.

However, as Benn Stancil noted in his popular Substack blog, there was no central repository for defining metrics. This causes confusion and misalignment across an organization.

"The core problem is that there’s no central repository for defining a metric. Without that, metric formulas are scattered across tools, buried in hidden dashboards, and recreated, rewritten, and reused with no oversight or guidance."
—Benn Stancil, The missing piece of the modern data stack

Another common issue is “dashboard sprawl” where metric logic is spread across different tools and data artifacts. Since this logic is different for every tool, teams often end up with different numbers for the same metrics and no one knows where to find the “correct” metric to answer their most important business questions.

This problem led to the metric layer becoming a new artifact in the modern data stack. With a single shared store of metrics definitions and values, the metric layer ensures consistent and accurate analysis and reporting of metrics.

A metric layer not only centralizes key business data but also helps improve the efficiency of data teams by removing the need for repeated analytics. This helps data stakeholders become key advocates and enablers of data-driven decision-making and data democratization across the entire organization.

Reutilization of metrics in diverse contexts and external tools
One of the benefits of having a single metrics repository is that it can be connected to a variety of tools; for example, CRM’s, BI tools, tools developed in-house, as well as data quality and experimentation tools.

A centralized architecture ensures that no matter how a tool’s internal logic is configured, the end result will be based on the same metric logic and consistent across tools and applications. For instance, MetricFlow, the metric layer behind Transform, has an API that enables users to express requests for their Transform metrics directly within SQL expressions.

Core metrics like Net Promoter Score (NPS), Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), loan-to-value (LTV), and Annual Recurring Revenue (ARR) capture the health of the business and need to be accurate for reporting and decision-making. With a metric layer, it’s possible to see the lineage of each metric, how it’s built, what the data source is, and how it’s consumed. By unifying metrics extraction and data analytics on these metrics, the metric layer provides the much-needed consistency that is lacking in modern data stacks.

Enhancing transparency between technical and non-technical teams with a single interface
A single interface for metrics information gives data stakeholders across an organization—in development, sales, marketing, and more—to have the same view and understanding of key metrics to track goals. This consistency allows all of these teams to speak the same language regardless of the tools they use to compute the metrics. This is a tremendous benefit of a metric layer and promotes stronger data democratization and governance across the entire organization.

Transform is unique in that it has the addition of a metrics catalog on top of MetricFlow, its open source metric layer. The metrics catalog is a central location where both data teams and non-technical users can interact with, build context, collaborate on, and share key metrics.

Tracking changes is easier
Because businesses are constantly evolving and creating new metrics or changing the definition of existing metrics, each data stakeholder has to manually keep track of changes in a data warehouse to update their metrics definition and logic.

However, with the combination of a metric layer and a metrics catalog, tracking changes metrics owners are alerted anytime the lineage or definition of a metric changes. This enables data stakeholders to make better sense of data, especially when a new metric definition leads to anomalous or unexpected results.

Dig into the Metric Layer
A metric layer reduces the problem of disparate results when the same metric is computed by different teams using a wide variety of BI tools. And it makes data-driven analytics more precise and promotes faster and more accurate decision-making.

If you’re looking for a streamlined and centralized metric layer, MetricFlow is now open source. You can explore the project on Github. Find more information about Transform’s metric layer and its benefits in the product documentation.

Related Blogs

Comments

Team Development Tips for Engineering and Product Leaders

18/4/2022

Comments

Published by StatusHero

Introduction
Teams are the building blocks of successful organizations. The success of modern technology companies is driven to a large extent by their engineering and product teams. It is crucial for new engineering and product team leaders to maximize the productivity of their respective teams while ensuring a strong sense of team spirit, motivation, and alignment to the larger mission of the company, as well as fostering an inclusive and open culture that is collaborative, meritocratic, and respectful of each team member. Effective team development and management is therefore critical for engineering and product leaders, and ensuring robust team development at scale remains a big challenge in the face of changing work conditions.

Despite the importance of team building and development, not many leaders are trained to succeed and hone their leadership skills. In many cases, individual contributors who progress or transition to the managerial track may not have the aptitude for developing teams nor have the necessary experience or training in this vital aspect of their new role. Although team development is more an art than a science, this topic has received significant interest from the industry as well as academia, leading to structured team development theories and strategies.

In this article, you’ll explore a list of curated tips for engineering and product leaders to better manage the development of your teams and accelerate your learning journey on the leadership track. This particular set of tips focuses on building team cohesion, facilitating the five stages of team development, and providing structures for effective teamwork and communication that foster an open and collaborative team culture.

Regular Check-Ins
One of the fundamental responsibilities of a team leader is to have periodic check-ins with team members, both individually and as a group. These meetings serve as an opportunity to assess each team member’s work performance, their attitude and motivation toward their respective projects, and even their sense of belonging and identity within the team and the organization at large. These regular one-on-one meetings with direct reports also help to bring to light any professional or personal concerns that the manager can then try to address, whether on their own or with the support of colleagues from the human resources department.

Group meetings are also essential to allow team members to gather and discuss work issues as a group and voice any concerns that may affect the entire team’s output, productivity, efficiency, or morale. Such group meetings also provide a window for colleagues to learn more about the work and progress made by other members in the team, as well as provide a collaborative atmosphere in which they are encouraged to share their opinions or suggestions. Holding regular retrospectives is a great way to foster discussion and collaboration.

As you can see, both individual and group meetings serve as a vital opportunity for team leaders to check the pulse of each member and the team as a whole to assess whether any interventions are necessary to uplift productivity and motivation. Sometimes, these kinds of meetings can be conducted as a retreat or simply at an off-site location to enable team members to bond in a fun environment and encourage more open communication about the team’s development and progress.

Structured Work
Team members benefit immensely from a high-level structure to guide their work and appropriately allocate their time and resources to the various projects they are involved in. Ideally, all employees should be assigned projects that suit their particular skill set and interests and should be empowered to take ownership for the success of their projects. With individual owners for each team project, the role of the manager is to simply serve each colleague in terms of offering strategic guidance, providing additional resources or bandwidth, and removing any technical or organizational blocks that may otherwise impede their progress.

In addition to a clear and structured assignment of work projects, teams also benefit from having a structured work cycle. For instance, engineering teams usually employ an Agile methodology and a regular Scrum cycle to plan their work in sprints and evaluate their progress.

Using these proven methodologies helps team members plan their work effectively and encourages feedback from colleagues and the managers to weigh into project planning and management. Over time, if these processes are followed diligently, teams become vastly more organized and productive, leading to more successful projects and deliverables.

Five Stages of Team Development
According to research by renowned psychologist Bruce Tuckman, there are five distinct stages in a team’s development. These include the following:

Forming
This is the first stage in a team’s development, in which team leaders introduce individual team members, highlight their respective experience and skills, and facilitate interactions among the team. Knowing each other’s core strengths helps team members better understand who to reach out to for help or collaborate with to execute their projects successfully. Ideally, this stage should be revisited each time a new colleague joins the team to ensure that they feel welcome and to stimulate effective onboarding.

Storming
Storming is the next stage in a team’s development, which involves team members openly sharing their ideas for current work or new projects in front of the entire team. Team leaders can facilitate this by organizing meetings or events such as hackathons. During this brainstorming stage, it is important that each individual is allowed to freely express their opinions even if they are in conflict with others’. This provides leaders an opportunity to provide high-level clarity and showcase their leadership by effectively resolving any conflicts and motivating team members to disagree and commit for the greater good of the team.

Norming
During this stage, the team has crossed the initial hurdles and resolved differing opinions, allowing them to begin to hit their stride and work more productively as a unit. With a clear roadmap and a better sense of team success, individual employees begin to celebrate each other’s strengths and weaknesses and collaborate more effectively. Team leaders should congratulate themselves for attaining the norming stage but also be aware of the need to maintain the team’s motivation and momentum toward achieving their goals.

Performing
By this stage, a team benefits from high levels of cohesion and trust in each other. Teams are more efficient and can self-sustain their progress and velocity with little oversight or push from the team leaders. This enables them to take on more challenging and audacious projects and push the team’s limits in a positive manner. During this stage, team leaders can step in to hone individual team members’ strengths and help them develop and strive for the next step in their careers. Sincere team leaders leverage their coaching and mentorship skills to empower individuals to progress toward their peak efficiency and realize their full potential at work.

Adjourning
By this stage, teams have completed their projects. This is an excellent opportunity to discuss what went well, what did not go so well, and how to improve and implement new strategies for future team projects. This is a good time to celebrate individual and team successes and to congratulate employees in a public forum, motivating them to strive for even greater success in the future. Team leaders should also take the feedback from the team and leverage it to improve their team building and development methods.

Conclusion
Developing teams of engineers and product managers is a critical responsibility for the leaders and managers of modern technology companies. When teams operate at their best, the organization as a whole benefits from their productivity and positive momentum.

In this article, you’ve learned several tips and strategies on how engineering and product team leaders absorb and implement in their respective teams. These include conducting regular check-ins with individual employees as well as the entire team, providing a structured framework for carrying out their work and executing projects successfully, and following the principles from the five stages of team development.

Essentially, leaders should strive to build a team where the whole is greater than the sum of its parts. This not only requires substantial care, attention, and efforts from the leaders but also a high level of empathy and understanding of each individual in the team. Teams with strong, empathetic, servant leaders rise above other teams in an organization, attracting better and more strategic projects and opportunities for collaboration, ultimately resulting in a win for every team member as well as the team leader.

Comments

Understanding and Measuring Data Quality

The Metric Layer & how it fits into the Modern Data Stack

Team Development Tips for Engineering and Product Leaders

Archives

Categories