Data is the cornerstone of businesses from large enterprises to small D2C brands, and huge amounts of it can be collected from websites, mobile apps, chat messages, call centers, business transactions, surveys, and social media platforms, among other channels. All this data represents a gold mine of information that can offer customer insights and lead to new ideas for features or products.
However, making sense of the data is easier said than done. The information originates from various channels and in multiple formats. It can be logged erroneously and contain other errors, including missing values. Because it comes from multiple domains, it can include unstructured data like text, images, audio, and video.
That is why data preparation is essential. This involves cleaning, curating, transforming, and storing data sets for downstream applications including data analytics and data visualization, as well as predictive intelligence based on machine learning and deep learning models. Data can only provide value once it has been processed from its raw form, and effective data preparation can maximize that value.
This article will explain the process of data preparation, especially in terms of data labeling, and will provide a checklist for data engineers to follow.
What Is Data Preparation?
Data preparation is not an entirely new process in technology companies. Data-driven operations previously focused on statistical analysis of business data from structured tables. The deep learning model has grown over the past decade along with the global penetration of mobile phones, widely available internet access, and cheaper cloud storage facilities. Today an estimated 2.5 quintillion bytes of data are being generated daily.
Every user interaction with online companies is recorded, from someone clicking an ad or adding a product to a shopping cart to sharing a photo on a social media app. User-generated data is generally unstructured data: images, text, audio, or video. Such data can be used to train sophisticated deep learning models to predict what users want to type in a text, which branded products are featured in an image, and what kind of customer service will be provided in a phone conversation.
For deep learning models to make sense of this data, all data samples need to be labeled. Data labeling tells the machine learning models what knowledge they need to acquire via supervised learning to power smart applications. This makes labeling critical in preparing data sets for training machine learning models.
However, data labeling can also represent the chief source of errors, affecting potential improvement in model performance. Machine learning models can only be as accurate as the labeled data, which represents the models’ entire knowledge for the particular use case.
For example, the source image data set in a face recognition program requires a label for every face shown in every image. During the labeling process for this data set, every image is reviewed by human subject matter experts, crowdsourced labelers on platforms like Amazon Mechanical Turk, or algorithms.
Labeling helps clean and prepare the data set by removing noisy or unusable data. In this case, images that don’t contain any faces, or that show unreadable faces due to poor lighting or angles, should be removed because they won’t be helpful in training a face recognition model. This step also ensures the inclusion of images that are most helpful for the desired use case.
Once the data set is reviewed and annotated, it can be used for all subsequent face recognition applications instead of going back to the raw data set. This saves time and effort for data engineers, as well as data scientists who might build novel models using the same data set.
Additionally, multiple labels and metadata can be applied to each image during the labeling process so that they’re available for new use cases. A tag that identifies the face as that of a man, woman, or child can be used for different computer vision applications. This can potentially give the data set more flexibility for the future.
The labeling can be built upon in subsequent versions of the data set. Once the face recognition model is live in production, new images can be labeled to help the model overcome data drift and augment its performance in the face of changing data distributions. This continued labeling and organizing keeps the models more robust and consistent.
Data Preparation Steps
There are certain best practices to follow when preparing data sets for deep learning applications. Following is a checklist for data engineers when working with unstructured data:
(1) Check data formats
Samples in a data set, especially if collected via web scraping or crowdsourcing, may come in multiple data formats. For example, an image could be a JPEG, PNG, or TIFF, while an audio file could be a WAV, MP3, or FLAC. Check whether the data set samples are in different formats, so that you can standardize the format across all samples.
(2) Verify data types
Certain deep learning applications are based on multimodal data including text, images, audio, video, and structured metadata. For example, a model that predicts what video a user might watch next is trained using multiple data types. It verifies the type of each data sample, then indexes and stores them separately. Note that an individual data type like numbers might also belong to different types like int, float, or string.
(3) Verify data dimensions
It’s crucial to check the dimensionality of the samples in a data set. For example, a set of images containing faces may be gathered from different cameras, each associated with different default image dimensions.
(4) Identify what data needs to be labeled
Once you’ve completed the above steps, you can begin data labeling. It may not be feasible in some situations to label each data sample, because manual labeling can be prohibitively expensive and time-consuming. In this case, choose an appropriate number of data samples for labeling. For common machine learning classification use cases, you need to sample data for labeling from each category.
(5) Determine what type of labeling to perform
The same data sample can be labeled in multiple ways depending on the use case. For instance, an image containing people and cars may be labeled for faces, for segmenting people or cars, or for the vehicle registration plates.
(6) Decide who will label the data
Data labeling can be performed manually by domain experts, crowdsourced from non-experts, or done programmatically using rule-based or model-based algorithms. Determine which annotators will define what kind of data, depending on their expertise or level of training. If a data set will be labeled using software, then the required configuration parameters, protocols, and performance metrics need to be established so that labeling is consistent.
(7) Review data for errors and mistakes
Usually, the first round of data labeling contains errors. To improve the data quality and eradicate errors, more experienced annotators should conduct a second or third level of review. Depending on cost, time, and available resources, each data sample can also be independently labeled by multiple annotators; the most commonly provided label can be assigned as the final label.
(8) Split the data set into training and testing segments
Once a data set is labeled, split it into separate train and test subsets for training and evaluating the model, respectively. Depending on the use case and the amount of available data, the ratio might be 80:20, 90:10, or even 99:1. To obtain more reliable results, k-fold cross-validation is recommended. Multiple training and test sets are sampled randomly, and the final results are averaged across all the different folds.
Without the protection of systematic data preparation and labeling checks, you may find that poor quality data damages the accuracy and performance of any analysis or models based on that data. If you follow the above guide, you will be able to ensure your data is good quality and labeled accurately.
Metrics are widely used by data, product, strategy, and business teams to capture and summarize data about various aspects of user behavior, product performance, and the health of the business. Metrics like annual recurring revenue (ARR), gross merchandise volume (GMV), customer acquisition cost (CAC), lifetime value (LTV), and net promoter score (NPS) are common parlance in product startups and large tech companies.
Technical and business stakeholders need the information collected in metrics to make sense of their product and business performance so that they can make data-driven decisions. This makes tracking metrics essential to detect potential issues, plan new business initiatives, ensure growth, and share pertinent information with regulatory bodies as well as shareholders.
A change in growth metrics can deeply impact investor confidence and the perception of the company in public markets. For instance, the stock prices of Meta and Netflix recently plummeted after they reported declines in key growth metrics like daily active users (DAU) and number of subscribers, respectively. For tech companies at this scale, staying on top of metrics is critical and requires a sophisticated approach to data engineering, data governance, and data democratization.
In this article, you’ll learn about how metrics are defined, used, and managed at different types of large tech companies.
How Do Large Companies Define and Use Metrics?
Though large companies are equally reliant on metrics to drive their decision-making, what they measure and how they measure it will vary by company. The following are examples of the metrics strategies used at Uber, Airbnb, Spotify, and Netflix.
Uber’s core business is a marketplace that connects riders with drivers in real time at a global scale. Its product teams rely most heavily on metrics related to trips taken and driver experience, such as “driver acceptance rate” and “completed trips.” It also uses map data to determine driver ETA and pickup and dropoff spots.
Because disparate versions of the same metrics were being used across business teams, leading to ineffective and poor decision-making, Uber implemented changes to improve metric standardization. The company built a unified metric platform called uMetric to enforce a strict one-to-one mapping from business logic to metrics without any discrepancies.
uMetric is built on engineering solutions that democratize data and provide a clear understanding of the entire metric lifecycle so that the data can be better used in machine learning models. The platform enables access to metrics across their entire lifecycle, from definition, discovery, and planning to computation, quality, and consumption.
Clear and unambiguous definition of metrics is a key pillar of the platform, and metrics can be defined by any author without any duplication. In uMetric, a metrics definition model is designed on the following core principles:
Using this definition model is not enough to ensure metric standardization, however. Additional policies and solutions focused on data governance, data quality, and access control are necessary to scale the platform across the company.
Similar to Uber, the vacation rental marketplace Airbnb built a metrics platform called Minerva to achieve metric consistency and serve as the ground truth for data analytics, reporting, and experimentation.
Airbnb built its foundation of data on lodgings and vacation rentals on tables referred to as `core_data`. As the company grew, though, teams built separate tables on top of `core_data` without any information about data lineage or correspondence between these tables. This led to conflicting results and insights, which confounded data-driven decision-makers.
Minerva was designed to solve these problems. It takes facts and dimension tables as inputs, optimizes the data through denormalization, then sends the data to downstream applications. Minerva acts as the metric store for more than 30,000 metrics produced by more than 200 stakeholders across the organization. As uMetric does, Minerva supports the end-to-end lifecycle of a metric from definition to deprecation and powers the whole tech stack of Airbnb.
Metrics, dimensions, and metadata are defined and stored in a central GitHub repository that is accessible by any stakeholder in the company. Once defined, metrics can be used anywhere via dashboarding tools or A/B experimentation frameworks. All the metrics defined in Minerva are indexed in Dataportal, Airbnb’s internal data catalog. A deeper dive into the metrics is facilitated by another tool called Metric Explorer, which is designed for both technical and non-technical users.
Minerva powers several downstream applications:
The Spotify global audio streaming service also developed an in-house metrics catalog, but as part of a modern A/B testing experimentation platform in order to create custom metrics at scale.
Spotify’s metrics catalog runs SQL pipelines to ingest metrics into a data warehouse. This enables the collected metrics to be almost instantly stored, managed, and served to the experimentation platform. A key feature of the metrics catalog is that it enables self-service. Teams can write SQL queries to define metrics, and the rest is taken care of by the managed system.
To address the problem of lack of metrics standardization and metrics duplication, Spotify built a Metrics Hub. In addition to providing a single source of truth, the hub also focused on creating symmetry between offline and online use of metrics. This feature makes it easy to take any metric definition and deploy it seamlessly in different environments to power experimentation and machine learning use cases.
In typical A/B testing experiments, users are split into distinct groups. Consider a hypothetical example in which Spotify wants to A/B test whether podcasts are more popular in the 30- to 39-year age group or the 20- to 29-year age group. This experiment requires a set of user-level input metrics like demographics, daily or weekly listening time, number of songs listened to, and number of podcasts listened to. Spotify’s metric pipeline integrates these metrics with the experimental group each user belongs to. This data is combined and stored in a data warehouse, then accessed with an API that allows users to query data without needing to understand the underlying storage.
A metrics catalog enables multiple stakeholders to access and analyze data, which helps an organization to more efficiently and quickly improve the customer experience.
As a global entertainment platform that serves real-time video content to millions of users, Netflix needs to mine numerous insights on metrics like user engagement, viewership, and video streaming quality. It uses the data it gathers to make recommendations to users based on factors like watch history and demographics.
Netflix powers multiple experiments in parallel through a centralized A/B experimentation platform. Similar to Spotify, this platform has a metrics catalog at its core.
A centralized metrics repository built using Python, Metrics Repo is home to diverse user-level as well as technical metrics like streaming time, play delay, and retention rate. Metrics Repo provides a unified platform for metric definitions that are typically defined and engineered differently by various business teams. In this modular architecture, data scientists can add metric definitions directly and join data tables to perform metric computations.
Analytical reports can be calculated on demand without affecting the underlying metrics. Metrics Repo serves as a single source of truth for statistical analysis and causal inference based on these metrics and visualization of corresponding results and insights.
This architecture provides a transparent metric lineage and definition, ensuring greater trust in the experimental results. This is critical for enabling rapid mining of insights, development of new products and strategies, and executive-level decision-making.
Metrics provide a data-driven summary of key business goals and operational performance. Product managers, data analysts, and business leaders use them to assess and track the growth of the business, as well as devise new products and strategies. Because metrics are so crucial to the health and growth of a business, stakeholders need a clearly defined way to collect and measure metrics in order to improve their decision-making.
You’ve learned about how data teams define and use metrics at four top tech companies: Uber, Airbnb, Spotify, and Netflix. Uber and Airbnb built an internal metrics platform that manages the entire lifecycle of their metrics. Spotify and Netflix, meanwhile, built metrics catalogs to form a central pillar of a modular and scalable experimentation platform. These different solutions achieve the same goal of making necessary data cohesive, consistent, and actionable.
Data culture refers to an organizational culture of using data to derive insights and make informed business decisions. Companies can build a strong data culture by arming themselves with data and the right set of people, policies, and technologies.
A data culture helps companies become more competitive and resourceful by leveraging data. And data-driven companies make better, faster, and more objective business decisions. They promote greater employee engagement and retention, and drive better financial outcomes in terms of revenue, profitability, and operational efficiency.
In this article, you'll learn about data culture, what its importance is for modern organizations, and how you can build a strong data culture at your company.
Why You Need a Strong Data Culture?
Without a solid data culture, organizations will inevitably fail to harness the power of data. As previously stated, data culture refers to a set of beliefs and practices that companies use to cultivate and drive more data-driven decisions.
Traditionally, businesses relied on the instinct and gut of a select few leaders to make strategic business decisions. However, with the accumulation and collection of massive volumes of customer and business data, domain expertise and instinct can now be complemented with data-driven insights to make more informed decisions.
There are several advantages to building a strong data culture. Some of these include the following:
Every business sector, from product to finance to HR, creates and collects a lot of data from external customers or internal operations. For business heads and decision-makers, it's no longer feasible to stay on top of the ever-increasing volumes of data to better understand and evaluate the current state of their organization. However, with data analysts and scientists embedded across each department, it is possible to tap business insights in real time and respond quickly to changes in business performance.
A strong data culture also promotes greater employee engagement and retention. When employees see that decisions are made on the basis of data and not driven just by the highest-paid executives, they feel that they can contribute more insights to influence decision-making. In the long term, this facilitates attracting the best talent in the market who can be incentivized to have a greater say in making key business decisions using data.
Moreover, there are also strong financial outcomes associated with building and promoting a data culture. Companies with data-driven cultures benefit from increased revenue, better customer services, and more operational efficiencies leading to improved profitability.
How to Build a Strong Data Culture?
Building a strong data culture is a long-term endeavor that requires patient support and encouragement from leadership. Companies with strong data-driven cultures have executives who lead by example and establish clear expectations that decisions will be objective and based on data.
Data leaders can lead from the front by establishing clear goals and guidelines, investing in technology and training, as well as identifying and rewarding employee behaviors that embody a data-led culture. Beyond leadership setting a tone for the whole organization, let's take a look at a few other components that can help build a strong data culture.
1 Bring Business and Data Science Together
One of the first steps in building a data culture is to build a strong data science team consisting of data analysts, data engineers, and data scientists. Having quality in-house data talent is a competitive advantage that reaps multiple benefits, including building a robust culture focused on data.
Once a data science team is up and running, it needs to be strategically embedded across various departments of the business. This helps business professionals interact with data professionals more regularly and better understand how the power of data analytics and data science can improve business efficiencies and impact profitability and growth.
At the same time, this setting enables data professionals to better understand how the business works and build intuition for developing better data and machine learning–powered tools and products. This creates a positive flywheel where both business and data science teams learn to collaborate better and benefit from their respective skill sets.
By bringing business and data science together, everyone in the organization learns to appreciate the value of data and use data-driven insights to improve the quality of their decisions, products, and services.
2 Leverage Data When Creating Goals and Deadlines
Driving strategic business goals and metrics by leveraging data is a key aspect of encouraging a data-led culture. When goal-setting exercises are conducted objectively and leaders regularly use data and metrics from previous business quarters or external data about competitors or the overall market, everyone in the organization will start to embrace similar data-driven approaches. Leveraging data for setting new targets also enables every stakeholder in the organization to understand and anticipate their future goals and prioritize their work accordingly.
Data-led goal setting is a more democratic and fair-minded process that encourages ownership of respective goals by every employee, as opposed to arbitrary, instinct-led, unilateral decisions made by the leadership.
3 Ensure Everybody Has Access to Data
A fundamental step toward attaining a data culture is to democratize access to data across the organization. Data culture is a difficult goal when employees in different parts of a business struggle to obtain data.
If you don't give your employees access to your data, they won't be able to utilize it when making decisions. This disenfranchises the data analysts, engineers, and scientists disproportionately, as their day-to-day work is impacted the most. Without a motivated team of data professionals, the downstream benefits of data are unlikely to materialize across various business departments.
A strong foundation of data governance and data democratization is a prerequisite to achieving the business goals associated with a robust data culture.
4 Keep Your Data Technology Up-to-Date
A critical aspect of building a data culture is employing modern tools and technologies to make it easier for employees to access, analyze, and share data-driven insights. Building a modern data stack with newer components like a metrics layer simplifies data-based operations and analytics for everyone, especially nontechnical business stakeholders.
Technology, like data warehouses and metrics layers; data analytics tools, like Tableau or Power BI; and customer relationship management (CRM) tools, like Salesforce, are indispensable for modern businesses. Building the data architecture in a cloud environment like Amazon Web Services further improves access to data and reduces the need for multiple tools with a steep learning curve.
The right use of tools for data, collaboration, and customer service goes a long way in fostering the use of technology to drive a strong data-led culture.
5 Provide Training for Employees
Having supportive leadership and access to data and technology is of little use if employees are not data literate and able to extract insights from data. This requires further investment in terms of learning and development to empower employees with the necessary skills to explore, understand, and share data-driven insights across the organization.
In addition to reducing the skills gap, it also encourages people from nontechnical backgrounds to become more data savvy, collaborate better with data experts, and build more comprehensive data products and solutions to benefit the business.
6 Reward Data-Oriented Decisions and Behavior
The primary challenge to becoming a data-driven organization is not technical but cultural. A strong data culture is based on a robust foundation of people, policies, and technology. However, once the initial foundation is in place, data leaders need to maintain and bolster the spirit of data-driven decision-making by incentivizing and rewarding behaviors that embody the culture.
At the same time, decisions and behaviors that do not represent a holistic data-led process ought to be called out and questioned until every single employee is on board with the philosophy of using data for every decision. This includes encouraging experimentation to answer key business questions for which data does not exist yet or when the current set of data does not yield compelling evidence.
In this article, you learned about the importance of a data culture for businesses. It's a formidable task to build a strong data culture and is a top priority for a majority of CEOs.
Data-driven companies are in a better position to attract and retain talent, make faster decisions with more conviction, and drive stronger growth and profitability to meet their business goals. According to research by McKinsey & Company, data-driven companies are able to achieve their goals faster and realize at least 20 percent more earnings.
Today, data is at the core of many companies, and it's of the highest importance for running a successful business. Companies process huge amounts of data daily, which they must store, categorize, track, and organize by cataloging, and that's where data governance comes in.
Data governance is a set of processes that promote better management of business data, unlocking the true value of data by ensuring that it's more accessible, reliable, secure, and compliant. For modern data-driven organizations, a strong data governance framework is not only important but essential for the best use of data in business decisions. A strong data governance framework usually encompasses functions such as managing data access and data ownership, tracing data lineage, managing duplicate or false data, and classifying and assuring data quality. All of these are the pillars of a successful data governance process.
However, implementing a robust data governance framework is no small feat. If not done systematically, it can lead to a huge loss of organizational time, resources, and effort. Companies that have made significant progress in building data governance frameworks and cultivated a strong and inclusive data culture have done so incrementally, aligning incentives and creating deep collaboration across cross-functional teams that own the data governance roadmap. Organizations are more likely to be successful if they can bring together people, processes, and technology to build their framework.
In this article, you'll learn about best practices for implementing data governance in an organization. Companies can leverage existing best practices and build on them to fast-track their own data governance efforts.
What Are the Challenges of Implementing Data Governance?
Before you plan your data governance strategy, you need to look out for some common challenges.
One major challenge for organizations is building a strong business use case for investing staff and resources in a data governance framework. Those that haven't yet embraced digital transformation and the better, faster decision-making possible with deeper data analysis might not see the long-term business value of data governance. It's important to unite relevant stakeholders across the organization to take on the challenge.
Even when organizations do launch a governance framework, they may fail to achieve its true potential. Poor data leadership and ownership may be an obstacle, for example. Data governance also requires collaboration and consistent enforcement across departments to succeed. For example, the finance department could collaborate with the accountancy department to create a cross-practice team to communicate and transfer data more transparently.
So, without the buy-in and blessings of the tech and collaborative data ownership that helps break down the organizational silos, the program is unlikely to come to fruition.
Additionally, a good data governance framework relies on high-quality data. The primary goal of data governance is to make data more accessible, secure, and reliable for stakeholders to consume for their own use cases. However, if the quality of the data at the source is poor, implementing data governance becomes much more difficult.
Data Governance Best Practices
The following are best practices that have been adopted successfully by numerous organizations, such as Collibra, IBM, Informatica, Select Star, and more, in building comprehensive data governance frameworks.
1 Build a Strong Business Use Case
The goal of data governance is to enable every stakeholder to use the data to make business decisions relevant to their department, whether that's sales, marketing, finance, or human resources. This means that you need the support and alignment of all users and departments right from the beginning. Without cross-functional support, building a strong business case for investing in a long-term mission like data governance is less likely to succeed.
Data governance generates some significant business benefits that can make the advantages of the process clear to the leadership. It saves time and provides improved security and reliable and more accurate data, making it easier to make data-driven decisions. When these business benefits are made clear to the leadership, it's easier to get approval for needed staff, budget, and resources for the project.
2 Identify Data Stewards and Owners
Clearly defined roles and owners are necessary to build the data governance framework in a structured manner. Knowing which stakeholders own certain responsibilities also helps with clear lines of communication. Exact roles may differ across organizations, but the following are common choices:
3 Start Small
Creating a strong data governance framework requires the right mix of people, processes, and technology to come together. It's crucial to start small and aim for quick incremental wins rather than overpromising and underdelivering. Creating governance guidelines requires specific expertise; you could hire this expertise, but empowering and upskilling people within your existing team might be more successful as they already know your data.
Those responsible for data governance then need to gradually build trust and seek alignment from various cross-functional departments before the framework policies can be enshrined as organization-wide processes.
For governance-based processes to be adopted and diligently followed, your data stewards need to implement regular checks and audits and guide team members and departments that might not be familiar with good data governance practices. This guidance has two dimensions: cultural guidance and technological guidance concerning the required tools.
When data stewards implement processes, they should also implement the right tools for advanced actions such as automation. Once every cross-functional team understands when and how to use governance principles in their day-to-day work with the help of the tools, you can automate some of the processes.
4 Define and Measure Metrics
Data governance is a long-term investment. However, it's important to measure progress in smaller time frames to ensure that key milestones are being achieved without any delays or hurdles. Monitoring some metrics, such as the percentage of the data assets per ownership, the number of questions or errors that are reported to the data team, or the number of dashboards that are being used across the organization and their types, might help achieve those key milestones in the long term.
In other words, a clear roadmap with specified deliverables, timelines, and metrics that are shared among all the owners ensures that progress can be evaluated in achievable, measurable steps. You need to be able to periodically check the progress of your governance framework to ensure that it's still on track.
This image shows a detailed roadmap for establishing a data governance program over a period of two years. Individual tasks can be defined for each business quarter and for different aspects of the framework, such as data insights, data quality, data standards, and data governance and management.
For example, improving data quality can be broken down into multiple milestones per business quarter. The goal for the first quarter may be hiring a data engineering team, while the next quarters may focus on establishing reference data repositories, data cleaning, and building data stores and data warehouses. This structured approach keeps cross-functional teams informed on the overall plan and ensures continued progress.
5 Establish Strong Communication Channels
Frequent and effective communication is the key to aligning stakeholders and collaborating across teams. Everyone should understand the desired goals and keep others informed on their progress in implementing them.
Additionally, your data stewards must be as transparent as possible to earn trust across the organization and emphasize the impact of investment in data governance to the executive leadership as well as to the downstream users of the framework. They can create a single channel for communication, which is like a linked data catalog where you can search data assets or collaborate on them.
This way of communication is pivotal both during the implementation phase and after the framework is established. A single channel for communication will help drive strong adoption rates, resolve queries, and allow you to share updates to the governance policies as data and compliance requirements evolve.
6 Contextualize Data
Data contextualization involves adding any relevant information to data to make it actionable. Contextualization provides users better interpretation of the data and enables organizations to make smarter decisions.
This helps a data governance process work more efficiently as contextualized data has clearer meanings and allows decision makers to have enriched information regarding the actions they should take. Moreover, it can help improve how the organization handles data in its data governance environment.
7 Build a Long-Term Strategy for Data Governance
Achieving a strong data governance framework can be a moving target. You need to ensure that stakeholders know this is a long-term investment. Data governance is a continuous process that consists of many smaller projects and deliverables. Ramping up speed and complexity over time helps to scale your efforts. While the overall framework may take several years, smaller milestones can be set and achieved over shorter time frames, like a business quarter.
For instance, a useful set of milestones to accomplish in the first quarter of working on a data governance framework may include establishing data management policies and standards, hiring a data engineering team, and drafting a data management strategy together with all relevant stakeholders.
As long as they see incremental progress, stakeholders will learn to trust the process and be invested in the success of the project.
8 Expose the Data through Documentation
Knowing exactly what your data represents is a critical component of data governance. Users should have a single, centralized platform where they can find documentation related to their data. This documentation should be continuously updated, reviewed, and revised and should also be directly tied to the actual data assets. These actions will ensure that your users can trust and rely on your documentation, as it will always be up to date and accurate.
Strong data governance should expose the data through process-oriented documentation that is directly connected to the data.
9 Data Lineage and Usage
Knowing the source of data, where your data is flowing, and who is accessing it is important. With data governance, you have to build trust in your data, ensure the data is used properly in your organization, and troubleshoot issues when they arise.
Data lineage helps automatically identify sensitive information and propagate some data governance-related policies. Data lineage also informs reports, issue logs, and audit logs, which show that the data governance requirements are met.
As an example, data lineage prevents teams from using a dashboard that was supposed to be deprecated or two different business units from building a metric using different underlying data.
Successful Data Governance Frameworks
Several large global companies have successfully implemented data governance frameworks. The following are some examples.
PwC, a global professional services company, has created a data governance framework consisting of the following components:
ING, a Dutch multinational banking and financial services corporation, leveraged IBM Cloud Pak to improve data governance for its users in a hybrid cloud environment.
There are also several third-party companies that assist larger organizations with their data governance strategy and implementation, such as Collibra, Informatica, and Alation, and data catalogs that provide tools and insights required for implementing a data governance practice on your own, such as Select Star and Atlan.
Outcomes of a Strong Data Governance
Implementing a strong data governance strategy will inevitably lead to outcomes such as improved data quality, decreased data management costs, and better data analytics, which, in turn, leads to better decision-making throughout the organization. The following list provides an overview of the outcomes of effective data governance:
For an organization, the time it takes to achieve these outcomes is closely related to the strength of its data governance implementation processes. Over time, these all contribute to one overarching outcome: organizational success.
Data governance is an essential requirement for modern organizations to drive greater adoption of data and empower business decision-making. Organizations can find it difficult to extract the full value of their data assets, especially as the amount of data keeps growing. Data governance frameworks lay down clear policies and guidelines for improving the quality of data and democratizing its usage across a business.
If you can navigate the challenges involved and follow the above best practices in creating and implementing your data governance framework, you can accelerate your organization's understanding and usage of data and deliver data-driven decision-making to your organization.
Traditional machine learning is based on training models on data sets that are stored in a centralized location like an on-premise server or cloud storage. For domains like healthcare, privacy and compliance issues complicate the collection, storage, and sharing of critical patient and medical data. This poses a considerable challenge for building machine learning models for healthcare.
Federated learning is a technique that enables collaborative machine learning without the need for centralized training data. A shared machine learning model is trained by keeping all the training data on a device, thereby ensuring higher levels of privacy and security compared to the traditional machine learning setup where data is stored in the cloud.
This technique is especially useful in domains with high security and privacy constraints like healthcare, finance, or governance. Users benefit from the power of personalized machine learning models without compromising their sensitive data.
This article describes federated learning and its various applications with a special focus on healthcare.
How Does Federated Learning Work?
This section discusses in detail how federated learning works for a hypothetical use case of a number of healthcare institutions working collaboratively to build a deep learning model to analyze MRI scans.
In a typical federated learning setup, there’s a centralized server, for instance, in the cloud, that interacts with multiple sources of training data, such as hospitals in this example. The centralized server houses a global deep learning model for the specific use case that is copied to each hospital to train on its own data set.
Each hospital in this setup trains the global deep learning model locally for a few iterations on its internal data set and sends the updated version of the model back to the centralized server.
Each model update is then sent to the cloud server using encrypted communication protocols, where it’s averaged with the updates from other hospitals to improve the shared global model. The updated parameters are then shared with the participating hospitals so that they can continue local training.
In this fashion, the global model can learn the intricacies of the diverse data sets stored across various partner hospitals and become more robust and accurate. At the same time, the collaborating hospitals never have to send their confidential patient data outside their premises, which helps ensure that they don’t violate strict regulatory requirements like HIPAA. The data from each hospital is secured within its own infrastructure.
This unique federated learning setup is easily scalable and can accommodate new partner hospitals; it also remains unaffected if any of the existing partners decide to exit the arrangement.
Use Cases for Federated Learning in Healthcare
Federated learning has immense potential across many industries, including mobile applications, healthcare, and digital health. It has already been used successfully for healthcare applications, including health data management, remote health monitoring, medical imaging, and COVID-19 detection.
As an example of its use for mobile applications, Google used this technique to improve Smart Text Selection on Android mobile phones. In this use case, it enables users to select, copy, and use text quickly by predicting the desired word or sequence of words based on user input. Each time a user taps to select a piece of text and corrects the model’s suggestion, the global model receives precise feedback that’s used to improve the model.
Federated learning is also relevant for autonomous vehicles to improve real-time decision-making and real-time data collection about traffic and roads. Self-driving cars require real-time updates, and the above types of information can be effectively pooled from several vehicles in real time using federated learning.
Privacy and Security
With increased focus on data privacy laws from governments and regulatory bodies, protecting user data is of utmost importance. Many companies store customer data, including personally identifiable information such as names, addresses, mobile numbers, email addresses, etc.
Apart from these static data types, user interactions with companies such as chat, emails, and phone calls also carry sensitive details that need to be protected from hackers and malicious attacks.
Privacy-enhancing technologies like differential privacy, homomorphic encryption, and secure multi-party computation have advanced significantly and are used for data management, financial transactions, and healthcare services, as well as data transfer between multiple collaborative parties.
Many startups and large tech companies are investing heavily in privacy technologies like federated learning to ensure that customers have a pleasant user experience without their personal data being compromised.
In the healthcare industry, federated learning is a promising technology that allows, for example, hospitals to share electronic health records (EHR) to create more accurate models. Privacy is preserved without violating strict HIPAA standards by decentralizing the data processing, which is distributed among multiple end-points instead of being managed from a central server.
Simply put, federated learning allows training of machine learning models without the need to collect raw data in a central location; instead, the data used by each end-point (in this example, hospitals) remains local. By combining the above with differential privacy, hospitals can even provide a quantifiable measure of data anonymization.
Federated Learning vs. Distributed Learning and Edge Computing
Federated learning is often confused with distributed learning. In the context of deep learning, distributed training is used to train large, deep neural networks across a number of GPUs or machines. However, distributed learning relies on centralized training data shared across multiple nodes to increase the speed of model training.
Federated learning, on the other hand, is based on decentralized data stored across a number of devices and produces a central, aggregate model. A fascinating example of the potential of this technology is using federated learning-based Person Movement Identification (PMI) through wearable devices for smart healthcare systems.
Edge computing is a related concept where the data and model are centralized in the same individual device. Edge computing doesn’t train models that learn from data stored across multiple devices, as in the case of federated learning. Instead, a centrally trained model is deployed on an edge device, where it runs on data collected from that device. For example, edge computing is applied in the context of Amazon Alexa devices, where a wake word detection model is stored on the device to detect every utterance of “Alexa.”
AI and Healthcare
Federated machine learning has a strong appeal for healthcare applications. By design, patient and medical data is highly regulated and needs to adhere to strict security and privacy standards. By collating data from participating healthcare institutions, organizations can ensure that confidential patient data doesn’t leave their ecosystem; they can also benefit from machine learning models trained on data across a number of healthcare institutions.
Large hospital networks can now work together and pool their data to build AI models for a variety of medical use cases. With federated learning, smaller community and rural hospitals with fewer resources and lower budgets can also benefit and provide better health outcomes to more of the population.
This technique also helps to capture a greater variety of patient traits, including variations in age, gender, and ethnicity, which may vary significantly from one geographic region to another. Machine learning models based on such diverse data sets are likely to be less biased and more likely to produce more accurate results. In turn, the expert feedback of trained medical professionals can help to further improve the accuracy of the various AI models.
Federated learning, therefore, has the potential to introduce massive innovations and discoveries in the healthcare industry and bring novel AI-driven applications to market and patients faster.
Federated learning enables secure, private, and collaborative machine learning where the training data doesn’t leave the user device or organizational infrastructure. It harnesses diverse data from various sources and produces an aggregate model that’s more accurate.
This technique has introduced significant improvements in information sharing and increased the efficacy of collaborative machine learning between hospitals. It circumvents and overcomes the challenges of working with highly sensitive medical data while leveraging the power of state-of-the-art machine learning and deep learning.
Copyright © 2022, Sundeep Teki
All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including electronic or mechanical methods, without the prior written permission of the author.
This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.