Published by Unbox.ai
Machine learning models, especially deep neural networks, are trained using large amounts of data. However, for many machine learning use cases, real-world data sets do not exist or are prohibitively costly to buy and label. In such scenarios, synthetic data represents an appealing, less expensive, and scalable solution.
Additionally, several real-world machine learning problems suffer from class imbalance—that is, where the distribution of the categories of data is skewed, resulting in disproportionately fewer observations for one or more categories. Synthetic data can be used in such situations to balance out the underrepresented data and train models that generalize well in real-world settings.
Synthetic data is now increasingly used for various applications, such as computer vision, image recognition, speech recognition, and time-series data, among others. In this article, you will learn about synthetic data, its benefits, and how it is generated for different use cases.
What is synthetic data?
Synthetic data is a form of data augmentation that is commonly used to address overfitting deep learning models. It’s generated with algorithms as well as machine learning models to have similar statistical properties as the real-world data sets. For data-hungry deep learning models, the availability of large training data sets is a massive bottleneck that can often be solved with synthetic data.
Additionally, synthetic data can be used for myriad business problems where real-world data sets are missing or underrepresented. Several industries—like consumer tech, finance, healthcare, manufacturing, security, automotive, and robotics—are already benefiting from the use of synthetic data. It helps avoid the key bottleneck in the machine learning lifecycle of the unavailability of data and allows teams to continue developing and iterating on innovative data products.
For example, building products related to natural language processing (NLP), like search or language translation, is often problematic for low-resource languages. Synthetic data generation has been successfully used to generate parallel training data for training deep learning models for neural machine translation.
Generating synthetic data for machine learning
There are several standard approaches for generating synthetic data. These include the following:
Types of synthetic data
Synthetic data can be classified into different types based on their usage and the data format. Generally, it falls into one of two categories:
Popular types of synthetic data, classified according to the data type, include the following:
Synthetic text finds its use in applications like language translation, content moderation, and product reviews. Synthetic images are used extensively for purposes like training self-driving cars, while synthetic audio and video data is used for applications including speech recognition, virtual assistants, and digital avatars. Synthetic time-series data are used in financial services to represent the temporal aspect of financial data, like stock price. Finally, synthetic tabular data is used in domains like e-commerce and fraud.
Techniques for generating synthetic data
Generating synthetic data can be very simple, such as adding noise to data samples, and can also be highly sophisticated, requiring the use of state-of-the-art models like generative adversarial networks. In this section, you’ll review two chief methods for generating synthetic data for machine learning and deep learning applications.
In statistics, data samples can be assumed to be generated from a probability distribution with certain characteristic statistical features like mean, variance, skew, and so on. For instance, in the case of anomaly detection, one assumes that the nonanomalous samples belong to a certain statistical distribution while the anomalous or outlier samples do not correspond to this data distribution.
Consider a hypothetical machine learning example of predicting the salaries of data scientists with certain years of experience at top tech companies. In the absence of real-world salary data, which is a topic considered taboo, synthetic salary data can be generated from a distribution defined by the few real-world salary public reports on platforms like Glassdoor, LinkedIn, or Quora. This can be used by recruiters and hiring teams to benchmark their own salary levels and adjust the salary offers to new hires.
Deep learning-based methods
As the complexity of the data increases, statistical-sampling-based methods are not a good choice for synthetic data generation. Neural networks, especially deep neural networks, are capable of making better approximations of complex, nonlinear data like faces or speech. A neural network essentially represents a transformation from a set of inputs to a complex output, and this transformation can be applied on synthetic inputs to generate synthetic outputs. Two popular neural network architectures for generating synthetic data are variational autoencoders and generative adversarial networks, which will be discussed in detail in the next sections.
Variational autoencoders are generative models that belong to the autoencoder class of unsupervised models. They learn the underlying distribution of a data set and subsequently generate new data based on the learned representation.
VAEs consist of two neural networks: an encoder that learns an efficient latent representation of the source data distribution and a decoder that aims to transform this latent representation back into the original space. The advantage of using VAEs is that the quality of the generated samples can be quantified objectively using the reconstruction error between the original distribution and the output of the decoder. VAEs can be trained efficiently through an objective function that minimizes the reconstruction error.
VAEs represent a strong baseline approach for generating synthetic data. However, VAEs suffer from a few disadvantages. They are not able to learn efficient representations of heterogeneous data and are not straightforward to train and optimize. These problems can be overcome using generative adversarial networks.
Generative adversarial networks
GANs are a relatively new class of generative deep learning models. Like VAEs, GANs are based on simultaneously training two neural networks but via an adversarial process.
A generative model, G, is used to learn the latent representation of the original data set and generate samples. The discriminator model, D, is a supervised model that learns to distinguish whether a random sample came from the original data set or is generated by G. The objective of the generator G is to maximize the probability of the discriminator D, making a classification error. This adversarial training process, similar to a zero-sum game, is continued until the discriminator can no longer distinguish between the original and synthetic data samples from the generator.
GANs originally became popular for synthesizing images for a variety of computer-visionproblems, including image recognition, text-to-image and image-to-image translation, super resolution, and so on. Recently, GANs have proven to be highly versatile and useful for generating synthetic text as well as private or sensitive data like patient medical records.
Synthetic data generation with Openlayer
Openlayer is a machine learning debugging workspace that helps individual data scientists and enterprise organizations alike to track and version models, uncover errors, and generate synthetic data. It is primarily used to augment underrepresented portions or classes in the original training data set. Synthetic data is generated from existing data samples, and data-augmentation tests are conducted to verify whether the model’s predictions on the synthetic data are the same as for the original data.
In this article, you learned about synthetic data for machine learning and deep learning applications. In the absence of real-world data, as well as other pertinent issues like privacy concerns or the high costs of data acquisition and labeling, synthetic data presents a versatile and scalable solution. Synthetic data has found mainstream acceptance in a number of domains and for a variety of data types, including text, audio, video, time series, and tabular data.
You explored these different types of synthetic data and the various methods for generation. These include statistical approaches as well as neural network–based methods like variational autoencoders and generative adversarial networks. Then you walked through a brief tutorial for generating synthetic data using deep learning methods. Finally, you saw the utility of third-party synthetic data generation products such as Openlayer, which can help companies rapidly scale their synthetic data requirements and accelerate model development and deployment.
Published by Earthly.dev
Bash (bourne again shell) scripts give you the ability to turn series of manual commands into an easily runnable and repeatable script. This can be especially useful when working with files.
For programmers, Bash enables you to efficiently search for particular keywords or phrases by reading each line separately. Bash can also be used for reading files for a variety of reasons, like shell scripting, searching, text processing, building processes, logging data, and automating administrative tasks. When you’re done with this article, you’ll be able to use Bash to read files line by line, use custom delimiters, assign variables, and more.
👉 Here is the full article
Published by Domino Data Lab
Data governance refers to the process of managing enterprise data with the aim of making data more accessible, reliable, usable, secure, and compliant across an organization. It is a critical feature of organizational data management and promotes better data quality and data democratization.
A well-planned data-governance framework is fundamental for any data-driven organization that aims to harness the business value of its data and downstream capabilities that drive robust decision-making. It covers and details best practices for data processes, roles, policies, standards, and metrics.
Naturally, data-governance frameworks vary from one organization to the next. Here are a few examples of strong data-governance frameworks recommended at companies like PWC, Hubspot, and ING.
However, there are a set of commonly accepted best practices, as listed below:
In this article, you’ll learn more about data-governance frameworks and their essential components, exploring use cases and best practices for choosing a data-governance framework for your organization.
The Importance of Data Governance
Without effective data governance, organization's data science teams fail to extract the full value of their data. Weak data governance leads to poor data practices and decision-making, causing organizations to lose their competitive edge in an increasingly data-driven corporate environment.
Crucially, poor governance can also impair compliance with regulatory standards like GDPR, HIPAA, SOX, CCPA, and the like, which can have massive consequences for businesses. Hefty fines for violating such laws can cause a dent in a company’s bottom line. For instance, in 2021, Amazon received a whopping GDPR fine of $877 million.
Strong data governance ensures that data is readily available, reliable, consistent, and of high quality to empower businesses to create value from the data. It encompasses processes, people, organizational bodies, and a set of policies that all work together to determine best practices of managing data.
The Benefits of Data Governance
The benefits of data governance are manifold, some of which include the following:
Investing in a robust data-governance framework yields significant returns and helps accelerate an organization’s digital- and data-transformation journey.
How to Choose a Data Governance Framework
There are several foundations of a modern data-governance framework. The primary focus areas include data quality, accessibility, security, and compliance. However, the success of a data-governance framework cannot be realized until people, processes, and technology are combined effectively.
Designing an effective data-governance framework also includes creating a clear set of organizational roles, responsibilities, and data stakeholders for enforcing and managing the governance policy. In this section, you’ll read about the core aspects of building an optimal data-governance framework.
Data accuracy is a central pillar of data quality and refers to error-free and reliable information. Inaccurate data is often the result of poor data entry practices, poor regulation of data accessibility, and poor data quality standards. It is critical to improve data accuracy standards as clean, consolidated, and accurate data can lead to a two times improvement on the return on investment.
There are many ways to test and improve existing data accuracy standards. A real-time data quality audit can identify issues like duplicates, missing values, incomplete information, data being stored in multiple locations, and so on. Some of these common issues can be fixed through tools that automate data quality and accuracy checks, while other issues require manual intervention by data teams.
Data relevance refers to whether the data is pertinent and applicable for deriving business value (i.e., whether it is fit for the purpose it was originally intended for). For instance, information about customers’ sexual orientation and military records are often asked on forms alongside relevant data fields like customer name, email, and other contact details. However, in many cases, data fields like sexual orientation and military records have no material consequence for the business.
In terms of data governance, every data set needs to be rigorously assessed for its relevance and potential value to the organization. As organizations accumulate more and more data, the cost of data storage, maintenance, privacy, and security increases. Therefore, not having robust data-relevance checks can have a significant financial impact as well.
Scope of Policy
In the context of modern data-driven organizations, data governance has a broad scope and covers everything from metadata and data storage to accessibility, quality, security, ownership, and organizational roles and policies for various processes, people, and technology.
Formulating a comprehensive data-governance policy that covers such a wide scope requires concerted efforts from a variety of data stakeholders as well as the executive leadership. As this requires significant investment in talent and technology—where ROI may not be evident in the short term—buy-in and support from leadership is critical.
Data Compliance Standards
Adherence to data compliance standards like the GDPR, HIPAA, SOX, and the like is another crucial element of data governance. As organizations store a lot of confidential user and business data—including personally identifiable information like names, contact details and addresses, passwords, credit card details, and so on—failure to adequately secure this data from adversarial attacks and data breaches, or even mistaken internal access, has massive business consequences.
Lack of compliance to data privacy and security regulations can result in tremendous fines. Perhaps even more significant is the reputational damage to an organization that fails to protect its customer and enterprise data, leading to greater expenses and the possibility of significant loss of revenue and future business.
Data Access Policy
A data access policy sets out which employees can access an organization’s data so that they do not face much friction and interference in their daily work that relies on data. Under a data-access policy, only a certain set of employees are authorized to access and use data for specific use cases.
This policy is applicable to all the data assets across business units and encompasses data of all types and formats. Clear guidelines on who has access and who does not also helps improve data security, promote better adherence to data compliance regulations, and reduce wastage of organizational time and resources.
Organizational Roles for Managing a Data Governance Policy
Data governance is the mandate of several stakeholders, such as the following:
Stages of Implementing Policy
Implementing a well-planned and comprehensive data-governance framework takes time and resources. It involves multiple steps, typically including the following:
It is important to start small and slowly build traction, gaining the confidence of the leadership as well as the various data stakeholders with every step in the process. Involving a variety of stakeholders as discussed above and testing the governance framework in a small-sized team helps to identify key practices and standards that can then be scaled to the level of the whole organization.
In this article, you learned about data-governance frameworks for organizations. Data governance is a fundamental requirement for data-driven companies that helps them manage in-house data assets effectively and create consistent business value.The importance of data governance for modern organizations is great, and the negative consequences of poor governance practices are significant. Choosing a data-governance framework is not straightforward, but exploring the core pillars of strong data governance, including data accuracy, relevance, compliance, and security is an important first step towards modernizing your organization's journey in digital transformation.
In the context of data science and machine learning, a robust data-governance framework is essential to maintain the quality, relevance, timeliness, completeness, accuracy, and validity of the data used to train machine learning models.
Copyright © 2022, Sundeep Teki
All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including electronic or mechanical methods, without the prior written permission of the author.
This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.