How to Generate Synthetic Data for Machine Learning Projects

16/6/2022

Published by Unbox.ai

Introduction
Machine learning models, especially deep neural networks, are trained using large amounts of data. However, for many machine learning use cases, real-world data sets do not exist or are prohibitively costly to buy and label. In such scenarios, synthetic data represents an appealing, less expensive, and scalable solution.

Additionally, several real-world machine learning problems suffer from class imbalance—that is, where the distribution of the categories of data is skewed, resulting in disproportionately fewer observations for one or more categories. Synthetic data can be used in such situations to balance out the underrepresented data and train models that generalize well in real-world settings.

Synthetic data is now increasingly used for various applications, such as computer vision, image recognition, speech recognition, and time-series data, among others. In this article, you will learn about synthetic data, its benefits, and how it is generated for different use cases.

What is synthetic data?
Synthetic data is a form of data augmentation that is commonly used to address overfitting deep learning models. It’s generated with algorithms as well as machine learning models to have similar statistical properties as the real-world data sets. For data-hungry deep learning models, the availability of large training data sets is a massive bottleneck that can often be solved with synthetic data.

Additionally, synthetic data can be used for myriad business problems where real-world data sets are missing or underrepresented. Several industries—like consumer tech, finance, healthcare, manufacturing, security, automotive, and robotics—are already benefiting from the use of synthetic data. It helps avoid the key bottleneck in the machine learning lifecycle of the unavailability of data and allows teams to continue developing and iterating on innovative data products.

For example, building products related to natural language processing (NLP), like search or language translation, is often problematic for low-resource languages. Synthetic data generation has been successfully used to generate parallel training data for training deep learning models for neural machine translation.

Generating synthetic data for machine learning
There are several standard approaches for generating synthetic data. These include the following:

Statistical approaches based on sampling from the source data distribution
Deep neural network–based methods such as variational autoencoders and generative adversarial networks

The choice of methods for synthetic data generation depends on the type of data to be generated, with statistical methods being more common for numerical data and deep learning methods being commonly used for unstructured data like images, text, audio, and video. In the following sections, you’ll learn more about the different types of synthetic data and then explore some techniques for generating it.

Types of synthetic data
Synthetic data can be classified into different types based on their usage and the data format. Generally, it falls into one of two categories:

Partially synthetic data, where only a specific set of the training data is generated artificially
Fully synthetic data, where the entire training data set consists of synthetic data

Partially synthetic data finds its application in use cases where sensitive data needs to be replaced in the original training data set. Fully synthetic data sets are used in domains like finance and healthcare, where privacy and compliance concerns restrict the use of original data.

Popular types of synthetic data, classified according to the data type, include the following:

Synthetic text
Synthetic media including images, audio, and video
Synthetic time-series data
Synthetic tabular data

Synthetic text finds its use in applications like language translation, content moderation, and product reviews. Synthetic images are used extensively for purposes like training self-driving cars, while synthetic audio and video data is used for applications including speech recognition, virtual assistants, and digital avatars. Synthetic time-series data are used in financial services to represent the temporal aspect of financial data, like stock price. Finally, synthetic tabular data is used in domains like e-commerce and fraud.

Techniques for generating synthetic data
Generating synthetic data can be very simple, such as adding noise to data samples, and can also be highly sophisticated, requiring the use of state-of-the-art models like generative adversarial networks. In this section, you’ll review two chief methods for generating synthetic data for machine learning and deep learning applications.

Statistical methods
In statistics, data samples can be assumed to be generated from a probability distribution with certain characteristic statistical features like mean, variance, skew, and so on. For instance, in the case of anomaly detection, one assumes that the nonanomalous samples belong to a certain statistical distribution while the anomalous or outlier samples do not correspond to this data distribution.

Consider a hypothetical machine learning example of predicting the salaries of data scientists with certain years of experience at top tech companies. In the absence of real-world salary data, which is a topic considered taboo, synthetic salary data can be generated from a distribution defined by the few real-world salary public reports on platforms like Glassdoor, LinkedIn, or Quora. This can be used by recruiters and hiring teams to benchmark their own salary levels and adjust the salary offers to new hires.

Deep learning-based methods
As the complexity of the data increases, statistical-sampling-based methods are not a good choice for synthetic data generation. Neural networks, especially deep neural networks, are capable of making better approximations of complex, nonlinear data like faces or speech. A neural network essentially represents a transformation from a set of inputs to a complex output, and this transformation can be applied on synthetic inputs to generate synthetic outputs. Two popular neural network architectures for generating synthetic data are variational autoencoders and generative adversarial networks, which will be discussed in detail in the next sections.

Variational autoencoders
Variational autoencoders are generative models that belong to the autoencoder class of unsupervised models. They learn the underlying distribution of a data set and subsequently generate new data based on the learned representation.

VAEs consist of two neural networks: an encoder that learns an efficient latent representation of the source data distribution and a decoder that aims to transform this latent representation back into the original space. The advantage of using VAEs is that the quality of the generated samples can be quantified objectively using the reconstruction error between the original distribution and the output of the decoder. VAEs can be trained efficiently through an objective function that minimizes the reconstruction error.

VAEs represent a strong baseline approach for generating synthetic data. However, VAEs suffer from a few disadvantages. They are not able to learn efficient representations of heterogeneous data and are not straightforward to train and optimize. These problems can be overcome using generative adversarial networks.

Generative adversarial networks
GANs are a relatively new class of generative deep learning models. Like VAEs, GANs are based on simultaneously training two neural networks but via an adversarial process.
A generative model, G, is used to learn the latent representation of the original data set and generate samples. The discriminator model, D, is a supervised model that learns to distinguish whether a random sample came from the original data set or is generated by G. The objective of the generator G is to maximize the probability of the discriminator D, making a classification error. This adversarial training process, similar to a zero-sum game, is continued until the discriminator can no longer distinguish between the original and synthetic data samples from the generator.

GANs originally became popular for synthesizing images for a variety of computer-visionproblems, including image recognition, text-to-image and image-to-image translation, super resolution, and so on. Recently, GANs have proven to be highly versatile and useful for generating synthetic text as well as private or sensitive data like patient medical records.

Synthetic data generation with Openlayer
Openlayer is a machine learning debugging workspace that helps individual data scientists and enterprise organizations alike to track and version models, uncover errors, and generate synthetic data. It is primarily used to augment underrepresented portions or classes in the original training data set. Synthetic data is generated from existing data samples, and data-augmentation tests are conducted to verify whether the model’s predictions on the synthetic data are the same as for the original data.

Conclusion
In this article, you learned about synthetic data for machine learning and deep learning applications. In the absence of real-world data, as well as other pertinent issues like privacy concerns or the high costs of data acquisition and labeling, synthetic data presents a versatile and scalable solution. Synthetic data has found mainstream acceptance in a number of domains and for a variety of data types, including text, audio, video, time series, and tabular data.

You explored these different types of synthetic data and the various methods for generation. These include statistical approaches as well as neural network–based methods like variational autoencoders and generative adversarial networks. Then you walked through a brief tutorial for generating synthetic data using deep learning methods. Finally, you saw the utility of third-party synthetic data generation products such as Openlayer, which can help companies rapidly scale their synthetic data requirements and accelerate model development and deployment.

Related Blogs

Comments

Using Bash to Read Files

13/6/2022

Comments

Published by Earthly.dev

Preview:
Bash (bourne again shell) scripts give you the ability to turn series of manual commands into an easily runnable and repeatable script. This can be especially useful when working with files.

For programmers, Bash enables you to efficiently search for particular keywords or phrases by reading each line separately. Bash can also be used for reading files for a variety of reasons, like shell scripting, searching, text processing, building processes, logging data, and automating administrative tasks. When you’re done with this article, you’ll be able to use Bash to read files line by line, use custom delimiters, assign variables, and more.

👉 Here is the full article

Comments

Choosing a Data Governance Framework for your Organization

7/6/2022

Comments

Published by Domino Data Lab

Data governance refers to the process of managing enterprise data with the aim of making data more accessible, reliable, usable, secure, and compliant across an organization. It is a critical feature of organizational data management and promotes better data quality and data democratization.

A well-planned data-governance framework is fundamental for any data-driven organization that aims to harness the business value of its data and downstream capabilities that drive robust decision-making. It covers and details best practices for data processes, roles, policies, standards, and metrics.

Naturally, data-governance frameworks vary from one organization to the next. Here are a few examples of strong data-governance frameworks recommended at companies like PWC, Hubspot, and ING.

However, there are a set of commonly accepted best practices, as listed below:

Start small but with a broad perspective and long-term view of data governance in mind.
Build a strong business case, including organizational benefits like higher revenue, better cross-functional efficiency, and customer experience.
Agree on a set of metrics to track the adoption, implementation, and value over time.
Communicate early and often, and create organization-wide awareness and alignment.
Get buy-in from leadership to invest in data governance for the long term, with the understanding that its benefits may not be evident in the short term.
Identify relevant data-focused roles and responsibilities for the creation, implementation, and execution of the data-governance framework.

For more examples of recommended best practices, check out these data-governance tips from Tableau and Snowflake.

In this article, you’ll learn more about data-governance frameworks and their essential components, exploring use cases and best practices for choosing a data-governance framework for your organization.

The Importance of Data Governance
Without effective data governance, organization's data science teams fail to extract the full value of their data. Weak data governance leads to poor data practices and decision-making, causing organizations to lose their competitive edge in an increasingly data-driven corporate environment.

Crucially, poor governance can also impair compliance with regulatory standards like GDPR, HIPAA, SOX, CCPA, and the like, which can have massive consequences for businesses. Hefty fines for violating such laws can cause a dent in a company’s bottom line. For instance, in 2021, Amazon received a whopping GDPR fine of $877 million.

Strong data governance ensures that data is readily available, reliable, consistent, and of high quality to empower businesses to create value from the data. It encompasses processes, people, organizational bodies, and a set of policies that all work together to determine best practices of managing data.

The Benefits of Data Governance
The benefits of data governance are manifold, some of which include the following:

Improved data quality and more efficient and reliable downstream analytics and data science
More robust data-driven business decision-making
Greater organizational awareness about accessibility and usage of data across departments
Better compliance with local, national, and international regulatory regimes
Increased trust in data and promotion of a data-driven organizational culture

Investing in a robust data-governance framework yields significant returns and helps accelerate an organization’s digital- and data-transformation journey.

How to Choose a Data Governance Framework
There are several foundations of a modern data-governance framework. The primary focus areas include data quality, accessibility, security, and compliance. However, the success of a data-governance framework cannot be realized until people, processes, and technology are combined effectively.

Designing an effective data-governance framework also includes creating a clear set of organizational roles, responsibilities, and data stakeholders for enforcing and managing the governance policy. In this section, you’ll read about the core aspects of building an optimal data-governance framework.

Data Accuracy
Data accuracy is a central pillar of data quality and refers to error-free and reliable information. Inaccurate data is often the result of poor data entry practices, poor regulation of data accessibility, and poor data quality standards. It is critical to improve data accuracy standards as clean, consolidated, and accurate data can lead to a two times improvement on the return on investment.

There are many ways to test and improve existing data accuracy standards. A real-time data quality audit can identify issues like duplicates, missing values, incomplete information, data being stored in multiple locations, and so on. Some of these common issues can be fixed through tools that automate data quality and accuracy checks, while other issues require manual intervention by data teams.

Data Relevance
Data relevance refers to whether the data is pertinent and applicable for deriving business value (i.e., whether it is fit for the purpose it was originally intended for). For instance, information about customers’ sexual orientation and military records are often asked on forms alongside relevant data fields like customer name, email, and other contact details. However, in many cases, data fields like sexual orientation and military records have no material consequence for the business.

In terms of data governance, every data set needs to be rigorously assessed for its relevance and potential value to the organization. As organizations accumulate more and more data, the cost of data storage, maintenance, privacy, and security increases. Therefore, not having robust data-relevance checks can have a significant financial impact as well.

Scope of Policy
In the context of modern data-driven organizations, data governance has a broad scope and covers everything from metadata and data storage to accessibility, quality, security, ownership, and organizational roles and policies for various processes, people, and technology.

Formulating a comprehensive data-governance policy that covers such a wide scope requires concerted efforts from a variety of data stakeholders as well as the executive leadership. As this requires significant investment in talent and technology—where ROI may not be evident in the short term—buy-in and support from leadership is critical.

Data Compliance Standards
Adherence to data compliance standards like the GDPR, HIPAA, SOX, and the like is another crucial element of data governance. As organizations store a lot of confidential user and business data—including personally identifiable information like names, contact details and addresses, passwords, credit card details, and so on—failure to adequately secure this data from adversarial attacks and data breaches, or even mistaken internal access, has massive business consequences.

Lack of compliance to data privacy and security regulations can result in tremendous fines. Perhaps even more significant is the reputational damage to an organization that fails to protect its customer and enterprise data, leading to greater expenses and the possibility of significant loss of revenue and future business.

Data Access Policy
A data access policy sets out which employees can access an organization’s data so that they do not face much friction and interference in their daily work that relies on data. Under a data-access policy, only a certain set of employees are authorized to access and use data for specific use cases.

This policy is applicable to all the data assets across business units and encompasses data of all types and formats. Clear guidelines on who has access and who does not also helps improve data security, promote better adherence to data compliance regulations, and reduce wastage of organizational time and resources.

Organizational Roles for Managing a Data Governance Policy
Data governance is the mandate of several stakeholders, such as the following:

Chief Data Officer: This is the senior executive who owns the implementation and execution for the data-governance framework. They take care of approvals, funding, recruitment, adoption of technology, and crucially, creating greater awareness about data governance across the organization.
Steering Committee: This consists of business leaders and data stakeholders and is responsible for creating data-governance policies and data standards. It has the final say on policy approvals and also helps resolve any disputes or conflicts among the data or business teams.
Data Stewards: They are responsible for the day-to-day management of data. They typically have strong domain expertise and oversee the management of data sets, playing a hands-on role. They also ensure that the policies advocated by the steering committee are actually implemented and complied with across the organization.

Stages of Implementing Policy
Implementing a well-planned and comprehensive data-governance framework takes time and resources. It involves multiple steps, typically including the following:

Identifying roles and responsibilities
Defining data domains
Setting up data workflows
Establishing data controls
Identifying data sources
Defining policies and standards

It is important to start small and slowly build traction, gaining the confidence of the leadership as well as the various data stakeholders with every step in the process. Involving a variety of stakeholders as discussed above and testing the governance framework in a small-sized team helps to identify key practices and standards that can then be scaled to the level of the whole organization.

Conclusion
In this article, you learned about data-governance frameworks for organizations. Data governance is a fundamental requirement for data-driven companies that helps them manage in-house data assets effectively and create consistent business value.The importance of data governance for modern organizations is great, and the negative consequences of poor governance practices are significant. Choosing a data-governance framework is not straightforward, but exploring the core pillars of strong data governance, including data accuracy, relevance, compliance, and security is an important first step towards modernizing your organization's journey in digital transformation.

In the context of data science and machine learning, a robust data-governance framework is essential to maintain the quality, relevance, timeliness, completeness, accuracy, and validity of the data used to train machine learning models.

Related Blogs

Comments

How to Generate Synthetic Data for Machine Learning Projects

Using Bash to Read Files

Choosing a Data Governance Framework for your Organization

Archives

Categories