Data is the cornerstone of businesses from large enterprises to small D2C brands, and huge amounts of it can be collected from websites, mobile apps, chat messages, call centers, business transactions, surveys, and social media platforms, among other channels. All this data represents a gold mine of information that can offer customer insights and lead to new ideas for features or products.
However, making sense of the data is easier said than done. The information originates from various channels and in multiple formats. It can be logged erroneously and contain other errors, including missing values. Because it comes from multiple domains, it can include unstructured data like text, images, audio, and video.
That is why data preparation is essential. This involves cleaning, curating, transforming, and storing data sets for downstream applications including data analytics and data visualization, as well as predictive intelligence based on machine learning and deep learning models. Data can only provide value once it has been processed from its raw form, and effective data preparation can maximize that value.
This article will explain the process of data preparation, especially in terms of data labeling, and will provide a checklist for data engineers to follow.
What Is Data Preparation?
Data preparation is not an entirely new process in technology companies. Data-driven operations previously focused on statistical analysis of business data from structured tables. The deep learning model has grown over the past decade along with the global penetration of mobile phones, widely available internet access, and cheaper cloud storage facilities. Today an estimated 2.5 quintillion bytes of data are being generated daily.
Every user interaction with online companies is recorded, from someone clicking an ad or adding a product to a shopping cart to sharing a photo on a social media app. User-generated data is generally unstructured data: images, text, audio, or video. Such data can be used to train sophisticated deep learning models to predict what users want to type in a text, which branded products are featured in an image, and what kind of customer service will be provided in a phone conversation.
For deep learning models to make sense of this data, all data samples need to be labeled. Data labeling tells the machine learning models what knowledge they need to acquire via supervised learning to power smart applications. This makes labeling critical in preparing data sets for training machine learning models.
However, data labeling can also represent the chief source of errors, affecting potential improvement in model performance. Machine learning models can only be as accurate as the labeled data, which represents the models’ entire knowledge for the particular use case.
For example, the source image data set in a face recognition program requires a label for every face shown in every image. During the labeling process for this data set, every image is reviewed by human subject matter experts, crowdsourced labelers on platforms like Amazon Mechanical Turk, or algorithms.
Labeling helps clean and prepare the data set by removing noisy or unusable data. In this case, images that don’t contain any faces, or that show unreadable faces due to poor lighting or angles, should be removed because they won’t be helpful in training a face recognition model. This step also ensures the inclusion of images that are most helpful for the desired use case.
Once the data set is reviewed and annotated, it can be used for all subsequent face recognition applications instead of going back to the raw data set. This saves time and effort for data engineers, as well as data scientists who might build novel models using the same data set.
Additionally, multiple labels and metadata can be applied to each image during the labeling process so that they’re available for new use cases. A tag that identifies the face as that of a man, woman, or child can be used for different computer vision applications. This can potentially give the data set more flexibility for the future.
The labeling can be built upon in subsequent versions of the data set. Once the face recognition model is live in production, new images can be labeled to help the model overcome data drift and augment its performance in the face of changing data distributions. This continued labeling and organizing keeps the models more robust and consistent.
Data Preparation Steps
There are certain best practices to follow when preparing data sets for deep learning applications. Following is a checklist for data engineers when working with unstructured data:
(1) Check data formats
Samples in a data set, especially if collected via web scraping or crowdsourcing, may come in multiple data formats. For example, an image could be a JPEG, PNG, or TIFF, while an audio file could be a WAV, MP3, or FLAC. Check whether the data set samples are in different formats, so that you can standardize the format across all samples.
(2) Verify data types
Certain deep learning applications are based on multimodal data including text, images, audio, video, and structured metadata. For example, a model that predicts what video a user might watch next is trained using multiple data types. It verifies the type of each data sample, then indexes and stores them separately. Note that an individual data type like numbers might also belong to different types like int, float, or string.
(3) Verify data dimensions
It’s crucial to check the dimensionality of the samples in a data set. For example, a set of images containing faces may be gathered from different cameras, each associated with different default image dimensions.
(4) Identify what data needs to be labeled
Once you’ve completed the above steps, you can begin data labeling. It may not be feasible in some situations to label each data sample, because manual labeling can be prohibitively expensive and time-consuming. In this case, choose an appropriate number of data samples for labeling. For common machine learning classification use cases, you need to sample data for labeling from each category.
(5) Determine what type of labeling to perform
The same data sample can be labeled in multiple ways depending on the use case. For instance, an image containing people and cars may be labeled for faces, for segmenting people or cars, or for the vehicle registration plates.
(6) Decide who will label the data
Data labeling can be performed manually by domain experts, crowdsourced from non-experts, or done programmatically using rule-based or model-based algorithms. Determine which annotators will define what kind of data, depending on their expertise or level of training. If a data set will be labeled using software, then the required configuration parameters, protocols, and performance metrics need to be established so that labeling is consistent.
(7) Review data for errors and mistakes
Usually, the first round of data labeling contains errors. To improve the data quality and eradicate errors, more experienced annotators should conduct a second or third level of review. Depending on cost, time, and available resources, each data sample can also be independently labeled by multiple annotators; the most commonly provided label can be assigned as the final label.
(8) Split the data set into training and testing segments
Once a data set is labeled, split it into separate train and test subsets for training and evaluating the model, respectively. Depending on the use case and the amount of available data, the ratio might be 80:20, 90:10, or even 99:1. To obtain more reliable results, k-fold cross-validation is recommended. Multiple training and test sets are sampled randomly, and the final results are averaged across all the different folds.
Without the protection of systematic data preparation and labeling checks, you may find that poor quality data damages the accuracy and performance of any analysis or models based on that data. If you follow the above guide, you will be able to ensure your data is good quality and labeled accurately.
Copyright © 2022, Sundeep Teki
All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including electronic or mechanical methods, without the prior written permission of the author.
This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.