Data Labeling: The Unsung Hero Combating Data Drift

27/10/2021

Introduction
Data drift is a common problem for production machine learning systems. It occurs when the statistical characteristics of the training (source) and test (target) data begin to differ significantly. As illustrated in the image below, the orange curve depicting the original data distribution shifts to the purple curve, representing a change in statistical properties like the mean and variance.

Understanding data drift is fundamental to maintaining the predictive power of your production machine learning systems. For instance, a data science team may have started working on a machine learning use case in 2019, using training data from 2018, but by the time the model is ready to go into production, it’s 2020. There could be a huge change in the distribution between the source data from 2018 and the live data coming from 2020.

Any time a machine learning model is ready to be shipped, it needs to be rigorously tested on live data. It’s critical that you detect data drift before deploying a model to production.

In this article, I’ll illustrate the various types of data drift and how data drift impacts model performance along with several examples. I’ll also address data labeling, one of the popular ways to tackle data drift, and how to perform data labeling efficiently.

Why Data Drift Happens?
In real-world situations, data drift can occur due to a variety of reasons:

seasonal trends
launch of a new feature or product that causes a change in the data or features used to train the machine learning model
rare black swan events like the COVID-19 pandemic that disrupted normal customer behavior in specific domains like food, travel, and hospitality

Continuing with the COVID-19 example, a model trained on data prior to the onset of global lockdowns, say from January to February 2020 will yield poor predictions on data in March and April 2020 after the lockdowns started. Thus, the original trained model is no longer relevant or practically useful and needs to be retrained.

Even small changes in the data structure or format of the source data can have significant consequences for machine learning models. For instance, a change in the format of a data field, like an IP address or hostname or ID, can often go undetected for a long time without effective root cause analysis.

Types of Data Drift
There are different types of data drift, but the two principal ones are:

Covariate drift refers to data drift associated with a shift in the independent variables. It happens when a few features change while still maintaining the same relationship between the feature and the target variable.

Covariate drift primarily occurs due to sample selection bias, which is a systematic bias in the selection of training data that results in a nonuniform and nonrepresentative training dataset. Nonstationary environments, where the training environment differs from the test environment, also cause covariate drift.

Concept drift, on the other hand, occurs when the relationship between the independent variables and the target variable changes.

Consider a product recommendation machine learning model in the context of e-commerce, where the original model is trained on user activity and transactions from users located in the US. Now imagine that the e-commerce company is going to launch in a new locale or market with the same product catalog as in the US. The original recommendation model will perform poorly when applied to users from the new market with significantly different online shopping behavior, financial literacy, or internet access for e-commerce.

In this example, the online shopping behavior of the users is markedly distinct. Even if the same features are used to train the machine learning model, it might underperform significantly. In such cases, concept drift is the root cause of data drift, and the personalization model needs to be reworked and include new features that better capture the new user behavior.

Overcoming Drift with Data Labeling
To overcome data drift, you need to retrain the model using all available data, including data from before and after drift occurred. New data needs to be labeled accurately before including it in the new training dataset.

Data labeling refers to the process of providing meaningful labels to target variables in the context of supervised machine learning where the target could be an image or text or an audio snippet.

In the context of data drift, data labeling is crucial to countering data drift, and thereby directly affects the performance of machine learning models in production.
Data labeling is integral to supervised machine learning where a model is fed input data along with relevant labels depending on the use case. For example, for a model learning to detect product placement in videos, the model is fed a video with products highlighted in the video.

Typically, data labeling is a manual exercise that’s both costly and time-consuming. It’s often outsourced to vendors in developing countries associated with low cost of labor. Annotators need to be trained to use labeling software, understand the machine learning use case and the annotation framework, and deliver highly accurate labels at a high velocity and throughput.

In such a scenario, labeling errors can occur, which exacerbates the problem of data drift if data from the new test or target distribution isn’t labeled accurately. In practice, several controversial labeling errors have occurred that cause reputational damage to the company, for instance, when Google Photos labeled two Black people as “gorillas.”

Big technology companies like Google and Facebook are grappling with such issues in their automated data labeling algorithms. Labeling errors can be made by human annotators, and also by machine learning models. Once trained, the predictions made by machine learning models on new data are often reused to augment the original training data to further improve the models. In such scenarios, data labeling errors can compound resulting in imperfect models that often yield such bizarre and controversial results.

Data labeling helps alleviate data drift by incorporating data from the changed distribution into the original training dataset. If enough new data is labeled, then it is possible to drastically reduce data drift by simply dropping the older data and only using the newly labeled data.

Therefore, proper and efficient data labeling is a crucial exercise with significant commercial impact, depending on the nature of the machine learning application. For example, incorrect data labels in a fraud detection use case can result in monetary loss every time the fraud detection machine learning model makes an incorrect prediction. Inaccurate data labels not only impact the performance of the machine learning model but also indirectly contribute to data drift. Any systematic data labeling errors may compound the problem as the model’s predictions on new data are typically leveraged to augment the training dataset.

Data labeling can be improvised and performed effectively through the use of intuitive software that enables human annotators to label data with high speed and low cognitive load. For additional improvement in data labeling, you can implement inter-annotator agreement; a particular training example is assigned a label that’s selected by a majority of the annotators. For example, if four out of seven annotators assign “Label1” to a particular data sample and the other three annotators assign it “Label2,” then the data sample would be tagged with “Label1.”

Strong operational practices including auditing of randomly selected labels for accuracy can improve the process and provide feedback about systematic labeling errors.

You can also use machine learning to aid data labeling with a model trained on a sample of data that’s labeled by humans to generate predictions on new or unlabeled data. These noisy labels can then be leveraged to build better machine learning models by incorporating the data samples associated with high probability and sending the data samples with low probability back to human annotators for more accurate labels. This process can be repeated iteratively to improve the overall performance of the model with minimal human data labeling efforts.

Conclusion
Data drift can have a negative impact on the performance of machine learning models as data distribution changes. This can cause a machine learning model’s predictive accuracy to go down over time if not countered effectively.

Data labeling is one technique to reduce data drift by applying labels to data from the new or changed distribution that the model does not predict well. This helps the machine learning model to incorporate this new knowledge during the training process to improve its performance.

There are several tools available today that enable annotators to label data efficiently. For example, Label Studio is an open-source data labeling tool that provides a platform for labeling different data types, including images, text, audio as well as multi-domain data. It’s already used by leading technology companies including Facebook, NVIDIA, Intel, so check it out if you’re looking for a robust, open-source solution for reducing data drift.

Comments

ML Engineer vs Data Scientist

21/10/2021

Comments

Published by Neptune.ai

Introduction
In 2010, DJ Patil and Thomas Davenport famously proclaimed Data Scientist (DS) to be the “Sexiest Job of the 21st century”. The progress in data science and machine learning over the last decade has been monumental. Data science has successfully empowered global businesses and organizations with predictive intelligence and data-driven decision-making to the extent that data science is no longer considered a fringe topic. Data science is now a mainstream profession and data science professionals are in high demand a cross all kinds of organizations from big tech companies to more traditional businesses.

A decade earlier the focus of data science was more on algorithmic development and modeling to extract robust insights from data. However, as data science has evolved over the decade, it has become clearer that data science involves more than just modeling. The machine learning lifecycle, from raw data through to deployment, now relies on specialized experts including data engineers, data scientists, machine learning engineers along with product and business managers.

The role of a machine learning engineer is gaining prominence across companies as they realized that the value of data science cannot be realized until a model is successfully deployed to production. Whilst a lot of tools and technologies such as Cloud APIs, AutoML, and a number of Python-based libraries have made the job of a data scientist easier, the MLOps of putting models into production and monitoring their performance is still quite unstructured.

For a detailed look at the respective skills, responsibilities, and tech stack of various profiles, ranging from a data scientist to a data science manager, refer to my previous article on how to build effective machine learning teams in the industry [2].

There are four core steps in executing a data science project:

Problem formulation – translating a business problem into a data science problem
Data engineering – preparing the data and pipelines to process raw data for modeling
Modeling – designing and experimenting with algorithms and models for the use case
Deployment – productionizing the model after testing and monitoring its performance.

In large tech companies and startups, there is a more established process of going about data science, and the work is clearly demarcated along the lines. Thus, it is common to expect professionals across various sub-domains to focus on their respective areas of specialization and collaborate with each other when required. However, in smaller organizations that do not have the luxury of having a large data science team, the first few data science hires are expected to work across these distinct functions as “full-stack” data scientists.

Thus, the definition and scope of a data scientist vs. a machine learning engineer is very contextual and depends upon how mature the data science team is. For the remainder of the article, I will expand on the roles of a data scientist and a machine learning engineer as applicable in the context of a large and established data science team.
In this article, I will:

review and compare the evolving roles and responsibilities of a data scientist and a machine learning engineer in the machine learning industry;
discuss the scope of each role, similarities and differences, and how to ensure strong communication and collaboration between these two core profiles without which data science projects are bound to fail.

Differences between Data Scientist & Machine Learning Engineer
In this section, I will discuss the primary differences in skills, responsibilities, day-to-day tasks, tech stack amongst other things.

The chief responsibility of a data scientist is to develop solutions using machine learning or deep learning models for various business problems. It is not always necessary to create novel algorithms or models as these tasks are research-intensive and can take up considerable time. In most cases, it is sufficient to use existing algorithms or pre-trained models, and optimize them in the context of the problem statement. However, in more innovative and R&D-focused teams or companies, scientists may be required to produce novel research and model artifacts.

On the contrary, the main goal of machine learning engineers is to take the models prepared by the data scientists and take them to production. This involves multiple aspects including model optimization to make it compatible with the custom deployment constraints and building MLOps infrastructure for experimentation, A/B testing, model management, containerization, deployment, and monitoring the model performance once deployed.

These factors translate into the underlying differences in skills, responsibilities, and tech stack for the respective roles as shown in the following tables.

Similarities, interference & handover
Similarities between Data Scientist and ML Engineer As evident from Tables 1-3, there is a partial overlap between the skills and responsibilities of data scientists and machine learning engineers. The tech stack is also quite similar and whilst data scientists are expected to mostly code in Python, machine learning engineers also need to know C++ for porting the model artifacts into a more efficient and faster format.

What machine learning engineers might lack in terms of subject matter expertise compared to data scientists, they make up for it in terms of knowledge of engineering tools and frameworks like Kubernetes that data scientists are less familiar with.

Data scientists usually have a STEM background or even advanced degrees like a Ph.D. in diverse fields like biology, economics, physics, mathematics amongst others. On the other hand, machine learning engineers generally have professional experience as software engineers.

While data scientists primarily deal with algorithmic and model development, machine learning engineers’ key focus is on scalable software engineering relevant to model deployment and monitoring, the remaining tasks are often common to both profiles.

In a few cases, these tasks might be shared depending on the size and maturity of the data science team, and things might work smoothly. However, more often than not, especially in larger teams and organizations, this can create considerable conflict and friction especially when data scientists and machine learning engineers work in different teams and report to different managers.

The handover processIt is possible to draw a clear line between the respective mandates of data scientists and machine learning engineers. Typically, data scientists will develop one or more candidate machine learning models and hand over these to the machine learning engineers following a specific contract.

The contract should specify:

the model accuracy,
latency,
memory,
the number of parameters,
the machine learning or deep learning framework used,
model versions,
the model predictions,
and the ground labels for the validation or test set amongst other parameters.

A structured handover contract ensures that the machine learning engineers have all necessary information to work on model optimization, any further experimentation, and deployment processes. After the handover, the data scientists become free to focus on the next machine learning use cases to take to production.

The collaboration between data scientists and machine learning engineers continues post-deployment and becomes critical especially when the models break in production. As the data scientists have greater insight into the working of the model, they are better positioned to troubleshoot and fix the models.

At the same time, some model failures are related to cracks in the underlying infrastructure developed by machine learning engineers, which they are in the best position to resolve. Continuous refinement of the model based on live data received by the model via active learning also falls under the domain of data scientists.

Communication & Collaboration between Data Scientists & ML Engineers
The success of a data science team is contingent on strong collaboration across the varied profiles [2]. Data scientists and machine learning engineers collaborate continuously during model development, deployment, and post-deployment monitoring and refinement. Ideally, if these two profiles ought to be part of the same team and report to the same leadership. In such a context, collaboration becomes easier and also fosters strong collegiality and learning from each other.

However, when data scientists and machine learning engineers are part of different teams and report to different leadership, the collaboration is not as strong as it should be. In such organizational settings, data scientists and machine learning engineers do not get to interact directly as much and rely on team productivity and project management tools like Slack, Teams, JIRA, Asana, etc.

For a lot of repetitive and common use cases, the use of such collaboration tools is actually a boon and saves the team a lot of time and effort. However, the transactional nature of relying on tools whose atomic units are tickets or tasks does not create a sense of team bonding and collaboration. In data science teams that rely heavily on such tools, this is a common grievance.

For more complex tasks or projects, in-person or video collaboration is a must and should not be ignored by the leadership. It is often in these settings that the technical professionals might learn of new use cases or clients from the business leaders, and the business professionals in turn might learn of a new technical breakthrough that could solve up-and-coming business use cases. The same holds true for data scientists and machine learning engineers as well, where each party could learn of either a new algorithm, or a model, or a new framework to make data science more effective and productive.

Current industry trends If a new version of the Harvard Business Review article in [1] were to be published in 2021, it would claim “machine learning engineer” as the sexiest job of the 2020s. While data science and model development is still a lucrative role across industry and academia, in recent years the focus in the industry has slightly shifted to building scalable and reliable infrastructure to serve data science models to millions of customers.
As of today, the machine learning engineer role is in much greater demand than that of a data scientist across the tech industry.

Industry leaders have learned that while it is great to have large, complex machine learning and deep learning models achieve state-of-the-art performance on academic benchmarks or training data, these do not yield any commercial value to the business until deployed and serving customer requests reliably and quickly at a high level of accuracy.
As more enterprises are becoming data-driven companies and establishing data science and machine learning teams or organizations [3], it is imperative for them to measure and achieve the required levels of ROI.
Big tech customer-focused companies that were early to venture into and invest in AI have already built strong teams of scientists and are now looking to enhance the production capabilities and commercialize the R&D artifacts developed by the data and research scientists.
Whilst top data scientists especially those with advanced degrees like PhDs will always be in high demand, we are currently witnessing a job market that is seeking skilled machine learning engineers whose supply is limited compared to data scientists.

The transition from Data Scientist to Machine Learning Engineer
There are numerous online courses on learning platforms like Coursera, Udacity, Udemy, etc. but there is a relative paucity of instructors and content focused on machine learning engineering practices. While building data science models can occur in a sandbox environment like Kaggle where the models are not made to serve real-world predictions, it is only possible to learn scalable model deployment, monitoring, and related machine learning engineering tasks in a real-world industry setting. As machine learning engineering and MLOps is a more applied discipline, there are fewer experts who have the required skillset to build and maintain robust infrastructure.

At the same time, existing data scientists, lured by the promise of greater potential impact, better compensation, and long-term career prospects are also seeking to transition into MLE roles.

As illustrated in tables 1, 2, and 3, there is considerable overlap between the two roles. However, machine learning engineers focus on the “engineering” aspects of taking models to production while data scientists focus on developing the right set of models for specific business problems. The most relevant skills that data scientists need to learn to become an effective machine learning engineer is software engineering including the ability to write optimized code, preferably in C++, rigorous testing, and understand and build and operate existing or custom tools and platforms for reliable model deployment and management.

It is definitely possible for data scientists to learn C++ and best practices in software engineering and software testing, as well as onboard new tools and technologies like Docker, Kubernetes, ONNX, and model serving platforms from multiple sources. However, since companies require machine learning engineers to have prior relevant experience, it becomes practically infeasible for data scientists to justify a machine learning profile if they do not have real-world hands-on experience in industry settings.

Given the chicken-and-egg nature of this problem, the best avenue for existing data scientists to transition to machine learning engineering is with their current employer. If data scientists express interest in machine learning engineering to their managers and are allowed to shadow or even assist and collaborate with machine learning engineers on specific projects, it becomes easier to make an internal transition within the same company. This represents a challenge for fresh graduates without any prior industry experience, and a similar internal transition route from data science or software engineering to machine learning engineering is the recommended pathway.

As the industry matures and companies evolve their machine learning systems and associated processes like hiring and upskilling, it will become easier for more candidates to make the transition from data science to machine learning engineering.

For more complex tasks or projects, in-person or video collaboration is a must and should not be ignored by the leadership. It is often in these settings that the technical professionals might learn of new use cases or clients from the business leaders, and the business professionals in turn might learn of a new technical breakthrough that could solve up-and-coming business use cases. The same holds true for data scientists and machine learning engineers as well, where each party could learn of either a new algorithm, or a model, or a new framework to make data science more effective and productive.

Conclusion
AI is a cornerstone of modern enterprise. This AI-revolution has accelerated significantly over the last decade and resulted in huge unmet demand for data science professionals. Data science as a discipline has also evolved, creating distinct profiles focused on data, modeling, engineering as well as product and customer success management. Of these profiles, machine learning engineers play a critical role in taking the models developed by data scientists based on the data prepared by data engineers and for use cases identified and developed by product or business managers to fruition.

Currently, the demand for machine learning engineers is similar to the demand for data scientists a decade ago. Such changes in the scope and nature of profiles in the AI industry will continue to happen, and present new challenging opportunities to engineers, scientists as well as business professionals to get their foot in the door.

References
[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
[2] https://neptune.ai/blog/how-to-build-machine-learning-teams-that-deliver
[3] https://neptune.ai/blog/building-ai-ml-projects-for-business-best-practices

Comments

Knowledge Distillation: Principles, Algorithms, Applications

7/10/2021

Comments

Published by Neptune.ai

Introduction
Large-scale machine learning and deep learning models are increasingly common. For instance, GPT-3 is trained on 570 GB of text and consists of 175 billion parameters. However, whilst training large models helps improve state-of-the-art performance, deploying such cumbersome models especially on edge devices is not straightforward.

Additionally, the majority of data science modeling work focuses on training a single large model or an ensemble of different models to perform well on a hold-out validation set which is often not representative of the real-world data.

This discord between training and test objectives leads to the development of machine learning models that yield good accuracy on curated validation datasets but often fail to meet performance, latency, and throughput benchmarks at the time of inference on real-world test data.

Knowledge distillation helps overcome these challenges by capturing and “distilling” the knowledge in a complex machine learning model or an ensemble of models into a smaller single model that is much easier to deploy without significant loss in performance.

In this blog, I will:

describe knowledge distillation in detail, its underlying principle, training schemes, and algorithms;
dive deeper into applications of Knowledge Distillation in deep learning for images, text, and audio.

What is knowledge distillation?
Knowledge distillation refers to the process of transferring the knowledge from a large unwieldy model or set of models to a single smaller model that can be practically deployed under real-world constraints. Essentially, it is a form of model compression that was first successfully demonstrated by Bucilua and collaborators in 2006 [2].

Knowledge distillation is performed more commonly on neural network models associated with complex architectures including several layers and model parameters. Therefore, with the advent of deep learning in the last decade, and its success in diverse fields including speech recognition, image recognition, and natural language processing, knowledge distillation techniques have gained prominence for practical real-world applications [3].

The challenge of deploying large deep neural network models is especially pertinent for edge devices with limited memory and computational capacity. To tackle this challenge, a model compression method was first proposed [2] to transfer the knowledge from a large model into training a smaller model without any significant loss in performance. This process of learning a small model from a larger model was formalized as a “Knowledge Distillation” framework by Hinton and colleagues [1].

As shown in Figure 1, in knowledge distillation, a small “student” model learns to mimic a large “teacher” model and leverage the knowledge of the teacher to obtain similar or higher accuracy. In the next section, I will delve deeper into the knowledge distillation framework and its underlying architecture and mechanisms.

Diving deeper into knowledge distillation
A knowledge distillation system consists of three principal components: the knowledge, the distillation algorithm, and the teacher-student architecture [3].

Knowledge In a neural network, knowledge typically refers to the learned weights and biases. At the same time, there is a rich diversity in the sources of knowledge in a large deep neural network. Typical knowledge distillation uses the logits as the source of teacher knowledge, whilst others focus on the weights or activations of intermediate layers. Other kinds of relevant knowledge include the relationship between different types of activations and neurons or the parameters of the teacher model themselves.
The different forms of knowledge are categorized into three different types: Response-based knowledge, Feature-based knowledge, and Relation-based knowledge. Figure 2 illustrates these three different types of knowledge from the teacher model. I will discuss each of these different knowledge sources in detail in the following section.

1. Response-based knowledge
As shown in Figure 2, response-based knowledge focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model. As illustrated in Figure 3, This can be achieved by using a loss function, termed the distillation loss, that captures the difference between the logits of the student and the teacher model respectively. As this loss is minimized over training, the student model will become better at making the same predictions as the teacher.

In the context of computer vision tasks like image classification, the soft targets comprise the response-based knowledge. Soft targets represent the probability distribution over the output classes and typically estimated using a softmax function. Each soft target’s contribution to the knowledge is modulated using a parameter called temperature. Response-based knowledge distillation based on soft targets is usually used in the context of supervised learning.

2. Feature-based knowledge
A trained teacher model also captures knowledge of the data in its intermediate layers, which is especially pertinent for deep neural networks. The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model. As shown in Figure 4, the goal is to train the student model to learn the same feature activations as the teacher model. The distillation loss function achieves this by minimizing the difference between the feature activations of the teacher and the student models.

3. Relation-based knowledge
In addition to knowledge represented in the output layers and the intermediate layers of a neural network, knowledge that captures the relationship between feature maps can also be used to train a student model. This form of knowledge, termed as relation-based knowledge is depicted in Figure 5. This relationship can be modeled as correlation between feature maps, graphs, similarity matrix, feature embeddings, or probabilistic distributions based on feature representations.

Training There are three principal types of methods for training student and teacher models, namely offline, online and self distillation. The categorization of the distillation training methods depends on whether the teacher model is modified at the same time as the student model or not, as shown in Figure 6.

1. Offline distillation
Offline distillation is the most common method, where a pre-trained teacher model is used to guide the student model. In this scheme, the teacher model is first pre-trained on a training dataset, and then knowledge from the teacher model is distilled to train the student model. Given the recent advances in deep learning, a wide variety of pre-trained neural network models are openly available that can serve as the teacher depending on the use case. Offline distillation is an established technique in deep learning and easier to implement.

2. Online distillation
In offline distillation, the pre-trained teacher model is usually a large capacity deep neural network. For several use cases, a pre-trained model may not be available for offline distillation. To address this limitation, online distillation can be used where both the teacher and student models are updated simultaneously in a single end-to-end training process. Online distillation can be operationalized using parallel computing thus making it a highly efficient method.

3. Self-distillation
As shown in Figure 6, in self-distillation, the same model is used for the teacher and the student models. For instance, knowledge from deeper layers of a deep neural network can be used to train the shallow layers. It can be considered a special case of online distillation, and instantiated in several ways. Knowledge from earlier epochs of the teacher model can be transferred to its later epochs to train the student model.

Architecture
The design of the student-teacher network architecture is critical for efficient knowledge acquisition and distillation. Typically, there is a model capacity gap between the more complex teacher model and the simpler student model. This structural gap can be reduced through optimizing knowledge transfer via efficient student-teacher architectures.

Transferring knowledge from deep neural networks is not straightforward due to their depth as well as breadth. The most common architectures for knowledge transfer include a student model that is:

a shallower version of the teacher model with fewer layers and fewer neurons per layer,
a quantized version of the teacher model,
a smaller network with efficient basic operations,
a smaller networks with optimized global network architecture,
the same model as the teacher.

In addition to the above methods, recent advances like neural architecture search can also be employed for designing an optimal student model architecture given a particular teacher model.

Algorithms for knowledge distillation
In this section, I will focus on the algorithms for training student models to acquire knowledge from teacher models.

1. Adversarial distillation
Adversarial learning as conceptualized recently in the context of generative adversarial networks, is used to train a generator model that learns to generate synthetic data samples as close as possible to the true data distribution and a discriminator model that learns to discriminate between the authentic and synthetic data samples. This concept has been applied to knowledge distillation to enable the student and teacher models to learn a better representation of the true data distribution.

To meet the objective of learning the true data distribution, adversarial learning can be used to train a generator model to obtain synthetic training data to use as such or to augment the original training dataset. A second adversarial learning based distillation method focuses on a discriminator model to differentiate the samples from the student and the teacher models based on either logits or feature maps. This method helps the student mimic the teacher well. The third adversarial learning-based distillation technique focuses on online distillation where the student and the teacher models are jointly optimized.

2. Multi-Teacher distillation
In multi-teacher distillation, a student model acquires knowledge from several different teacher models as shown in Figure 7. Using an ensemble of teacher models can provide the student model with distinct kinds of knowledge that can be more beneficial than knowledge acquired from a single teacher model.

The knowledge from multiple teachers can be combined as the average response across all models. The type of knowledge that is typically transferred from teachers is based on logits and feature representations. Multiple teachers can transfer different kinds of knowledge as discussed in section 2.1.

3. Cross-modal distillation
Figure 8 shows the cross-modal distillation training scheme. Here, the teacher is trained in one modality and its knowledge is distilled into the student that requires knowledge from a different modality. This situation arises when data or labels are not available for specific modalities either during training or testing thus necessitating the need to transfer knowledge across modalities.

Cross-modal distillation is used most commonly in the visual domain. For example, the knowledge from a teacher trained on labeled image data can be used for distillation for a student model with an unlabeled input domain like optical flow or text or audio. In this case, features learned from the images from the teacher model are used for supervised training of the student model. Cross-modal distillation is useful for applications like visual question answering, image captioning amongst others.

4. Others
Apart from the distillation algorithms discussed above, there are several other algorithms that have been applied for knowledge distillation.

Graph-based distillation captures intra-data relationships using graphs instead of individual instance knowledge from the teacher to the student. Graphs are used in two ways – as a means of knowledge transfer, and to control transfer of the teacher’s knowledge. In graph-based distillation, each vertex of the graph represents a self-supervised teacher which may be based on response-based or feature-based knowledge like logits and feature maps respectively.
Attention-based distillation is based on transferring knowledge from feature embeddings using attention maps.
Data-free distillation is based on synthetic data in the absence of a training dataset due to privacy, security or confidentiality reasons. The synthetic data is usually generated from feature representations of the pre-trained teacher model. In other applications, GANs are also used to generate synthetic training data.
Quantized distillation is used to transfer knowledge from a high-precision teacher model (e.g. 32-bit floating point) to a low-precision student network (e.g. 8-bit).
Lifelong distillation is based on the learning mechanisms of continual learning, lifelong learning and meta-learning where previously learnt knowledge is accumulated and transferred into future learning.
Neural architecture search-based distillation is used to identify suitable student model architectures that optimize learning from the teacher models.

Applications of knowledge distillation
Knowledge distillation has been successfully applied to several machine learning and deep learning use cases like image recognition, NLP, and speech recognition. In this section, I will highlight existing applications and the future potential of knowledge distillation techniques.

1. Vision
The applications of knowledge distillation in the field of computer vision are plenty. State-of-the-art computer vision models are increasingly based on deep neural networks that can benefit from model compression for deployment. Knowledge distillation has been successfully employed for use cases like:

image classification,
face recognition,
image segmentation,
action recognition,
object detection,
lane detection,
pedestrian detection,
facial landmark detection,
pose estimation,
video captioning,
image retrieval,
shadow detection,
text-to-image synthesis,
video classification,
visual question answering, amongst others [3].

Knowledge distillation can also be used for niche use cases like cross-resolution face recognition where an architecture based on a high-resolution face teacher model and a low-resolution face student model can improve model performance and latency. As knowledge distillation can take advantage of different kinds of knowledge including cross-modal data, multi-domain, multi-task and low-resolution data, a wide variety of distilled student models can be trained for specific visual recognition use cases.

2. NLP
The application of knowledge distillation for NLP applications is especially important given the prevalence of large capacity deep neural networks like language models or translation models. State-of-the-art language models contain billions of parameters, for example, GPT-3 contains 175 billion parameters. This is several orders of magnitude greater than a previous state-of-the-art language model, BERT, which contains 110 million parameters in the base version.

Knowledge distillation is therefore highly popular in NLP to obtain fast, lightweight models that are easier and computationally cheaper to train. Other than language modeling, knowledge distillation is also used for NLP use cases like:

neural machine translation,
text generation,
question answering,
document retrieval,
text recognition [3].

Using knowledge distillation, efficient and lightweight NLP models can be obtained that can be deployed with lower memory and computational requirements. Student-teacher training can also be used to address multilingual NLP problems where knowledge from multilingual models can be transferred and shared by each other.

Case study: DistilBERT
DistilBERT is a smaller, faster, cheaper and lighter BERT model [4] developed by Hugging Face. Here, the authors pre-trained a smaller BERT model that can be fine-tuned on a variety of NLP tasks with reasonably strong accuracy. Knowledge distillation was applied during the pre-training phase to obtain a distilled version of BERT model that is smaller by 40% (66 million parameters vs. 110 million parameters) and faster by 60% (410s vs. 668s for inference on the GLUE sentiment analysis task) whilst retaining a model performance that is equivalent to 97% of the original BERT model accuracy. In DistilBERT, the student has the same architecture as BERT and was obtained using a novel triplet loss that combined losses related to language modeling, distillation and cosine-distance loss.

3. Speech
State-of-the-art speech recognition models are also based on deep neural networks. Modern ASR models are trained end-to-end and based on architectures that include convolutional layers, sequence-to-sequence models with attention, and recently transformers as well. For real-time, on-device speech recognition, it becomes paramount to obtain smaller and faster models for effective performance.
There are several use cases of knowledge distillation in speech:

speech recognition,
spoken language identification,
audio classification,
speaker recognition,
acoustic event detection,
speech synthesis,
speech enhancement,
noise-robust ASR,
multilingual ASR,
accent detection [10].

Case study: Acoustic Modeling by Amazon Alexa
Parthasarathi and Strom (2019) leveraged student-teacher training to generate soft targets for 1 million hours of unlabeled speech data where the training dataset consisted only of 7000 hours of labeled speech. The teacher model produced a probability distribution over all the output classes. The student model also produced a probability distribution over the output classes given the same feature vector and the objective function optimized the cross-entropy loss between these two distributions. Here, knowledge distillation helped simplify the generation of target labels on a large corpus of speech data.

Conclusions
Modern deep learning applications are based on cumbersome neural networks with large capacity, memory footprint, and slow inference latency. Deploying such models to production is an enormous challenge. Knowledge distillation is an elegant mechanism to train a smaller, lighter, faster, and cheaper student model that is derived from a large, complex teacher model. Following the conceptualization of knowledge distillation by Hinton and colleagues (2015), there has been a massive increase in the adoption of knowledge distillation schemes for obtaining efficient and lightweight models for production use cases. Knowledge distillation is a complex technique based on different types of knowledge, training schemes, architectures and algorithms. Knowledge distillation has already enjoyed tremendous success in diverse domains including computer vision, natural language processing, speech amongst others.

References
[1] Distilling the Knowledge in a Neural Network. Hinton G, Vinyals O, Dean J (2015) NIPS Deep Learning and Representation Learning Workshop. https://arxiv.org/abs/1503.02531
[2] Model Compression. Bucilua C, Caruana R, Niculescu-Mizil A (2006) https://dl.acm.org/doi/10.1145/1150402.1150464
[3] Knowledge distillation: a survey. You J, Yu B, Maybank SJ, Tao D (2021) https://arxiv.org/abs/2006.05525
[4] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019) Sanh V, Debut L, Chammond J, Wolf T. https://arxiv.org/abs/1910.01108v4
[5] Lessons from building acoustic models with a million hours of speech (2019) Parthasarathi SHK, Strom N. https://arxiv.org/abs/1904.01624

Comments

Data Labeling: The Unsung Hero Combating Data Drift

ML Engineer vs Data Scientist

Knowledge Distillation: Principles, Algorithms, Applications

Archives

Categories