Index

5/3/2024

Comments

Reach out - [email protected] for your AI B2B Marketing and Thought Leadership content requirements.

AI: Leadership & Best Practices

AI: Data & Governance

AI: Use cases

Team development

Misc

Comments

How to choose a Vector Database?

1/3/2024

Comments

Vector databases have recently gained prominence with the rise of large language models and generative AI. A vector database is a data store for unstructured text in the form of vector embeddings for various AI models and applications. Embeddings are a high dimensional vector representation of text that conveys rich semantic information and represent an efficient way of capturing unstructured data like text.

The rising popularity of large language models like GPT-4, Gemini, Claude-2, Llama-2, Mixtral and others have fuelled tremendous interest in generative AI across the industry to build applications based on these models. Vector databases are specialized for handling vector data that is used to train or fine-tune these foundational models for domain and company specific use cases. Unlike traditional scalar-based databases, vector databases offer optimized storage and querying capabilities for vector embeddings.

Although several vector databases are now available in the market like Pinecone, Chroma, Qdrant amongst others, deciding which vector database to choose for enterprise use cases is not a straightforward decision.

In this article, you will learn how to decide which vector database to choose for your organization based on criteria like performance, reliability, scalability, cost-efficiency, developer experience, security, technical support amongst others.

Key Considerations
In this section, you will learn in detail about each of the key factors that should be considered to make your final selection of a vector database. These include data and use case characteristics, performance, functionality, enterprise-readiness, developer experience, and future roadmap.

1. Data and Use Case
It is important to work backwards from the specific business use case that you are planning to solve by leveraging organizational data and the latest techniques from the field of generative AI. For instance, if your business objective is to build an enterprise knowledge management chatbot like McKinsey’s Lilli, you will need to organize and prepare all the in-house text data such as documents, emails, chat messages etc.

The use case defines several aspects of the data, including its size, frequency, data type, growth in the volume of data over time, data freshness and consequently the nature of the underlying vector embeddings to be stored in the vector database. These vectors may be sparse, dense, and also span multiple modalities depending on the use case.

Additionally, careful planning and scoping of the use case also helps you understand other crucial aspects such as the number of users, the number of queries per day, the peak number of queries at any given instant, as well as the query patterns of the users.

Vector databases utilize indexing and vector search powered by k-nearest neighbors (kNN) or approximate nearest neighbor (ANN) algorithms. This empowers a vector db to perform similarity search and identify the most similar vectors in the database. This capability underlies enterprise use cases based on natural language processing such as question-answering, document analysis, recommender systems, image and voice recognition etc.

2. Performance

2.1 Query latency and query per second (QPS)
The primary performance metrics of a vector db are the query latency, i.e., the time it takes to run a query and get the result and the query per second that defines the throughput in terms of the number of queries processed in a second. These parameters are critical for ensuring a seamless user experience for several applications that require real-time results such as chatbots. Typical QPS values range from ~50-300 and the average query latency from 25-100 ms depending on the underlying hardware.

2.2 Scalability
Scalability measures the ability of the vector database to grow and expand further to support the requirements of its customers. The scale can be measured in terms of the number of embeddings that can be supported and in terms of horizontal scaling of existing resources and vertical scaling of additional servers. Typically, most existing vector db companies provide scale-out capabilities up to a billion vectors without any performance degradation. If the resources can scale automatically, then you can be rest assured that your application will always be up and running.

2.3 Accuracy
A vector database is as good as its accuracy of retrieving the right set of results based on the user queries. Here, the choice of vector search algorithms to identify data sources with similar embeddings as the embedding of the user query is pivotal. There are several different algorithms used for powering vector search such as kNN, ANN, FAISS, NGT. These algorithms generate approximate results and the best vector databases provide a good trade-off between speed and accuracy.

3. Functionality

3.1 Filtering on metadata
In practice, filtering vector search results based on the metadata helps reduce the search space, thus providing for faster and more accurate search results. Typical metadata includes information like dates, versions, tags and the ability of a vector database to store multiple metadata fields allows for a better search experience.

3.2 Integrations
Integrating a vector database into the existing data and engineering infrastructure in your organization is critical to faster adoption and lesser time to value. The ability of vector databases to seamlessly integrate with essential infrastructure elements like the cloud infrastructure, underlying large language models, databases etc. is a key factor to consider.

3.3 Cost-efficiency
While performance metrics and functionality are core to a technology, the cost should be reasonable and fit your budget. The pricing of vector databases is a function of the number of ‘write’ operations such as update and delete and the number of queries. Other factors that affect the cost include the dimensionality of the embedding, the number of vectors stored in the database, and the size of the metadata.

Depending on your use case and requirements, it is essential to estimate the overall cost of running your application at scale on a monthly or quarterly basis and evaluate the overall costs relative to your budget and the expected revenue from running the AI applications.

4. Enterprise-readiness

4.1 Security and compliance
For most enterprise companies, it is imperative that any external vendor they employ meets strict security and compliance requirements. These requirements include SOC2, GDPR, HIPAA, ISO compliance and others, depending on the domain in which the company operates. The data privacy and security standards have gone up in the light of recent cybersecurity attacks and breaches of customer data, and you should ensure that any vector db vendor meets your specific security and compliance requirements.

4.2 Cloud setup
Several modern companies have undergone digital transformation and house their entire data and infrastructure in the cloud vs on-premise. You may choose to manage and maintain your infrastructure via a self-hosted setup or go for a fully managed SaaS platform. The benefit of a fully managed system is that it automates clusters with minimal requirements for you to provision and scale clusters or take care of operational issues.

4.3 Availability
Availability, i.e. the ability of your vector db to run without any interruptions, issues or downtime is essential to not adversely impact user experience. Most vector database providers vouch for specific SLAs which should meet the requirements for your applications. Typical values include 99.9% for uptime SLA and a few hours to a few business days for response time SLA depending on the severity of the production issue.

4.4 Technical support
More often than not, you might be stuck facing some issues with your vector db and need some hands-on support from the vendor to help troubleshoot the issue. Does the company provide you with a dedicated team who can be available at a short notice to get on a call and figure out how to solve the problem? The quality of responsiveness and customer support experience provided by a vector db company is valuable and helps you develop a stronger sense of trust in the company.

4.5 Open source vs Closed source
Some vector db companies are closed source and operate under a proprietary license such as Pinecone. At the same time, there are a host of vector db companies that are open source under the Apache 2.0 license such as Qdrant or Chroma while also offering a fully managed service. This can also influence your choice of the vector db provider.

5. Developer experience

5.1 Community
Software and AI engineers are the core professionals who will work on the vector db and integrate it in the company’s infrastructure and deploy your generative AI application to production. Therefore, the quality of experience that developers have with a vector db solution is integral in shaping your final decision. Having an open-source community on Slack or Discord helps build more engagement and trust with developers than commercial vendor support. It provides your developers an opportunity to learn from developers at other companies as well and discuss and solve issues by leveraging the wisdom of the community.

5.2 Onboarding
Onboarding a new technology is challenging as it determines the time your developer team takes to properly understand the product, integrate it, troubleshoot any issues, and become an expert in using the vector database. The availability of APIs and SDKs as well as clear product demos and documentation goes a long way in reducing the barriers to understanding a new vector database so that your developers can build with speed and confidence.

5.3 Time to value
Similar to the time to onboard a new vector db, another important factor is the time to business value. If a vector db provider vouches for a fast deployment of a production-ready application, then you can realize value sooner, and meet your business goals faster as well. A long gestation time from onboarding to business value is a deterrent for many fast-moving companies and startups especially in the current frantic race to adopt and ship generative AI applications.

5.4 Documentation
The quality of the vector database’s documentation determines the time to onboard, time to value, and trust in the provider’s expertise and product. Clear instructions with tutorials, examples and case studies help your developers understand and master the vector db faster.

5.5 User education
Similar to community-based offerings, expert technical content such as blogs, demos and videos focused on the existing as well as new features are helpful for your team to understand and build faster. In addition to text and video content, other offerings like user testimonials, workshops, conferences also help educate your team and build more trust in the vector db provider.

6. Future roadmap
A final factor to consider is the product roadmap of the vector database provider. Vector databases are an emerging technology that will need to continuously evolve alongside the advances in generative AI models, chip design and hardware, and novel enterprise use cases across domains.

Therefore, the vector db vendor should show the potential for evaluating long-term and future industry trends such as sophisticated vectorization techniques for a wider variety of data types, hybrid databases, optimized hardware accelerators for AI applications such as GPUs and TPUs, distributed vector dbs, real-time and streaming data based applications, as well as industry-specific solutions that might require advance data privacy and security.

Conclusion
Vector databases are an essential ingredient for modern generative AI applications built on unstructured data such as text. Their popularity has increased in parallel to the developments in the generative AI field such as large language models, large image models etc. to serve as the underlying database for handling high-dimensional data stored as vector embeddings.

In this article, you learned about several important pillars to help your decision making about the choice of the vector database. These factors include data and use case considerations, performance-based requirements such as query speed and scalability, functionality requirements such integrations and cost-efficiency, enterprise-readiness including security and compliance, and developer experience including community and documentation.

Several vector database companies have emerged to build this foundational infrastructure. There is no single ‘best’ vendor of vector db and the ultimate choice is highly contingent on your organization’s business goals. Therefore, a data-driven approach guided by the factors listed in this article will help you select the most optimal vector db for your organization.

Comments

Mixtral of Experts

19/2/2024

Comments

Figure 1 - Mixture of Experts layer.

1. Introduction
Mistral is a pioneering French AI startup that launched their own foundational large language model, called Mistral 7B in September 2023. As of the date of launch, it was the best 7 billion parameter language model, outperforming even larger language models like Llama 2 of size 13 billion parameters across multiple benchmarks. In addition to its performance, Mistral 7B is also popular as the model is open-sourced under the Apache 2.0 license with the model weights available for download.

Mixtral 8x7B (hereafter, referred to as “Mixtral”) is the latest model released by Mistral in January 2024 and represents a significant extension of their prior work on Mistral 7B. It is a 7B Sparse Mixture of Experts (SMoE) language model with stronger capabilities than Mistral 7B. It uses 13B active parameters during inference out of a total of 47B parameters, and supports multiple languages, code, and 32k context window.

In this blog, you will learn about the details of the Mixtral language model architecture, its performance on various standard benchmarks vis-a-vis state-of-the-art large language models like Llama 1 and 2 and GPT3.5, as well as potential use cases and applications.

2. Mixtral
Mixtral is a mixture-of-experts network, similar to [GPT4]. While GPT4 is said to constitute 8 expert models of 222B parameters each, Mixtral is a mixture of 8 experts of 7B parameters each. Thus, Mixtral only requires a subset of the total parameters during decoding, thus allowing faster inference speed at low batch sizes and higher throughput at large batch sizes.

2.1 Sparse Mixture of Experts
Figure 1 illustrates the Mixture of Experts (MoE) layer. Mixtral has 8 experts, and each input token is routed to two experts with different sets of weights. The final output is a weighted sum of the outputs of the expert networks, where the weights are determined by the output of the gating network. The number of experts (n) and the top K experts are hyperparameters that are set to 8 and 2 respectively. The number of experts, n determines the total or sparse parameter count while K determines the number of active parameters used for processing each input token.

The MoE layer is applied independently per input token in lieu of the feed-forward sub-block of the original Transformer architecture. Each MoE layer can be run independently on a single GPU using a model parallelism distributed training strategy.

2.2 Mistral 7B
Mixtral’s core architecture is similar to Mistral 7B, and therefore, a review of its architecture is relevant for a more comprehensive understanding of Mixtral. Mistral 7B is based on the Transformer architecture. In comparison to Llama, it has a few novel features that contribute to it surpassing Llama 2 (13B) on various benchmarks.

2.2.1 Grouped-Query Attention
Grouped-Query Attention (GQA) is an extension of multi-query attention, which uses multiple query heads but single key and value heads. Popular language models like PaLM employ multi-query attention.

GQA represents an interpolation between multi-head and multi-query attention with single key and value heads per subgroup of query heads. As shown in figure 2, GQA divides query heads into G groups, each of which shares a single key and query head. It is different to multi-query attention which shares single key and value heads across all query heads. GQA is an important feature as it significantly accelerates the speed of inference and also reduces the memory requirements during decoding. This enables the models to scale to higher batch sizes and higher throughput, which is a critical requirement for real-time AI applications.

2.2.2 Sliding Window Attention
Sliding window attention (SQA), introduced in the Longformer architecture exploits the stacked layers of a Transformer to attend to information beyond the typical window size. SWA is designed to attend to a much longer sequence of tokens than vanilla attention, and also offers significant reductions in computational cost.

The combination of GQA and SWA collectively enhance the performance of Mistral 7B and therefore Mixtral relative to other language models like the Llama series.

Figure 2. A comparison of the configuration of key, value and query heads for GQA vs. multi-head and multi-query attention.

3. Performance

3.1 Standard benchmarks
The authors of Mixtral benchmarked the performance of the model on a range of standard benchmarks and evaluated the accuracy of Mixtral versus leading language models like Llama 1, Llama 2, and GPT3.5 as shown in figure 3, table 1, and table 2.

In summary, Mixtral is better than much larger language models with up to 70B parameters like Llama 2 70B while only using 13B (~18.5%) of the active parameters during inference. Mixtral’s performance is especially superior in tasks focused on mathematics, code generation, as well as multilingual comprehension.

3.2 Multilingual understanding
Table 3 shows the performance of Mixtral versus Llama models on multilingual benchmarks. As Mixtral was pretrained with a significantly higher proportion of multilingual data, it is able to outperform Llama 2 70B on multilingual tasks in French, German, Spanish, and Italian while being comparable in English.

3.3 Long-range performance
As shown in figure 4, the input context length of language models has increased by several orders of magnitude in the last few years - from 512 tokens for the BERT model to 200k tokens for Claude 2. However, most large language models struggle to efficiently use the longer context. Nelson and colleagues showed that current language models do not robustly make use of information in long input contexts, and their performance is typically highest when the relevant information for tasks such as question-answering or key-value retrieval occurs at the beginning or the end of the input context, with significantly degraded performance when the the models need to access information in the middle of long contexts.

Mixtral, which has a context size of 32k tokens, overcomes this deficit of large language models and shows 100% retrieval accuracy regardless of the context length or the position of the key to be retrieved in a long context.

The perplexity, a metric that captures the capability of a language model to predict the next word given the context, decreases monotonically as the context length increases. Lower perplexity implies higher accuracy, and the Mixtral model is therefore capable of extremely good performance on tasks based on long context lengths as shown in figure 5.

Figure 3. Performance of Mixtral in comparison to Llama 1 and 2 models of different sizes.

Figure 4. Evolution of the context length of large language models.

Figure 5. Long range performance of Mixtral.

Table 1. Mixtral outperforms Llama 2 70B model on almost all benchmarks while using less than 1/5th of the active parameters during inference.

Table 2. Mixtral outperforms or matches the performance of Llama 2 70B and GPT-3.5 on most benchmarks.

Table 3. Mixtral’s performance on multilingual benchmarks for French, German, Spanish and Italian versus Llama 1 and 2 models.

4. Instruction Fine-tuning
Instruction tuning refers to the process of further training large language models on a curated dataset containing (instruction, output) pairs of training samples. Instruction tuning is a computationally efficient method for extending the capabilities of large language models in diverse domains without extensive retraining or architectural changes.

“Mixtral - Instruct” model was fine-tuned on an instruction dataset followed by Direct Preference Optimization (DPO) on a paired feedback dataset. DPO is a technique to optimize large language models to adhere to human preferences without explicit reward modeling or reinforcement learning. As of January 26, 2024, on the standard LMSys Leaderboard, Mixtral - Instruct continues to be the best performing open-source large language model. This leaderboard is a crowdsourced open platform for evaluating large language models that ranks models following the Elo ranking system in chess. Mixtral - Instruct only ranks below proprietary models like OpenAI’s GPT-4, Google’s Bard and Anthropic’s Claude models, while being a significantly small model.

This extremely strong performance of Mixtral - Instruct and with an open-source friendly Apache 2.0 license opens up the possibility for tremendous adoption of Mixtral for both commercial and non-commercial applications. It represents a much more powerful alternative to Llama 2 70B that is already being used as the foundational model for extending large language models to other languages like Hindi or Tamil that are spoken widely but not adequately represented in the training dataset of these large language models.

Figure 6. Mixtral is the best performing open-source large language model on the LMSys Leaderboard.

5. Use Cases
Mixtral represents the numero uno of open-source large language models as it clearly outperforms the previous best open-source model, Llama 2 70B, by a significant margin, while providing for faster and cheaper inference.

At the time of writing this article, Mixtral has been available in the open-source for less than two months and we are yet to see many examples of how it is being used in the industry. However, there are some early movers, like the Brave browser that has already incorporated Mixtral in its AI-based browser assistant, Leo. Mixtral is also incorporated by Brave for powering its [programming-related queries in Brave Search. It is only a matter of time before Mixtral witnesses widespread adoption across industry for a variety of use cases and challenges the hegemony of proprietary models like OpenAI’s GPT-4 and the likes.

6. Conclusion
Mixtral is a cutting-edge, mixture-of-experts model with state-of-the-art performance among open-source models. It consistently outperforms Llama 2 70B on a variety of benchmarks while having 5x fewer active parameters during inference. It thus allows for a faster, more accurate and cost-effective performance for diverse tasks including mathematics, code generation, as well as multilingual understanding. Mixtral - Instruct also outperforms proprietary models such as Gemini-Pro, Claude-2.1, GPT-3.5 Turbo on human evaluation benchmarks.

Mixtral thus represents a powerful alternative to the much larger and more compute intensive Llama 2 70B as the de facto best open-source model, and will facilitate development of new methods and applications benefitting a wide variety of domains and industries.

Comments

How to Automate MLOps?

31/12/2022

Comments

Published by Pachyderm

MLOps refers to the practice of delivering machine-learning models through repeatable and efficient workflows. It consists of a set of practices that focuses on various aspects of the machine-learning lifecycle, from the raw data to serving the model in production.
Despite the routine nature of many of these MLOps tasks, it’s not uncommon for several steps to still be processed manually, incurring massive ongoing maintenance costs. Your organization can benefit tremendously from automating MLOps to achieve efficiency, reliability, and cost-effectiveness at scale. For example, automation could:

Decrease the time to realize business value and impact from MLOps
Improve operational excellence, including transparency
Minimize risk due to human errors
Accelerate MLOps efforts at scale
Improve performance
Increase trust and collaboration among data science and partner teams

However, many companies lack the capabilities, talent, and infrastructure to drive machine-learning models to production reliably and efficiently. This not only means wasted time and resources but also hinders adoption and trust in AI. The sooner that companies of any size, enterprise and startups alike, invest in automating their MLOps processes to expedite delivery of machine-learning models, the sooner they can meet their business goals.
So, let’s talk about six methods for automating MLOps that can help streamline the continuous delivery of machine-learning models to production.

1. Automated Data-driven Pipelines
Delivering a machine-learning model involves numerous steps, from processing the raw data to serving the model to production. Machine-learning pipelines consist of several connected components that can execute automatically in an independent and modular fashion. For instance, different pipelines can focus on data processing, model training, and model deployment.

When it comes to machine learning, data is as or more important than code; pipelines track changes in training data and automatically trigger pipelines for processing new or changed data. Such automated data-driven pipelines kickstart further iterations of data processing and model training based on the new datasets.

Without automated pipelines, the data science team executes these steps manually. This inevitably leads to manual errors, production delays, and lack of visibility of the overall pipeline for relevant stakeholders. Manually built pipelines are harder to troubleshoot when defects creep into production, and so compound technical debt for the MLOps team.
Automating pipelines can significantly reduce manual effort and free up organizational time, resources, and bandwidth so your MLOps team can focus on other challenges.

2. Automated Version Control
In the realm of software engineering, version control refers to the tracking of changes in code, making it easier to monitor, troubleshoot and collaborate among large teams. In machine learning, the need for version control applies to data as well as code.

Version control is especially critical for machine-learning applications in domains like healthcare and finance that have a higher burden of model explainability, data privacy, and compliance. Automating version control for machine learning ensures that the history of the different moving parts—code, data, configurations, models, pipelines—is centrally maintained and fully automated. Through automated version control, your MLOps team has a more efficient ability to trace bugs, roll back changes that didn’t work, and collaborate with greater transparency and reliability.

3. Automated Deployment
Large data science organizations develop multiple models trained on structured and unstructured data for various use cases. Some of these models need to make predictions in real-time at ultra-low latencies while others may be invoked less often or serve as inputs to other models. All these models need to be periodically retrained to improve performance and mitigate challenges due to data drift.

Deploying models manually in such a complex business environment is highly inefficient and time consuming. Manual deployment is cumbersome and can cause serious errors that impacts model serving and the quality of model predictions. This often leads to poor customer experience and customer churn.

Deployment of models to production involves several steps. It starts with choosing multiple environments and services for staging the model, selecting appropriate servers that can handle the production traffic, and pushing the model forward to production. It then includes monitoring model performance and data drift, automating model retraining with more recent data and inputs, and ensuring the reliability of the models through better testing and security.

Automating these steps yields several benefits:

Reduce errors due to manual operations
Increase velocity of deployment of original and retrained models
Reduce time to take a model from training to production
improve visibility of the entire deployment pipeline for troubleshooting
Improve tracking and monitoring of model performance after it’s been deployed
Reduce time taken to improve and retrain an existing model in production
Improve data and model governance in the face of increasing data compliance laws
Improve productivity for data scientists and machine learning engineers
Increase organizational awareness of the impact of deployed AI/ML models
Increase the efficiency of machine learning program management

4. Automated Feature Selection for Model Training
Classical machine-learning models are trained on data with hundreds to thousands of features, ie, key variables in the dataset that are often correlated with model performance. Choosing a set of features that significantly account for the predictive power of the trained models is therefore essential.

Feature selection by hand is cumbersome and requires significant subject matter expertise. Automating feature selection not only helps train the machine-learning model faster on a smaller dataset but also makes the model easier to interpret. Selecting fewer features but with high feature importance is critical in the preparation of training data.

Automated feature selection helps reduce the size of the model to make faster predictions, or to increase the speed of training your machine learning or deep learning model. Feature selection can be automated using either unsupervised learning techniques, like principal component analysis, or supervised methods using statistical tests like f-test, t-test, or chi-squared tests.

5. Automated Data Consistency Checks
A central focus of data-centric AI is the quality of data used to train machine-learning models. Data quality determines the accuracy of the models, which in turn impacts business decision-making. So the underlying data must have minimal errors, inconsistencies, or missing values.

Simplify the challenge of ensuring data quality and consistency by automating unit tests that check data types, expected values, missing cells, column and row names, and counts. Consider extending your automation to the analysis and reporting of the statistical properties of relevant features. If the training dataset consists of a few thousand to millions of samples and hundreds to thousands of features, you can’t manually evaluate every row and column for data consistency. Automated routines that test for different types of data inconsistencies makes it easier to eliminate poor quality data.

6. Automated Script Shortcuts
Processing data and training machine-learning models involves a lot of boilerplate code. Automate the creation of scripts for common tasks to save time and effort while providing better visibility and version control.

Typically, data scientists and machine-learning engineers create their own unique automations and shortcuts, which are seldom shared among the larger team. However, having a centralized repository of script shortcuts reduces the need to improvise, and perhaps even avoids a team member reinventing the wheel.

Save these shortcuts as executable bash scripts for different use cases like downloading data from data lakes or uploading model artifacts in backup folders.

Automate MLOps with Pachyderm
Fortunately, you don’t have to build these MLOps automation features in-house from scratch. Pachyderm is a software platform that integrates with all the major cloud providers to continuously monitor changes in data at the level of individual files. Whenever any existing file is modified or new files are added to a training dataset, Pachyderm triggers events for pipelines and launches a new iteration of data transformation, testing data quality, or model training.

Pachyderm can take care of automated version control and lineage for data as well as [deployment](https://www.pachyderm.com/events/how-to-build-a-robust-ml-workflow-with-pachyderm-and-seldon/. It also enables autoscaling and parallel processing on Kubernetes, orchestrating server resources for deployment at scale.

Conclusion
With a lot of the machine learning lifecycle still handled manually across the industry, consider automating any of the six MLOps tasks we covered here in order to achieve efficiency and reliability at scale:

Data-driven pipelines
Version control
Deployment
Feature selection for model training
Data consistency checks
Script shortcuts

A data science organization’s level of automation across its machine-learning lifecycle indicates its maturity. The velocity of training and delivering new machine-learning models to production increases significantly with that maturity, leading to faster realization of business impact.

Pachyderm, a leading enterprise-grade data science platform, helps make explainable, repeatable, and scalable machine learning systems a reality. Its automated data pipeline and versioning tools can power complex data transformations for machine learning while remaining cost effective.

Comments

Federated Machine Learning for Healthcare

2/11/2022

Comments

Introduction
Traditional machine learning is based on training models on data sets that are stored in a centralized location like an on-premise server or cloud storage. For domains like healthcare, privacy and compliance issues complicate the collection, storage, and sharing of critical patient and medical data. This poses a considerable challenge for building machine learning models for healthcare.

Federated learning is a technique that enables collaborative machine learning without the need for centralized training data. A shared machine learning model is trained by keeping all the training data on a device, thereby ensuring higher levels of privacy and security compared to the traditional machine learning setup where data is stored in the cloud.

This technique is especially useful in domains with high security and privacy constraints like healthcare, finance, or governance. Users benefit from the power of personalized machine learning models without compromising their sensitive data.

This article describes federated learning and its various applications with a special focus on healthcare.

How Does Federated Learning Work?
This section discusses in detail how federated learning works for a hypothetical use case of a number of healthcare institutions working collaboratively to build a deep learning model to analyze MRI scans.

In a typical federated learning setup, there’s a centralized server, for instance, in the cloud, that interacts with multiple sources of training data, such as hospitals in this example. The centralized server houses a global deep learning model for the specific use case that is copied to each hospital to train on its own data set.

Each hospital in this setup trains the global deep learning model locally for a few iterations on its internal data set and sends the updated version of the model back to the centralized server.

Each model update is then sent to the cloud server using encrypted communication protocols, where it’s averaged with the updates from other hospitals to improve the shared global model. The updated parameters are then shared with the participating hospitals so that they can continue local training.

In this fashion, the global model can learn the intricacies of the diverse data sets stored across various partner hospitals and become more robust and accurate. At the same time, the collaborating hospitals never have to send their confidential patient data outside their premises, which helps ensure that they don’t violate strict regulatory requirements like HIPAA. The data from each hospital is secured within its own infrastructure.

This unique federated learning setup is easily scalable and can accommodate new partner hospitals; it also remains unaffected if any of the existing partners decide to exit the arrangement.

Use Cases for Federated Learning in Healthcare
Federated learning has immense potential across many industries, including mobile applications, healthcare, and digital health. It has already been used successfully for healthcare applications, including health data management, remote health monitoring, medical imaging, and COVID-19 detection.

As an example of its use for mobile applications, Google used this technique to improve Smart Text Selection on Android mobile phones. In this use case, it enables users to select, copy, and use text quickly by predicting the desired word or sequence of words based on user input. Each time a user taps to select a piece of text and corrects the model’s suggestion, the global model receives precise feedback that’s used to improve the model.

Federated learning is also relevant for autonomous vehicles to improve real-time decision-making and real-time data collection about traffic and roads. Self-driving cars require real-time updates, and the above types of information can be effectively pooled from several vehicles in real time using federated learning.

Privacy and Security
With increased focus on data privacy laws from governments and regulatory bodies, protecting user data is of utmost importance. Many companies store customer data, including personally identifiable information such as names, addresses, mobile numbers, email addresses, etc.

Apart from these static data types, user interactions with companies such as chat, emails, and phone calls also carry sensitive details that need to be protected from hackers and malicious attacks.

Privacy-enhancing technologies like differential privacy, homomorphic encryption, and secure multi-party computation have advanced significantly and are used for data management, financial transactions, and healthcare services, as well as data transfer between multiple collaborative parties.

Many startups and large tech companies are investing heavily in privacy technologies like federated learning to ensure that customers have a pleasant user experience without their personal data being compromised.

In the healthcare industry, federated learning is a promising technology that allows, for example, hospitals to share electronic health records (EHR) to create more accurate models. Privacy is preserved without violating strict HIPAA standards by decentralizing the data processing, which is distributed among multiple end-points instead of being managed from a central server.

Simply put, federated learning allows training of machine learning models without the need to collect raw data in a central location; instead, the data used by each end-point (in this example, hospitals) remains local. By combining the above with differential privacy, hospitals can even provide a quantifiable measure of data anonymization.

Federated Learning vs. Distributed Learning and Edge Computing
Federated learning is often confused with distributed learning. In the context of deep learning, distributed training is used to train large, deep neural networks across a number of GPUs or machines. However, distributed learning relies on centralized training data shared across multiple nodes to increase the speed of model training.

Federated learning, on the other hand, is based on decentralized data stored across a number of devices and produces a central, aggregate model. A fascinating example of the potential of this technology is using federated learning-based Person Movement Identification (PMI) through wearable devices for smart healthcare systems.

Edge computing is a related concept where the data and model are centralized in the same individual device. Edge computing doesn’t train models that learn from data stored across multiple devices, as in the case of federated learning. Instead, a centrally trained model is deployed on an edge device, where it runs on data collected from that device. For example, edge computing is applied in the context of Amazon Alexa devices, where a wake word detection model is stored on the device to detect every utterance of “Alexa.”

AI and Healthcare
Federated machine learning has a strong appeal for healthcare applications. By design, patient and medical data is highly regulated and needs to adhere to strict security and privacy standards. By collating data from participating healthcare institutions, organizations can ensure that confidential patient data doesn’t leave their ecosystem; they can also benefit from machine learning models trained on data across a number of healthcare institutions.

Large hospital networks can now work together and pool their data to build AI models for a variety of medical use cases. With federated learning, smaller community and rural hospitals with fewer resources and lower budgets can also benefit and provide better health outcomes to more of the population.

This technique also helps to capture a greater variety of patient traits, including variations in age, gender, and ethnicity, which may vary significantly from one geographic region to another. Machine learning models based on such diverse data sets are likely to be less biased and more likely to produce more accurate results. In turn, the expert feedback of trained medical professionals can help to further improve the accuracy of the various AI models.

Federated learning, therefore, has the potential to introduce massive innovations and discoveries in the healthcare industry and bring novel AI-driven applications to market and patients faster.

Conclusion
Federated learning enables secure, private, and collaborative machine learning where the training data doesn’t leave the user device or organizational infrastructure. It harnesses diverse data from various sources and produces an aggregate model that’s more accurate.

This technique has introduced significant improvements in information sharing and increased the efficacy of collaborative machine learning between hospitals. It circumvents and overcomes the challenges of working with highly sensitive medical data while leveraging the power of state-of-the-art machine learning and deep learning.

Related Blogs

Comments

Building Artificial Intelligence Products

19/10/2022

Comments

Link to the video

Comments

AI & Web3

19/10/2022

Comments

Encrypt_AI_Web3
File Size:	349 kb
File Type:	pdf

Download File

Web3 is the third generation of the internet based on emerging technologies like blockchains, tokens, DAOs, digital assets, decentralised finance that has the potential to give back control of digital assets back to the users with greater trust and transparency.

Typical web3 applications focus on DAOs, DeFi, Stablecoins, Privacy and digital infrastructure, the creator economy amongst others. The web3 ecosystem represents a promising green space for creators, developers, and various types of tech and non-tech professionals as well.

In my talk (video and slides shared above) for Crater's Encrypt 2022 hackathon, I describe how AI can be leveraged to build commercially viable web3 applications for India. I cover a number of relevant AI/ML datasets, models, resources and applications for these domains, recognized by the Ministry of Electronics and Information Technology's National Strategy on Blockchain:

Transfer of land records/property
E-Voting
Electronics health record management
Identity management
Smart Energy Grid
Agricultural and Pharmaceutical Supply Chains
Blockchain for Social Good

Related Blogs

Comments

Top 10 MLOps tools

16/9/2022

Comments

Source: MLOps Org

Machine learning operations (MLOps) refer to the emerging field of delivering machine learning models through repeatable and efficient workflows. The machine learning lifecycle is composed of various elements, as shown in the figure below. Similar to the practice of DevOps for managing the software development lifecycle, MLOps enables organizations to smooth the path to successful AI transformation by providing an engineering and technological backbone to underlying machine learning processes.

MLOps is a relatively new field, as the commercial use of AI at scale is itself a fairly new practice. MLOps is modeled on the existing field of DevOps, but in addition to code, it incorporates additional components, such as data, algorithms, and models. It includes various capabilities that allow the modern machine learning team, comprising data scientists, machine learning engineers, and software engineers, to organize the building blocks of machine learning systems and take models to production in an efficient, reliable, and reproducible fashion.

MLOps tools
MLOps is carried out using a diverse set of tools, each catering to a distinct component of the machine learning pipeline. Each tool under the MLOps umbrella is focused on automation and enabling repeatable workflows at scale. As the field of machine learning has evolved over the last decade, organizations are increasingly looking for tools and technologies that can help extract the maximum return from their investment in AI. In addition to cloud providers, like AWS, Azure, and GCP, there are a plethora of start-ups that focus on accommodating varied MLOps use cases.

In this article, I will cover tools for the following MLOps categories:

Metadata management
Versioning
Experiment tracking
Model deployment
Monitoring

In the following section, I will list a selection of MLOps tools from the above categories. It is important to note that although a particular tool might be listed under a specific category, the majority of these tools have evolved from their initial use case into a platform for providing multiple MLOps solutions across the entire ML lifecycle.

Metadata Management
Building machine learning models involves many parameters associated with code, data, metrics, model hyperparameters, A/B testing, and model artifacts, among others. Reproducing the entire ML workflow requires careful storage and management of the above metadata.

Featureform
Featureform is a virtual feature store. It can integrate with various data platforms, and it enables the management and governance of the data from which features are built. With a unique, feature-first approach, Featureform has built a product called Embeddinghub, which is a vector database for machine learning embeddings. Embeddings are high-dimensional representations of different kinds of data and their interrelationships, such as user or text embeddings, that quantify the semantic similarity between items.

MLflow
MLflow is an open-source platform for the machine learning lifecycle that covers experimentation and deployment, and it also includes a central model registry. It has four principal components: Tracking, Projects, Models, and Model Registry. In terms of metadata management, the MLflow Tracking API is used for logging parameters, code, metrics, and model artifacts.

Versioning
For machine learning systems, versioning is a critical feature. As the pipeline consists of various data sets, labels, experiments, models, and hyperparameters, it is necessary to version control each of these parameters for greater accessibility, reproducibility, and collaboration across teams.

Pachyderm
Pachyderm provides a data layer for the machine learning lifecycle. It offers a suite of services for data versioning that are organized by data repository, commit, branch, file, and provenance. Data provenance captures the unique relationships between the various artifacts, like commits, branches, and repositories.

DVC
DVC, or Data Version Control, is an open-source version control system for machine learning projects. It includes version control for machine learning data sets, models, and any intermediate files. It also provides code and data provenance to allow for end-to-end tracking of the evolution of each machine learning model, which promotes better reproducibility and usage during the experimentation phase.

Experiment Tracking
A typical machine learning system may only be deployed after hundreds of experiments. To optimize the model performance, data scientists perform numerous experiments to identify the most appropriate set of data and model parameters for the success criteria. Managing these experiments is paramount for staying on top of the data science modeling efforts of individual practitioners, as well as the entire data science team.

Comet
Comet is a machine learning platform for managing and optimizing the entire machine learning lifecycle, from experiment tracking to model monitoring. Comet streamlines the experimentation workflow for data scientists and enables clear tracking and visualization of the results of each experiment. It also allows side-by-side comparisons of experiments so users can easily see how model performance is affected.

Weights & Biases
Weights & Biases is another popular machine learning platform that provides a host of services, including [experiment tracking](https://wandb.ai/site/experiment-tracking). It facilitates tracking and visualization of every experiment, allows rerunning previous model checkpoints, and can monitor CPU and GPU usage in real time.

Model Deployment
Once a machine learning model is built and tests have found it to be robust and accurate enough to go to production, the model is deployed. This is an extremely important aspect of the machine learning lifecycle, and if not managed well, it can lead to errors and poor performance in production. AI models are increasingly being deployed across a range of platforms, from on-premises servers to the cloud to edge devices. Balancing the trade-offs for each kind of deployment and scaling the service up or down during critical periods are very difficult tasks to achieve manually. A number of platforms provide model deployment capabilities that automate the entire process of taking a model to production.

Seldon
Seldon is a model deployment software that helps enterprises manage, serve, and scale machine learning models in any language or framework on Kubernetes. It’s focused on expediting the process to take a model from proof of concept to production, and it’s compatible with a variety of cloud providers.

Kubeflow
Kubeflow is an open-source system for productionizing models on the Kubernetes platform. It simplifies machine learning workflows on Kubernetes and provides greater portability and scalability. It can run on any hardware and infrastructure on which Kubernetes is running, and it is a very popular choice for machine learning engineers when deploying models.

Monitoring
Once a model is in production, it is essential to monitor its performance and log any errors or issues that may have caused the model to break in production. Monitoring solutions enable setting thresholds as indicators for robust model performance and are critical in solving for known issues, like data drift. These tools can also monitor the model predictions for bias and explainability.

Fiddler
Fiddler is a machine learning model performance monitoring software. To ensure expected model performance, it monitors data drift, data integrity, and anomalies in the data. Additionally, it provides model explainability solutions that help identify, troubleshoot, and understand underlying problems and causes of poor performance.

Evidently
Evidently is an open-source machine learning model monitoring solution. It measures model health, data drift, target drift, data integrity, and feature correlations to provide a holistic view of model performance.

Conclusion
MLOps is a growing field that focuses on organizing and accelerating the entire machine learning lifecycle through best practices, tools, and frameworks borrowed from the DevOps philosophy of software development lifecycle management. With machine learning, the need for tooling is much greater, as machine learning is built on foundational blocks of data and models, as well as code.

To bring reliability, maturity, and scale to machine learning processes, a diverse set of MLOps tools are being increasingly used. These tools are developed for optimizing the nuts and bolts of machine learning operations, including metadata management, versioning, model building and experiment tracking, model deployment, and monitoring in production.

Over the past decade, the field of AI and machine learning has grown rapidly, with organizations embracing AI and recognizing its critical importance for transforming their business. The field of MLOps is still young, but the creation and adoption of tools will further empower organizations in their journey of AI transformation and value creation.

Related Blogs

Comments

AWS Redshift Pricing

30/8/2022

Comments

Published by CloudForecast

Introduction
Amazon Redshift is a widely used cloud data warehouse that is used by many businesses, like Nasdaq, GE, and Zynga, to process analytical queries and analyze exabytes of data across databases, data lakes, data warehouses, and third-party data sets.

There are multiple use cases for Redshift, including enhancing business intelligence capabilities, increasing developer and analyst productivity,
and building machine learning models for predictive insights, like demand forecasting.

Amazon Redshift can be leveraged by modern data-driven organizations to vastly improve their data warehousing and analytics capabilities. However, the pricing for Redshift services can be challenging to understand, with multiple criteria that define the total cost.

In this article, you’ll learn about Amazon Redshift and its pricing structure, with suggestions for how to optimize costs.

What Is Amazon Redshift?
Essentially, Amazon Redshift provides analytics over multiple databases and offers high scalability in a secure and compliant fashion.

Additionally, there is a serverless option called Amazon Redshift Serverless that makes it even easier to rapidly scale analytics setup without requiring a managed data warehouse infrastructure. It helps with data democratization and assists various data stakeholders to extract data insights by simply loading and querying data in the warehouse.

Amazon Redshift Pricing
In this section, you’ll learn about Amazon Redshift’s capabilities as it pertains to usage and pricing.

Free Tier
For new enterprise users, the AWS Free Tier provides a free two-month trial of the DC2.Large node. This free service includes 750 hours per month, which is sufficient to run a single DC2.Large node with 160GB of compressed solid-state drives (SSD).

On-Demand Pricing
When you launch an Amazon Redshift cluster, you select a number of nodes in a specific region as well as their instance type to run your data warehouse. In on-demand pricing, a simple hourly rate applies based on the previous configuration and is billed as long as the cluster is live. The typical hourly rate for a DC2.Large node is $0.25 USD per hour.

Redshift Serverless Pricing
With Amazon Redshift Serverless, costs accrue only when the data warehouse is active and is measured in units of Redshift Processing Units (RPUs). You’re charged in terms of RPU-hours on a per-second basis. The serverless configuration also includes concurrency scaling and Amazon Redshift Spectrum, and the cost for these services is already included.

Managed Storage Pricing
Amazon Redshift charges for the data stored in a managed storage at a specific rate per GB-month. Its usage is calculated on an hourly basis as a function of the total amount of data and starts as low as $0.024 USD per GB with the RA3 node. The cost of a managed storage also varies according to the particular AWS region in which the data is stored.

For example, consider the cost of a managed storage pricing where 100TB of data is stored with an RA3 node type for thirty days in the US East region, where the cost is $0.024 USD per GB-month.

The total usage for thirty days in GB-hours is as follows:
100TB × 1024GB/TB (converting TB to GB) × 30 days × 24 hours/day = 73,728,000 GB-hours

Then you can convert GB-hours to GB-months:
73,728,000 GB-hours / (24 × 30) hours per month = 102,400 GB-months

Finally, you can calculate the total cost of 102,400 GB-months at $0.024 USD/GB-month in the US East region:
102,400 GB-months × $0.024 USD = $2,457.60 USD

Spectrum Pricing
With Amazon Redshift Spectrum, users can run SQL queries directly on the data in the S3 buckets. Here, the cost is based on the number of bytes scanned by the Spectrum utility.
The pricing of Redshift Spectrum is $5 USD per terabyte of data scanned.

Concurrency Scaling Pricing
With Concurrency Scaling, Amazon Redshift can be scaled to multiple concurrent users and queries. For every twenty-four hours that your main cluster is live, you accrue a one-hour credit. Any additional usage is charged on a per-second, on-demand rate that depends on the number of types of nodes in the main cluster.

Reserved Instance Pricing
Reserved instances are designated for stable production workloads and are less expensive than clusters run on an on-demand basis. Significant cost savings can be achieved through long-term usage and commitment to Amazon Redshift in the span of a few years.

Pricing for reserved instances can either be paid all up front, partially up front, or monthly over the course of a year with no up-front charges.

Amazon Redshift Cost Optimization Considerations
Before you begin using Amazon Redshift, you need to be aware of your current costs.
AWS Cost ExplorerThe AWS Pricing Calculator provides a configurable tool to estimate the cost of using Amazon Redshift.

For instance, the annual cost of one node of the DC2.8xlarge instance in the US East (Ohio) region on an on-demand basis is as follows:
1 instance × $4.80 USD hourly × 730 hours in a month × 12 months = $42,048 USD

The cost for the same Amazon Redshift configuration for a reserved instance for a one-year term paid up front is $27,640 USD.

AWS Tags
Using AWS cost allocation tags can help you decode and manage your AWS costs. Tagsenable AWS resources to be labeled in the form of key-value pairs and can include various types, like technical, business, security, and automation. Once the tags are activated in the Billing and Cost Management console, a cost allocation report can be generated based on the specific resources tagged. Tags can be user-defined or AWS-generated.

Amazon Redshift Cost Optimization
Optimizing Amazon Redshift costs comes down to effective planning, prudent usage and allocation of resources, and regular monitoring of the usage and associated costs.

Optimizing Queries
The analytical queries made on the data stored in Amazon Redshift can be optimized to run more efficiently. Queries can be compute-intensive, can be storage-intensive, or can take a long time to execute.

There are a number of query tuning techniques that can be used to optimize your queries. Tables with skewed data or missing statistics, and queries with nested loops and long wait times, typically affect query performance and can be improved as illustrated in this AWS developer guide.

Here is a commonly used weak query that selects all the columns in a table:
SELECT * FROM USERS

The previous query can be very inefficient and slow if the table consists of thousands of columns, especially if only a few columns are relevant for the necessary analysis. This query can be optimized by specifying and retrieving the exact column names like the following:
SELECT Firstname, Lastname, DOB FROM USERS

Cluster Limits and Quotas
Usage limits on Amazon Redshift clusters can be programmed using the AWS Command Line Interface (CLI) tool. Limits can be imposed on concurrency scaling in terms of time and spectrum in terms of data scanned. Daily, weekly, or monthly periods can be used.

A number of limits and quotas are defined for Redshift resources that can also be applied to constrain the overall costs associated with Redshift.

Data Type
Amazon Redshift costs can also be managed by storing data in a compressed, partitioned, and columnar data format, like Apache Parquet, since fewer data is scanned.

Conclusion
Amazon Redshift is a powerful and cost-effective cloud-native data warehouse that provides scalable and performant data analytics and processing capabilities. It also comes with a serverless configuration that allows any data stakeholder to run data queries without the need to provision and manage the data warehouse infrastructure.

Amazon Redshift has multiple aspects affecting its pricing, including on-demand or reserved capabilities, serverless, managed storage pricing, Redshift Spectrum pricing, concurrency scaling pricing, and reserved instance pricing. Keeping on top of the various Amazon Redshift costs is not straightforward but can be made easier by AWS cost monitoring tools, like CloudForecast.

CloudForecast helps manage AWS costs through daily cost management reports, monthly financial reports, untagged AWS resources discovery, and idle and underutilized resources visibility for cost-saving opportunities.

Related blog

AWS Lambda Pricing and Cost Optimization Guide

Comments

AWS Lambda Pricing

28/8/2022

Comments

Published by CloudForecast

Introduction
Companies are increasingly moving their production code to serverless functions using AWS Lambda, which has gained popularity for its better code maintenance, low-cost hosting charges, and automatically scaled and optimized performance. But without careful oversight, Lambda can become an expensive choice for your project.

Lambda, offered by market-leading AWS, offers many benefits. Lambda is one example of serverless functions, or single-purpose, programmatic functions hosted and maintained by cloud providers like AWS, Azure, or GCP to ensure near-perfect runtime and scaling to any incoming network request volume. Companies can use Lambda, an event-driven compute service, to run any type of application or backend service without worrying about provisioning or managing servers.

Lambda adapts to a variety of use cases across startups and enterprises alike. It can process data at scale, run interactive web and mobile backend services, enable powerful machine learning models, and build in-house event-driven applications.

It also specifies limits for the amount of compute and storage resources used to run and store serverless functions. These limits apply to a number of resources, such as the number of concurrent executions; storage for uploaded functions as well as quotas for function configuration; deployment and execution parameters like memory allocation; timeout; environment variables; layers; and burst concurrency.

The key to using Lambda is keeping your costs in check. This article will review Lambda’s pricing structure to show how costs can be efficiently managed without compromising on operational excellence and execution of Lambda functions. It will also discuss tools like CloudForecast that can help engineering teams monitor and reduce their serverless computing costs on AWS.

Understanding AWS Lambda Pricing
AWS Lambda pricing is based on the amount of memory allocated to the serverless function and the amount of time the code runs, rounded to the nearest millisecond. The key variables that determine Lambda costs are the type of architecture, the number of requests, the time frame for which the requests apply, the duration of each request (in milliseconds), and the amount of memory allocated to the Lambda function.

Each Lambda request starts when code executes in response to an event trigger from services like Amazon’s Simple Notification Service or calls from Amazon API Gateway or via the AWS SDK. The cost for each compute and storage resource is calculated depending on the function configuration.

AWS offers a free tier that allows one million free requests per month and 400,000 GB-seconds of compute time per month powered by x86 and Graviton2 processors. It also offers a flexible pricing model called the Compute Savings Plan, based on guaranteed usage (measured in dollars per hour) for a one- or three-year term.

AWS Lambda does offer an attractive feature called Provisioned Concurrency that enables greater control over start-up latency when Lambda functions are triggered. Provisioned concurrency solves the problem of variable start-up latency when a Lambda service is triggered on demand and scales up to meet the needs of the application workloads. This overhead in starting a Lambda function is referred to as cold start, and the magnitude of this problem is a function of the time taken to set up the execution environment and the duration for the code to be initialized.

As illustrated in this official AWS example, with provisioned concurrency enabled, the percentage of requests served within a given time remains fairly constant—especially for the slowest five percent of the requests—in comparison to a scenario with provisioned concurrency disabled. At scale, this can have a massive impact not only on the costs but also on the user experience.

While the first factor is controlled by AWS, the second factor falls to the developer. The code initialization duration is predominantly responsible for cold start latency. Provisioned concurrency solves for cold start by enabling Lambda functions to be initialized for high workloads in milliseconds.

AWS provides a pricing calculator to estimate the cost of using Lambda for your applications. The below estimate provides pricing calculations for a sample application with the following settings:

Architecture: x86
Number of requests: 1 million per minute
Duration of each request: 5 ms
Amount of memory allocated: 512 MB

The same pricing calculator can also provide an estimate for provisioned concurrency. In this case, in addition to the above parameters, the cost is a function of the amount of concurrency specified and the period of time the configuration is active.

Controlling AWS Lambda Costs
AWS Lambda does offer options for controlling costs, but as the above example showed, the cost of function calls can quickly scale up as part of the organizational application workload. If the configuration is not carefully monitored and fine-tuned for current applications, Lambda can become prohibitively expensive.

You can keep AWS Lambda costs down by focusing on three important factors:

the memory of the Lambda function
the number of times the function is invoked
the length of time the function is invoked

The cost of a Lambda function invocation is multiplied by its execution time and memory size, so reducing either factor by even a small amount can have a significant impact on billable costs. It’s important to ensure you have the correct configuration. Periodic monitoring of the actual values of the memory size and the number and duration of function calls can help confirm whether the current configuration is fine-tuned for the current workload.

AWS Lambda logs are ingested into Amazon CloudWatch, so mining these logs can help optimize the configuration and the costs. External tools like CloudForecast can also monitor usage and costs.

Avoiding high maximum execution time also helps save costs. It’s common to have a buffer of execution time beyond what’s specified, but the additional costs incurred by Lambda functions add up, making it prudent to change the value of the “duration of each request” parameter as needed.

Lambda Step Functions can also help manage costs. Step functions are state machines with a visual workflow that allow developers to coordinate different tasks like calling various Lambda functions. Using step functions is a more efficient way to poll for the status of tasks. Typically, long polling increases the costs of Lambda functions as they are waiting idle, and step functions help alleviate the total costs based on the number of state transitions to execute the application, instead of the execution time of a workflow.

Another tactical method to control Lambda costs is to evaluate whether your application can be run asynchronously. Running async workloads prevents idle downtime in which the AWS Lambda functions wait for external applications to complete. If the overall architecture can be analyzed for idle instances and reconfigured for asynchronous execution, the costs of Lambda functions can be drastically reduced.

The frequency at which Lambda functions are invoked can also impact the usage and costs. Where applications like Kinesis are used as a Lambda function trigger, increasing the batch size can reduce the frequency at which the Lambda function needs to be invoked, thus reducing the total number of executions.

Writing optimized production code always helps, and its lower execution time can reduce Lambda costs. You can, for instance, record and analyze the Duration metric in CloudWatch for slow execution times.

For some applications, EC2 spot instances may be cheaper and more effective than Lambda functions. This is especially true for an application architecture in which the traffic is predictable and sustained, making a reliable EC2 spot instance a more suitable alternative.

Conclusion
AWS Lambda and serverless functions have had a tremendous impact on the efficient execution of software, data, and machine learning applications in the cloud. Lambda can help you achieve savings on your engineering costs, but it’s possible to reduce your costs even more by optimizing the configuration of your applications and fine-tuning your resources.

Doing this work manually can require careful logging and monitoring of your application in production settings. Instead, you can use tools to automate and dynamically adjust Lambda function settings to reduce costs in a more cost- and time-efficient manner. One of those tools is CloudForecast, which can manage and optimize the cost of using AWS services like Lambda.

CloudForecast provides an out-of-the-box solution for engineering teams to monitor their monthly budget and move toward a more responsible use of Lambda functions. Its detailed reports suggest ways to reduce AWS costs, and it can also provide reports for your finance and accounting teams.

To learn more about how CloudForecast can help with your AWS Lambda costs, check its official blog.

Related blog

AWS Redshift Pricing and Cost Optimization Guide

Comments

How to Hire Data Science Teams?

29/7/2022

Comments

Data science teams are an integral part of early-stage or growth-stage start-ups as midlevel and enterprise companies. A data science team can include a wide range of roles that take care of the end-to-end machine learning lifecycle from project conceptualization to execution, delivery, and monitoring:

Data engineer
Data scientist
Machine learning engineer
Product manager
Project manager
Data science manager

The manager of a data science team in an enterprise organization has multiple responsibilities, including the following:

Hiring a data science team
Cross-functional stakeholder management
Career development and mentorship
Performance appraisals
Ownership of the entire data science program

As the data science manager, it’s critical to have a structured, efficient hiring process, especially in a highly competitive job market where the demand outstrips the supply of data science and machine learning talent. A transparent, thoughtful, and open hiring process sends a strong signal to prospective candidates about the intent and culture of both the data science team and the company, and can make your company a stronger choice when the candidates are selecting an offer.

In this blog, you’ll learn about key aspects of the process of hiring a top-class data science team. You’ll dive into the process of recruitment, interviewing, and evaluating candidates to learn how to find the ones who can help your business improve its data science capabilities.

Benefits of an Efficient Hiring Process
Recent events have accelerated organizations’ focus on digital and AI transformation, resulting in a very tight labor market when you’re looking for data sciencedigital skills, like machinelike data science and machine learning, statistics, and programming.

A structured, efficient hiring process enables teams to move faster, make better decisions, and ensure a good experience for the candidates. Even if candidates don’t get an offer, a positive experience interacting with the data science and the recruitment teams makes them more likely to share good feedback on platforms like Glassdoor, which might encourage others to interview at the company.

Hiring Data Science Teams
A good hiring process is a multistep process, and in this section, you’ll look at every step of the process in detail.

Building a Funnel for Talent
Depending on the size of the data science team, the hiring manager may have to assume the responsibility of reaching out to candidates and building a pipeline of talent. In larger organizations, managers can work with in-house recruiters or even third-party recruitment agencies to source talent.

It’s important for the data science managers to clearly convey the requirements for the recruited candidates, such as the number of candidates desired and the profiles of those candidates. Candidate profiles might include things like previous experience, education or certifications, skill set or tech stack, and experience with specific use cases. Using these details, recruiters can then start their marketing, advertising, and outreach campaigns on platforms, like LinkedIn, Glassdoor, Twitter, HackerRank, and LeetCode.

In several cases, recruiters may identify candidates who are a strong fit but who may not be on the job market or are not actively looking for new roles. A database of all such candidates ought to be maintained so that recruiters can proactively reach out to them at a more suitable time and reengage the candidates.

Another trusted source of identifying good candidates is through employee referrals. An in-house employee referral program that incentivizes current employees to refer candidates from their network is often an effective way to attract the specific types of talent you’re looking for.

The data science leader should also publicize their team’s work through channels, like conferences or workshops, company blogs, podcasts, media, and social media. By investing dedicated time and energy in building up the profile of the data science team, it’s more likely that candidates will reach out to your company seeking data science opportunities.

When looking for a diverse set of talent, the search an be difficult as data science is a male dominated field. As a result, traditional recruiting paths will continue to reflect this bias. Reaching out and building relationships with groups such as Women in Data Science, can help broad the pipeline of talent you attract.

Defining Roles and Responsibilities
Good candidates are more likely to apply for roles that have a clear job description, including a list of potential data science use cases, a list of required skills and tech stack, and a summary of the day-to-day work, as well as insights into the interviewing process and time lines. Crafting specific, accurate job descriptions is a critical—if often overlooked—aspect of attracting candidates. The more information and clarity you provide up front, the more likely it is that candidates have sufficient information to decide if it’s a suitable role for them and if they should go ahead with the application or not. If you’re struggling with creating this, you can start with an existing job description template and then customize it in accordance with the needs of the team and company.

It's also critical to not over populate a job description with every possible skill or experience you hope a candidate brings. That will narrow your potential applicant pool. Instead focus on those skills and experiences that are absolutely critical. The right candidate will be able to pick up other skills on the job.

It can be useful for the job description to include links to any recent publications, blogs, or interviews by members of the data science team. These links provide additional details about the type of work your team does and also offer candidates a glimpse of other team members.

Here are some job description templates for the different roles in a data science team:

Interviewing process
When compared to software engineering interviews, the interview process for data science roles is still very unstructured, and data science candidates are often uncertain about what the interview process involves. The professional position of data scientist has only existed for a little over a decade, and in that time, the role has evolved and transformed, resulting in even newer, more specialized roles, such as data engineer, machine learning engineer, applied scientist, research scientist, and product data scientist.

Because of the diversity of roles that could be considered data science, it’s important for a data science manager to customize the interviewing process depending on the specific profile they’re seeking. Data scientists need to have expertise in multiple domains, and one or more second-round interviews can be tailored around these core skills:

Programming
Statistics
Mathematics
Machine learning
Deep learning
Product sense
Leadership

Given how tight the job market is for data science talent, it’s important to not over complicate the process. The more steps in the process, the longer it will take and the higher the likelihood you will lose viable candidates to other offers. So be thoughtful in your approach and evaluate it periodically to align with the market.

Types of Data Science Interviews
Interviews are often a multistep process and can involve multiple steps of assessments.

Screening Interviews
To save time, one or more screening rounds can be conducted before inviting candidates for second-round interviews. These screening interviews can take place virtually and involve an assessment of essential skills, like programming and machine learning, along with a deep dive into the candidate’s experience, projects, career trajectory, and motivation to join the company. These screening rounds can be conducted by the data science team itself or outsourced to other companies, like HackerRank, HackerEarth, Triplebyte, or Karat.

Onsite Interviews
Once candidates have passed the screening interviews, the top candidates will be invited to a second interview, either virtually or in person. The data science manager has to take the lead in terms of coordinating with internal interviewers to confirm the schedule for the series of interviews that will assess the candidate’s skills, as described earlier. On the day of the second-round interviews, the hiring manager needs to help the candidate feel welcome and explain how the day will proceed. Some companies like to invite candidates to lunch with other team members, which breaks the ice by allowing the candidate to interact with potential team members in a social setting.

Each interview in the series should start by having the interviewer introduce themself and provide a brief summary of the kind of work they do. Depending on the types of interviews and assessments the candidate has already been through, the rest of the interview could focus on the core skill set to be evaluated or other critical considerations. Wherever possible, interviewers should offer the candidate hints if they get stuck and otherwise try to make them feel comfortable with the process. The last five to ten minutes of each interview should be reserved for the candidate to ask questions to the interviewer. This is a critical component of second-round interviews, as the types of questions a candidate asks offer a great deal of information about how carefully they’ve considered the role.

Before the candidate leaves, it’s important for the recruiter and hiring manager to touch base with the candidate again, inquire about their interview experience, and share time lines for the final decision.

Technical Assessment
It is common for there to be some sort of case study or technical assessment to get a better understanding of a candidate’s approach to problem solving, dealing with ambiguity and practical skills. This provides the company with good information about how the candidate may perform in the role It also is an opportunity to show the candidate what type of data and problems they may work on when working for you.

Evaluating candidates
After the second-round interviews and technical assessment, the hiring manager needs to coordinate a debrief session. In this meeting, every interviewer shares their views based on their experience with the candidate and offers a recommendation if the candidate should be hired or not.

After obtaining the feedback from each member of the interview panel, the hiring manager also shares their opinion. If the candidate unanimously receives a strong hire or a strong no-hire signal, then the hiring manager’s decision is simple.

However, there may be candidates who perform well in some interviews but not so well in others, and who elicit mixed feedback from the interview panel. In cases like this, the hiring manager has to make a judgment call on whether that particular candidate should be hired or not. In some cases, an offer may be extended if a candidate didn’t do well in one or more interviews but the panel is confident that the candidate can learn and upskill on the job, and is a good fit for the team and the company.

If multiple candidates have interviewed for the same role, then a relative assessment of the different candidates should be considered, and the strongest candidate or candidates, depending on the number of roles to be filled, should be considered.

While most of the interviews focus on technical data science skills, it’s also important for interviewers to use their time with the candidate to assess soft skills, like communication, clarity of thought, problem-solving ability, business sense, and leadership values. Many large companies place a very strong emphasis on behavioral interviews, and poor performance in this interview can lead to a rejection, even if the candidate did well on the technical assessments.

Job Offer
After the debrief session, the data science manager needs to make their final decision and share the outcome, along with a compensation budget, with the recruiter. If there’s no recruiter involved, the manager can move directly to making the candidate an offer.

It’s important to move quickly when it comes to making and conveying the decision, especially if candidates are interviewing at multiple companies. Being fast and flexible in the hiring process gives companies an edge that candidates appreciate and take into consideration in their decision-making process.

Once the offer and details of compensation have been sent to the candidate, it’s essential to close the offer quickly to prevent candidates from using your offer as leverage at other companies. Including a deadline for the offer can sometimes work to the company’s advantage by incentivizing candidates to make their decision faster. If negotiations stretch and the candidate seems to lose interest in the process, the hiring manager should assess whether the candidate is really motivated to be part of the team. Sometimes, it may move things along if the hiring manager steps in and has another brief call with the candidate to help remove any doubts about the type of work and projects. However, additional pressure on the candidates can often work to your disadvantage and may put off a skilled and motivated candidate in whom the company has already invested a lot of time and money.

Conclusion
In this article, you’ve looked at an overview of the process of hiring a data science team, including the roles and skills you might be hiring for, the interview process, and how to evaluate and make decisions about candidates. In a highly competitive data science job market, having a robust pipeline of talent, and a fast, fair, and structured hiring process can give companies a competitive edge.

Related Blogs

Comments

The Case for Reproducible Data Science

18/7/2022

Comments

Published by Domino Data Lab

Reproducibility is a cornerstone of the scientific method and ensures that tests and experiments can be reproduced by different teams using the same method. In the context of data science, reproducibility means that everything needed to recreate the model and its results such as data, tools, libraries, frameworks, programming languages and operating systems, have been captured, so with little effort the identical results are produced regardless of how much time has passed since the original project.

Reproducibility is critical for many aspects of data science including regulatory compliance, auditing, and validation. It also helps data science teams be more productive, collaborate better with nontechnical stakeholders, and promote transparency and trust in machine learning products and services.

In this article, you’ll learn about the benefits of reproducible data science and how to ingrain reproducibility in every data science project. You’ll also learn how to cultivate an organizational culture that promotes greater reproducibility, accountability, and scalability.

What does it mean to be reproducible?
Machine learning systems are complex, incorporating code, data sets, models, hyperparameters, pipelines, third-party packages, model training and development configurations across machines, operating systems, and environments. To put it simply, reproducing a data science experiment is difficult if not impossible if you can’t recreate the exact same conditions used to build the model. To do that, all artifacts have to be captured and versioned in an accessible repository. That way when a model needs to be reproduced, the exact environment, using the exact training data and code, within the exact package combination can be recreated easily. Too often it's an archeological expedition that can take weeks or months (or potentially never) when the artifacts are not captured at the time of creation.

While the focus on reproducibility is a phenomenon in data science, it has been a cornerstone of scientific research across all kinds of industries, including clinical and life sciences, healthcare, and finance. If your company is unable to produce consistent experimental results, that can significantly impact your productivity, waste valuable resources, and impair decision-making.

Situations Where Reproducibility Matters
In data science, reproducibility is especially vital for data scientists to apply the experimental findings to their own work.

Regulatory Compliance
In highly regulated industries like insurance, finance and life sciences, all aspects of a model have to be documented and captured to provide full transparency, justification and validation on how models are developed and used inside an organization. This includes the type of algorithm being used, why the algorithm has been selected and how the model has been implemented within the business. A big part of complying involves being able to exactly reproduce the results of a model at any time. Without a system for capturing the artifacts, code, data, environment, packages and tools used to build a model this can be a time consuming, difficult task.

Model Validation
In all industries models should be validated prior to deployment to ensure the results are repeatable, understood and the model will achieve its intended purpose. Too often this is a time intensive process with validation teams having to piece together the environment, tools, data and other artifacts that were used to create the model, which slows down moving a model into production. When an organization is able to reproduce a model instantly, validators can focus on their core function of ensuring the model is robust and accurate.

Collaboration
Data science innovation happens when teams are able to collaborate and compound knowledge. It doesn’t happen when they have to spend time painstakingly recreating a prior experiment or accidentally duplicate work. When all work is easily reproducible, and easily searched, it's easy to build on prior work to innovate. It also means that as team staffing changes, institutional knowledge doesn’t disappear.

Ingraining Reproducibility in Data Science Projects
Instilling a culture of reproducibility in data science across an organization requires a long-term strategy, technology investment, and buy-in from data and engineering leadership. In this section, you’ll learn about a few established best practices for conducting and promoting reproducible data science work in your industry.

Version Control
Version control refers to the process of tracking and managing changes to artifacts, like code, data, labels, models, hyperparameters, experiments, dependencies, documentation, as well as environments for training and inference.

The building blocks of version control for data science are more complex than software projects, making reproducibility that much more difficult and challenging. For code, there are multiple platforms, like GitHub, GitLab, and Bitbucket, that can be used to store, update, and track code, like Python scripts, Jupyter Notebooks, and configuration files, in common repositories.

However that isn’t sufficient. Datasets need to be captured and versioned as well. So do the environments, tools and packages. This is because code may or may not run the same on a different version of Python or R, for example. Data may have changed even if pulled with the same parameters. Similarly capturing different versions of models and corresponding hyperparameters for each experiment is important to reproduce and replicate the results of a winning model that might be deployed to production.

Reproducing end-to-end data science experiments is a complex, technical challenge that can be achieved much more efficiently using platforms like Domino’s Enterprise MLOps platform which eliminates all manual work and ensures reproducibility at scale.

Scalable Systems
Building accurate and reproducible data science models requires robust and scalable infrastructure for data storage and warehousing, data pipelines, feature stores, model stores, deployment pipelines, and experiment tracking. For machine learning models that serve predictions in real time, the importance of reproducibility is even higher in order to quickly resolve bugs and performance issues.

End-to-end machine learning pipelines involve multiple components, and an organizational strategy for reproducible data science work must carefully plan for the tooling and infrastructure to enable it. Engineering reproducible workflows requires sophisticated tooling to encompass code, data, models, dependencies, experiments, pipelines, and runtime environments.

For many organizations, it makes sense to buy (vs. build) such scalable workflows focused on reproducible data science.

Conclusion
Reproducible research is a cornerstone of scientific research. Reproducibility is especially significant for cross-functional disciplines like data science that involve multiple artifacts, like code, data, models, and hyperparameters, as well as a diverse set of practitioners and stakeholders. Reproducing complex experiments and results is, therefore, essential for teams and organizations when making important decisions like which models to deploy, identifying root causes when the models break down, and building trust in data science work.

Reproducing data science results requires a complex set of processes and infrastructure that is not easy or necessary for many teams and companies to build in-house.

Related Blogs

Comments

How to Generate Synthetic Data for Machine Learning Projects

16/6/2022

Comments

Published by Unbox.ai

Introduction
Machine learning models, especially deep neural networks, are trained using large amounts of data. However, for many machine learning use cases, real-world data sets do not exist or are prohibitively costly to buy and label. In such scenarios, synthetic data represents an appealing, less expensive, and scalable solution.

Additionally, several real-world machine learning problems suffer from class imbalance—that is, where the distribution of the categories of data is skewed, resulting in disproportionately fewer observations for one or more categories. Synthetic data can be used in such situations to balance out the underrepresented data and train models that generalize well in real-world settings.

Synthetic data is now increasingly used for various applications, such as computer vision, image recognition, speech recognition, and time-series data, among others. In this article, you will learn about synthetic data, its benefits, and how it is generated for different use cases.

What is synthetic data?
Synthetic data is a form of data augmentation that is commonly used to address overfitting deep learning models. It’s generated with algorithms as well as machine learning models to have similar statistical properties as the real-world data sets. For data-hungry deep learning models, the availability of large training data sets is a massive bottleneck that can often be solved with synthetic data.

Additionally, synthetic data can be used for myriad business problems where real-world data sets are missing or underrepresented. Several industries—like consumer tech, finance, healthcare, manufacturing, security, automotive, and robotics—are already benefiting from the use of synthetic data. It helps avoid the key bottleneck in the machine learning lifecycle of the unavailability of data and allows teams to continue developing and iterating on innovative data products.

For example, building products related to natural language processing (NLP), like search or language translation, is often problematic for low-resource languages. Synthetic data generation has been successfully used to generate parallel training data for training deep learning models for neural machine translation.

Generating synthetic data for machine learning
There are several standard approaches for generating synthetic data. These include the following:

Statistical approaches based on sampling from the source data distribution
Deep neural network–based methods such as variational autoencoders and generative adversarial networks

The choice of methods for synthetic data generation depends on the type of data to be generated, with statistical methods being more common for numerical data and deep learning methods being commonly used for unstructured data like images, text, audio, and video. In the following sections, you’ll learn more about the different types of synthetic data and then explore some techniques for generating it.

Types of synthetic data
Synthetic data can be classified into different types based on their usage and the data format. Generally, it falls into one of two categories:

Partially synthetic data, where only a specific set of the training data is generated artificially
Fully synthetic data, where the entire training data set consists of synthetic data

Partially synthetic data finds its application in use cases where sensitive data needs to be replaced in the original training data set. Fully synthetic data sets are used in domains like finance and healthcare, where privacy and compliance concerns restrict the use of original data.

Popular types of synthetic data, classified according to the data type, include the following:

Synthetic text
Synthetic media including images, audio, and video
Synthetic time-series data
Synthetic tabular data

Synthetic text finds its use in applications like language translation, content moderation, and product reviews. Synthetic images are used extensively for purposes like training self-driving cars, while synthetic audio and video data is used for applications including speech recognition, virtual assistants, and digital avatars. Synthetic time-series data are used in financial services to represent the temporal aspect of financial data, like stock price. Finally, synthetic tabular data is used in domains like e-commerce and fraud.

Techniques for generating synthetic data
Generating synthetic data can be very simple, such as adding noise to data samples, and can also be highly sophisticated, requiring the use of state-of-the-art models like generative adversarial networks. In this section, you’ll review two chief methods for generating synthetic data for machine learning and deep learning applications.

Statistical methods
In statistics, data samples can be assumed to be generated from a probability distribution with certain characteristic statistical features like mean, variance, skew, and so on. For instance, in the case of anomaly detection, one assumes that the nonanomalous samples belong to a certain statistical distribution while the anomalous or outlier samples do not correspond to this data distribution.

Consider a hypothetical machine learning example of predicting the salaries of data scientists with certain years of experience at top tech companies. In the absence of real-world salary data, which is a topic considered taboo, synthetic salary data can be generated from a distribution defined by the few real-world salary public reports on platforms like Glassdoor, LinkedIn, or Quora. This can be used by recruiters and hiring teams to benchmark their own salary levels and adjust the salary offers to new hires.

Deep learning-based methods
As the complexity of the data increases, statistical-sampling-based methods are not a good choice for synthetic data generation. Neural networks, especially deep neural networks, are capable of making better approximations of complex, nonlinear data like faces or speech. A neural network essentially represents a transformation from a set of inputs to a complex output, and this transformation can be applied on synthetic inputs to generate synthetic outputs. Two popular neural network architectures for generating synthetic data are variational autoencoders and generative adversarial networks, which will be discussed in detail in the next sections.

Variational autoencoders
Variational autoencoders are generative models that belong to the autoencoder class of unsupervised models. They learn the underlying distribution of a data set and subsequently generate new data based on the learned representation.

VAEs consist of two neural networks: an encoder that learns an efficient latent representation of the source data distribution and a decoder that aims to transform this latent representation back into the original space. The advantage of using VAEs is that the quality of the generated samples can be quantified objectively using the reconstruction error between the original distribution and the output of the decoder. VAEs can be trained efficiently through an objective function that minimizes the reconstruction error.

VAEs represent a strong baseline approach for generating synthetic data. However, VAEs suffer from a few disadvantages. They are not able to learn efficient representations of heterogeneous data and are not straightforward to train and optimize. These problems can be overcome using generative adversarial networks.

Generative adversarial networks
GANs are a relatively new class of generative deep learning models. Like VAEs, GANs are based on simultaneously training two neural networks but via an adversarial process.
A generative model, G, is used to learn the latent representation of the original data set and generate samples. The discriminator model, D, is a supervised model that learns to distinguish whether a random sample came from the original data set or is generated by G. The objective of the generator G is to maximize the probability of the discriminator D, making a classification error. This adversarial training process, similar to a zero-sum game, is continued until the discriminator can no longer distinguish between the original and synthetic data samples from the generator.

GANs originally became popular for synthesizing images for a variety of computer-visionproblems, including image recognition, text-to-image and image-to-image translation, super resolution, and so on. Recently, GANs have proven to be highly versatile and useful for generating synthetic text as well as private or sensitive data like patient medical records.

Synthetic data generation with Openlayer
Openlayer is a machine learning debugging workspace that helps individual data scientists and enterprise organizations alike to track and version models, uncover errors, and generate synthetic data. It is primarily used to augment underrepresented portions or classes in the original training data set. Synthetic data is generated from existing data samples, and data-augmentation tests are conducted to verify whether the model’s predictions on the synthetic data are the same as for the original data.

Conclusion
In this article, you learned about synthetic data for machine learning and deep learning applications. In the absence of real-world data, as well as other pertinent issues like privacy concerns or the high costs of data acquisition and labeling, synthetic data presents a versatile and scalable solution. Synthetic data has found mainstream acceptance in a number of domains and for a variety of data types, including text, audio, video, time series, and tabular data.

You explored these different types of synthetic data and the various methods for generation. These include statistical approaches as well as neural network–based methods like variational autoencoders and generative adversarial networks. Then you walked through a brief tutorial for generating synthetic data using deep learning methods. Finally, you saw the utility of third-party synthetic data generation products such as Openlayer, which can help companies rapidly scale their synthetic data requirements and accelerate model development and deployment.

Related Blogs

Comments

Surefire Ways to Identify Data Drift

26/5/2022

Comments

Published by Unbox.ai

Data drift refers to the phenomenon where the distribution of live, real-world data differs or “drifts” from the distribution of data used to train a machine learning model. When data drift occurs, the performance of machine learning models in production degrades, resulting in inaccurate predictions. This reduction in the model’s predictive power can adversely impact the expected business value from the investment in training. If data drift is not identified in time, the machine learning model may become stale and eventually useless.

In this article, you’ll learn more about data drift, exploring why and in what ways it occurs, its impact, and how it can be mitigated and prevented.

The Importance of Detecting Data Drift
Machine learning models operate in a dynamic environment but are trained on data from a fixed, statistical distribution. Data drift can occur due to a variety of reasons, including seasonal variations, new product features, changes in customer behavior, or even rare events like the Covid-19 pandemic.

Data drift is a critical challenge for production machine learning systems. It occurs when the statistical distribution of the target or real-world data diverges significantly from the statistical properties of the data on which the model was trained. This hurts model performance on new unseen data during real-world inference, leading to inaccurate predictions, poor customer experience, and monetary and reputational costs for the business.

If undetected, data drift can cause multiple problems besides the obvious loss in model performance. It leads to greater MLOps challenges and technical burden for teams such as identifying data drift, conducting root cause analysis for input features correlated with data drift, data labeling, active learning, retraining, and redeploying the updated models to production. This is a significant investment of time and resources that can be avoided if machine learning models are closely monitored and a strategy for detecting and fixing data drift is in place.

How to Identify Data Drift
It is common to assume that a loss in model performance may be due to data drift. However, before arriving at this conclusion, it is important to assess data quality. Target data distributions could change due to a new set of users, a feature or product update, or even something as simple as a bug or formatting error in the code or data. After data-quality issues are ruled out, data drift can be examined in more detail.

Fundamentally, data drift implies a change in the statistical distribution of target data from that of the training data. Thus, the simplest way to identify data drift is to compare summary statistics (like mean, variance, Kullback–Leibler divergence, and so on) of a carefully sampled subset of the target data against the training data. Other statistical measures include comparing the number of outliers in the two distributions or using the Kolmogorov–Smirnov test. Analyzing the correlations between input features and model predictions for both the data distributions can also shed some light.

Model-based machine learning techniques can also be used to identify data drift. A sample of data from the reference or training distribution can be labeled as 0, and an equivalent sample from the target distribution can be labeled as 1. Based on this input data, a simple binary classification model can be trained to discriminate between the two types of data distributions. If the model can distinguish between the two data sets, this implies that data drift is present. Alternatively, if the model fails to discriminate between the data sets, then no data drift is evident. Using a machine learning–based approach captures nonlinear relationships better and can help catch data drift where the above statistical methods might fail.

What are the different kinds of Data Drift?
The change in the statistical distribution between the target and training data manifests in different forms of data drift that are observed in real-world machine learning systems:

Covariate shift (feature drift)
Covariate drift refers to a data drift that is correlated with a shift in the independent variables or input features. The relationship between the features and target variables is unchanged, but the change in a few features leads to covariate drift. Covariate drift can occur due to sample selection bias and is frequently observed in nonstationary environments.

Concept drift
Concept drift is associated with a change in the relationship between the independent variables and the target variables. For instance, a particular machine learning model may suffer from concept drift when it is launched in a new geography where the customer behavior is markedly different from the behavioral data from the original geography that was used to train the models. Although the set of input features and the data distributions may remain the same, the model may not make any useful predictions and is rendered obsolete.

How can you mitigate and prevent data drift?
Once data drift is confirmed as real and significant after rigorous analysis and statistical tests as described above, it is important to address it sooner than later. Here are a few strategies for doing so.

Data labeling
Labeling the new target data is the first step toward addressing data drift. A carefully selected batch of the test data can be sampled and sent to subject matter experts for data annotation. Thereafter, this labeled target data from the modified data distribution can be incorporated into the original data distribution to ameliorate the impact of data drift.

Periodic model training
With newly labeled target data, the model can be retrained on data from both the original distribution and the test distribution. As the new model is now trained to recognize data from the modified target distribution, it typically does a better job in production than the original model. However, a model might need to be trained multiple times, depending on the rate of data drift, to capture the new patterns from the test data.

Model recalibration
With repeated model retraining, the training pipeline, model architecture, and hyperparameters may remain the same, with the only difference being the change in training data. However, if data drift is not taken care of with periodic retraining, it might be prudent to train the model from scratch with a fresh approach and insights learned from efforts focused on evaluating and mitigating data drift.

The new model may be trained differently from the original one in a number of ways:

Reweighting the more recent data samples;
Conducting error analysis to identify any segments on which the model fails consistently;
Testing different model architectures depending on the machine learning use case;
Evaluating model-adaptation strategies to fit the model to a new target domain.

Continuous monitoring
Continuous monitoring of machine learning model performance is critical to keep track of the quality of the model in production. Model performance metrics like true positives, false positives, precision, recall, F1-score, and AUC-ROC curves can be periodically assessed. After thresholds for such performance metrics are carefully selected, alerts can be triggered using platforms like Grafana or Prometheus or by using third-party-managed MLOps platforms.

Apart from output metrics, other things to monitor include any data issues or inconsistencies, bias in training data, and explainability metrics.

Conclusion
The phenomenon of data drift afflicts most machine learning models in production, arising from various reasons due to the dynamic nature of real-world data, seasonal trends, changes in product features, software- or data-related issues, changes in customer behavior due to new competition or legislation, and even rare black swan events like the Covid-19 pandemic.

Data drift can be of different types, depending on whether the relationship between the independent features and the target variables changes or not. This article has equipped you to know what data drift looks like and provided a list of best practices for identifying and mitigating it before it becomes a major MLOps challenge and renders the machine learning model unfit for its intended business purpose.

Related Blogs

Comments

Data Labeling and Relabeling in Data Science

6/5/2022

Comments

Published by Unbox.ai

Introduction
Supervised machine learning models are trained using data and their associated labels. For example, to discriminate between a cat and a dog present in an image, the model is fed images of cats or dogs and a corresponding label of “cat” or “dog” for each image. Assigning a category to each data sample is referred to as data labeling.

Data labeling is essential to imparting machines with knowledge of the world that is relevant for the particular machine learning use case. Without labels, models do not have any explicit understanding of the information in a given data set. A popular example that demonstrates the value of data labeling is the ImageNet data set. More than a million images were labeled with hundreds of object categories to create this pioneering data set that heralded the deep-learning era.

In this article, you’ll learn more about data labeling and its use cases, processes, and best practices.

Why is data labeling important?
Labeled data is necessary to build discriminative machine learning models that classify a data sample into one or more categories. Once a machine learning model is trained using data and corresponding labels, it can predict the label of a new unseen data sample. Data labeling is a crucial process as it directly impacts the accuracy of the model. If a significant proportion of the training data set is mislabeled, it will cause the model to make inaccurate predictions.

Data labeling of production data is also important to counter data drift. The model can be continuously improved by incorporating the newly labeled samples from the real-world data distribution into the training data set.

Poorly labeled data can also introduce bias in the data set, which can cause the models to consistently make inaccurate predictions on a subset of real-world data. Mislabelingcan severely impact the fairness and accuracy of models and warrants additional efforts to detect and eliminate labeling errors. Relabeling helps to address mislabeled samples, improving the data quality and, consequently, the accuracy of the machine learning models.

How is data labeling performed?
Again, data labeling helps train supervised machine learning models that learn from data and their corresponding labels. For example, the following text, sourced from the Large Movie Review Dataset, can be annotated in a number of ways depending on the use case:

I saw this movie in NEW York city. I was waiting for a bus the next morning, so it was 2 or 3 in the morning. It was raining, and did not want to wait at the PORT AUTHORTY. So I went across the street and saw the worst film of my life. It was so bad, that I chose to stay and see the whole movie,I have yet to see anything else that bad since. The year was 69,so call me crazy. I stayed only because I could not belive it.........1.

Use case: Sentiment analysis

Label: [Negative]

2. Use case: Named entity recognition

Label (Place): [NEW York city], [PORT AUTHORTY]

3. Use case: Spelling correction

Label (Typo): [belive], [AUTHORTY]

For the named entity recognition use case, data annotators have to review the entire text and identify and label any mention of places.

Typically, data annotation is outsourced to vendors who contract subject matter experts relevant for the specific machine learning use case. The team of annotators are assigned different batches of data to label on a daily basis for the duration of the project, using simple tools like Excel or more sophisticated labeling platforms like Label Studio. Labelers’ performance is evaluated in terms of metrics like overall accuracy and throughput—i.e., the number of samples labeled in a day.

If the same set of data samples are assigned to multiple annotators, then the labels given by each annotator can be combined through a majority vote. Inter-annotator agreementhelps to reduce bias and mislabeling errors.

For several use cases, data labeling can be extremely painstaking and time-consuming, which may lead to labeling fatigue. To counter this, labels assigned to each annotator undergo one or more rounds of review to catch any systematic errors. Once a batch of data is labeled, reviewed, and validated, it is shared with the data science team, who review select samples for labeling accuracy and verification and then provide feedback to the annotators. This iterative and collaborative process ensures that the final labels are of high quality and accuracy to use for training machine learning models.

How is data relabeling performed?
The repetitive and manual nature of data labeling is often fraught with errors. This necessitates the need to identify and relabel samples that were erroneously labeled the first time around. Relabeling is an expensive but necessary process as it is imperative to have a training data set of high quality. Unlike labeling, relabeling is usually done on a smaller sample of the entire data set and can be completed much faster if the samples are mislabeled in a unique way or associated with the same annotator.

Once a trained model is deployed, its predictions on real-world data can be evaluated. A detailed error-analysis process can sometimes reveal systematic prediction errors. Many times, these characteristic errors may be correlated with a certain type of data sample or feature. In such cases, having another look at similar samples in the training data can help identify mislabeled samples. More often than not, labeling errors on a certain segment of the training data can be captured through such error analysis and corrected with relabeling.

Best practices for data labeling
Data labeling can be prohibitively expensive and time-consuming for large data sets. As model development is contingent on the availability of good-quality labeled data, poor labeling can affect the timelines and prolong the time to build and deploy machine learning models.

A good practice for data scientists is to curate a comprehensive data-annotation framework for each use case before starting the data-labeling process. Clear, structured guidelines with examples and edge cases provide much-needed clarity for annotators to do their job with greater speed and accuracy. In the absence of domain experts within the company, external experts can be sought to discuss and conceptualize guidelines and best practices for labeling specific types of data.

As labeling of large data sets by domain experts can be quite expensive, in specific cases, data labeling can be crowdsourced to thousands of users on platforms like Amazon Mechanical Turk. Typically, labeling by crowdsourced users is fast but often noisy and less accurate. Still, crowdsourcing can be a significantly quicker method of collecting the first set of labels before doing one or more rounds of relabeling to eliminate errors.

Error analysis is another recommended practice to diagnose model prediction errors and iteratively improve model performance. Error analysis can be done manually by the data scientists or with greater speed and reproducibility using machine learning debugging platforms like Openlayer.

Another good practice, in the context of very large data sets for deep learning applications, is to leverage machine learning to obtain a first pass of labels using techniques like the following:

Conclusion
Machine learning and deep-learning models are typically trained on large data sets. To train such models, a label for each data sample is necessary to teach the model about the information in the data set. Labeling, therefore, is an integral aspect of the machine learning lifecycle and directly influences the quality and performance of models in production.

In this article, you’ve seen the importance, process, and best practices for efficient data labeling and relabeling. Mislabeled data samples introduce noise and bias in the data set that adversely impact the performance of the model. Identifying mislabeled examples through error analysis is a proven technique to improve the quality of training data that can be accelerated using machine learning debugging and testing platforms like Openlayer.

Related Blogs

Comments

Understanding and Measuring Data Quality

28/4/2022

Comments

Published by Unbox.ai

Introduction
Modern companies now unanimously recognize the value of data for driving business growth. However, high-quality data is much more valuable than data assets of poor quality. As companies accumulate petabytes of data from various sources, it becomes imperative to focus on the quality of data and filter out bad data.

Data is the fundamental building block for predictive machine learning models. Although having access to greater amounts of data is beneficial, it doesn’t always translate to better-performing machine learning models. Sampling training data that passes quality checks and meets certain acceptance criteria can significantly boost the accuracy of the model predictions.

In this article, you’ll learn more about why high-quality data is essential for building robust machine learning models, expanding on the various parameters that define data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. You’ll also explore a few mechanisms you can implement to measure and improve the quality of your data.

What is data quality?
Data quality is a measure of how suitable the data is for its intended applications in data analytics, data science, or machine learning. There are several dimensions along which data quality is measured, which include the following:

Accuracy
Completeness
Consistency
Timeliness
Uniqueness
Validity

Measuring the quality of data in terms of the above parameters is critical for organizations to assess whether their in-house data is suitable for downstream applications.

Why is data quality important?
Data quality is an important determinant of the quality of decision-making within an organization. Poor-quality data leads to inaccurate analytics and machine learning models, which might adversely impact various business operations as well as customer experience. Decisions and business strategies based on flawed data can have massive consequences.
Typical data-quality issues include data security and data that is incomplete, duplicated, inconsistent, incorrect, missing, poorly defined, poorly organized, or stale.

In the context of data science use cases, the consequences of using poor-quality data can be immense—machine learning models trained on low-quality data invariably generate weak or inaccurate predictions, which are not easy to troubleshoot.

Deep-learning models in particular are very data-hungry, and their state-of-the-art performance is driven by the massive amounts of data on which they are trained. In this context, recent work has shown that training models with less data reflects real-world scenarios better and is increasingly becoming the norm.

The cost of bad data to organizations is also enormous—as per an IBM study, the yearly cost of poor-quality data in the US alone is equal to USD 3.1 trillion. Therefore, it is paramount for organizations to invest in proper measurement and evaluation of data quality before building data-driven applications or devising new business strategies.

Determining data quality
Several organizations, from IMF to World Bank, have formulated Data Quality Assessment Frameworks (DQAF) to establish clear guidelines for measuring the quality of data in terms of accuracy, completeness, consistency, timeliness, uniqueness, and validity.
This section will focus on each of these data-quality dimensions and discuss how they define the quality of data.

Accuracy
Accuracy, as the term implies, is a pivotal aspect of data quality—it means that the information is correct. Naturally, inaccurate information can cause many significant problems for a business.

For instance, consider an example in which the time of financial transactions is incorrectly recorded due to a failure to update to daylight saving time. In such a scenario, the timing offset could lead to inaccurate analysis and reporting of core business metrics like daily sales and revenue.

Such data inaccuracies can lead to potentially damaging consequences of incorrect financial and tax filings that could result in financial penalties by regulatory bodies.

Completeness
Completeness refers to how comprehensive the data is and whether it contains all the fields and values necessary to make them fit for the intended purpose. Incomplete data often contains empty or missing values across rows or columns and is unusable for further analysis.

For instance, if a customer’s email address is missing, then this customer may not feature in any marketing campaigns, resulting in a potential loss of business for the company.

Consistency
Consistency is another fundamental trait of data quality, as it can affect the usage of the entire data set. If a data set has millions of records but some rows store a customer’s name as “CustomerName” while the remaining rows store the same information as “FirstName” and “LastName” separately, it might lead to inaccurate results and analysis.

Another common example of inconsistent data is related to the underlying format or units of specific data fields. For instance, data like time is often kept in inconsistent formats, and units of money may be recorded differently from country to country.

Timeliness
Timeliness refers to how recent and up-to-date the information is. For a number of applications, timely data is essential as it captures the current trends and patterns in customer behavior or business health.

Data tends to lose its value over time and can drastically affect the quality of business decisions as well as predictions from machine learning models trained on older data. It can cost organizations lost time and money, in addition to reputational damage.

Uniqueness
Uniqueness refers to the lack of duplication or overlap within a data set or across data sets. Modeling redundant information can often lead to spurious correlations or results that can adversely affect statistical analysis as well as model predictions.
Thus, uniqueness is a critical dimension of data quality that is important to build trust in the data for downstream use cases.

Validity
For several data fields, validation checks are important. For instance, a mobile phone number is usually ten digits long, and zip codes in the US should have five digits. When data does not conform to standard formats or business-specific rules, it is said to be invalid. Invalid data can cause grave errors in downstream analytics and necessitates careful scrutiny of every data column before using it.

Truncation of data also leads to data-validity problems. For instance, a user may mistakenly input six digits for a US zip code, which gets truncated to five digits. While such an input may pass data-validation checks, it is ultimately inaccurate.

Additional sources of data-validity errors arise due to mismatched data formats. For instance, a data type like zip code may be inconsistently saved in numeric or string format.

Improving data quality
There are numerous methods for improving data quality.
The first step often involves data profiling—that is, doing an initial assessment of the current state of the data sets. Defining what is good data is also critical to establishing guardrails around selecting data for further usage.

Furthermore, a number of checks for data validation, completeness, consistency, and timeliness can be defined and have to be met by all current and new data sets.
Data standardization across the organization helps to meet data-quality standards so that every stakeholder across different divisions has the same understanding of the various data sets and fields.

Implementing a robust data governance framework can also help businesses improve the quality of organizational data.

Finally, recent advances in machine learning and deep learning can also be used to identify and improve the quality of data in a more scalable and reproducible fashion. For example, in the deep-learning study, a data-quality assessment framework grounded in statistics and deep learning was used to identify outliers in a data set of salary information published by the state of Arkansas, USA.

As the size of organizational data is bound to increase exponentially in the coming years, companies ought to allocate dedicated resources and investments in new techniques from fields like machine learning and deep learning to measure and provide statistical insights into the quality of their data.

Conclusion
In this article, you’ve learned what data quality is and why it is important for organizations to measure and evaluate the quality of their in-house data. Poor-quality data can have significant consequences for a business in terms of inaccurate analytics, predictive machine learning models trained on bad data, as well as ill-informed business decisions and strategies.

Data quality can be measured in terms of a number of parameters such as accuracy, completeness, consistency, timeliness, uniqueness, and validity. Each of these data-quality dimensions are important, and organizations can improve the quality of their data by having robust data profiling, standardization, and validation checks in place. More recently, advances from machine learning and deep learning can also be harnessed to quantitatively define and evaluate the quality of data.

Related Blogs

Comments

Benefits of FAANG companies for Data Science & Machine learning roles

7/3/2022

Comments

I receive several messages about the benefits of joining FAANG and similar companies and startups in the context of Data Science, Machine Learning & AI roles.

Here’s my take, in no particular order:

1. 𝐁𝐫𝐚𝐧𝐝. FAANG+ are not only the top technology companies but also the biggest companies by market cap -> great brand to add to your profile, top compensation and benefits.

2. 𝐒𝐜𝐨𝐩𝐞. The scope of AI/ML applications in these companies is tremendous as they have tons of data. You can get to work on multiple use cases, driven by statistics, machine learning, deep learning, unsupervised / semi-supervised / self-supervised, reinforcement learning etc. Internal team transfers facilitate expanding your breadth of ML experience.

3. 𝐁𝐚𝐫. The AI/ML work is cutting edge, as most of these companies invest heavily in R&D and create game-changing techniques and models. They also invest heavily in platform, cloud, services etc. that make it easier to build and deploy ML products.

4. 𝐑&𝐃. You can do both research on moon-shot projects if that’s your cup of tea, as well as more immediate business-driven data science projects with monthly or quarterly deliverables.

5. 𝐏𝐞𝐨𝐩𝐥𝐞. You get to work with the creme-de-al-creme in terms of talent, ideas, vision, and execution. Your own level will rise if you are surrounded by some of the brightest folks, and also get to collaborate with their clients and collaborators from academia, startups as well.

6. 𝐍𝐞𝐭𝐰𝐨𝐫𝐤. After FAANG, people go on to do many diverse things — from building a startup to doing cutting-edge research to non-profits to venture capital amongst others. You can find quality partners for the next steps of your career journey.

7. 𝐒𝐲𝐬𝐭𝐞𝐦𝐬. Processes and systems for AI/ML/Data are more mature and streamlined than smaller/newer companies which can facilitate your speed and execution of your projects.

8. 𝐂𝐮𝐥𝐭𝐮𝐫𝐞. The culture, on average, is more professional as these companies invest heavily in their employees and regularly come up with new employee-friendly policies to make it a great place to work.

9. 𝐅𝐫𝐞𝐞𝐝𝐨𝐦. After FAANG, you will be in demand and recruiters and hiring managers will seek you out if you’ve proved your chops whilst at the company. You will have more opportunities to sample from and greater freedom in terms of deciding your career and life trajectory, as you can also move internally to different countries.

10. 𝐈𝐦𝐩𝐚𝐜𝐭. Given the scale at which these companies operate, the scope for real-world measurable impact is enormous.

There are some downsides, caveats and exceptions as well, but on average these factors make FAANG and similar tech companies a very attractive proposition to launch, build and grow your career in data science and machine learning.

Comments

Why Data Democratization is important to your business?

1/3/2022

Comments

Introduction
"Data democratization" has become a buzzword for a reason. Modern organizations rely extensively on data to make informed decisions about their customers, products, strategy, and to assess the health of the business. But even with an abundance of data, if your business can’t access or leverage this data to make decisions, it’s not useful. To that end, data democratization, or the process of making data accessible to everyone, is quintessential to data-driven organizations.

Providing data access to everyone also implies that there are few if any roadblocks or gatekeepers who control this access. When stakeholders from different departments—like sales, marketing, operations, and finance—are permitted and incentivized to use this data to better understand and improve their business function, the whole organization benefits.

Successful data democratization requires constant effort and discipline. It’s founded on an organization-wide cultural shift that embraces a data-first approach and empowers every stakeholder to comfortably use data and make better data-driven decisions. As Transform co-founder James Mayfield put it, organizations should think about "democratizing insights, not data."

In this article, I will provide a detailed overview of data democratization, why organizations should invest in it, and how to actually implement it in practice.

Why democratize access to data?
Historically, data used to be kept in silos, usually under the purview of the IT or Analytics departments. When any stakeholder from outside these departments required data for their work, they had to go through these data gatekeepers to access the necessary assets. This philosophy has been the norm for decades but is no longer relevant for modern data-driven organizations.

Removing these types of bottlenecks is a necessary first step toward data democratization. Guidelines for data democratization can be noted in a data governance framework to improve access and provide high-quality data for downstream analytics. Improving access is just the first step of an ongoing process where every individual employee is encouraged and trained to make use of data. The more people who can make decisions based on data, the more the organization stands to benefit from a variety of perspectives and ideas.

Companies have been dedicating huge investments in data infrastructure and tooling in order to build an analytics advantage over their competitors. The dream is to “democratize data” and get employees to change their ways of working and start making decisions informed by data, not gut feelings. By investing in data education and helping analysts influence, then building modern tools to support metrics, we will continue making progress toward that goal of truly democratized data" —James Mayfield, co-founder, Transform

While data analytics and business intelligence efforts are traditionally the domain of data experts, organizations can empower non-technical stakeholders to perform basic data operations via in-house training programs, workshops, and self-service tools that can simplify their onboarding and learning process. They can also use software that surfaces data in an easy-to-consume format for business stakeholders.

Data democratization has multiple downstream benefits. It leads to greater data literacy, which can facilitate not only greater data-driven decision-making but also potentially lead to creation of new products or services based on insights mined from the data. Therefore, greater democratization, usage, and adoption of a data-driven approach can unlock massive commercial value and new growth levers for businesses.

How do you actually democratize data?
Implementing data democratization is a hard challenge and an ongoing process. To be successful, it needs support, buy-in, and a lot of patience from the leadership. Apart from conceptualizing and implementing curated data governance frameworks and policies, organizations can leverage tools to enable data democratization at scale.

Tools to enable data democratization

The Data Catalog
A data catalog is a collection of metadata that, combined with data management and search tools, helps data stakeholders find and acquire data for downstream analytics. A data catalog provides a managed and scalable data discovery and metadata management capabilities which are fundamental requirements of attaining higher levels of data democratization in an organization.

The Data Mart
A data mart is a subset of a data warehouse focused on a specific business vertical or data domain. Data marts enable specific users to access specific data that empowers them to quickly access these datasets without wasting time searching for the same in the data warehouse. For instance, individual departments like sales, marketing, operations, and finance can have their respective data marts for accelerating their domain-specific data-driven decision making.

The Metrics Catalog
A metrics catalog is a new layer in the modern data stack. It is a centralized store for all of your organizations’ most important metrics (or key performance indicators) and it's uniquely positioned between the data warehouse and downstream tools. As a self-service place for business KPIs, every stakeholder in the organization has access to track their own metrics and share context with others.

By capturing core business metrics in this fashion and this location in the modern data stack, a metrics catalog provides immense visibility and transparency into an organization's most critical metrics and metric lineage for all stakeholders in an organization. This new concept of a metrics catalog can have a significant role to play in democratizing data to everyone.

As a single source of ground truth for business data, a metrics catalog enables diverse stakeholders to base all key decisions on the same foundation. It also allows for disparate teams to use the same metrics, ask questions, and keep everyone aligned and on track. This greatly enhances the level of data democratization within an organization.

Challenges for data democratization
Although the benefits of data democratization are pretty evident, there are also numerous challenges. Some challenges are common, like data being kept in silos and unclear data ownership. The informational silos problem is antithetical to data democratization, and can adversely impact an organization's ability to leverage data for improvising its business performance and decision making.

Different teams have ownership of different types of data, which contributes to the problem of information silos. When a particular team has exclusive access to specific data assets, they not only hinder other teams from accessing the data but also guard their analysis and insights derived from the same data. This often leads to duplication of efforts across teams, causing a massive waste of organizational time and resources. As each individual team or department hoards its own data and analyses, it contributes to the adoption of the same undemocratic processes across other teams further compounding the challenges in promoting data democratization.

With greater access to the organizational data assets, there is also a challenge of data security, privacy, and potential misuse of the data. It increases the number of gaps in the organization which might become vulnerable to adversarial attacks and data breachers. This is why it’s important to have a balance between data security and data access—including having stronger safeguards around who can access and analyze personally-identifiable information and customer data.

Looking ahead
If implemented well, data democratization can provide an immense competitive edge that will only compound over time as organizations mature in their digital transformation journey.

Several tools and data artifacts can aid in better implementation and adoption of best practices and policies that help in democratizing data. A metrics catalog is one relatively new tool that provides a centralized store of business critical information accessible to multiple stakeholders. It captures essential business metrics and provides a simplified interface that is agnostic of the separate analytics, CRM, and BI platforms used by various teams in the organization. Learn more about how a metrics store can promote data democratization and governance at Transform.co.

Comments

How to ensure Data Quality through Governance

29/12/2021

Comments

Introduction
Data governance is a fundamental pillar of modern digital businesses. It refers to a framework of processes and guidelines that companies use to ensure all enterprise data assets are managed and utilized appropriately.

Even if an organization has large investments in data infrastructure and teams, without a structured data governance framework, organizations will struggle to harness the full value of their data.

A strong framework provides a clear set of guidelines for all employees who access and consume data in downstream applications. It also contributes to greater trust in the authenticity and quality of data and allows data stakeholders to focus on core data tasks instead of worrying about whether the data was created, processed, stored accurately, and in compliance with national or domain-related legislations like GDPR, HIPAA, CCPA, and data localization laws. Given recent data breaches, the importance of a structured data governance framework cannot be emphasized enough.

In this article, you’ll learn how to ensure data quality through better data governance mechanisms, leading to an increase in data informed decision-making. You’ll also learn how a clear data governance framework contributes to improved data quality and value creation across the entire organization.

Why do you need data governance?
The digital revolution is founded on data and the idea that data can generate insights that are critical for decision-making and long-term planning. With the emergence of cloud technologies, it’s easier for businesses to see the importance of data and store it in a more accessible, scalable, and secure way.

A data governance framework is a set of rules and processes for collecting, storing, and using data. This diagram shows a simplified outline for how to think about building a data governance framework for your organization.However, collecting and storing data is just the tip of the iceberg. Without a clear and robust governance framework, you can’t fully understand the value of your data. High-quality data will help you make the best possible decision for your company.

A data governance framework consists of several layers, stakeholders, business goals, and structured processes with a focus on information and project management. This accountability means organizations can build high-quality data products with confidence.

This is evident in the case of top technology companies like Google and Amazon that have invested early and massively in data and data-driven technologies. They benefited from investing and enforcing a data governance framework that lowers the organizational threshold, velocity, and efficiency with which businesses can adapt to change.

So, why is data governance important?
Investing in data governance leads to many benefits including:

Empowering data-driven business decision-making.
Increasing awareness, access, and utilization of data to achieve key business goals.
Facilitating compliance with local, national, or domain-specific regulations.
Promoting good practices among the various stakeholders, processes, and teams.
Influencing people and builds a culture of acceptance and trust in organizational data.
Reducing the threshold for adoption of new tools and technologies to help enforce data governance.
Increasing ownership and accountability of maintaining data quality across the organization.

Even though the process of defining and operating a framework can seem daunting, the proven benefits will make a lasting impact on your business.

Ensure data quality through governance
A major outcome of a solid data governance framework, if carried out properly, is improved data quality. When organizations follow these guidelines, it leads to a clearer understanding of their data assets and increases accountability.

First, think about your data lineage. Record the source of each data set and the date/time that it is accessed. It’s also critical to understand the teams that are accessing the data including the applications they’re using.This ensures compliance and prevents data breaches.

You can test data quality by asking different stakeholder teams to provide the value for a common business metric. More often than not, different teams will have conflicting answers for the same metric. This can be the result of a flaw in your data governance strategy, fuzzy guidelines, or scattered metrics logic across downstream tools.

Create policies that ensure data accuracy
Maintaining accurate data across the organization is difficult but rewarding. Once a new data asset is created, either internal or external, it needs to be systematically logged and entered into the appropriate databases.

Consistently using data governance best practices for completeness, relevance, reliability, and lifecycle can lead to better data quality and accuracy.

Develop practices to test data completeness
Data completeness refers to the wholeness of the data. Data is complete when there are no missing values, records, or duplicates. Basic automated checks to validate the number of rows and columns, dimensionality, missing and null values, and data format mismatch can help identify missing elements.

Adopt technologies to check data relevance
Data relevance refers to the utility of data in providing critical insights. It’s important to remember that not all data is useful or relevant to particular business problems, and identifying the right set of input data can help focus subsequent analytics and modeling efforts.

Track relevance with data reliability
Data reliability is an indicator of how useful and relevant it is over time. It builds upon the concepts of completeness and relevance, and is more likely to be used and reused by teams for their work. This lays the foundation for multiple use cases and business insights.
Stay compliant with data depreciation and lifecycleData timeliness and lifecycle management provides clear timelines for the validity and deprecation of data, ensuring that it’s used only when relevant and compliant with privacy laws. This regulates the lifecycle before it is depreciated or deleted permanently.

Standardizing metrics as part of your data governance strategy
Let’s take a look at how you can standardize your metrics through metrics catalogs and policies and build into a data governance strategy that ensures data quality.

Catalog metrics in a metrics storeStandard metrics like annual recurring revenue (ARR), gross merchandise value (GMV), customer acquisition cost (CAC), customer lifetime value (LTV), and net promoter score (NPS) are common. Once you've defined your metrics, these metrics can be stored in a metrics catalog for greater ease of access, use, and re-use across the organization.

A metrics catalog has several advantages. It reduces valuable organizational time and effort to reproduce the underlying analysis, and it creates a centralized metrics store that facilitates better understanding and decision-making.

As depicted in the figure below, a metrics store is a centralized and governed place for organizations to store key metrics, creating a repository for stakeholders to access key metrics in a repeatable way, regardless of where people access their data.

Policies and practices for sign-off
Before creating a metric, there needs to be a clear policy on the steps that people use to analyze and validate their business metrics. Data quality policies should not be treated as an administrative exercise but regarded as an important milestone in this stage of data transformation.

In addition to assigning an owner for each of your critical metrics, you should also think about executive sponsorship for the organization’s most important, “north-star” metrics. A stamp of approval from the C-suite or an executive sponsor conveys the importance of the data policy framework to the entire organization but can also be used to negotiate and expedite resolutions when conflicts arise.

Conclusion
In this article, you’ve learned about data quality as an index that can be used for many attributes of data in an organization. A data governance framework creates a set of best practices that improve data accuracy and relevance.

A data governance framework also makes it possible to distribute high-quality data to your teams in the most efficient way possible. Building a metrics store is a critical part of this process because metrics are the language that you use to express whether you achieved your organizational goals. A metrics store, like the Transform Metrics Store, centralizes all of this knowledge in one place for easy access and collaboration.

To learn more about the metrics catalog and other solutions, visit Transform.co.

Comments

Data Labeling: The Unsung Hero Combating Data Drift

27/10/2021

Comments

Introduction
Data drift is a common problem for production machine learning systems. It occurs when the statistical characteristics of the training (source) and test (target) data begin to differ significantly. As illustrated in the image below, the orange curve depicting the original data distribution shifts to the purple curve, representing a change in statistical properties like the mean and variance.

Understanding data drift is fundamental to maintaining the predictive power of your production machine learning systems. For instance, a data science team may have started working on a machine learning use case in 2019, using training data from 2018, but by the time the model is ready to go into production, it’s 2020. There could be a huge change in the distribution between the source data from 2018 and the live data coming from 2020.

Any time a machine learning model is ready to be shipped, it needs to be rigorously tested on live data. It’s critical that you detect data drift before deploying a model to production.

In this article, I’ll illustrate the various types of data drift and how data drift impacts model performance along with several examples. I’ll also address data labeling, one of the popular ways to tackle data drift, and how to perform data labeling efficiently.

Why Data Drift Happens?
In real-world situations, data drift can occur due to a variety of reasons:

seasonal trends
launch of a new feature or product that causes a change in the data or features used to train the machine learning model
rare black swan events like the COVID-19 pandemic that disrupted normal customer behavior in specific domains like food, travel, and hospitality

Continuing with the COVID-19 example, a model trained on data prior to the onset of global lockdowns, say from January to February 2020 will yield poor predictions on data in March and April 2020 after the lockdowns started. Thus, the original trained model is no longer relevant or practically useful and needs to be retrained.

Even small changes in the data structure or format of the source data can have significant consequences for machine learning models. For instance, a change in the format of a data field, like an IP address or hostname or ID, can often go undetected for a long time without effective root cause analysis.

Types of Data Drift
There are different types of data drift, but the two principal ones are:

Covariate drift refers to data drift associated with a shift in the independent variables. It happens when a few features change while still maintaining the same relationship between the feature and the target variable.

Covariate drift primarily occurs due to sample selection bias, which is a systematic bias in the selection of training data that results in a nonuniform and nonrepresentative training dataset. Nonstationary environments, where the training environment differs from the test environment, also cause covariate drift.

Concept drift, on the other hand, occurs when the relationship between the independent variables and the target variable changes.

Consider a product recommendation machine learning model in the context of e-commerce, where the original model is trained on user activity and transactions from users located in the US. Now imagine that the e-commerce company is going to launch in a new locale or market with the same product catalog as in the US. The original recommendation model will perform poorly when applied to users from the new market with significantly different online shopping behavior, financial literacy, or internet access for e-commerce.

In this example, the online shopping behavior of the users is markedly distinct. Even if the same features are used to train the machine learning model, it might underperform significantly. In such cases, concept drift is the root cause of data drift, and the personalization model needs to be reworked and include new features that better capture the new user behavior.

Overcoming Drift with Data Labeling
To overcome data drift, you need to retrain the model using all available data, including data from before and after drift occurred. New data needs to be labeled accurately before including it in the new training dataset.

Data labeling refers to the process of providing meaningful labels to target variables in the context of supervised machine learning where the target could be an image or text or an audio snippet.

In the context of data drift, data labeling is crucial to countering data drift, and thereby directly affects the performance of machine learning models in production.
Data labeling is integral to supervised machine learning where a model is fed input data along with relevant labels depending on the use case. For example, for a model learning to detect product placement in videos, the model is fed a video with products highlighted in the video.

Typically, data labeling is a manual exercise that’s both costly and time-consuming. It’s often outsourced to vendors in developing countries associated with low cost of labor. Annotators need to be trained to use labeling software, understand the machine learning use case and the annotation framework, and deliver highly accurate labels at a high velocity and throughput.

In such a scenario, labeling errors can occur, which exacerbates the problem of data drift if data from the new test or target distribution isn’t labeled accurately. In practice, several controversial labeling errors have occurred that cause reputational damage to the company, for instance, when Google Photos labeled two Black people as “gorillas.”

Big technology companies like Google and Facebook are grappling with such issues in their automated data labeling algorithms. Labeling errors can be made by human annotators, and also by machine learning models. Once trained, the predictions made by machine learning models on new data are often reused to augment the original training data to further improve the models. In such scenarios, data labeling errors can compound resulting in imperfect models that often yield such bizarre and controversial results.

Data labeling helps alleviate data drift by incorporating data from the changed distribution into the original training dataset. If enough new data is labeled, then it is possible to drastically reduce data drift by simply dropping the older data and only using the newly labeled data.

Therefore, proper and efficient data labeling is a crucial exercise with significant commercial impact, depending on the nature of the machine learning application. For example, incorrect data labels in a fraud detection use case can result in monetary loss every time the fraud detection machine learning model makes an incorrect prediction. Inaccurate data labels not only impact the performance of the machine learning model but also indirectly contribute to data drift. Any systematic data labeling errors may compound the problem as the model’s predictions on new data are typically leveraged to augment the training dataset.

Data labeling can be improvised and performed effectively through the use of intuitive software that enables human annotators to label data with high speed and low cognitive load. For additional improvement in data labeling, you can implement inter-annotator agreement; a particular training example is assigned a label that’s selected by a majority of the annotators. For example, if four out of seven annotators assign “Label1” to a particular data sample and the other three annotators assign it “Label2,” then the data sample would be tagged with “Label1.”

Strong operational practices including auditing of randomly selected labels for accuracy can improve the process and provide feedback about systematic labeling errors.

You can also use machine learning to aid data labeling with a model trained on a sample of data that’s labeled by humans to generate predictions on new or unlabeled data. These noisy labels can then be leveraged to build better machine learning models by incorporating the data samples associated with high probability and sending the data samples with low probability back to human annotators for more accurate labels. This process can be repeated iteratively to improve the overall performance of the model with minimal human data labeling efforts.

Conclusion
Data drift can have a negative impact on the performance of machine learning models as data distribution changes. This can cause a machine learning model’s predictive accuracy to go down over time if not countered effectively.

Data labeling is one technique to reduce data drift by applying labels to data from the new or changed distribution that the model does not predict well. This helps the machine learning model to incorporate this new knowledge during the training process to improve its performance.

There are several tools available today that enable annotators to label data efficiently. For example, Label Studio is an open-source data labeling tool that provides a platform for labeling different data types, including images, text, audio as well as multi-domain data. It’s already used by leading technology companies including Facebook, NVIDIA, Intel, so check it out if you’re looking for a robust, open-source solution for reducing data drift.

Comments

ML Engineer vs Data Scientist

21/10/2021

Comments

Published by Neptune.ai

Introduction
In 2010, DJ Patil and Thomas Davenport famously proclaimed Data Scientist (DS) to be the “Sexiest Job of the 21st century”. The progress in data science and machine learning over the last decade has been monumental. Data science has successfully empowered global businesses and organizations with predictive intelligence and data-driven decision-making to the extent that data science is no longer considered a fringe topic. Data science is now a mainstream profession and data science professionals are in high demand a cross all kinds of organizations from big tech companies to more traditional businesses.

A decade earlier the focus of data science was more on algorithmic development and modeling to extract robust insights from data. However, as data science has evolved over the decade, it has become clearer that data science involves more than just modeling. The machine learning lifecycle, from raw data through to deployment, now relies on specialized experts including data engineers, data scientists, machine learning engineers along with product and business managers.

The role of a machine learning engineer is gaining prominence across companies as they realized that the value of data science cannot be realized until a model is successfully deployed to production. Whilst a lot of tools and technologies such as Cloud APIs, AutoML, and a number of Python-based libraries have made the job of a data scientist easier, the MLOps of putting models into production and monitoring their performance is still quite unstructured.

For a detailed look at the respective skills, responsibilities, and tech stack of various profiles, ranging from a data scientist to a data science manager, refer to my previous article on how to build effective machine learning teams in the industry [2].

There are four core steps in executing a data science project:

Problem formulation – translating a business problem into a data science problem
Data engineering – preparing the data and pipelines to process raw data for modeling
Modeling – designing and experimenting with algorithms and models for the use case
Deployment – productionizing the model after testing and monitoring its performance.

In large tech companies and startups, there is a more established process of going about data science, and the work is clearly demarcated along the lines. Thus, it is common to expect professionals across various sub-domains to focus on their respective areas of specialization and collaborate with each other when required. However, in smaller organizations that do not have the luxury of having a large data science team, the first few data science hires are expected to work across these distinct functions as “full-stack” data scientists.

Thus, the definition and scope of a data scientist vs. a machine learning engineer is very contextual and depends upon how mature the data science team is. For the remainder of the article, I will expand on the roles of a data scientist and a machine learning engineer as applicable in the context of a large and established data science team.
In this article, I will:

review and compare the evolving roles and responsibilities of a data scientist and a machine learning engineer in the machine learning industry;
discuss the scope of each role, similarities and differences, and how to ensure strong communication and collaboration between these two core profiles without which data science projects are bound to fail.

Differences between Data Scientist & Machine Learning Engineer
In this section, I will discuss the primary differences in skills, responsibilities, day-to-day tasks, tech stack amongst other things.

The chief responsibility of a data scientist is to develop solutions using machine learning or deep learning models for various business problems. It is not always necessary to create novel algorithms or models as these tasks are research-intensive and can take up considerable time. In most cases, it is sufficient to use existing algorithms or pre-trained models, and optimize them in the context of the problem statement. However, in more innovative and R&D-focused teams or companies, scientists may be required to produce novel research and model artifacts.

On the contrary, the main goal of machine learning engineers is to take the models prepared by the data scientists and take them to production. This involves multiple aspects including model optimization to make it compatible with the custom deployment constraints and building MLOps infrastructure for experimentation, A/B testing, model management, containerization, deployment, and monitoring the model performance once deployed.

These factors translate into the underlying differences in skills, responsibilities, and tech stack for the respective roles as shown in the following tables.

Similarities, interference & handover
Similarities between Data Scientist and ML Engineer As evident from Tables 1-3, there is a partial overlap between the skills and responsibilities of data scientists and machine learning engineers. The tech stack is also quite similar and whilst data scientists are expected to mostly code in Python, machine learning engineers also need to know C++ for porting the model artifacts into a more efficient and faster format.

What machine learning engineers might lack in terms of subject matter expertise compared to data scientists, they make up for it in terms of knowledge of engineering tools and frameworks like Kubernetes that data scientists are less familiar with.

Data scientists usually have a STEM background or even advanced degrees like a Ph.D. in diverse fields like biology, economics, physics, mathematics amongst others. On the other hand, machine learning engineers generally have professional experience as software engineers.

While data scientists primarily deal with algorithmic and model development, machine learning engineers’ key focus is on scalable software engineering relevant to model deployment and monitoring, the remaining tasks are often common to both profiles.

In a few cases, these tasks might be shared depending on the size and maturity of the data science team, and things might work smoothly. However, more often than not, especially in larger teams and organizations, this can create considerable conflict and friction especially when data scientists and machine learning engineers work in different teams and report to different managers.

The handover processIt is possible to draw a clear line between the respective mandates of data scientists and machine learning engineers. Typically, data scientists will develop one or more candidate machine learning models and hand over these to the machine learning engineers following a specific contract.

The contract should specify:

the model accuracy,
latency,
memory,
the number of parameters,
the machine learning or deep learning framework used,
model versions,
the model predictions,
and the ground labels for the validation or test set amongst other parameters.

A structured handover contract ensures that the machine learning engineers have all necessary information to work on model optimization, any further experimentation, and deployment processes. After the handover, the data scientists become free to focus on the next machine learning use cases to take to production.

The collaboration between data scientists and machine learning engineers continues post-deployment and becomes critical especially when the models break in production. As the data scientists have greater insight into the working of the model, they are better positioned to troubleshoot and fix the models.

At the same time, some model failures are related to cracks in the underlying infrastructure developed by machine learning engineers, which they are in the best position to resolve. Continuous refinement of the model based on live data received by the model via active learning also falls under the domain of data scientists.

Communication & Collaboration between Data Scientists & ML Engineers
The success of a data science team is contingent on strong collaboration across the varied profiles [2]. Data scientists and machine learning engineers collaborate continuously during model development, deployment, and post-deployment monitoring and refinement. Ideally, if these two profiles ought to be part of the same team and report to the same leadership. In such a context, collaboration becomes easier and also fosters strong collegiality and learning from each other.

However, when data scientists and machine learning engineers are part of different teams and report to different leadership, the collaboration is not as strong as it should be. In such organizational settings, data scientists and machine learning engineers do not get to interact directly as much and rely on team productivity and project management tools like Slack, Teams, JIRA, Asana, etc.

For a lot of repetitive and common use cases, the use of such collaboration tools is actually a boon and saves the team a lot of time and effort. However, the transactional nature of relying on tools whose atomic units are tickets or tasks does not create a sense of team bonding and collaboration. In data science teams that rely heavily on such tools, this is a common grievance.

For more complex tasks or projects, in-person or video collaboration is a must and should not be ignored by the leadership. It is often in these settings that the technical professionals might learn of new use cases or clients from the business leaders, and the business professionals in turn might learn of a new technical breakthrough that could solve up-and-coming business use cases. The same holds true for data scientists and machine learning engineers as well, where each party could learn of either a new algorithm, or a model, or a new framework to make data science more effective and productive.

Current industry trends If a new version of the Harvard Business Review article in [1] were to be published in 2021, it would claim “machine learning engineer” as the sexiest job of the 2020s. While data science and model development is still a lucrative role across industry and academia, in recent years the focus in the industry has slightly shifted to building scalable and reliable infrastructure to serve data science models to millions of customers.
As of today, the machine learning engineer role is in much greater demand than that of a data scientist across the tech industry.

Industry leaders have learned that while it is great to have large, complex machine learning and deep learning models achieve state-of-the-art performance on academic benchmarks or training data, these do not yield any commercial value to the business until deployed and serving customer requests reliably and quickly at a high level of accuracy.
As more enterprises are becoming data-driven companies and establishing data science and machine learning teams or organizations [3], it is imperative for them to measure and achieve the required levels of ROI.
Big tech customer-focused companies that were early to venture into and invest in AI have already built strong teams of scientists and are now looking to enhance the production capabilities and commercialize the R&D artifacts developed by the data and research scientists.
Whilst top data scientists especially those with advanced degrees like PhDs will always be in high demand, we are currently witnessing a job market that is seeking skilled machine learning engineers whose supply is limited compared to data scientists.

The transition from Data Scientist to Machine Learning Engineer
There are numerous online courses on learning platforms like Coursera, Udacity, Udemy, etc. but there is a relative paucity of instructors and content focused on machine learning engineering practices. While building data science models can occur in a sandbox environment like Kaggle where the models are not made to serve real-world predictions, it is only possible to learn scalable model deployment, monitoring, and related machine learning engineering tasks in a real-world industry setting. As machine learning engineering and MLOps is a more applied discipline, there are fewer experts who have the required skillset to build and maintain robust infrastructure.

At the same time, existing data scientists, lured by the promise of greater potential impact, better compensation, and long-term career prospects are also seeking to transition into MLE roles.

As illustrated in tables 1, 2, and 3, there is considerable overlap between the two roles. However, machine learning engineers focus on the “engineering” aspects of taking models to production while data scientists focus on developing the right set of models for specific business problems. The most relevant skills that data scientists need to learn to become an effective machine learning engineer is software engineering including the ability to write optimized code, preferably in C++, rigorous testing, and understand and build and operate existing or custom tools and platforms for reliable model deployment and management.

It is definitely possible for data scientists to learn C++ and best practices in software engineering and software testing, as well as onboard new tools and technologies like Docker, Kubernetes, ONNX, and model serving platforms from multiple sources. However, since companies require machine learning engineers to have prior relevant experience, it becomes practically infeasible for data scientists to justify a machine learning profile if they do not have real-world hands-on experience in industry settings.

Given the chicken-and-egg nature of this problem, the best avenue for existing data scientists to transition to machine learning engineering is with their current employer. If data scientists express interest in machine learning engineering to their managers and are allowed to shadow or even assist and collaborate with machine learning engineers on specific projects, it becomes easier to make an internal transition within the same company. This represents a challenge for fresh graduates without any prior industry experience, and a similar internal transition route from data science or software engineering to machine learning engineering is the recommended pathway.

As the industry matures and companies evolve their machine learning systems and associated processes like hiring and upskilling, it will become easier for more candidates to make the transition from data science to machine learning engineering.

For more complex tasks or projects, in-person or video collaboration is a must and should not be ignored by the leadership. It is often in these settings that the technical professionals might learn of new use cases or clients from the business leaders, and the business professionals in turn might learn of a new technical breakthrough that could solve up-and-coming business use cases. The same holds true for data scientists and machine learning engineers as well, where each party could learn of either a new algorithm, or a model, or a new framework to make data science more effective and productive.

Conclusion
AI is a cornerstone of modern enterprise. This AI-revolution has accelerated significantly over the last decade and resulted in huge unmet demand for data science professionals. Data science as a discipline has also evolved, creating distinct profiles focused on data, modeling, engineering as well as product and customer success management. Of these profiles, machine learning engineers play a critical role in taking the models developed by data scientists based on the data prepared by data engineers and for use cases identified and developed by product or business managers to fruition.

Currently, the demand for machine learning engineers is similar to the demand for data scientists a decade ago. Such changes in the scope and nature of profiles in the AI industry will continue to happen, and present new challenging opportunities to engineers, scientists as well as business professionals to get their foot in the door.

References
[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
[2] https://neptune.ai/blog/how-to-build-machine-learning-teams-that-deliver
[3] https://neptune.ai/blog/building-ai-ml-projects-for-business-best-practices

Comments

Knowledge Distillation: Principles, Algorithms, Applications

7/10/2021

Comments

Published by Neptune.ai

Introduction
Large-scale machine learning and deep learning models are increasingly common. For instance, GPT-3 is trained on 570 GB of text and consists of 175 billion parameters. However, whilst training large models helps improve state-of-the-art performance, deploying such cumbersome models especially on edge devices is not straightforward.

Additionally, the majority of data science modeling work focuses on training a single large model or an ensemble of different models to perform well on a hold-out validation set which is often not representative of the real-world data.

This discord between training and test objectives leads to the development of machine learning models that yield good accuracy on curated validation datasets but often fail to meet performance, latency, and throughput benchmarks at the time of inference on real-world test data.

Knowledge distillation helps overcome these challenges by capturing and “distilling” the knowledge in a complex machine learning model or an ensemble of models into a smaller single model that is much easier to deploy without significant loss in performance.

In this blog, I will:

describe knowledge distillation in detail, its underlying principle, training schemes, and algorithms;
dive deeper into applications of Knowledge Distillation in deep learning for images, text, and audio.

What is knowledge distillation?
Knowledge distillation refers to the process of transferring the knowledge from a large unwieldy model or set of models to a single smaller model that can be practically deployed under real-world constraints. Essentially, it is a form of model compression that was first successfully demonstrated by Bucilua and collaborators in 2006 [2].

Knowledge distillation is performed more commonly on neural network models associated with complex architectures including several layers and model parameters. Therefore, with the advent of deep learning in the last decade, and its success in diverse fields including speech recognition, image recognition, and natural language processing, knowledge distillation techniques have gained prominence for practical real-world applications [3].

The challenge of deploying large deep neural network models is especially pertinent for edge devices with limited memory and computational capacity. To tackle this challenge, a model compression method was first proposed [2] to transfer the knowledge from a large model into training a smaller model without any significant loss in performance. This process of learning a small model from a larger model was formalized as a “Knowledge Distillation” framework by Hinton and colleagues [1].

As shown in Figure 1, in knowledge distillation, a small “student” model learns to mimic a large “teacher” model and leverage the knowledge of the teacher to obtain similar or higher accuracy. In the next section, I will delve deeper into the knowledge distillation framework and its underlying architecture and mechanisms.

Diving deeper into knowledge distillation
A knowledge distillation system consists of three principal components: the knowledge, the distillation algorithm, and the teacher-student architecture [3].

Knowledge In a neural network, knowledge typically refers to the learned weights and biases. At the same time, there is a rich diversity in the sources of knowledge in a large deep neural network. Typical knowledge distillation uses the logits as the source of teacher knowledge, whilst others focus on the weights or activations of intermediate layers. Other kinds of relevant knowledge include the relationship between different types of activations and neurons or the parameters of the teacher model themselves.
The different forms of knowledge are categorized into three different types: Response-based knowledge, Feature-based knowledge, and Relation-based knowledge. Figure 2 illustrates these three different types of knowledge from the teacher model. I will discuss each of these different knowledge sources in detail in the following section.

1. Response-based knowledge
As shown in Figure 2, response-based knowledge focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model. As illustrated in Figure 3, This can be achieved by using a loss function, termed the distillation loss, that captures the difference between the logits of the student and the teacher model respectively. As this loss is minimized over training, the student model will become better at making the same predictions as the teacher.

In the context of computer vision tasks like image classification, the soft targets comprise the response-based knowledge. Soft targets represent the probability distribution over the output classes and typically estimated using a softmax function. Each soft target’s contribution to the knowledge is modulated using a parameter called temperature. Response-based knowledge distillation based on soft targets is usually used in the context of supervised learning.

2. Feature-based knowledge
A trained teacher model also captures knowledge of the data in its intermediate layers, which is especially pertinent for deep neural networks. The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model. As shown in Figure 4, the goal is to train the student model to learn the same feature activations as the teacher model. The distillation loss function achieves this by minimizing the difference between the feature activations of the teacher and the student models.

3. Relation-based knowledge
In addition to knowledge represented in the output layers and the intermediate layers of a neural network, knowledge that captures the relationship between feature maps can also be used to train a student model. This form of knowledge, termed as relation-based knowledge is depicted in Figure 5. This relationship can be modeled as correlation between feature maps, graphs, similarity matrix, feature embeddings, or probabilistic distributions based on feature representations.

Training There are three principal types of methods for training student and teacher models, namely offline, online and self distillation. The categorization of the distillation training methods depends on whether the teacher model is modified at the same time as the student model or not, as shown in Figure 6.

1. Offline distillation
Offline distillation is the most common method, where a pre-trained teacher model is used to guide the student model. In this scheme, the teacher model is first pre-trained on a training dataset, and then knowledge from the teacher model is distilled to train the student model. Given the recent advances in deep learning, a wide variety of pre-trained neural network models are openly available that can serve as the teacher depending on the use case. Offline distillation is an established technique in deep learning and easier to implement.

2. Online distillation
In offline distillation, the pre-trained teacher model is usually a large capacity deep neural network. For several use cases, a pre-trained model may not be available for offline distillation. To address this limitation, online distillation can be used where both the teacher and student models are updated simultaneously in a single end-to-end training process. Online distillation can be operationalized using parallel computing thus making it a highly efficient method.

3. Self-distillation
As shown in Figure 6, in self-distillation, the same model is used for the teacher and the student models. For instance, knowledge from deeper layers of a deep neural network can be used to train the shallow layers. It can be considered a special case of online distillation, and instantiated in several ways. Knowledge from earlier epochs of the teacher model can be transferred to its later epochs to train the student model.

Architecture
The design of the student-teacher network architecture is critical for efficient knowledge acquisition and distillation. Typically, there is a model capacity gap between the more complex teacher model and the simpler student model. This structural gap can be reduced through optimizing knowledge transfer via efficient student-teacher architectures.

Transferring knowledge from deep neural networks is not straightforward due to their depth as well as breadth. The most common architectures for knowledge transfer include a student model that is:

a shallower version of the teacher model with fewer layers and fewer neurons per layer,
a quantized version of the teacher model,
a smaller network with efficient basic operations,
a smaller networks with optimized global network architecture,
the same model as the teacher.

In addition to the above methods, recent advances like neural architecture search can also be employed for designing an optimal student model architecture given a particular teacher model.

Algorithms for knowledge distillation
In this section, I will focus on the algorithms for training student models to acquire knowledge from teacher models.

1. Adversarial distillation
Adversarial learning as conceptualized recently in the context of generative adversarial networks, is used to train a generator model that learns to generate synthetic data samples as close as possible to the true data distribution and a discriminator model that learns to discriminate between the authentic and synthetic data samples. This concept has been applied to knowledge distillation to enable the student and teacher models to learn a better representation of the true data distribution.

To meet the objective of learning the true data distribution, adversarial learning can be used to train a generator model to obtain synthetic training data to use as such or to augment the original training dataset. A second adversarial learning based distillation method focuses on a discriminator model to differentiate the samples from the student and the teacher models based on either logits or feature maps. This method helps the student mimic the teacher well. The third adversarial learning-based distillation technique focuses on online distillation where the student and the teacher models are jointly optimized.

2. Multi-Teacher distillation
In multi-teacher distillation, a student model acquires knowledge from several different teacher models as shown in Figure 7. Using an ensemble of teacher models can provide the student model with distinct kinds of knowledge that can be more beneficial than knowledge acquired from a single teacher model.

The knowledge from multiple teachers can be combined as the average response across all models. The type of knowledge that is typically transferred from teachers is based on logits and feature representations. Multiple teachers can transfer different kinds of knowledge as discussed in section 2.1.

3. Cross-modal distillation
Figure 8 shows the cross-modal distillation training scheme. Here, the teacher is trained in one modality and its knowledge is distilled into the student that requires knowledge from a different modality. This situation arises when data or labels are not available for specific modalities either during training or testing thus necessitating the need to transfer knowledge across modalities.

Cross-modal distillation is used most commonly in the visual domain. For example, the knowledge from a teacher trained on labeled image data can be used for distillation for a student model with an unlabeled input domain like optical flow or text or audio. In this case, features learned from the images from the teacher model are used for supervised training of the student model. Cross-modal distillation is useful for applications like visual question answering, image captioning amongst others.

4. Others
Apart from the distillation algorithms discussed above, there are several other algorithms that have been applied for knowledge distillation.

Graph-based distillation captures intra-data relationships using graphs instead of individual instance knowledge from the teacher to the student. Graphs are used in two ways – as a means of knowledge transfer, and to control transfer of the teacher’s knowledge. In graph-based distillation, each vertex of the graph represents a self-supervised teacher which may be based on response-based or feature-based knowledge like logits and feature maps respectively.
Attention-based distillation is based on transferring knowledge from feature embeddings using attention maps.
Data-free distillation is based on synthetic data in the absence of a training dataset due to privacy, security or confidentiality reasons. The synthetic data is usually generated from feature representations of the pre-trained teacher model. In other applications, GANs are also used to generate synthetic training data.
Quantized distillation is used to transfer knowledge from a high-precision teacher model (e.g. 32-bit floating point) to a low-precision student network (e.g. 8-bit).
Lifelong distillation is based on the learning mechanisms of continual learning, lifelong learning and meta-learning where previously learnt knowledge is accumulated and transferred into future learning.
Neural architecture search-based distillation is used to identify suitable student model architectures that optimize learning from the teacher models.

Applications of knowledge distillation
Knowledge distillation has been successfully applied to several machine learning and deep learning use cases like image recognition, NLP, and speech recognition. In this section, I will highlight existing applications and the future potential of knowledge distillation techniques.

1. Vision
The applications of knowledge distillation in the field of computer vision are plenty. State-of-the-art computer vision models are increasingly based on deep neural networks that can benefit from model compression for deployment. Knowledge distillation has been successfully employed for use cases like:

image classification,
face recognition,
image segmentation,
action recognition,
object detection,
lane detection,
pedestrian detection,
facial landmark detection,
pose estimation,
video captioning,
image retrieval,
shadow detection,
text-to-image synthesis,
video classification,
visual question answering, amongst others [3].

Knowledge distillation can also be used for niche use cases like cross-resolution face recognition where an architecture based on a high-resolution face teacher model and a low-resolution face student model can improve model performance and latency. As knowledge distillation can take advantage of different kinds of knowledge including cross-modal data, multi-domain, multi-task and low-resolution data, a wide variety of distilled student models can be trained for specific visual recognition use cases.

2. NLP
The application of knowledge distillation for NLP applications is especially important given the prevalence of large capacity deep neural networks like language models or translation models. State-of-the-art language models contain billions of parameters, for example, GPT-3 contains 175 billion parameters. This is several orders of magnitude greater than a previous state-of-the-art language model, BERT, which contains 110 million parameters in the base version.

Knowledge distillation is therefore highly popular in NLP to obtain fast, lightweight models that are easier and computationally cheaper to train. Other than language modeling, knowledge distillation is also used for NLP use cases like:

neural machine translation,
text generation,
question answering,
document retrieval,
text recognition [3].

Using knowledge distillation, efficient and lightweight NLP models can be obtained that can be deployed with lower memory and computational requirements. Student-teacher training can also be used to address multilingual NLP problems where knowledge from multilingual models can be transferred and shared by each other.

Case study: DistilBERT
DistilBERT is a smaller, faster, cheaper and lighter BERT model [4] developed by Hugging Face. Here, the authors pre-trained a smaller BERT model that can be fine-tuned on a variety of NLP tasks with reasonably strong accuracy. Knowledge distillation was applied during the pre-training phase to obtain a distilled version of BERT model that is smaller by 40% (66 million parameters vs. 110 million parameters) and faster by 60% (410s vs. 668s for inference on the GLUE sentiment analysis task) whilst retaining a model performance that is equivalent to 97% of the original BERT model accuracy. In DistilBERT, the student has the same architecture as BERT and was obtained using a novel triplet loss that combined losses related to language modeling, distillation and cosine-distance loss.

3. Speech
State-of-the-art speech recognition models are also based on deep neural networks. Modern ASR models are trained end-to-end and based on architectures that include convolutional layers, sequence-to-sequence models with attention, and recently transformers as well. For real-time, on-device speech recognition, it becomes paramount to obtain smaller and faster models for effective performance.
There are several use cases of knowledge distillation in speech:

speech recognition,
spoken language identification,
audio classification,
speaker recognition,
acoustic event detection,
speech synthesis,
speech enhancement,
noise-robust ASR,
multilingual ASR,
accent detection [10].

Case study: Acoustic Modeling by Amazon Alexa
Parthasarathi and Strom (2019) leveraged student-teacher training to generate soft targets for 1 million hours of unlabeled speech data where the training dataset consisted only of 7000 hours of labeled speech. The teacher model produced a probability distribution over all the output classes. The student model also produced a probability distribution over the output classes given the same feature vector and the objective function optimized the cross-entropy loss between these two distributions. Here, knowledge distillation helped simplify the generation of target labels on a large corpus of speech data.

Conclusions
Modern deep learning applications are based on cumbersome neural networks with large capacity, memory footprint, and slow inference latency. Deploying such models to production is an enormous challenge. Knowledge distillation is an elegant mechanism to train a smaller, lighter, faster, and cheaper student model that is derived from a large, complex teacher model. Following the conceptualization of knowledge distillation by Hinton and colleagues (2015), there has been a massive increase in the adoption of knowledge distillation schemes for obtaining efficient and lightweight models for production use cases. Knowledge distillation is a complex technique based on different types of knowledge, training schemes, architectures and algorithms. Knowledge distillation has already enjoyed tremendous success in diverse domains including computer vision, natural language processing, speech amongst others.

References
[1] Distilling the Knowledge in a Neural Network. Hinton G, Vinyals O, Dean J (2015) NIPS Deep Learning and Representation Learning Workshop. https://arxiv.org/abs/1503.02531
[2] Model Compression. Bucilua C, Caruana R, Niculescu-Mizil A (2006) https://dl.acm.org/doi/10.1145/1150402.1150464
[3] Knowledge distillation: a survey. You J, Yu B, Maybank SJ, Tao D (2021) https://arxiv.org/abs/2006.05525
[4] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019) Sanh V, Debut L, Chammond J, Wolf T. https://arxiv.org/abs/1910.01108v4
[5] Lessons from building acoustic models with a million hours of speech (2019) Parthasarathi SHK, Strom N. https://arxiv.org/abs/1904.01624

Comments

Developing AI Projects for Business - Best Practices

7/5/2021

Comments

Published by Neptune.ai

Introduction
Only 10% of AI/ML projects have created positive financial impact according to a recent survey of 3,000 executives.

Given these odds, it seems that building a profit generating ML project requires a lot of work across the entire organization, from planning to production.

In this article, I’ll share best practices for businesses to ensure that their investments in Machine Learning and Artificial Intelligence are actually profitable, and create significant value for the entire organization.

Best practices for identifying AI use cases
Most AI projects fail at the very first hurdle – poor understanding of the business problems that can be solved with AI. This is the main bottleneck in successful deployment of AI.
This problem is compounded by the early stages of organizational intuition for AI, and for how it can be leveraged to solve critical business problems [2].

What does this mean? Well, not every problem can be feasibly solved with AI. To understand if your particular problem can, you need tried and tested practices and approaches.

AI use cases
AI has transformed industries. It automates routine and manual processes, and provides crucial predictive insights to almost all business functions. Table 1 shows a list of some of the business use cases that have been successfully addressed using AI.

Brainstorming appropriate business problems should ideally be done together with business leaders, product managers, and any available subject matter experts. The list of business problems sourced across the organization should then be vetted, and analyzed for potential solutions using AI.

Not every business problem should be solved with AI. Oftentimes, a rule-based or engineered solution is good enough. Additionally, a lot of business problems can be mined from customer reviews or feedback, which typically points to broken business processes that need to be fixed.

In table 2, you can see a checklist of questions, both technical and commercial, to determine whether a business problem is relevant for AI.

KPIs and Metrics
As part of the planning process, the appropriate model and business metric for each potential use case should be discussed. Work backwards from the expected outcome, and it’ll be easier to crystallize which particular metric to optimize.
To illustrate this, in table 3 I prepared a list of AI use cases and corresponding model and business metrics. For the success of an AI project, it’s ultimately important to ensure the business metric and goals are achieved.

Prioritization
We have a set of business problems. They’ve been reviewed and documented after careful consideration of the criteria listed in Table 2, and analysis of appropriate business metrics as in Table 3. The candidate list of use cases needs to be prioritized, or ranked, in terms of impact and relevance to the overarching business strategy and goals.

From a detailed written document describing comprehensive facets of the business use case and potential AI-based solutions, it’s useful to have objective criteria to quantify all the proposed use cases on the same scale. Here, it’s crucial for product managers and business leaders to have their own intuition about how AI works in practice, or rely on the judgment of a product-focused technical or domain expert. Whilst it’s easy to rank projects on certain success criteria, it’s not so straightforward to rate the risk associated with AI projects.

A balanced metric ought to consider and weigh the likelihood and impact of a successful outcome of the AI projects versus the risk of it failing or not generating enough impact. Risks to the project might be related to organizational aspects, domain-specific aspects of the AI problem, or related to external factors beyond the remit of the business. Once a suitable balanced metric is defined, it aligns all stakeholders and leadership, who are then able to form their own subjective views based on the objective scores.

A lot of factors need to be considered before a ‘yes’ or ‘no’ decision is made for a particular AI project, as well as the number of AI-relevant projects selected for a defined period. Securing buy-in from the leadership is difficult. Certain final executive decisions might appear subjective or not data-driven, but it’s still absolutely critical to go through the aforementioned planning process to present each AI project in the best light possible, and maximize the likelihood of the AI project being selected for execution.

Best practices for planning AI use cases
As part of the planning process with cross-functional teams, it’s important for organizations to have a streamlined mechanism for defining the AI product vision or roadmap, the bandwidth, specific roles and responsibilities of individual contributors and managers in each team, as well as the technical aspects (data pipelines, modeling stack, infrastructure for production and maintenance).

In this section, I’ll describe the details of specific planning steps essential to build a successful AI product.

AI product requirements
For each identified use case, it’s necessary to draw the roadmap for how the product will evolve from its baseline version to a more mature product over time. In Table 4, I outline a set of essential questions and criteria to fulfil for creating a comprehensive AI roadmap for each use case.

PR-FAQ (Press Release – Frequently Asked Questions) and PRD (Product Requirements Document) are two critical documents that are generally prepared during the initial stages of product ideation and conception. Pioneered by Amazon, these two documents serve as the north star for all concerned teams to align themselves with and build and scale the product accordingly. It’s absolutely essential that all stakeholder teams contribute meaningfully to these documents and share their specific domain expertise to craft a meticulous document for executive review.

It’s necessary for all stakeholder team managers to review and contribute to the document, so that any team- or domain-specific intrinsic biases of product development are laid bare and addressed accordingly. Typically, teams should rely on data-driven intuition for product development. In the absence of in-house data, intuition for the AI product can be borrowed from work done by other companies or research in the same field [2, 4].
Data requirementsAs the roadmap is defined and finalized after stakeholder meetings, it’s always beneficial to have an MVP or a basic prototype of the AI product ready to validate initial assumptions and present to the leadership. This exercise also helps to streamline the data and engineering pipelines necessary to acquire, clean and process the data and train the model to obtain the MVP.

The MVP should not be a highly sophisticated model. It should be basic enough to successfully transform the input data to a model prediction, and trained on a minimal set of training data. If the MVP is hosted as an API, each of the cross-functional stakeholder teams can explore the product and build intuition for how the AI product might be better developed for the end customer.

From a data perspective, the machine learning team can dive deeper into the minimal training data, and do a careful analysis of the data as listed in Table 5.

Model requirements
After systematic assessment of the data quality, features, statistics, labels and other checks as listed in Table 5, the Machine Learning team can start building the prototype / MVP model. The best approach at the early stages of product development is to act with speed rather than accuracy. The initial (baseline) model should be simple enough to demonstrate that the model works, the data and modeling pipelines are bug-free, and the model metrics indicate that the model performs significantly better than chance.

Machine learning use cases and products have become increasingly complex over the years. Whilst linear regression and binary or multi-class classification models were once too common, there are newer classes of models that are faster to train, and generalize better on real-world test data. For the ML scientist or engineer, no two use cases may be built using an identical tech stack of tools and libraries. Depending on the characteristics of the data relevant for the AI use case (see Table 2), the data science team must define the modeling stack specific to each use case (see Table 6 below).

Best practices for executing AI use cases
After identifying and planning for promising AI use cases, the next step is to actually execute the projects. It might seem that execution is a straightforward process, where the machine learning team gets to weave their magic. But, simply ‘building models’ is not enough for successful deployment. Model building has to be done in a collaborative and iterative fashion:

involving feedback from users of the product as well as cross-functional teams,
incorporating any new or revised feature requests from product teams,
updating initial hypotheses for the use case based on any changes in the business or operating environment,
only then launching the product to users.

Shipping the model to production is a major milestone to celebrate, document and share within the organization – but the work doesn’t stop there. It’s crucial to monitor how the model performs on real world data from customers, and periodically apply fixes or update the model so that it doesn’t get stale along with changes in:

distribution of data,
nature of use cases,
customer behaviour,

In the next section, I will discuss the best practices for the operational aspects of executing and deploying AI models successfully and realizing the proposed commercial value.

Reviews and feedback
Once the AI project has kickstarted, it’s essential for the machine learning team to have both periodic as well as ad-hoc review meetings with stakeholders, including product teams and business leadership. The documents prepared during the planning phase (PR-FAQ and PRD) serve as the context in which any updates or changes should be addressed.

The goal of regular meetings is to assess the state of progress vis-a-vis the product roadmap, and address any changes in:

product or business strategy,
organizational structure,
resources allocated to the project.

While planning is important, most corporate projects don’t go as initially planned. It’s important to be nimble and agile, respond to any new information (regarding technical, product or business aspects), and re-align towards a common path forward. For example, the 2020 lockdowns severely impacted the economy. In light of such high-impact unexpected events, it’s critical to adapt and change strategy for AI use cases as well.
In addition to regular internal feedback, it’s good to keep in touch with the end users of the product throughout the AI lifecycle. In the initial stages (user research, definition of target user personas and their demographics), and especially in product design and interaction with the model predictions. A core group of users from the target segment should be maintained to obtain regular feedback across all stages of product development.

Once an MVP is ready, users can be very helpful in providing early feedback that can often bring to light several insights and uncover any biases or shortcomings. When the AI model is ready to be shipped and different model versions are to be evaluated, user feedback can again be very insightful. User insights about the design, ease of use, perceived speed and overall user flow can help the product team to refine the product strategy as needed.
Building iterativelyFrom the technical perspective, the model building process is usually an iterative one. After establishing a robust baseline, the team gets insight into how far the model performance is from the established acceptance criteria. In the early stages of model building, the focus should primarily be on accuracy rather than latency.

At each stage of model development, a comprehensive analysis of model errors on the validation set can reveal important insights into the model shortcomings, and how to address them. The errors should also be reviewed in conjunction with subject matter experts, to evaluate any errors in data annotation as well as any specific patterns in the errors.

If the model is prone to a particular kind of error, it might need additional features. Or it might need to be changed to a model based on a different objective function, or underlying principle, to overcome these errors. This repetitive process helps the machine learning team to consolidate their intuition about the use case, think outside the box, and propose new creative ideas or algorithms to achieve the desired metrics.
During the course of model building, machine learning practitioners should systematically document every experiment and the corresponding results. A structured approach is helpful not only for the particular use case, but also helps build organizational knowledge that can be helpful to onboard new hires, or serve as shining examples of successful AI deployment.

Deployment and maintenance
Once the candidate machine learning model is ready and benchmarked thoroughly on the validation and test sets, errors analyzed, and the acceptance criteria met, the model may be taken to production. There’s a huge difference between the model training and deployment environments. The format in which the model is trained may not be compatible with taking the model to production, and need to be appropriately serialized and converted to the right format.

In an environment that simulates the production settings, model accuracy and latency should be validated again on the hold-out dataset. Deployment should be done incrementally by surfacing the model to a small portion of real-world traffic or input to the model, ideally to be tested first by internal or core user groups.

Once the deployment pipeline has been rigorously tested and vetted by the MLOps team, more traffic can be directed to the model. In scenarios where one or more candidate models are available, A/B testing of these models should be done systematically, and evaluated for statistically significant differences to determine the winning model.

Post-deployment, it’s important to ensure that all the input-output pairs are collected and archived appropriately within the data ecosystem. The launched model should be periodically assessed and the distribution of the real-world data compared with the distribution of the training data to assess for data and model drifts. In such cases, an active learning pipeline that feeds some of the real-world test samples back into the original training dataset helps to alleviate the shortcomings of the deployed model.
Finally, once the model production environment and all pipelines are stable, the machine learning and product teams should evaluate the business metrics and KPIs to assess whether the metrics meet the predefined success criteria or not. In case it does, then only can the use case be deemed to be a success and a summary of the overall use case and results should be documented and shared internally with every stakeholder and the business leadership.

Wrapping up
If machine learning, product and business teams in startups and enterprises adopt a systematic approach and follow the best practices as laid out in this article, then the likelihood of successful AI outcomes can only increase.
Adequate upfront preparation is crucial. Without it, teams won’t be able to rectify any errors or respond to changes, nor realize the massive commercial potential that AI can deliver.

References

Comments

How to build Machine Learning Teams that Deliver

18/4/2021

Comments

Published by Neptune.ai

Introduction
In this article, I have documented the best practices and approaches to build a productive Machine Learning team that creates positive business impact and generates economic value within corporate entities, be it startup or enterprise.
If you do Machine Learning, either as an individual contributor or team manager, I’ll help you understand your current team structure and how to improve internal processes, systems and culture. We’ll explore how to build truly disruptive ML teams that drive successful outcomes.

Why build an ML team?
Artificial Intelligence (AI) is predicted to create global economic value of nearly USD 13 Trillion by 2030 [1]. Most companies across diverse industries and sectors have realized the potential value of AI, and are well on the way to becoming an AI-first entity. From tech companies building cutting-edge AI products like self-driving cars or smart speakers, to traditional enterprises leveraging AI for non-glamorous use cases like fraud detection or customer service automation, the potential of AI to deliver commercial impact is beyond doubt.
The adoption of AI in industry is accelerated by a number of trends:

Exponentially growing amount of data from internet-enabled devices, sensors, web platforms and so on, combined with drop in costs of storing, accessing and sharing data.
Cheaper computational costs of training AI models with the advent of GPU and cloud.
Innovation in algorithmic research and deep learning models that build upon the foundations of data and compute.

Given this paradigm shift in the last decade, the cost of building AI products and services has come down substantially. The door is now open for a diverse variety of players like enterprises, startups, entrepreneurs, students or hobbyists to innovate and create transformational AI.

In the following section, I will describe the challenges in building Machine Learning teams for startups and enterprises respectively.

Challenges for startups
Startups, in the early stages of operations, are typically bootstrapped and have limited budgets to deploy for building machine learning teams.
If your startup has a core product or service founded on AI, then it’s imperative to hire machine learning talent early on to build the MVP, and raise funding to hire more talent and scale the product.

On the other hand, for startups whose core product or service is focused on other domains like finance, healthcare or education, AI will either be incidental to the core operations, or not essential until product-market fit is achieved.
The main challenges of building ML teams in startups are:

Limited funding or budget for AI.
Scarce availability of training data to build machine learning models.
Lack of labelled data or funds to outsource labeling to third-party vendors.
No structured data warehousing, data pipelines or machine learning deployment infrastructure.
Require more hands-on machine learning talent that can operate across the entire machine learning lifecycle – from data engineering, algorithmic and model development to deploying and monitoring machine learning models in production instead of specialized talent to focus individually on the various aspects of the machine learning lifecycle.
Cannot hire the best machine learning talent in the market and match salaries offered by leading technology companies.
To compete with established players, faster machine learning development cycles are imperative. However, this can often lead to friction and inefficiencies in the absence of structured internal processes and management.

In the face of such daunting challenges of machine learning work combined with general organizational challenges at startups [2], it becomes even more important for startups to hire and build the right machine learning team from the very beginning.

Challenges for enterprise
Unlike startups, big organizations and enterprises don’t suffer from lack of funding or budget to seed a machine learning team. The challenges in an enterprise are unique from one entity to another, but generally arise due to the size of the organization, internal bureaucracy and slower decision making processes – things that tend to benefit startups and help them ship products faster.

Although today, it might appear that technology companies are ubiquitous, they’re still a minority compared to the vast number of traditional enterprises focused on diverse sectors like finance, FMCG, retail, healthcare, education and so on. Technology companies have a headstart when it comes to machine learning and AI, and their strong early focus and investment in AI R&D will ensure their dominance compared to their traditional counterparts.

However, there are numerous challenges that traditional enterprises face in adopting and onboarding AI across the organization [3], which more often than not result in failed AI projects and reduced trust in the capacity and potential of AI [4]:

Inability to develop a roadmap and vision for using AI to solve business challenges.
Obstacles in facilitating cross-functional and interdisciplinary collaboration to define valid AI use cases with well-defined KPIs.
Slow decision-making processes that are not founded on data but instinct and gut.
Risk-averse cultural mindset that is not conducive to building digital-first organizations.
Typical AI use cases are domain-agnostic that impact every aspect of the organization e.g. customer service automation, which do not excite or attract most data scientists.
View AI from a cost-saving perspective rather than a potential revenue-generating stream.
Inability to mobilize internal teams quickly to build a new AI product or enter a new market that has been validated by fast-moving startups.

Having detailed the core challenges faced by startups and enterprise, in the next section, I will describe the composition of a typical machine learning team and the skills and tools used by various profiles who work across the machine learning lifecycle. A better understanding of the skills and abilities of diverse machine learning teams is essential for hiring managers and teams to build out a complete and productive machine learning team.

Profiles in a Machine Learning team
Modern machine learning teams are truly diverse. Yet, at the core, they involve candidates who have strong analytical skills and the ability to understand data from different domains, train and deploy predictive models, and derive business or product insights from the same.

SCOPING
The first stage of scoping out an AI use case requires AI experts along with business or domain experts. Plenty of successful AI projects start with a deep understanding of the potential business problems that can be solved with AI, and require the combined intuition and understanding of seasoned technical and business experts. In this stage, the usual collaborators involve business leaders, product managers, AI team managers and perhaps one or more senior data scientists with deep, hands-on experience with the underlying data.

DATA
The second stage is focused on acquiring data, cleaning, processing from the raw form to structured format and storing it in specific on-premise databases or cloud repositories. In this stage, the role of the data engineer is prominent, alongside data scientists. The business and product managers serve a helpful role in providing access to the data, metadata and any preliminary business insights based on rudimentary analytics.

MODELING
The third stage involves core data science and machine learning modeling using the datasets prepared in the previous stage. In this stage data scientists, applied or research scientists are predominant in training initial models, refining them based on test set performance and feedback from cross-functional stakeholders, developing new algorithms if needed, and finally producing one or more candidate models that meet the required accuracy and latency benchmarks to take the models to production.

DEPLOYMENT
The final stage of the machine learning lifecycle is focused on deploying trained models to production, where they serve predictions from the inputs received from end users. In this stage, machine learning engineers take the models developed by the data/applied/research scientists and prepare them for production. If the models meet the predefined accuracy and latency benchmarks, the models are good to go live. Otherwise, ML engineers work on optimizing the model size, performance, latency and throughput. Models go through systematic A/B testing procedures before deciding which version(s) of the models are best suited for deployment.

Next, I prepared detailed profiles for the different types of experts you may need for your ML team.

Data Engineer
Skills

Database
Programming
Querying languages
Data Pipelines
Architecture
Analytics
Data manipulation, transformation and preprocessing
Cloud services
Workflow management

Responsibilities

Build data pipelines, architectures and infrastructure
Clean and process datasets for data science modeling
Build internal tools to optimize data engineering workflow
Aggregate disparate datasets for defined use cases
Support data scientists with data related requirements

Tech stack

SQL, MySQL
Java
Python
C++
Scala
Spark
Hadoop
Kafka
Databases: Postgres, MongoDB, Cassandra, Redis, Hive, Storm
Cloud: AWS/Azure/GCP, EC2, EMR, RDS, Redshift

Data Scientist
Skills

Problem solving
Programming
Statistics
Data Analytics
Data Visualization
Data Science
- Supervised machine learning
- Unsupervised machine learning
Written and verbal communication to work with cross-functional teams
Presentation skills to present results and insights to leadership
Derive statistically valid insights from data science models to improve product development, marketing or business strategies

Responsibilities

Identify and validate business use cases that can be solved with AI
Analyze and visualize data at different stages of the modeling pipeline
Develop custom algorithms and data science models
Identify additional datasets or generate synthetic data
Develop data annotation strategies and validate the same
Coordinate with cross-functional teams to seek feedback on models, share results and implement models
Develop custom tools or libraries to optimize the entire data science modeling workflow

Tech stack

Python
R
Java
Jupyter notebooks
Visualization: Matplotlib, Seaborn, Bokeh, Plotly etc.
SQL
Spark
Git, Github/Bitbucket
Cloud: AWS/Azure/GCP; SageMaker, S3, Boto
Machine learning: Scikit-learn, Fast.ai, AllenNLP, OpenCV, HuggingFace
Deep learning: TensorFlow, PyTorch, MXNet, JAX, Chainer etc.
Hyperparameter tuning: Neptune, Comet, Weights & Biases

Machine Learning Engineer
Skills

Data structures
Data modeling
Programming
Software engineering
ML frameworks like TensorFlow, PyTorch, Scikit-learn etc.
Statistics
Conceptual knowledge of ML to understand use cases and interact with data scientists and other stakeholders

Responsibilities

Deploy models to production
Optimize models for better latency and throughput
A/B testing of candidate models
Inference testing on variety of hardware: edge, CPU, GPU
Monitoring model performance, maintenance, debugging
Maintaining model versions, experiments and metadata

Tech stack

Linux
Cloud: AWS/Azure/GCP; SageMaker, S3, EC2, Boto
Machine learning: Scikit-learn, Fast.ai, AllenNLP, OpenCV, HuggingFace
Deep learning: TensorFlow, PyTorch, MXNet, JAX, Chainer etc.
Serving: TensorFlow Serving, TensorRT, TorchServe, MXNet Model Server
Python
C++
Scala
Bash
Git, Github/Bitbucket

Research Scientist
Skills

PhD in a quantitative discipline like Computer Science, AI, Physics, Biology, Economics etc.
Scientific mindset and first-principles thinking
Depth and breadth of state-of-the-art approaches in science
Prior experience in conducting academic or industry research
Creative problem solving
Machine Learning
Deep Learning
Design and develop ML prototypes and models
Written and verbal communication to work with cross-functional teams
Presentation skills to present results and insights to leadership
Derive statistically valid insights from data science models to improve product development, marketing or business strategies

Responsibilities

Conduct research for novel ML use cases and applications
Build initial ML prototypes and models
Conduct systematic experiments across multiple models and hyperparameter combinations
Create or augment datasets
Clean, process, analyze and visualize data and model performance
Keep up-to-date with new research literature and state-of-the-art machine learning and deep learning approaches
Evangelize new ML approaches and ideas
Mentor software, data and ML engineers

Tech stack

Python
Jupyter notebooks
Machine learning: Scikit-learn, Fast.ai, AllenNLP, OpenCV, HuggingFace
Deep learning: TensorFlow, PyTorch, MXNet, JAX, Chainer etc.
Linux
Cloud: AWS/Azure/GCP; SageMaker, S3, EC2, Boto
Hyperparameter tuning: Neptune, Comet, Weights & Biases
Git, Github/Bitbucket

Product Manager + Business Leader
Skills

Product design, marketing
Data analytics
Subject matter expertise in one or more domains
Program management
Business development
Understanding of product roadmaps and end to end project delivery
Understanding of software, architecture, data and machine learning best practices
Basic knowledge of fundamental machine learning concepts, process, metrics and deployment
Excellent written and verbal communication skills to work across customers, business and technical teams
Interpersonal skills including persuasion and getting work done from stakeholder teams
Awareness of distinction between managing and building machine learning vs. software products

Responsibilities

Create detailed product, feature roadmaps with milestones, deliverables, metrics and business impact
Conduct surveys with customers of the proposed products to streamline design, UX and reduce friction
Balance multiple priorities from stakeholders, customers to define and deliver the product
Work with software engineering and machine learning teams to iterate and improve models as per the roadmap
Take ownership of the product and ensure delivery of features and the entire product under tight deadlines

Tech stack

SQL to pull and analyze data to build intuition for the product
Excel
Work management tools
Productivity tools
Scheduling tools

Data Science / Machine Learning Manager
Skills

Complete understanding of machine learning lifecycle from conception to production
Written and verbal communication skills
Interpersonal and persuasion skills to get buy-in from leaders, work with cross-functional teams
Vision to create and collaborate on product features and roadmap with product and business managers
Practical and theoretical understanding of machine learning and deep learning concepts, deployment and continual improvement of ML products
Mentorship to individual machine learning contributors
Program management alongside product and project managers

Responsibilities

Own machine learning products and roadmap
Create vision for novel machine learning products
Align stakeholders towards proposed product vision
Understanding of machine learning metrics vs. business metrics
Evangelize capabilities and successes of machine learning team within the company as well as in the wider ecosystem
Hire machine learning talent who specialize across the machine learning lifecycle
Collaborate and dive deep with product and business teams to identify potential use cases that can be solved with machine learning
Understand how the business and its products and services work from the point of view of the customer
Understand the business revenue streams and come up with original ideas for machine learning projects that improve the product or service, reduce costs, and automate manual processes

Tech stack

Python
SQL
Excel
Productivity, collaboration and communication tools

Building productive and impactful Machine Learning teams
We explored the typical composition of a Machine Learning team, which includes a variety of different profiles specialized in specific aspects of building machine learning projects. However, the reality on the ground is that having a solid machine learning team is not a guarantee that the team will create and deliver massive business impact. The reality on the ground is that the vast majority of corporate AI projects fail, and a lot of these projects fail despite having a great machine learning team.

In this section, I will dive deeper into the cultural, procedural and collaborative aspects of building impactful machine learning teams from first-principles. The success of a machine learning team is founded on several factors related to systems, processes, and culture. When built the wrong way, this will inevitably lead to failed projects and erosion of trust and confidence in the team, as well as machine learning as a business capability and competitive edge.

1. Working on the right AI use cases
For a brand new machine learning team to deliver impact in an organization, it’s paramount that the team starts off on the right foot. Early traction is critical to build trust in the organization, evangelize the potential of AI across business verticals, and leverage early successes to deliver riskier or moonshot projects with greater impact.

2. Planning for success – measuring impact
As part of the process of selecting and defining the right AI use cases, it’s fundamental to critically assess and evaluate the business impact and return on the investment in the particular machine learning project. The best approach for evaluation is by defining a set of metrics that address several aspects of the project and its potential impact.

Technical metrics
For classification models:

Accuracy
Precision
Recall
F1 score
AUC

For regression models:

Root mean squared error
Adjusted R2
Mean absolute error

For deep learning models (depends on the particular application):

Perplexity, cosine similarity, Jaccard similarity, BLEU (NLP)
Word Error Rate (speech recognition)
Intersection Over Union, Average Precision, Mean average precision (computer vision)

Business metrics
Business metrics are defined by first-principles, and are often downstream metrics that are impacted by the machine learning models. For measuring outcomes, it’s crucial to a priori identify the relevant business metrics and track the effect of the machine learning models on the same during A/B testing, deployment, and continuously monitor live models.
Standard business metrics aim to capture levels of trust, satisfaction, faults, and SLAs, among others.

Once a candidate set of machine learning projects is scoped, defined and formulated from conception to production with associated set of metrics, each project needs to be evaluated by leadership teams from the perspective of high-level organizational goals to be achieved in a defined time period. Leaders need to balance the business impact (on the opline or bottomline), budget, team bandwidth, time savings, efficiency savings, and the urgency for delivering projects in the short-term vs. the long term. Executives need to incorporate multiple factors to arrive at a carefully considered decision to give the green signal for one or more machine learning projects.

3. Structured processes – Agile, Sprints
Once a project is defined and has the go ahead from the leadership team, it is important to ensure that systems and structured processes are in place to ensure that the machine learning team can work unhindered and execute the project in a timely fashion as per the agreed plan.

Key operational infrastructure like data warehouse, database management systems, data ETL pipelines, metadata storage and management platforms, data annotation frameworks and availability of labeled data, access to compute on-prem or in the cloud, licensed as well as open source tools and softwares that streamline the model training process, machine learning experiment, results and metadata management tools, A/B testing platforms, model deployment infrastructure and solutions, continuous model monitoring and dashboards are integral for a smooth data processing, model building, and deployment workflow. However, the existence of such key skeletal infrastructure for machine learning varies from one organization to another depending on how mature the machine learning organization or the company is.

Apart from the infrastructure, processes related to planning tasks of the individual contributors of the project using sprints and agile frameworks need to be hardwired and accessible to all stakeholders of the project. While Agile processes have worked well for software projects, machine learning projects are different and may not be that well suited to the same frameworks. Although similarities like iterative model building and refining based on feedback exist, machine learning projects are more sophisticated, as the fundamental blocks include data and models in addition to code.

While software engineering best practices like code review and versioning are very well established, the same rigor and structure is not always applied to data and machine learning models. Documentation is another aspect that is even more critical to keep track of multiple hypotheses, experiments, results and all the moving parts associated with machine learning projects.
In the absence of well entrenched tools and best practices, most data science work tends to be highly inefficient where data scientists end up spending a lot of time on routine chores that can be automated. It’s imperative that managers try to reduce such barriers to more efficient and productive work, so that the machine learning teams can focus exclusively on their work.

4. Clear communication within and across teams
Communication is an essential skill for data scientists. Machine learning is a more intricate discipline and the end results might often be too obscure for generalist and non-technical managers of data science, product or business teams to comprehend easily. However, communication is just the tip of the iceberg, and many more interpersonal skills like persuasion, empathy, collaboration are exercised on a regular basis whilst working in cross-functional teams.

Writing emails of results or updates or slide presentations to stakeholders and leadership, live demos, expounding the project for product review documents, writing up the entire project for a blog meant for lay audience or for a journal or conference meant for a technical audience, requires strong writing skills. Typical data scientists may be more proficient in writing code than words, so the organization should invest in corporate training programs for data scientists that include training in written and spoken communication skills.

Oral communication skills can’t be underestimated either, and are increasingly important in remote-first organizations. Effective stakeholder management involves building rapport and trust and establishing clear channels of communication, which is much harder to do if a data scientist is not able to speak and communicate clearly in an engaging and delightful manner. Although a lot of workplace productivity apps have created digital channels of reduced in-person communication, the power of live in-person communication with peers, stakeholders and leaders often gets the job done faster.

Clear communication destroys information silos, so that each stakeholder is aware, updated and aligned with the progress of various machine learning projects. Regular meetings are important to have checks and balances, in addition to documented progress in tools to ensure that projects are moving in the right direction.

5. Effective collaboration with business
Machine learning teams are typically part of the engineering or technology organizations in a company. While this makes natural sense for effective collaboration across colleagues from data, analytics, engineering functions, regular interaction with business teams is a must. Given the fact that most machine learning models are built on historical ‘business’ data that can change in a predictable manner due to new product or feature launches or seasonality patterns, as well as in an unpredictable manner, for instance, during Covid-19 lockdowns, machine learning teams must have a real-time awareness of how the business data is changing on the ground.

Not only is it important to adjust the underlying hypotheses in the face of massive changes in customer behavior or new product launches, but also to correct the planned course of action if initial assumptions are violated or the data changes too dramatically for the machine learning models to be relevant or have the same impact as before.
Business teams are in the best position to give feedback on early prototypes based on their domain expertise, validate new assumptions or ideas by doing customer research and surveys, and evaluating the impact of deployed machine learning models. For these reasons, the partnership between machine learning and business teams needs to be mutually beneficial and symbiotic.

Leaders of machine learning teams need to build close ties with business teams and encourage team members to do the same.

6. Creating a culture of innovation
For long-term success of machine learning teams, apart from working on the right use cases and facilitating collaborative work across the organization, it’s imperative to build a culture that embraces and rewards innovation. Here, leadership should lead by example and encourage innovation and R&D across different business verticals.
For a machine learning team, it’s critical to make a mark in the ecosystem through patent applications, journal or conference publications, outreach and dissemination via meetups, workshops, seminars by leading experts, collaboration with startups and academic organizations as needed, and so on. Most organizations don’t focus on building such a thriving culture that promotes exchange and cross-fertilization of new ideas and technologies, which can often impact current organizational processes and thinking in a substantial way.

Leaders also need to build strong diverse teams and hire new talent, from entry level graduates to experienced engineers and scientists. The inflow of new talent brings in novel ideas that can positively impact the work culture. Otherwise stasis sets in, teams can become narrow-minded, and decline in their capacity to innovate and launch impactful products. Meritocratic executive decisions strongly impact culture, both in terms of promoting talent that demonstrates a consistent track record of exceptional bar-raising work, as well as letting go of non-performing individuals or managers. The appropriate balance and culture in a team is an ongoing process, but it’s important for leaders to ensure that at no point in time, the members of a machine learning team are unmotivated and uninspired by the systems, processes, and culture within the organization.

7. Celebrating and sharing AI success stories
Finally, given the low odds of success for AI projects at present, it’s important to make sure that any AI success stories are widely shared within the organization to attract the attention of other business teams who could potentially partner with the machine learning team. Furthermore, given the immense popularity of AI as a discipline, success stories might also attract potential new team members from within the company who feel motivated to upskill in machine learning and become a data scientist.

It’s important to recognize the effort of the core contributors to the success of AI projects in a public manner within the company and not behind closed doors. It helps to build morale and confidence and foster a meritocratic culture within the team that will help them in their career development. Additionally, wherever possible, the leadership should take steps to share such AI success stories widely within the broader ecosystem in which the company operates, for instance, via company blogs, social media posts, podcasts or talks at meetups, workshops or conferences.

For a machine learning team to continue to deliver strong performance and results, it’s critical to build a portfolio of successful projects starting from simpler ones to gradually more sophisticated ones with an ever increasing scope and commercial impact. The success of a machine learning team acts as a trigger and accelerates the digital and AI transformation of a company. In the highly competitive digital economy, companies that have invested early and invested a lot in AI have emerged as the early winners, for instance, the big tech companies. Thus, impactful machine learning teams act as a lever in the journey towards embracing and onboarding AI and transforming the company into a forward-looking, data-driven, AI-first company.

References

Comments

Why Corporate AI Projects fail? Part 2/4

11/4/2021

Comments

Published in BecomingHuman.ai

tldr: Poor processes and culture can derail the success of many an exceptional AI team

In part 1, I introduced a four-pronged framework for analysing the principal factors underlying the failure of corporate AI projects:

Intuition (Why)
Process/Culture (How)
People (Who)
Systems (What)

and addressed in detail how a lack of organizational intuition for AI can undermine any efforts to embrace and adopt AI successfully.

In the second part of the blog series, I will focus on core aspects of organizational processes and culture that companies should inculcate to ensure that their AI teams are successful and deliver significant business impact.

Figure 1. The Facade Culture (source: Dilbert.com)

Culture

Organizational culture is the foundation on which a company is built and shapes its future outcomes related to commercial impact and success, hiring and retention, as well as the spirit of innovation and creativity. Whilst organizational behaviour and culture have been studied for decades, it needs to be relooked in the context of new-age tech startups and enterprises. The success of such cutting-edge AI-first companies is highly correlated with the scale of innovation through new products and technology, which necessitates an open and progressive work culture.

Typically, new startups on the block, especially those building a core AI product or service, are quick to adopt and foster a culture that promotes creativity, rapid experimentation and calculated risk-taking. Being lean and not burdened by any legacy, most tech startups are quick to shape the company culture in the image of the founders’ vision and philosophy (for better or worse). However, the number of tech companies that have become infamous for the lack of an inclusive and meritocratic culture are far too many.

There are innumerable examples, from prominent tech startups like Theranos, Uber to big tech companies like Google and Facebook, where an open and progressive culture has at times taken a back seat. However, with the increasing focus on sustainability, diversity and inclusion, and ESG including better corporate governance, it is imperative for tech companies to improve organizational culture and not erode employee, consumer or shareholder trust or face real risks to the business from financial as well as regulatory authorities as recently experienced by BlackRock and Deliveroo.

Here is a ready reckoner of some of the ways AI companies tend to lose sight of culture:

Leadership is not clear that the goal of AI is to augment human productivity, reduce costs and improve efficiencies, and not to replace employees. Without human partners in the loop, AI-based products and services cannot succeed
Hiring non-technical leaders with little intuition of AI, and/or technical leaders with poor product or business mindset to lead AI teams or orgs
Poor organization alignment about the long-term vision and purpose of AI across key stakeholders in multiple business verticals
Lack of investment in organization-wide training in AI especially for business teams resulting in poor intuition about the commercial potential of AI
Lack of allocation of budget for investing in AI, or investing too late whereby early movers and competitors have already embraced and adopted AI across the org
Poor management of AI teams with no roadmap, vision, or hiring strategy
Talking up the promise of AI with no well-defined use cases and projects in sight
Treating AI like software and expecting things to work the same way as for building and delivering software products
No structured processes and mechanisms for embracing AI: from defining business-relevant AI use cases to deploying them in the real world

Figure 2. How to scope AI projects (source: DeepLearning.ai)

Process
There are several processes that are integral for ensuring a successful AI outcome across the entire lifecycle from conception to production. However, from first-principles, the primary process that needs to be streamlined and managed well is identifying the right use cases for AI that have the potential to create significant commercial impact. In this blog, I will focus only on this particular aspect and expound on the other processes in separate blogs.

What can go wrong in identifying the right set of AI use cases?

Focus on the technical aspects of AI instead of the business problem
Defining a use case without checking availability of data for the problem
Poor intuition for the feasibility of the AI use case without diving deep into the data and consulting business or domain experts
No clearly defined acceptance criteria, success metrics or KPIs for the use case
No shared alignment across cross-functional teams for the proposed use cases
Poor project management, and misallocation of time and bandwidth across AI and cross-functional teams
Poor prioritization and ranking of proposed use cases
No clearly defined project milestones and timelines to validate the use case once the project has started
Poor brainstorming of potential AI solutions for the use case

So, having listed a variety of issues that can go wrong in identifying an AI use case, how should one ideally go about scoping AI projects systematically? As per Figure 2, the strategy to scope an AI use case involves 5 steps: from identifying a business problem to brainstorming AI solutions to assessing feasibility and value to determining milestones and finally budgeting for resources.

The scoping process starts with a careful dissection of business, not AI problems, that need to be solved for creating commercial value. As discussed above, if not done right, the rest of the AI journey in an organization is bound to fail.

Secondly, it is important to brainstorm potential AI solutions across AI, engineering and product teams to shortlist a set of approaches and techniques that are practically feasible instead of going with the latest or most sophisticated AI model or algorithm.

Thirdly, AI teams should assess the feasibility of shortlisted methods by creating a quick prototype, validating the approach based on literature survey or discussions with domain experts within the company or partner with external collaborators accordingly. If a particular method does not appear to be feasible, then teams should consider the alternative approaches until they are ruled out.

Once the initial efforts have validated the use case, its feasibility and potential approaches, it is critical to define key business metrics, KPIs, acceptance or success criteria. These are not composed of the typical AI model metrics like precision, accuracy of F-1 score, but KPIs need to be defined that are directly correlated with the impact of the AI models on business goals e.g. retention, NPS, customer satisfaction amongst others.

The final step involves program management of the entire project from allocating time, bandwidth of individual contributors in the AI as well as partner teams, budget for collecting or labeling data, hiring data scientists or buying software or infrastructure to setup and streamline the entire AI lifecycle.

Tldr part 2:
Before you head out to build AI, first ask what are the business problems that are big enough and suitable for an AI-based solution? What business metrics and objectives ought to be targeted? Scope out the problem systematically to ensure the best chance of success.

Build on the initial successes of AI and foster a meritocratic and open culture of innovation and cross-functional collaboration to build AI that solves a variety of business use cases.

Comments

<<Previous

Index

How to choose a Vector Database?

Mixtral of Experts

How to Automate MLOps?

Federated Machine Learning for Healthcare

Building Artificial Intelligence Products

AI & Web3

Top 10 MLOps tools

AWS Redshift Pricing

AWS Lambda Pricing

How to Hire Data Science Teams?

The Case for Reproducible Data Science

How to Generate Synthetic Data for Machine Learning Projects

Surefire Ways to Identify Data Drift

Data Labeling and Relabeling in Data Science

Understanding and Measuring Data Quality

Benefits of FAANG companies for Data Science & Machine learning roles

Why Data Democratization is important to your business?

How to ensure Data Quality through Governance

Data Labeling: The Unsung Hero Combating Data Drift

ML Engineer vs Data Scientist

Knowledge Distillation: Principles, Algorithms, Applications

Developing AI Projects for Business - Best Practices

How to build Machine Learning Teams that Deliver

Why Corporate AI Projects fail? Part 2/4

Archives

Categories