Consumer technology companies like Amazon, Yelp, and Airbnb are focused on providing an impeccable customer experience, and reviews are integral to that experience. Reviews from previous customers can signal trust and reliability (e.g., total number of reviews or average star rating), empowering first-time buyers or new customers in their decision-making. Millions of reviews are shared on platforms like Amazon for e-commerce products, on Airbnb for travel and hospitality, on Glassdoor for company and employment experience, and on Google for third-party businesses.
However, the internet has become rife with fake reviews. Fake reviews and inflated ratings provide a tainted picture of a product or service and are designed to trick customers away from or toward certain purchases. As these reviews are an important input factor for search and ranking algorithms, they can have a massive influence on product discovery and sales. This provides a strong incentive for bad actors to try to manipulate the system by improving the ratings of their products through fake reviews.
There is a booming market for fake reviews, which are purchased via multiple social media and community platforms. The problem is enormous - nearly four percent of all reviews are fake, translating into a global economic impact of USD 152 billion.
E-commerce companies like Amazon spend upwards of a billion dollars and employ tens of thousands of workers to combat online fraud and abuse. Some companies use sophisticated technologies including AI to detect and delete fake reviews, but their accuracy is limited (less than forty percent) and it often takes more than one hundred days to remove those reviews. During that time, fraudulent sellers can make strong short-term revenues and profits.
Apart from the short-term commercial losses, there is a longer-term problem; fake reviews erode customer trust and safety, causing customers to avoid online purchases. Catching fake reviews is therefore paramount for a majority of online marketplaces and businesses.
Characteristics of Fake Reviews
Fake reviews have several telltale characteristics. For instance, as they are based on a fraudulent experience with the product or service, fake reviews will often focus on a poor customer experience without specific details about that product or service.
Another sign is the repetition of positive or negative keywords and text. As it is difficult to fabricate a review, fake reviewers keep emphasizing certain keywords and details to paint a terrible customer experience. Such reviews accentuate extreme details without providing a balanced perspective.
Fake reviewers also excessively use emoticons and exclamation points in an attempt to appeal to the customers’ emotions. Genuine reviewers tend to focus more on information and provide thoughtful, actionable feedback for other customers about the product experience.
One clear giveaway is the reviewer’s name and avatar. Fake reviews are usually submitted using an account with a dubious username, avatar, or email address. If a reviewer seems like they could be illegitimate, check whether they have shared any reviews previously, how often, and for which products or businesses. Fake accounts are often created for one-time use, and fake reviewers can submit multiple reviews in a short span of time, sometimes on the same day.
Sometimes fake reviewers post a poor rating without any comments to describe their experience. Genuine reviewers take the time and effort to write useful feedback.
Spotting these characteristics can help you find many fake reviews, but scamsters are always devising more sophisticated techniques to replace those that have already been detected through algorithms, AI, or human reviewers.
One solution to this problem is fingerprinting technology, that can identify unique users of your website regardless of VPNs, cookie blockers, private browsing, or other tools. They use data including the browser and device used, usage patterns, IP addresses, and geolocation to create a unique identifier for site visitors, making it easier to spot users trying to hide their identity or committing fraudulent activity.
Fake reviews have undermined the revenue and growth of online sellers and small businesses. These reviews can boost the sales of a poor product by exaggerating its positive rating, or damage the sales of competitor products via negative reviews. While there are ways to catch these fake reviewers in the act, it’s an increasingly sophisticated scam and a headache for businesses.
Fingerprinting technology can help you find and remove fake reviews as well as protect your business from all types of online fraud. This helps ensure that your customers will have a safe and reliable online shopping experience.
Recently, the Government of India issued a draft framework of standards to counter fake reviews in order to reduce their prevalence on e-commerce platforms.
Published by CloudForecast
Amazon Redshift is a widely used cloud data warehouse that is used by many businesses, like Nasdaq, GE, and Zynga, to process analytical queries and analyze exabytes of data across databases, data lakes, data warehouses, and third-party data sets.
There are multiple use cases for Redshift, including enhancing business intelligence capabilities, increasing developer and analyst productivity,
and building machine learning models for predictive insights, like demand forecasting.
Amazon Redshift can be leveraged by modern data-driven organizations to vastly improve their data warehousing and analytics capabilities. However, the pricing for Redshift services can be challenging to understand, with multiple criteria that define the total cost.
In this article, you’ll learn about Amazon Redshift and its pricing structure, with suggestions for how to optimize costs.
Here is the full article
Published by CloudForecast
Companies are increasingly moving their production code to serverless functions using AWS Lambda, which has gained popularity for its better code maintenance, low-cost hosting charges, and automatically scaled and optimized performance. But without careful oversight, Lambda can become an expensive choice for your project.
Lambda, offered by market-leading AWS, offers many benefits. Lambda is one example of serverless functions, or single-purpose, programmatic functions hosted and maintained by cloud providers like AWS, Azure, or GCP to ensure near-perfect runtime and scaling to any incoming network request volume. Companies can use Lambda, an event-driven compute service, to run any type of application or backend service without worrying about provisioning or managing servers.
Lambda adapts to a variety of use cases across startups and enterprises alike. It can process data at scale, run interactive web and mobile backend services, enable powerful machine learning models, and build in-house event-driven applications.
It also specifies limits for the amount of compute and storage resources used to run and store serverless functions. These limits apply to a number of resources, such as the number of concurrent executions; storage for uploaded functions as well as quotas for function configuration; deployment and execution parameters like memory allocation; timeout; environment variables; layers; and burst concurrency.
The key to using Lambda is keeping your costs in check. This article will review Lambda’s pricing structure to show how costs can be efficiently managed without compromising on operational excellence and execution of Lambda functions. It will also discuss tools like CloudForecast that can help engineering teams monitor and reduce their serverless computing costs on AWS.
Here is the full article.
Data science teams are an integral part of early-stage or growth-stage start-ups as midlevel and enterprise companies. A data science team can include a wide range of roles that take care of the end-to-end machine learning lifecycle from project conceptualization to execution, delivery, and monitoring:
The manager of a data science team in an enterprise organization has multiple responsibilities, including the following:
As the data science manager, it’s critical to have a structured, efficient hiring process, especially in a highly competitive job market where the demand outstrips the supply of data science and machine learning talent. A transparent, thoughtful, and open hiring process sends a strong signal to prospective candidates about the intent and culture of both the data science team and the company, and can make your company a stronger choice when the candidates are selecting an offer.
In this blog, you’ll learn about key aspects of the process of hiring a top-class data science team. You’ll dive into the process of recruitment, interviewing, and evaluating candidates to learn how to find the ones who can help your business improve its data science capabilities.
Benefits of an Efficient Hiring Process
Recent events have accelerated organizations’ focus on digital and AI transformation, resulting in a very tight labor market when you’re looking for data sciencedigital skills, like machinelike data science and machine learning, statistics, and programming.
A structured, efficient hiring process enables teams to move faster, make better decisions, and ensure a good experience for the candidates. Even if candidates don’t get an offer, a positive experience interacting with the data science and the recruitment teams makes them more likely to share good feedback on platforms like Glassdoor, which might encourage others to interview at the company.
Hiring Data Science Teams
A good hiring process is a multistep process, and in this section, you’ll look at every step of the process in detail.
Building a Funnel for Talent
Depending on the size of the data science team, the hiring manager may have to assume the responsibility of reaching out to candidates and building a pipeline of talent. In larger organizations, managers can work with in-house recruiters or even third-party recruitment agencies to source talent.
It’s important for the data science managers to clearly convey the requirements for the recruited candidates, such as the number of candidates desired and the profiles of those candidates. Candidate profiles might include things like previous experience, education or certifications, skill set or tech stack, and experience with specific use cases. Using these details, recruiters can then start their marketing, advertising, and outreach campaigns on platforms, like LinkedIn, Glassdoor, Twitter, HackerRank, and LeetCode.
In several cases, recruiters may identify candidates who are a strong fit but who may not be on the job market or are not actively looking for new roles. A database of all such candidates ought to be maintained so that recruiters can proactively reach out to them at a more suitable time and reengage the candidates.
Another trusted source of identifying good candidates is through employee referrals. An in-house employee referral program that incentivizes current employees to refer candidates from their network is often an effective way to attract the specific types of talent you’re looking for.
The data science leader should also publicize their team’s work through channels, like conferences or workshops, company blogs, podcasts, media, and social media. By investing dedicated time and energy in building up the profile of the data science team, it’s more likely that candidates will reach out to your company seeking data science opportunities.
When looking for a diverse set of talent, the search an be difficult as data science is a male dominated field. As a result, traditional recruiting paths will continue to reflect this bias. Reaching out and building relationships with groups such as Women in Data Science, can help broad the pipeline of talent you attract.
Defining Roles and Responsibilities
Good candidates are more likely to apply for roles that have a clear job description, including a list of potential data science use cases, a list of required skills and tech stack, and a summary of the day-to-day work, as well as insights into the interviewing process and time lines. Crafting specific, accurate job descriptions is a critical—if often overlooked—aspect of attracting candidates. The more information and clarity you provide up front, the more likely it is that candidates have sufficient information to decide if it’s a suitable role for them and if they should go ahead with the application or not. If you’re struggling with creating this, you can start with an existing job description template and then customize it in accordance with the needs of the team and company.
It's also critical to not over populate a job description with every possible skill or experience you hope a candidate brings. That will narrow your potential applicant pool. Instead focus on those skills and experiences that are absolutely critical. The right candidate will be able to pick up other skills on the job.
It can be useful for the job description to include links to any recent publications, blogs, or interviews by members of the data science team. These links provide additional details about the type of work your team does and also offer candidates a glimpse of other team members.
Here are some job description templates for the different roles in a data science team:
When compared to software engineering interviews, the interview process for data science roles is still very unstructured, and data science candidates are often uncertain about what the interview process involves. The professional position of data scientist has only existed for a little over a decade, and in that time, the role has evolved and transformed, resulting in even newer, more specialized roles, such as data engineer, machine learning engineer, applied scientist, research scientist, and product data scientist.
Because of the diversity of roles that could be considered data science, it’s important for a data science manager to customize the interviewing process depending on the specific profile they’re seeking. Data scientists need to have expertise in multiple domains, and one or more second-round interviews can be tailored around these core skills:
Given how tight the job market is for data science talent, it’s important to not over complicate the process. The more steps in the process, the longer it will take and the higher the likelihood you will lose viable candidates to other offers. So be thoughtful in your approach and evaluate it periodically to align with the market.
Types of Data Science Interviews
Interviews are often a multistep process and can involve multiple steps of assessments.
To save time, one or more screening rounds can be conducted before inviting candidates for second-round interviews. These screening interviews can take place virtually and involve an assessment of essential skills, like programming and machine learning, along with a deep dive into the candidate’s experience, projects, career trajectory, and motivation to join the company. These screening rounds can be conducted by the data science team itself or outsourced to other companies, like HackerRank, HackerEarth, Triplebyte, or Karat.
Once candidates have passed the screening interviews, the top candidates will be invited to a second interview, either virtually or in person. The data science manager has to take the lead in terms of coordinating with internal interviewers to confirm the schedule for the series of interviews that will assess the candidate’s skills, as described earlier. On the day of the second-round interviews, the hiring manager needs to help the candidate feel welcome and explain how the day will proceed. Some companies like to invite candidates to lunch with other team members, which breaks the ice by allowing the candidate to interact with potential team members in a social setting.
Each interview in the series should start by having the interviewer introduce themself and provide a brief summary of the kind of work they do. Depending on the types of interviews and assessments the candidate has already been through, the rest of the interview could focus on the core skill set to be evaluated or other critical considerations. Wherever possible, interviewers should offer the candidate hints if they get stuck and otherwise try to make them feel comfortable with the process. The last five to ten minutes of each interview should be reserved for the candidate to ask questions to the interviewer. This is a critical component of second-round interviews, as the types of questions a candidate asks offer a great deal of information about how carefully they’ve considered the role.
Before the candidate leaves, it’s important for the recruiter and hiring manager to touch base with the candidate again, inquire about their interview experience, and share time lines for the final decision.
It is common for there to be some sort of case study or technical assessment to get a better understanding of a candidate’s approach to problem solving, dealing with ambiguity and practical skills. This provides the company with good information about how the candidate may perform in the role It also is an opportunity to show the candidate what type of data and problems they may work on when working for you.
After the second-round interviews and technical assessment, the hiring manager needs to coordinate a debrief session. In this meeting, every interviewer shares their views based on their experience with the candidate and offers a recommendation if the candidate should be hired or not.
After obtaining the feedback from each member of the interview panel, the hiring manager also shares their opinion. If the candidate unanimously receives a strong hire or a strong no-hire signal, then the hiring manager’s decision is simple.
However, there may be candidates who perform well in some interviews but not so well in others, and who elicit mixed feedback from the interview panel. In cases like this, the hiring manager has to make a judgment call on whether that particular candidate should be hired or not. In some cases, an offer may be extended if a candidate didn’t do well in one or more interviews but the panel is confident that the candidate can learn and upskill on the job, and is a good fit for the team and the company.
If multiple candidates have interviewed for the same role, then a relative assessment of the different candidates should be considered, and the strongest candidate or candidates, depending on the number of roles to be filled, should be considered.
While most of the interviews focus on technical data science skills, it’s also important for interviewers to use their time with the candidate to assess soft skills, like communication, clarity of thought, problem-solving ability, business sense, and leadership values. Many large companies place a very strong emphasis on behavioral interviews, and poor performance in this interview can lead to a rejection, even if the candidate did well on the technical assessments.
After the debrief session, the data science manager needs to make their final decision and share the outcome, along with a compensation budget, with the recruiter. If there’s no recruiter involved, the manager can move directly to making the candidate an offer.
It’s important to move quickly when it comes to making and conveying the decision, especially if candidates are interviewing at multiple companies. Being fast and flexible in the hiring process gives companies an edge that candidates appreciate and take into consideration in their decision-making process.
Once the offer and details of compensation have been sent to the candidate, it’s essential to close the offer quickly to prevent candidates from using your offer as leverage at other companies. Including a deadline for the offer can sometimes work to the company’s advantage by incentivizing candidates to make their decision faster. If negotiations stretch and the candidate seems to lose interest in the process, the hiring manager should assess whether the candidate is really motivated to be part of the team. Sometimes, it may move things along if the hiring manager steps in and has another brief call with the candidate to help remove any doubts about the type of work and projects. However, additional pressure on the candidates can often work to your disadvantage and may put off a skilled and motivated candidate in whom the company has already invested a lot of time and money.
In this article, you’ve looked at an overview of the process of hiring a data science team, including the roles and skills you might be hiring for, the interview process, and how to evaluate and make decisions about candidates. In a highly competitive data science job market, having a robust pipeline of talent, and a fast, fair, and structured hiring process can give companies a competitive edge.
Published by Domino Data Lab
Reproducibility is a cornerstone of the scientific method and ensures that tests and experiments can be reproduced by different teams using the same method. In the context of data science, reproducibility means that everything needed to recreate the model and its results such as data, tools, libraries, frameworks, programming languages and operating systems, have been captured, so with little effort the identical results are produced regardless of how much time has passed since the original project.
Reproducibility is critical for many aspects of data science including regulatory compliance, auditing, and validation. It also helps data science teams be more productive, collaborate better with nontechnical stakeholders, and promote transparency and trust in machine learning products and services.
In this article, you’ll learn about the benefits of reproducible data science and how to ingrain reproducibility in every data science project. You’ll also learn how to cultivate an organizational culture that promotes greater reproducibility, accountability, and scalability.
Here is the full article.
Machine learning models, especially deep neural networks, are trained using large amounts of data. However, for many machine learning use cases, real-world data sets do not exist or are prohibitively costly to buy and label. In such scenarios, synthetic data represents an appealing, less expensive, and scalable solution.
Additionally, several real-world machine learning problems suffer from class imbalance—that is, where the distribution of the categories of data is skewed, resulting in disproportionately fewer observations for one or more categories. Synthetic data can be used in such situations to balance out the underrepresented data and train models that generalize well in real-world settings.
Synthetic data is now increasingly used for various applications, such as computer vision, image recognition, speech recognition, and time-series data, among others. In this article, you will learn about synthetic data, its benefits, and how it is generated for different use cases.
👉 Here is the full article
Data drift refers to the phenomenon where the distribution of live, real-world data differs or “drifts” from the distribution of data used to train a machine learning model. When data drift occurs, the performance of machine learning models in production degrades, resulting in inaccurate predictions. This reduction in the model’s predictive power can adversely impact the expected business value from the investment in training. If data drift is not identified in time, the machine learning model may become stale and eventually useless.
In this article, you’ll learn more about data drift, exploring why and in what ways it occurs, its impact, and how it can be mitigated and prevented.
👉 Here is the full article
Published by Colabra
Effective communication skills are pivotal to success in science. From maximizing productivity at work through efficient teamwork and collaboration to preventing the spread of misinformation during global pandemics like Covid19, the importance of strong communication skills cannot be emphasized enough.
However, scientists often struggle to communicate their work clearly for various reasons. Firstly, most academic institutes do not prioritize training scientists in essential soft skills like communication. With negligible organizational or departmental training and little to no feedback from professors and peers, scientists fail to fully appreciate the real-world importance and consequences of poor communication skills. The long scientific training period in the academic ivory tower is spent conversing with fellow scientists, with minimal interaction with non-technical professionals and the general public. Thus, the lingua franca among scientists is predominantly interspersed with jargon, leading to poor communication with non-scientists.
This article will describe best practices and frameworks for professional scientists and non-scientists in commercial scientific enterprises to communicate effectively.
👉 Here is the full article
Supervised machine learning models are trained using data and their associated labels. For example, to discriminate between a cat and a dog present in an image, the model is fed images of cats or dogs and a corresponding label of “cat” or “dog” for each image. Assigning a category to each data sample is referred to as data labeling.
Data labeling is essential to imparting machines with knowledge of the world that is relevant for the particular machine learning use case. Without labels, models do not have any explicit understanding of the information in a given data set. A popular example that demonstrates the value of data labeling is the ImageNet data set. More than a million images were labeled with hundreds of object categories to create this pioneering data set that heralded the deep-learning era.
In this article, you’ll learn more about data labeling and its use cases, processes, and best practices.
👉 Here is the full article
Modern companies now unanimously recognize the value of data for driving business growth. However, high-quality data is much more valuable than data assets of poor quality. As companies accumulate petabytes of data from various sources, it becomes imperative to focus on the quality of data and filter out bad data.
Data is the fundamental building block for predictive machine learning models. Although having access to greater amounts of data is beneficial, it doesn’t always translate to better-performing machine learning models. Sampling training data that passes quality checks and meets certain acceptance criteria can significantly boost the accuracy of the model predictions.
In this article, you’ll learn more about why high-quality data is essential for building robust machine learning models, expanding on the various parameters that define data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. You’ll also explore a few mechanisms you can implement to measure and improve the quality of your data.
👉 Here is the full article
Published by Transform
A metric layer is a centralized repository for key business metric. This “layer” sits between an organization’s data storage and compute layer and downstream tools where metric logic lives—like downstream business intelligence tools.
A metric layer is a semantic layer where data teams can centrally define and store business metrics (or key performance indicators) in code. It then becomes a source of truth for metric—which means people who analyze data in downstream tools like Hex, Mode, or Tableau will all be working with the same metric logic in their analyses.
The metric layer is a relatively new concept in the modern data stack, mainly because until recently, it was only available to companies with large or sophisticated data teams. Now it is more readily available to all organizations with metric platforms like Transform.
In this article, you’ll learn what a metric layer is, how to use your data warehouse as a data source for the metric layer, and how to get value from this central metric repository by consuming metrics in downstream tools.
👉 Here is the full article
I receive several messages about the benefits of joining FAANG and similar companies and startups in the context of Data Science, Machine Learning & AI roles.
Here’s my take, in no particular order:
1. 𝐁𝐫𝐚𝐧𝐝. FAANG+ are not only the top technology companies but also the biggest companies by market cap -> great brand to add to your profile, top compensation and benefits.
2. 𝐒𝐜𝐨𝐩𝐞. The scope of AI/ML applications in these companies is tremendous as they have tons of data. You can get to work on multiple use cases, driven by statistics, machine learning, deep learning, unsupervised / semi-supervised / self-supervised, reinforcement learning etc. Internal team transfers facilitate expanding your breadth of ML experience.
3. 𝐁𝐚𝐫. The AI/ML work is cutting edge, as most of these companies invest heavily in R&D and create game-changing techniques and models. They also invest heavily in platform, cloud, services etc. that make it easier to build and deploy ML products.
4. 𝐑&𝐃. You can do both research on moon-shot projects if that’s your cup of tea, as well as more immediate business-driven data science projects with monthly or quarterly deliverables.
5. 𝐏𝐞𝐨𝐩𝐥𝐞. You get to work with the creme-de-al-creme in terms of talent, ideas, vision, and execution. Your own level will rise if you are surrounded by some of the brightest folks, and also get to collaborate with their clients and collaborators from academia, startups as well.
6. 𝐍𝐞𝐭𝐰𝐨𝐫𝐤. After FAANG, people go on to do many diverse things — from building a startup to doing cutting-edge research to non-profits to venture capital amongst others. You can find quality partners for the next steps of your career journey.
7. 𝐒𝐲𝐬𝐭𝐞𝐦𝐬. Processes and systems for AI/ML/Data are more mature and streamlined than smaller/newer companies which can facilitate your speed and execution of your projects.
8. 𝐂𝐮𝐥𝐭𝐮𝐫𝐞. The culture, on average, is more professional as these companies invest heavily in their employees and regularly come up with new employee-friendly policies to make it a great place to work.
9. 𝐅𝐫𝐞𝐞𝐝𝐨𝐦. After FAANG, you will be in demand and recruiters and hiring managers will seek you out if you’ve proved your chops whilst at the company. You will have more opportunities to sample from and greater freedom in terms of deciding your career and life trajectory, as you can also move internally to different countries.
10. 𝐈𝐦𝐩𝐚𝐜𝐭. Given the scale at which these companies operate, the scope for real-world measurable impact is enormous.
There are some downsides, caveats and exceptions as well, but on average these factors make FAANG and similar tech companies a very attractive proposition to launch, build and grow your career in data science and machine learning.
"Data democratization" has become a buzzword for a reason. Modern organizations rely extensively on data to make informed decisions about their customers, products, strategy, and to assess the health of the business. But even with an abundance of data, if your business can’t access or leverage this data to make decisions, it’s not useful.
To that end, data democratization, or the process of making data accessible to everyone, is quintessential to data-driven organizations.
Providing data access to everyone also implies that there are few if any roadblocks or gatekeepers who control this access. When stakeholders from different departments—like sales, marketing, operations, and finance—are permitted and incentivized to use this data to better understand and improve their business function, the whole organization benefits.
Successful data democratization requires constant effort and discipline. It’s founded on an organization-wide cultural shift that embraces a data-first approach and empowers every stakeholder to comfortably use data and make better data-driven decisions. As Transform co-founder James Mayfield put it, organizations should think about "democratizing insights, not data."
In this article, I will provide a detailed overview of data democratization, why organizations should invest in it, and how to actually implement it in practice.
👉 Here is the full article
Copyright © 2022, Sundeep Teki
All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including electronic or mechanical methods, without the prior written permission of the author.
This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.