Published in Towards Data Science
Electronic means of communication have helped to eliminate time and distance barriers to sharing and broadcasting information. However, despite all its advantages, faster means of communication have also resulted in the extensive spread of misinformation. The world is currently going through the deadly COVID-19 pandemic and fake news regarding the disease, its cures, its prevention, and causes have been broadcast widely to millions of people. The spread of fake news and misinformation during such precarious times can have grave consequences leading to widespread panic and amplification of the threat of the pandemic itself.
As per a recent BBC report from August 2020, at least 800 people may have died around the world because of coronavirus-related misinformation in the first three months of this year. It is therefore of paramount importance to limit the spread of fake news and ensure that accurate knowledge is disseminated to the public.
In this blog, we explore the problem of fake news detection related to COVID-19 and describe our approach to tackle it using Natural Language Processing. This is based on our recent paper — ‘Two Stage Transformer Model for COVID-19 Fake News Detection and Fact Checking’, accepted at the NLP for Internet Freedom Workshop, co-located with COLING2020.
Our NLP solution:
We built a topical fake news detection system capable of verifying claims as well as providing explanations, all in real-time. Developing a solution for such a task involves generating a database of factual explanations, which constitutes our knowledge base, that serves as ground truth for any given claim. We computed the entailment between any given claim and explanation to verify whether the claim is true or not. Querying for claim-explanation pairs for each explanation in our knowledge base is computationally expensive and slow, so we propose generating a set of candidate explanations that are contextually similar to the claim. We achieved this by using a model trained with relevant and irrelevant claim-explanation pairs and using a similarity metric between the two to match them.
Previous research on fake news detection
Previous work on fake news detection has primarily focused on evaluating the relationship measured via a textual entailment task between a header and the body of the article. Researchers have explored the use of simple classifier models with TF-IDF features and cosine similarity metric to classify fake news. Several baselines with such methods exist on standard datasets like FNC-1 and FEVER.
Transformer based pre-trained models achieved state of the art results in several NLP subtasks, their ease of fine-tuning makes them adaptable to newer tasks. In further related work, the authors proposed a model based on the BERT architecture to detect fake news by analyzing the contextual relationship between the headline and the body text of news. They further enhanced their model performance by pre-training with domain-specific news and articles.
The use of social media has also been extensively studied for stopping misinformation for Covid-19. In a related work to this, authors developed an Infodemic Risk Index (IRI) after analyzing Twitter posts across various languages and calculated the rate at which a particular user from a locality comes across unreliable posts from different classes of users like verified humans, unverified humans, verified bots and unverified bots.
But none of these mentioned works tackles the problem of misinformation by reasoning out the given fake claim with an explanation.
The use of an existing misinformation dataset would not serve as a reliable knowledge base for training and evaluating the models due to the recent and uncommon nature i.e., the vocabulary used to describe the disease and the terms associated with the COVID-19 pandemic.
It was therefore important to generate real and timely datasets to ensure accurate and consistent evaluation of the methods.
To overcome this drawback, we manually curated a dataset specific to COVID-19. Our proposed dataset consists of 5500 claim and explanation pairs. There are multiple sources on the web that are regularly identifying and debunking fake news on COVID-19. We collected data from “Poynter”, a fact checking website which collects fake news and debunks or fact-checks them with supporting articles from more than 70 countries.
For each fact check, we collected only the ”claim” and the corresponding “explanation” from this database which were rated as ’False’ or ’Misleading’. In this way, we collected about 5500 false-claim and explanation pairs. We further manually rephrased a few of these false claims to generate a true claim, as the ones that aligned with the explanation, so as to create an equal proportion of true-claim and explanation pairs.
The architecture consists of a two stage model, we will refer to the first model as “Model A” and the second model as “Model B”. The objective of Model A is to fetch the candidate “true facts” or explanations for a given claim, which are then evaluated for entailment using Model B.
Model A is trained on all claim-explanation pairs, as we have a lot more of them, and the task of the Model A is to pick out candidate claims for a given explanation. Model A is trained on a Next Sentence Prediction (NSP) task.
Through our experiments, we find that, on this trained model, if we generate embeddings for a single sentence (either claim or explanation individually) and compare matching [claim, explanation] embeddings using the cosine similarity metric, there is a distinction in the distribution of similarity scores between related and unrelated [claim, explanation] pairs.
Therefore, for faster near real-time performance, we cache the embeddings for all our explanations (knowledge base) beforehand and compute the cosine similarity between the claim and the cached embeddings of the explanations. We fetch the top explanations for any given claim exceeding a certain threshold of sentence similarity as there could be several explanations relevant for a given claim.
The second part of the pipeline is to identify the veracity of a given claim. Model A fetches the candidate explanations while Model B is used to verify whether the given claim aligns with our set of candidate explanations or not. To train Model B, we use a smaller subset of “false claim” and “explanation” pairs from our original dataset, and cross-validate each sample with “true claim” or in other words, claims that align with the factual explanation.
However, this small annotated data is not sufficient to train the model effectively. Therefore, the parameters of the Model A, which was trained on a much larger dataset were used as initial parameters for Model B, and fine-tuned further using our cross-validated dataset. Model B is also trained for the sequence classification task. Essentially Model B computes the entailment between its input claim, explanations pairs.
We trained and evaluated both Model A and Model B using several approaches based on classical NLP methods as well as more sophisticated pre-trained Transformer models. The flow of the Model A + Model B pipeline is shown in the above figure.
Transformer based Models:
We trained and evaluated three Transformer based pre-trained models for both Model A and Model B using the training strategy described before. As our focus was to ensure that the proposed pipeline can be deployed effectively in a near real-time scenario, we restricted our experiments to models that could efficiently be deployed using inexpensive compute. We chose the following three models — BERT(base), ALBERT, and MobileBERT.
Model A was trained on 5000 claim-explanation pairs on the sequence classification task to optimize the softmax cross entropy loss. This trained model was then validated on a test set comprising 1000 unseen claim-explanation pairs. The training data structure here looks like this.
[claim, relevant explanation, 1], [claim, irrelevant explanation, 0]
Model B was trained on a smaller subset of 800 cross-validated [claim, explanation, label] data, on the same sequence classification task, where the label was assigned based on whether the claim aligned with the explanation — 1 or not — 0. This was validated on 200 unseen data-points. The loss function used was softmax cross-entropy. The training data structure here looks like:
[true claim, relevant explanation, 1] [false claim, relevant explanation, 0]
For baselining we implement classical NLP approaches in our use-case and compare those results with transformer based models. We implement GLoVeand TF-IDF architectures for the classical ones.
For evaluating the performance of the overall pipeline model, we first evaluate the performance of Model A in its ability to retrieve relevant explanations. For this we use Mean Reciprocal Rank(MRR) and Mean Recall@10, that is, the proportion of claims for which the relevant explanation was present in the top 10 most contextual explanation by cosine similarity and their mean inverse rank.
Once, Model A has retrieved relevant explanations, we evaluate the performance of Model B on computing the veracity of the claim. Here, we only used explanations that exceed an empirically defined threshold in cosine similarity between the query claim and the explanation. Through our experiments, we found that a threshold of the mean standard deviation of cosine similarity over the validation data worked well for picking relevant explanations. For evaluating the accuracy, we take a mean of the output probabilities for each claim, explanationᵢ.
This is expected due to the lower parameter size of the TF-IDF and GloVe models. Among the Transformer based models, MobileBERT had the least latency per claim as expected while ALBERT consumed the least memory. The best performing BERT+ALBERT model utilized a memory of 1398MB and fetched relevant explanations of each claim in 2.471 seconds. The model latencies and memory usage were evaluated on an Intel Xeon — 2.3GHz Single-core — 2 thread CPU.
We however do acknowledge that our models could still make errors of two kinds:
Model A might not fetch a relevant explanation which automatically means that the prediction provided by Model B is irrelevant,
Model A might have fetched the correct explanation(s) but Model B classifies it incorrectly. We show some of the errors our models made in this table.
In this work, we have demonstrated the use and effectiveness of pre-trained Transformer based language models in retrieving and classifying fake news in a highly specialized domain of COVID-19. Our proposed two stage model performs significantly better than other baseline NLP approaches. Our knowledge base, which we prepare through collecting factual data from reliable sources from the web can be dynamic and change to a large extent, without having to retrain our models again for as long as the distribution is consistent. All of our proposed models can run in near real-time with moderately inexpensive compute. Our work is based on the assumption that our knowledge base is accurate and timely.
This assumption might not always be true in a scenario such as COVID-19 where “facts” are changing as we learn more about the virus and its effects. Therefore a more systematic approach is needed for retrieving and classifying claims using this dynamic knowledge base.
Our future work consists of weighting our knowledge base on the basis of the duration of the claims and benchmarking each claim against novel sources of ground truth.
Our model performance can be further boosted by better pre-training, through domain specific knowledge. In one of the more recent work, the authors propose a novel semantic textual similarity dataset specific to COVID-19. Pre-training our models using such specific datasets could help in a better understanding of the domain and ultimately better performance. Fake news and misinformation is an increasingly important and difficult problem to solve, especially in an unforeseen situation like the COVID-19 pandemic.
Leveraging state of the art machine learning and deep learning algorithms along with preparation and curation of novel datasets can help address the challenge of fake news related to COVID-19 and other public health crises.
Copyright © 2024, Sundeep Teki
All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including electronic or mechanical methods, without the prior written permission of the author.
This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.