Data Science against disinformation: How Artificial Intelligence and Machine Learning can Fact-Check claims of digital election campaigns

SEP 01, 2021

Becoming digital allows election campaigns to promote their candidates more effectively and economically. Campaigners can collect and analyse users’ public data on online social networks to target their potential voters, approach them with personalised messages, and convince them to vote for their specific candidates.

While this approach is generally sound, what usually happens in election campaigns reminds us of the famous Sergey Nechayev’s quote: “The end justifies the means.” Some candidates do not hesitate to resort to any ethical/nonethical means to increase their election chances or harm their rival’s reputation. One of these nonethical means is to propagate disinformation, which is defined as false information that is spread deliberately to deceive. As an example, Donald Trump once tweeted, “I WON THIS ELECTION, BY A LOT!”

Traditionally, journalists have to go through hours of archival data collection and analysis to fact-check such a claim. Although the above example tweet was easy to refute based on the US 2020 election statistics, fact-checking of claims is not always that easy. Consider a hypothetical election campaigner who claims that, under his last four-year presidency, “the country became the first economic power of Europe.” Here, our journalist will have a harder fact-checking task. They’d first need to filter out this particular sentence from the candidate’s long speech as this sentence contains a claim. Next, they need to come up with some definitions for “being the first economic power” and “Europe”. Then, they need to collect some data related to the economic indexes of different European countries during a specific period. Finally, they must analyse the data to see whether the numbers support the original candidate’s claim or not. This process, which is usually tedious, time consuming, and error prone, has to be repeated for every candidate and every claim.

This is where data science, artificial intelligence and machine learning-based approaches can come into the picture to facilitate this fact-checking process for our journalist. Data science approaches can (semi)automate each of the above tasks. First, we can train a claim detection classifier that processes all sentences of each candidate’s speech to automatically filter those that contain a claim. Second, a keyword extraction approach can automatically extract the most important phrases of our filtered sentence, such as “our country”, “the first economic power”, and “Europe”. Third, an information retrieval system can automatically search and retrieve all the archival (un)structured datasets related to these keywords. These datasets could be an unstructured economic report containing the specified keywords or a structured table of economic indexes of different countries that is annotated with similar keywords. Finally, we can have a final trained model to take all these collected datasets and the original candidate’s claim to estimate the truth score of the claim based on the collected data.

Seems like magic? Wait a second! We are not yet at the point where this whole process can be automated, as described above. Although we can technically build all the described systems, in practice, their performance might not be that impressive. The main reason is that to train a smart approach for each of the mentioned steps, we usually need to collect a large set of training examples. For example, for training the claim detection classifier, we need to provide thousands (or even millions) of examples of sentences that do/don’t contain a claim.

Where there is a lack of enough training data, these data science approaches might not always generate a correct and complete result set. The claim detection classifier might miss some of the claims or wrongly mark some normal sentences. Similarly, the keyword extraction approach might miss some keywords or extract non-keyword phrases. The same is true for the information retrieval system that might miss some relevant datasets or retrieve some irrelevant ones. The final task to estimate the truth score of the claim based on the collected data is perhaps the most challenging step as this score estimation could be a subjective calculation. If we, human beings, cannot agree on the truth score of a claim based on the current facts and data, how can we expect machines to do this task for us accurately?

Despite the natural challenges of this fact-checking process and the imperfectness of the data science approaches, they can still support humans in this process. That is why such systems are usually called “decision support systems” as they are not going to completely take over a human’s role – At least not yet! These systems support human beings, in our example of the journalist, the systems supported the decision-making processes. This way, the journalist can save hours of the groundwork for since they already have some initial data and results with minimal amount of effort. Therefore, the journalist can take these data as an initial seed for further investigation.