Meta Is Building an AI to Fact-Check Wikipedia—All 6.5 Million Articles
August 26, 2022
Most people older than 30 probably remember doing research with good old-fashioned encyclopedias. You’d pull a heavy volume from the shelf, check the index for your topic of interest, then flip to the appropriate page and start reading. It wasn’t as easy as typing a few words into the Google search bar, but on the plus side, you knew that the information you found in the pages of the Britannica or the World Book was accurate and true.
Not so with internet research today. The overwhelming multitude of sources was confusing enough, but add the proliferation of misinformation and it’s a wonder any of us believe a word we read online.
- Wikipedia is a case in point. As of early 2020, the site’s English version was averaging about 255 million page views per day, making it the eighth-most-visited website on the internet. As of last month, it had moved up to spot number seven, and the English version currently has over 6.5 million articles.
But as high-traffic as this go-to information source may be, its accuracy leaves something to be desired; the page about the site’s own reliability states, “The online encyclopedia does not consider itself to be reliable as a source and discourages readers from using it in academic or research settings.”
Meta—of the former Facebook—wants to change this. In a blog post published last month, the company’s employees describe how AI could help make Wikipedia more accurate.
Though tens of thousands of people participate in editing the site, the facts they add aren’t necessarily correct; even when citations are present, they’re not always accurate nor even relevant.
Meta is developing a machine learning model that scans these citations and cross-references their content to Wikipedia articles to verify that not only the topics line up, but specific figures cited are accurate.
This isn’t just a matter of picking out numbers and making sure they match; Meta’s AI will need to “understand” the content of cited sources (though “understand” is a misnomer, as complexity theory researcher Melanie Mitchell would tell you, because AI is still in the “narrow” phase, meaning it’s a tool for highly sophisticated pattern recognition, while “understanding” is a word used for human cognition, which is still a very different thing).
Meta’s model will “understand” content not by comparing text strings and making sure they contain the same words, but by comparing mathematical representations of blocks of text, which it arrives at using natural language understanding (NLU) techniques.
“What we have done is to build an index of all these web pages by chunking them into passages and providing an accurate representation for each passage,” Fabio Petroni, Meta’s Fundamental AI Research tech lead manager, told Digital Trends. “That is not representing word-by-word the passage, but the meaning of the passage. That means that two chunks of text with similar meanings will be represented in a very close position in the resulting n-dimensional space where all these passages are stored.”
The AI is being trained on a set of four million Wikipedia citations, and besides picking out faulty citations on the site, its creators would like it to eventually be able to suggest accurate sources to take their place, pulling from a massive index of data that’s continuously updating.
One big issue left to work out is working in a grading system for sources’ reliability. A paper from a scientific journal, for example, would receive a higher grade than a blog post. The amount of content online is so vast and varied that you can find “sources” to support just about any claim, but parsing the misinformation from the disinformation (the former means incorrect, while the latter means deliberately deceiving), and the peer-reviewed from the non-peer-reviewed, the fact-checked from the hastily-slapped-together, is no small task—but a very important one when it comes to trust.
Meta has open-sourced its model, and those who are curious can see a demo of the verification tool. Meta’s blog post noted that the company isn’t partnering with Wikimedia on this project, and that it’s still in the research phase and not currently being used to update content on Wikipedia.
If you imagine a not-too-distant future where everything you read on Wikipedia is accurate and reliable, wouldn’t that make doing any sort of research a bit too easy? There’s something valuable about checking and comparing various sources ourselves, is there not? It was a big a leap to go from paging through heavy books to typing a few words into a search engine and hitting “Enter”; do we really want Wikipedia to move from a research jumping-off point to a gets-the-last-word source?