For those deeply involved with Wikidata, the richness of its interconnected data is both a blessing and a challenge when it comes to programmatic access. While the standard wbgetentities API endpoint is fundamental, retrieving the complete set of properties, including labels and values, for a given item often leads to a cascade of recursive API calls. For example, suppose we fetch all properties for Q42 using wbgetentities API - https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42. In the response, if well lookup the “country of citizenship” (P27) for Q42 (Douglas Adams): the initial response only provides the target QID (Q145), necessitating further queries to resolve both P27 and Q145 into human-readable labels.
[Read More]
An Experiment in Detecting Wikipedia Edit Policy Violations with LLMs
Wikipedia, the world’s largest online encyclopedia, relies on a massive community of volunteers to maintain its accuracy and neutrality. But with so many editors, how do you ensure edits adhere to Wikipedia’s strict policies? I decided to explore whether Large Language Models (LLMs) could be used to automatically detect policy violations in Wikipedia edits. Here’s what I found.
Wikipedia has well-defined policies to ensure content quality. These include:
WP:NPOV (Neutral Point of View): Avoiding bias and presenting information objectively.
[Read More]
Natural Language based question answering system for Wikipedia and Wikidata
This is a blog post version a paper titled “Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering” available at https://arxiv.org/abs/2501.11301.
In the world of Large Language Models (LLMs) and question answering systems, hallucination - where models generate plausible but incorrect information - remains a significant challenge. This is particularly problematic when dealing with encyclopedic knowledge sources like Wikipedia, where accuracy is paramount. Today, I’ll discuss a novel approach that addresses this challenge through question-to-question retrieval.
[Read More]
Wikimania 2023
I attended Wikimania 2023, an annual conference of people working on Wikipedia and other Wikimedia projects. This year’s conference was at Singapore.
State of Machine Learning on the Wikimedia projects I presented a talk titled “State of Machine Learning on the Wikimedia projects”. Machine learning is used in many Wikimedia projects. This talk was be round up of various projects that use ML. I talked about:
How Machine learning is used in our project, the benefits and impact.
[Read More]
sentencex: Empowering NLP with Multilingual Sentence Extraction
Sentence segmentation is a fundamental process in natural language processing. It involves breaking down a given text into individual sentences, a task that finds applications in various contexts. Whether you need to split a paragraph into sentences for further analysis or present sentence boundaries in a user-friendly frontend application, sentence segmentation is crucial.
At first glance, identifying sentence boundaries might seem straightforward – just look for a period or full stop.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part 4
I wrote about the exploration on Natural language querying for wikipedia in previous three blog posts.
In Part 1, I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
In Part 2, I tried using distilbert-base-cased-distilled-squad with ONNX optimization to answer the questions.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part 3
I wrote about the exploration on Natural language querying for wikipedia in previous two blog posts.
In Part 1, I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
In Part 2, I tried using distilbert-base-cased-distilled-squad with ONNX optimization to answer the questions.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part2
A few days back I posted an experiment on Natural language querying for wikipedia by generating questions and answers. I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
As a continuation of that experiment, I was exploring various options for answering questions.
[Read More]
Natural language question answering in Wikipedia - an exploration
In this blog post I explain the prospects of providing questions and answers as an additional content format in wikipedia and a human-in-the-loop approach for that with a prototype.
Introduction Wikipedia is a hub for curiosity, with people visiting the site in search of answers to their questions. However, they typically arrive at Wikipedia via intermediaries such as search engines, which direct them to the relevant article. While Wikipedia’s keyword-based search function can be helpful, it may not be sufficient for addressing more complex natural language queries.
[Read More]
One million Wikipedia articles by translation
I am happy to share a news from my work at Wikimedia Foundation. The Wikipedia article translation system, known as Content Translation reached a milestone of creating one million articles. Since 2015, this is my major project at WMF and I am lead engineer for the project. The Content Translation system helps Wikipedia editors to quickly translate and publish articles from one language wiki to another. This way, the knowledge gap between different languages are reduced.
[Read More]