Integrating Knowledge Graphs (KGs) with Large Language Models (LLMs) is a well-explored research field. KGs are vast, structured databases storing factual associations as graph edges. KGs can help LLMs for tasks like question answering, drawing on the structured, often up-to-date information within KGs, thereby mitigating the risk of hallucination. For instance, an LLM that can query Wikidata—a prominent KG project—instead of solely depending on its training data becomes significantly more reliable and useful.
[Read More]
qrender: Render wikidata item in different formats
Wikidata is a rich knowledge graph, but its raw data format can be challenging for both humans and AI to process effectively. This blog post explores how I addressed these challenges by creating qrender, a tool for rendering Wikidata items in more human-readable and AI-friendly formats.
In my previous article about qjson, I explained the importance of retrieving all information about a Wikidata Item. I write qjsonas an easy API to fetch all such information in one API call instead of multiple SPARQL queries or API calls.
[Read More]
qjson: Fetching all properties of a wikidata item in a single API call
For those deeply involved with Wikidata, the richness of its interconnected data is both a blessing and a challenge when it comes to programmatic access. While the standard wbgetentities API endpoint is fundamental, retrieving the complete set of properties, including labels and values, for a given item often leads to a cascade of recursive API calls. For example, suppose we fetch all properties for Q42 using wbgetentities API - https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42. In the response, if well lookup the “country of citizenship” (P27) for Q42 (Douglas Adams): the initial response only provides the target QID (Q145), necessitating further queries to resolve both P27 and Q145 into human-readable labels.
[Read More]
An Experiment in Detecting Wikipedia Edit Policy Violations with LLMs
Wikipedia, the world’s largest online encyclopedia, relies on a massive community of volunteers to maintain its accuracy and neutrality. But with so many editors, how do you ensure edits adhere to Wikipedia’s strict policies? I decided to explore whether Large Language Models (LLMs) could be used to automatically detect policy violations in Wikipedia edits. Here’s what I found.
Wikipedia has well-defined policies to ensure content quality. These include:
WP:NPOV (Neutral Point of View): Avoiding bias and presenting information objectively.
[Read More]
Natural Language based question answering system for Wikipedia and Wikidata
This is a blog post version a paper titled “Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering” available at https://arxiv.org/abs/2501.11301.
In the world of Large Language Models (LLMs) and question answering systems, hallucination - where models generate plausible but incorrect information - remains a significant challenge. This is particularly problematic when dealing with encyclopedic knowledge sources like Wikipedia, where accuracy is paramount. Today, I’ll discuss a novel approach that addresses this challenge through question-to-question retrieval.
[Read More]
Wikimania 2023
I attended Wikimania 2023, an annual conference of people working on Wikipedia and other Wikimedia projects. This year’s conference was at Singapore.
State of Machine Learning on the Wikimedia projects I presented a talk titled “State of Machine Learning on the Wikimedia projects”. Machine learning is used in many Wikimedia projects. This talk was be round up of various projects that use ML. I talked about:
How Machine learning is used in our project, the benefits and impact.
[Read More]
sentencex: Empowering NLP with Multilingual Sentence Extraction
Sentence segmentation is a fundamental process in natural language processing. It involves breaking down a given text into individual sentences, a task that finds applications in various contexts. Whether you need to split a paragraph into sentences for further analysis or present sentence boundaries in a user-friendly frontend application, sentence segmentation is crucial.
At first glance, identifying sentence boundaries might seem straightforward – just look for a period or full stop.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part 4
I wrote about the exploration on Natural language querying for wikipedia in previous three blog posts.
In Part 1, I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
In Part 2, I tried using distilbert-base-cased-distilled-squad with ONNX optimization to answer the questions.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part 3
I wrote about the exploration on Natural language querying for wikipedia in previous two blog posts.
In Part 1, I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
In Part 2, I tried using distilbert-base-cased-distilled-squad with ONNX optimization to answer the questions.
[Read More]
Natural language question answering in Wikipedia - an exploration - Part2
A few days back I posted an experiment on Natural language querying for wikipedia by generating questions and answers. I was suggesting that building such a collection of question and answers can help natural language answering. One missing piece was actually suggesting an answer for a new question that is not part of QA set for article.
As a continuation of that experiment, I was exploring various options for answering questions.
[Read More]