qrender: Render wikidata item in different formats

Wikidata is a rich knowledge graph, but its raw data format can be challenging for both humans and AI to process effectively. This blog post explores how I addressed these challenges by creating qrender, a tool for rendering Wikidata items in more human-readable and AI-friendly formats.

In my previous article about qjson, I explained the importance of retrieving all information about a Wikidata Item. I write qjsonas an easy API to fetch all such information in one API call instead of multiple SPARQL queries or API calls.

One of the reasons I needed such a tool is making the wikidata accessible for LLMs. I was experimenting with an approach to connect wikidata knowledge graph with LLMs, specifically an LLM capable of calling a tool depending on the context.(AKA, Agentic AI ). I will write about that experiment in another article, but in this article I will explain why qjson was not sufficient for achieving my goal and how I had to build another layer on top of qjson.

As you can observe from the output of a qjson API, the json quite raw. See https://qjson.toolforge.org/Q405.json for example. It has all information, but it has multiple issues:

  • The PIDs are not sorted. Related properties appear in random order. This is acceptable for a program, but quite difficult for a human to comprehend. If related information is scattered across multiple parts of a json, it is hard for both humans and LLMs. For example, a person will have first name, last name, nick name, family name etc. It would be much better if they appear together. Another example is date of birth, date of death, place of birth, place of death set.
  • The json format has so many repeated content in the form of keys. JSON format need that, but overall content that need to go to an LLM is unnecessary big because of that.

When I gave the json output from qrender and asked questions based on that to LLMs, I noticed underwhelming results. And I arrived at the above reasoning about the limitations of the data format. So I started experimenting with alternate formats that will make the results better. I tried plain text and markdown as candidates. These formats are not the literal transformation of json, but transformed to a format that resembles how such information is presented to a casual human reader.

Please take a look at the following text format output. It is clean and contains the same information as the JSON, but in a compact and organized way.

# names

given name:
        Noël (series ordinal: 2)
        Q463035 (series ordinal: 1, object of statement has role: http://www.wikidata.org/entity/Q3409033)
family name:
        Adams
name in native language:
        Douglas Adams
short name:
        Douglas Adams
birth name:
        Douglas Noël Adams
pseudonym:
        David Agnew (statement is subject of: http://www.wikidata.org/entity/Q11036149)

# employer

employer:
        The Digital Village (start time: 1996-01-01T00:00:00Z)
        BBC

# languages

native language:
        English
languages spoken, written or signed:
        English

However, there are two challenges here. First, how do we group the Wikidata properties (roughly 13K) in a useful semantic fashion? The second challenge is that, compared to JSON, we lose all PIDs and QIDs in this format, which we need to retain for linking purposes.

Grouping Wikidata properties

This page: https://www.wikidata.org/wiki/MediaWiki:Wikibase-SortedProperties lists all Wikidata properties and groups them into high-level categories such as People, Banking, Math, Medicine, and so on. It is a good start, but we need more granularity within this classification. As explained earlier, we need grouping within all properties of a person (names, awards, life, family, jobs, etc.). There is no easy or ready-made solution for this grouping other than creating one yourself. So, I manually started to build such semantic groupings of properties. I prepared https://github.com/santhoshtr/qrender/blob/master/groups.toml - a configuration-driven grouping system.

It is not complete, but it is good enough for my experiments. It is a YAML file for easy editing and expansion as well.

Formats

The text file is a good start, but it cannot represent links. We also need some semantic representation of group titles and lists. Markdown is better for this. So, I wrote an alternate renderer with markdown output, all based on qjson output.

The output is too large to include here, so please visit https://hackmd.io/@sthottingal/S1JXCdM4ex to see the markdown and preview. Here is another example for Q405 - Moon: https://hackmd.io/@sthottingal/ryVFeYMNex

Since I was using a Handlebars-based templating system to generate this markdown, I also added wikitext and HTML as additional formats. They might be useful for someone, but for the above-mentioned LLM use case, I observed that markdown works relatively well.

A few days after I wrote qrender, I came across this paper - KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs https://knowledge-nlp.github.io/naacl2025/papers/39.pdf?s=35, which articulates a similar approach for efficient LLM integration with KGs. The paper also evaluates various data formats.

qrender

Source code: https://github.com/santhoshtr/qrender. This tool is written in the Rust programming language.

I think qrender bridges the gap between Wikidata’s raw data and its practical use in AI and human applications. It is incomplete and I plan to improve the other renderers especially the html renderer as per my Wikipedia factoids concept in future iterations.

Disclaimer

I work at the Wikimedia Foundation. However, this project, its exploration, and the opinions expressed are entirely my own and do not reflect my employer’s views. This is not an official Wikimedia Foundation project.

comments powered by Disqus