Wikidata Metrics

Exploratory research to define metrics about Wikidata usage.

Skills used

Metrics design

Product recommendations

Conceptual frameworks

Data visualization

SQL

Spark

Big data

Python

Writing

This was one of the most fun and challenging projects I’ve worked on, and I’m extremely proud of the result!

Wikidata is a massive, crowdsourced knowledge graph. Content from Wikidata can be used on Wikipedia and other Wikimedia sites, however, until recently, there was no good way to track the scale and nature of its use.

Wikidata usage is complex, both technically and socially. Over the years, online communities have developed a huge number of technical methods for accessing it, as well as editorial policies and practices that vary greatly from one language and wiki type to another.

In this project, we defined metrics about Wikidata usage and wrote code to calculate them based on raw data in the Wikimedia Foundation’s data lake, using Python and Spark. A key requirement was that the metrics support longitudinal analyses. Despite complex issues with data sources, the metrics we defined can fulfill this need.

Along the way, we found that Wikipedia contributors seem to have conflicting mental models about article content.

This work was carried out for Wikimedia Deutschland's Wikidata For Wikimedia Projects team.

Approach

One of our starting points was the idea that, despite the highly technical methods needed to add Wikidata to articles, editors’ activities remain communicative and social at their core. So, we viewed the data sources for the metrics as trace data, that is, mechanical traces of human activity.

At the same time, we found the best way to obtain metrics about Wikidata usage was to analyze wiki articles as if they were computer code. Unexpectedly, this helped shed light on the communicative processes involved.

Results

Initial measurements showed widespread usage of Wikidata; in one way or another, it contributes to the content of tens of millions of wiki pages. We also found Wikidata usage seems to be increasing, varies among wikis and wiki types, and is unevenly distributed across wiki pages.

The plots below show this via the key indicator we proposed: the number of Wikidata property references that appear in the source code used to generate each wiki article. (This tells us approximately how much Wikidata was used in the article’s content.)

Other results included product recommendations, suggestions for improving data collection, a conceptual framework, and ideas for future research.

Code

Code to calculate these metrics is available here. Results are generated by running a series of Jupyter notebooks on the internal Wikimedia Foundation data lake. (It would also be possible to calculate the same metrics using public sources.)

Queries use Spark SQL. Due to the size and complexity of the data, we create temporary tables with the output of intermediate steps, then calculate metrics per wiki page, and aggregate. A final notebook generates plots and an ODF text document.

Publications and Talk

For more information, please see the full report. A summary paper and talk were presented at the 12th Annual Wiki Workshop.

Andrew Russell Green

Research, data science and software portfolio