Integrating Wikidata into an audiovisual archive

45.000 entities linked
data enrichment

Sound and Vision has a large thesaurus containing more than half a million entities. The thesaurus (the Common Thesaurus for Audiovisual Archives (GTAA)) has grown over time, and became more difficult to manage internally. Research showed that external publicly available sources had more complete information about entities covered in the thesaurus and Sound and Vision set her goal to re-use more of this knowledge, and alleviate the maintenance work needed for the GTAA. In a first project Jesse and his colleagues decided to link the terms contained in the Sound and Vision thesaurus to corresponding items in Wikidata, the open, structured knowledge database, used to support Wikipedia.

'The GTAA was solely used to standardize catalogue descriptions of our collection. By linking GTAA to Wikidata, the thesaurus could be used in novel ways: to provide context and additional information, to test the quality of the terms and to create a connecting layer with other heritage collections and other types of information.'

In an initial attempt, the 137.000 personal names contained in the GTAA were uploaded to the Wikidata Mix'n'match tool. This tool automatically suggests matching Wikidata items for the items in an uploaded dataset and subsequently allows users to confirm or reject these suggestions. However, it turned out that in the vast majority of cases, it was impossible to make a match solely based on the personal name and the limited additional information that GTAA occasionally contains (such as the occupation of a person). A matching Wikidata item was suggested for only 10.000 personal names. Over the course of three years, 8.000 of these were confirmed by the community of Sound and Vision employees and Wikipedians.

'It was clear that we needed another approach. I took inspiration from CultuurLINK, which enables cultural heritage institutions to link their thesauri to GTAA and decided to discuss my idea with Spinque who had developed this application. They immediately thought along and expressed their interest in investigating the possibility to link our data to an external data source such as Wikidata.'

It was decided to use the Sound and Vision catalogue as a source for additional information about the people in the GTAA. After all, GTAA terms are used to describe the items in the catalogue. The idea was to determine where personal names are used in the catalogue and to extract other, hopefully related, terms in their immediate vicinity. The personal names and the additional terms would subsequently be used to find matching Wikidata persons based on all information contained in the entry. Spinque Desk was used to put this idea into practice:

'What I like about Spinque Desk is its flexibility. We were able to include all the necessary datasets, even a truthy dump of Wikidata, and we could subsequently use various building blocks to search and access them. The probabilistic results gave us the opportunity to determine which matching Wikidata persons we accepted as automatic suggestions and which we did not.'

Based on this approach, over 45.000 matches were automatically suggested and 26.000 of these have since been accepted by the community. The additional context further facilitates the manual matching of items. Using the approved matches, Sound and Vision is able to improve and enrich the data on Dutch Media History on Wikidata and vice versa enrich the data in its catalogue. For example, based on a maker's date of death, it can automatically determine when a work is transferred to the public domain. Or based on a person's birth date, sex and occupation, it can enable researchers to use advanced queries over the collection; for example to return all tv-programs in which female politicians, born in the 70s occur.

Spinque puts Jesse in charge of his search and enables him to collaborate closely with Wikidata to enrich the Sound and Vision archive and to improve its services.

About the client

Sound and Vision is a combined archive, museum and knowledge institute on media culture. It collects, preserves and provides access to Dutch audiovisual heritage for as many users as possible: media professionals, researchers, teachers and the general public.

Jesse de Vos manages the Sound and Vision services aimed at academic researchers and heritage professionals. In order to improve these services, Jesse aims to make use of existing, community-driven data-sources, such as Wikidata and Discogs.


Find Jesse's side of the story on the Sound and Vision Blog (in Dutch):

What I like about Spinque Desk is its flexibility. We were able to include all the necessary datasets (...) and we could subsequently use various building blocks to search and access them.

Behind the screens

Spinque Desk

For this project Spinque Desk was used to design a strategy that searches matching Wikidata persons for all personal names in the thesaurus of Sound and Vision.

In the strategy first the Sound and Vision catalogue is searched for terms related to each personal name. This combined information is subsequently used to search for matching Wikidata persons.

If a match is found the person data from Wikidata is imported in the Sound and Vision database and the additional information is presented at all sites where the personal names are used.

What we can do for you

In this project Spinque Desk was used to enrich the thesaurus of Sound and Vision. One of the many ways in which this application can be used.

To what dataset could you link the entities in your domain in order to enrich them? Let us know, we are happy to think along!

Other projects