Content-based dataset search with Spinque PSSA

Posted on 08/10/2015 by Wouter Alink

Within the COMSODE project Spinque applied its Search by Strategy approach to Open Data. We successfully extended the approach to Linked Open Data and we demonstrated the technology through several search applications. At the end of the project we are very pleased with the result and we are grateful that we got the chance to help the Open Data community a bit forward within the Comsode project.

We still wanted to do one more thing; during the project an additional request for search functionality came up. Can we use Spinque to find datasets? Although not anticipated in the project proposal we wanted to support this. Spinque's Search by Strategy approach is, however, not the best match for this purpose. This type of search functionality asks for Spinque PSSA. Let me explain...

The Comsode project has published over 150 datasets with the Open Data Node. These datasets are available in the public CKAN at http://data.comsode.eu/dataset. With the built-in search functionality from CKAN we can find specific datasets, e.g. the datasets about 'inspections'. This search functionality uses the metadata associated with each dataset, typically a title and a description. Proper metadata will increase the findability of the datasets, but it will always remain a limited reflection of the content itself. For example, when searching for 'praha' (or 'prague') we do not find any datasets. Even though a large number of the datasets are from the Czech Republic and do contain such keywords.

To find the datasets that contain information about Prague we need to search in the data itself. With CKAN we can only index the metadata and not the content. In fact, it is not so straightforward to do this with a traditional search engine. First of all we are dealing with a lot of data. Second the data comes in various formats, such as RDF and CSV. Spinque's Search by Strategy is also not the best match for this case. The Search by Strategy approach is intended to support information specialists with the modeling of tailored search functionality and therefore requires some knowledge of the data format, structure and semantics. Search by Strategy is like a precision drill that allows you to carve out something very specific. For content-based search of datasets we need a hammer! It should work on everything without any thinking. We need something that works directly on the raw data, something that can handle extremely large data and that is lightning fast!

Meet Spinque PSSA!

PSSA is Spinque's Ctrl-F solution for large datasets. It works like the Ctrl-F in your favorite document-editor: exact string-match without tokenization, stemming or other features. It doesn't even interpret the character encoding (it can search Mandarin texts as easy as the western script). It reports all occurrences of a string in a fraction of a second (often within 25ms). This speed opens up new ways to interactively explore very large datasets.

We indexed all the Comsode datasets with Spinque PSSA. When searching for 'praha' you can now instantly find all datasets containing this word. The result includes the number of word occurrences per dataset:


[
    {
        group: 'rdf/CTIA_1-coi-kontroly.trig',
        fileoffset: '1697014',
        count: 13124
    },
    {
        group: 'rdf/CUZK_52-pracoviste-resortu.trig',
        fileoffset: '196866802',
        count: 24
    },
    {
        group: 'rdf/CZ_MZP_01-cenia-cz-pollution.trig',
        fileoffset: '199794811',
        count: 1670
    },
    {
        group: 'rdf/MFCR_3-ares-rzp.trig',
        fileoffset: '236577546',
        count: 22637
    },
    {
        group: 'rdf/MICR_1-gov-cz-organy.trig',
        fileoffset: '1591550612',
        count: 6221
    },
    {
        group: 'rdf/MICR_3-gov-cz-agendy.trig',
        fileoffset: '1665949269',
        count: 247
    },
    {
        group: 'rdf/eh_3-medicinal-products-sukl-cz.ttl',
        fileoffset: '2075119200',
        count: 293
    }
]

The results contain the filename as the group, the offset and the count. The dataset from the Czech Chamber of Commerce (MFCR_3-ares-rzp.trig) contains the most occurrences of praha (22637), followed by the dataset from the Czech inspection authority (CTIA_1-coi-kontroly.trig) with (13124).

The index is not really split in datasets, it is one long array, where the offset indicates where a new file (the dataset) starts. Using the offset as constraints we can search within a dataset. The offset of the first dataset in the result is 1697014, and the offset of the second dataset is 196866802. So if you want to find actual snippets, but only within the first dataset, you will use 1697014 as the start offset and 196866802 as the end offset. You can change the size of the returned text snippets through the prefix and suffix parameters. The prefix parameter indicates the number of characters returned before the match and the suffix parameter the number of characters after the match.

With these two requests we could extend CKAN with content based dataset search. First find all datasets containing the query. Optionally, organize them by the number of occurrences. For each dataset show snippets containing the query.