Should I wait or grab a cup of coffee?

Posted on 16/10/2019 by Laurens Kuiper

While creating a search strategy in Spinque Desk, you often want to check the output at a particular point in your evolving strategy. Right now all you see is "X operations left" as the system is building a preview of your results. This is accurate, but not very informative. I recently did an internship at Spinque where I created a prediction model to help users decide what to do while waiting: "Should I wait, grab a cup of coffee, or go for a break?".

Nowadays, when downloading or processing data, we have become used to seeing a progress bar which tells us exactly how much work is done, and how much is left. In cases such as downloading files or processing images, this is simply a case of 5/10 means 50% done; it took us 3 minutes so far, so it will probably take 3 minutes more. For Spinque, it is not that easy. You have probably noticed that our current way of displaying progress is not very informative. Sometimes "70 operations left" becomes 30 in a matter of seconds, which then takes minutes to become 29, before finishing quickly. This is because there is great variance in the amount of data that is processed in each operation.

During my internship, I learned how the back-end system of Spinque works, created a prediction model using machine learning techniques, and finally integrated it into Spinque Desk, which you will hopefully see soon! This blog is my story about diving into Spinque's databases, sifting out the cache, learning what "X operations left" actually means, and how I applied my skills as a Data Science student to solve this problem.

Predicting what, exactly?

Anyone that uses machine learning in practice will agree that understanding what you are trying to predict is the first step in building a model. Therefore, I made it my job to understand the inner workings of Spinque's back-end system as quickly as possible. Roberto set me up with a copy of a production database which I could use whichever way I wanted without accidentally deleting a lot of data. This was reassuring, because unlike university projects, real clients are involved.

Spinque acknowledges that their clients, which range from webshops to historical archives, have very different search needs. Custom search strategies can be built in Spinque Desk, which define how a search engine operates. At any point during the creation of a strategy, you can choose to preview the results. Previewing results requires an index to be built, which is similar to an inverted index at the back of an encyclopedia. Building the index takes time depending on the size of the data set, and the complexity of the strategy, both of which can vary greatly.

Spinque's approach to search engines is integrating IR and DB: information retrieval and databases. The clients' data is stored in a database, and a search index is built using database operations (SQL). How to build this index is described in SpinQL; Spinque's own domain specific language based on Probabilistic Relational Algebra. In fact, a search strategy is nothing more than a graphical representation of a sequence of SpinQL expressions. Each expression is an 'operation', which is exactly what you are waiting for! These expressions are compiled to SQL by the SpinQL compiler, which are then executed by the database management system MonetDB.

Just dump it into a model

Now that we've got some data and know what to predict, we can throw it all into a big neural network and get some results! ... Right? Well, it turned out to be a bit more difficult. Unlike many datasets used for educational or competitional purposes that you might find on Kaggle or the UCI machine learning library, this data is nowhere near suitable to feed to a prediction model. This is one of the first lessons many data science students learn when working in 'the real world'. Take the following SpinQL expression for example, which took around 500 milliseconds to execute.

  _cachedrel_3 := Select [$1="TextField"](Unite (Project SUM[$1,$2](_cachedrel_1), Project SUM[$1,$2](_cachedrel_2)));

This describes how table _cachedrel_3 was made, exactly how long it took, but it does not tell us 'why' it took this long. I decided to write a script that splits each expression up into simple expressions like so:

  _cachedrel_3_3 := Project SUM[$1,$2](_cachedrel_1);

  _cachedrel_3_2 := Project SUM[$1,$2](_cachedrel_2);

  _cachedrel_3_1 := Unite (_cachedrel_3_3, _cachedrel_3_2);

  _cachedrel_3 := Select [$1="TextField"](_cachedrel_3_1);

Each one of these expressions will be a data point for the prediction model. This is a step in the right direction, but models require each data point to have a number of features, and a value to predict. It took around 500 milliseconds to execute the sequence of these four expressions, but there is no way to know how long each one took, unless these simplified expressions executed again, one at a time.

So this was my next step. I executed each of the simplified expressions again, all while measuring and querying important values like the size of the tables, the range of values within them, and other statistics. This is the 'why' behind the duration of each operation. These values were saved to a large dataset on which models could be trained.

Now that the dataset was made, I trained and evaluated the models, and found that for some expressions that contain a simple operator ('Project', 'Sort'), duration was easy to predict, resulting in a low error rate for the model. However, for more complex operators, especially 'Join', this was not the case. My professor at the University provided me some literature, which clarified that the Cardinality Estimation problem is still a very difficult problem within the database research community. In short, this means that it is hard to predict how large some tables will be, and that I should not expect to get accurate results. Nevertheless, I decided to dive into the topic, worked on feature engineering, and tweaked the model until I managed to improve the predictions.

Putting everything together

Predictions for single, simple expressions can be made using the trained models, but not yet for a sequence of expressions. Up until this point, I had worked in Python, a programming language which I am very comfortable with, which has many available machine learning libraries. However, for the model to be used in Spinque Desk, it has to be written in Java. I decided that this was the time to make the switch, since I had shown that predictions could be done as a proof of concept.

I began re-writing and re-structuring my scripts as Java classes. Since the code was already written, the hardest part about this was dealing with library dependency issues. When these issues were fixed, and the code was re-made, I worked on making sequential prediction possible, and integrating it so it could be seen in Spinque Desk. Michiel and Wouter helped me set up a local version of Spinque Desk, which allowed me to do a lot of debugging and tweaking.

The result is a model that can predict the duration of a sequence of cache expressions up to a decent level of certainty, which also shows the client 'how sure' it is of its prediction. It will be added in a later release of Spinque Desk (hopefully soon!).

Reflection

One of the most important experiences as a student who will soon enter the professional world is doing an internship. At the Radboud University in Nijmegen where I am currently studying, it is actually required to do one in most master's programmes. Having finished the greater part of the curriculum except for the research internship and thesis, I was on the lookout for an interesting company to work at.

I came into contact with Spinque after attending a guest lecture at our university given by Roberto and Michiel as part of a course on Information Retrieval. At my first visit to their office in Utrecht we agreed on a six month, half-time internship during which I would be trying to develop the prediction model. My activities there would be supervised by Roberto, although I was assured that I was free to ask any of my colleagues questions. To my surprise, I had to sign a non-disclosure agreement to ensure the anonymity of their clients.

Looking back at my time at Spinque, I have to say I really enjoyed it. All of my colleagues were very kind, and willing to answer any questions that I had. Even though I had to travel for an hour and a half to get there, I usually looked forward to going to work. Halfway through my time there, we had a company outing where I got to know all of my colleagues better, which I really enjoyed.

I learned a lot about working in a professional environment and about how many of the skills that I have acquired while studying can be applied in a real-life situation. Many of the datasets that I worked with at the university are easier to predict, therefore I was confident that predictions for Spinque would go well, but I found them to be much harder than expected. Nevertheless, predictions became better over time, and good enough to where it will be added to Spinque Desk, where it can help clients, which I am proud of.

So, what's next for me? Working at Spinque went so well that I have signed a contract with them. I will be doing my master's thesis and working half-time. My internship has helped me find a direction for my master's thesis, which I was unsure about before. Databases are not covered in detail at my university, but by working with them I have realised that they are actually very interesting! I hope to learn more about IR and databases in the coming year.