Data Science at FMP: Dataset Recommendation

recommendations

Problem Outline

One of the problems we recently tackled in the data science team was how we could introduce our users to new datasets they may not be familiar with. We wanted these datasets to be personal to each user and relevant to their research areas. A recommendation system of some sort was the obvious way to solve this problem.

Recommendation Systems

Recommendation systems come with several possible methodologies. Two of the most widely known methodologies are Collaborative Filtering (which can be split into two branches, user-based or item-based) and Content-Based Filtering.

User-based collaborative filtering recommendations are based on finding users with similar preferences and recommending items that similar users rated highly but have not been viewed by the target user. E.g., “Users who liked similar items to you also liked…”

For item-based collaborative filtering, instead of finding similar users we find items that have been rated similarly by different users and recommended based on a user’s previous preferences. E.g., “Users that liked this item also liked…”

An important part of the data required for these techniques is ratings, so you can tell which items users liked.

Another popular technique is Content-Based Filtering. This technique requires less information about each individual user and more about the items themselves. The recommendations are created by finding items that have similar characteristics, e.g., movies of the same genre, with the same actors, or directors. Then given a user’s previous preferences you can recommend items similar to items they previously enjoyed. A broad and consistent range of metadata is needed for each of the items and again ratings are useful for this methodology.

Within our domain at Findmypast a Content-Based Filtering system would not be appropriate. It is easy to see that some datasets which would share a lot of metadata would not necessarily make good recommendations. E.g., Kentucky Birth Records would most likely share just as much metadata with New South Wales Births as Kentucky Death Records when one is obviously a much better recommendation.

An early proof of concept used an adapted version of the user-based collaborative filtering, although results were promising as we do not have any dataset user ratings there was limited scope to improve this methodology (without setting up a ratings system).

Due to these domain and data challenges we decided to try another approach known as a Sequential Recommender System. This technique stems from the natural language processing field, a simple use case is predicting the next word in a sentence. If a model is pre-trained on a corpus it will be able to output the probabilities of each word it is aware of occurring at any point in a sequence. E.g., If we fed the model “I am going…” the word with the highest probability to go next could be “to”. You can take this logic and apply it to many domains. For example, if a user has looked at several datasets from one geographical area recently it is likely that the next dataset in the sequence will be from that area. Providing that the user hasn’t viewed that dataset previously it would likely make a good recommendation.

Our Methodology

To implement this technique some pre-processing is required including:

Converting all the dataset and newspaper labels in our collection into integer tokens (tokenisation).
Transforming our view data into sequences for each user.
Splitting these long sequences into smaller more relevant sequences.
Maximising the amount of training data by including every dataset view as the end of a sequence.

Once the data processing is complete, we utilise a TensorFlow neural network to predict the next dataset to be viewed. This neural network consists of a Long-Short Term Memory layer and an Attention layer. The output is an array of probabilities for every sequence representing each dataset in the training data. We then select the datasets with the highest probabilities as potential recommendations. Any dataset that a user has seen recently are filtered out.

Relevancy Scores

A problem with this methodology is that the model is scoring itself on whether it correctly predicts what a user will view next, however there will be datasets that are good recommendations that the user didn’t view. For this reason, it is hard to measure realistic performance of the model using the standard metrics. This led us to develop some custom relevancy metrics to judge the model. These metrics look at the datasets in each sequence and compare some metadata with the recommendation. The metadata we use includes geographic location and the record types of the dataset. We can use this to measure the impact of changes to the models without having to qualitatively judge recommendations by hand.

Next steps

Currently this system is in production and saving recommendations for each of our users daily. We are currently going through the process of designing the user-journey for serving these recommendations to customers and will have them available to users in the app and on the site soon.

Get in touch

We are always looking for new ideas and collaborations so if you’re interested contact us or check out our current vacancies.