3 min read

ML-based recommendation system trained right: the Booking.com experience

ML-based recommendation system trained right: the Booking.com experience

Andrew Mende, Booking’s product manager for machine learning

Nowadays, every major e-commerce site has a recommendation system boosting sales and customer satisfaction. Such systems are being constantly improved and upgraded. The problem is that every next recommendation model launched is typically trained on datasets generated by its predecessors.

According to Andrew Mende, product manager for machine learning at Booking.com, this approach is flawed. But there is a cure. He told us how to prevent predictive recommendation models from deteriorating into simple popularity-based algorithms.

Andrew is a former Yandex PM (Yandex is the Russian Google, as Andrew puts it). Through his career, he always managed big datasets seeking to improve the product’s performance.


Case: you land at Booking.com looking for a hotel in Paris, planning a romantic weekend. Booking.com is aware of your intent, but they have 40,000 hotel listings in Paris. Visitors rarely browse beyond the first 40 listings. So what should Booking.com do to get you excited?

It could show the most popular hotels in Paris, or those that are right exactly for you in terms of price and quality. Andrew and his team designed an ML-powered predictive recommendations model that was supposed to diversify the search results page and show listing suggestions that check all the boxes and exceed expectations.

However, after some time in use, the new model started failing to deliver as expected and essentially deteriorated into a simple popularity-based algorithm.

Do the data. Go beyond algorithm fine-tuning

A predictive recommendations model is data plus algorithm. So you don’t have to improve the algorithm every time. You can keep it unchanged, add data and the whole model is supposed to perform better.

The system typically knows a lot about its users. Booking.com’s personalization works continuously through your session there, customizing the list shown based on:

  • your previous searches;
  • your previous orders;
  • your network;
  • the content you’ve seen through the current session;
  • your launch point, i.e., where you came from;
  • and a number of other data points.

Andrew tried to use these new “customer signals” to help users find the best options faster. He falsely assumed that this data was neutral and objective. But it was not. As mentioned above, with time, the new shiny ML-powered recommendations model began to act much like the old popularity-driven algorithm.  

The team soon learned that it happened because of the training dataset: it was generated by the old model and thus biased, with popular listings having more weight to them from the outset. Thus, the two A/B-tested models began generating almost similar customer preference data.

How to mitigate data convergence in new-vs-old recommendation model A/B testing

Andrew and the team came up with 3 solutions:

  • True randomization;
  • Partial randomisation;
  • Isolated feedback loops.

True randomization means you just deploy the new ML-powered recommendation model and let it train in action, without feeding any data before deployment. At the outset, as you understand, it will show completely random items to visitors, thus “true randomization.” This means that the recommendations will be irrelevant for a certain period of time and your sales will go down. While pure, this is an approach very few can afford.

Partial randomizations means that you train the model a bit and show random things to people. This approach also takes some investment to support the business while the sales are down because of irrelevant recommendations.

The third option is to isolate the feedback loops of the A and B models tested. A is the new one, trained on the biased datasets generated previously by B. But if you support its further training with just the data it generates itself, the results will be much better: simple popularity stops outweighing other factors and the search results page finally starts looking differently from that delivered by the popularity-driven algorithm.

With this approach, you will need to control the data generated, the consequences, and the dependencies, so it is not cheap, but, anyway, completely doable.

When training a new model, the main thing you need to remember is that all historical data are biased because they were generated by a previously deployed algorithm. There is no universal optimum when it comes to recommendation models. Change it, and customer behavior follows.

Follow this link to see Andrew describe the problem and the solutions in greater detail. The full version is available for Epic Growth Premium members. You can start with a 7-days FREE trial.



Get product growth insights straight to your inbox:
💌 No spam! Just one newsletter a week
⏳Only takes 5 minutes to read (and become smarter)
😄 100% FREE (and if you don't like it, unsub in 2 clicks)