If you haven’t seen it already, we have started rolling out a recommended reading section for our threads!

While the recommender system is at a very naive stage, it marks a new milestone for our (currently very virtually distanced) office at VerticalScope! VerticalScope is introducing applied research and machine learning in its workflow.

This blog is meant to be a friendly introduction to some of our research efforts concerning our first release, and future work on the recommended reading sections of our forums.

We’ll take you on a tour inside some of the questions & solutions our MLDS(Machine Learning, Data Science) team is focused on at VS — This release is at the proof of concept stage, in the hopes of eventually serving our forum enthusiasts quality content via better recommendations!

Some of the work from our new team will focus on things like machine learning, probabilistic methods, or various optimization problems. So to make sure no one gets lost along the way, here are a few interpretations that might help!
  • Machine Learning is a subset of Artificial Intelligence.
    Machine Learning implies using a sample of data to make decisions or predictions.

  • Natural Language Processing (NLP) is a subset of Artificial Intelligence & Linguistics focusing on the use and the interpretation of language. Oftentimes, NLP implies the use of Machine Learning.

  • A Recommender System is a piece of software whose goal is to provide a recommendation.

  • Information Retrieval is a field that focuses on extracting relevant information in an often inexpensive and efficient way (in our case the data is a collection of text called “threads”).
The new recommended reading section at VS provides readers with the top 5 most similar threads to the current thread they are reading.

Traditional methods to compare text would only take the frequency of a term or groups of terms into account when determining what pieces are most statistically meaningful. Traditional methods attempt to do so without understanding the surrounding “context”.

So how is NLP & Machine Learning being used here?

In NLP, one way to understand words is by learning their representations. Words having similar meaning should have similar representations. A word can be translated to its representation using embeddings, which are often trained using an artificial neural network.

In our case, the model in place is a sentence embedding. Just like word embeddings, our sentences can be represented in a fixed vector format (a vector here being a list of multiple values which when combined together points to somewhere specific in space).
While it is impossible to perfectly represent any piece or group of text using a simple vector, we can make relatively interesting approximations using NLP.

The special thing about the embedding we are utilizing is that it is built using contextually aware word representations. In other words, not only can we represent text and determine how similar it is to another piece of text, but we can also determine how contextually similar it is to one another. The reason why that is possible is that the model uses the concept of attention(which weighs different signals by taking into account the neighbouring representations).


The use of attention in NLP has broken many records since the year of 2016, even beating human performance in early 2018 using popular benchmarks[1]. In recent years more and more complex architectures have come about also based on the concept of attention, such as Transformer networks[2][3]. Transformers essentially only rely on attention, including our pillar/foundation model “USE_T” that was created by Google.

While in the field of information retrieval, probabilistic models (such as BM25) remain very hard to beat when it comes to retrieval tasks. However, If combined properly with transformers (such as USE_T) we can obtain interesting results [4].

Now, this might be a lot of information to take in but it doesn’t explain why you might be still receiving poor recommendations…

Let’s start with one reason why the language model USE_T used by itself and by default will not perform well here — it’s because it is not adapted to your language domains!

Every vertical and even forum holds very complex and different uses of the (maybe) English language. On top of that, our forums have a relatively unique vocabulary. Forums for Vehicles, Fishing, Wrestling, will refer to individuals, models, and brands which don’t tend to co-occur within the different verticals and even forums! Even when the vocabulary is the same, we can still suffer from taking things out of context (polysemy).

The USE_T (large) model is used pre-trained. The practice of using pre-trained models are becoming quite standard. They are usually very long and expensive models to train and required a very significant amount of data. The model is still trained on a different corpus using words/vocabulary specific to it — this implies having a lot of out of vocabulary words when applied to domains which use very (very) niche vocabulary.

Specifically, words which are not understood after having been treated using a tokenizer and other cleaning routines will be replaced with <UNK> if it is not found in the original dictionary… Things like: “w126 300SEL”, “shimano exsence 300mhg”, “PROFORM PRO 2000” when referring to a Mercedes, fishing rotor, or a treadmill will most likely have some parts replaced. This makes these words unusable for any comparisons.

There are ways of mitigating this, by performing transfer learning. This is where we use a model trained on a specific corpus or task and we adapt the weights of the artificial neural network to another corpus. Unfortunately, the model used does not support an extension of vocabulary/dictionary making it impossible to use for domain adaptation, and therefore using it alone can be a big problem.

Currently, USE is still a model that demonstrates good performance supporting various domains (especially given that our forums are very diversified). This is why we refer to it as our current foundation model, where we have the ability to build on top of it, extend it, and use it as a baseline against other models for thread representation. As a foundation, it must understand the generalized properties of the English language.

So are there better models to use as a foundation for encoding threads?

Maybe, ranging from how the tokenization is conducted, to how to handle different spoken languages (English, French, Polish, Mandarin,.. ).
So far, from our findings, all have compromises and we still have a lot more to explore... Which is okay, for many reasons! There are many possible adjustments and improvements for the thread representation we are aware of and are already working on ( S-Bert, WME). There is also work on the aggregation of different components that specialize on the vertical, forum representations & vocabulary level (Word/ Named Entity/ Entity embedding linking). NLP is a large topic on its own. If you wish to learn more about it, feel free to let us know and we will write a blog post with further details if enough people are interested!

Okay enough about NLP...

You might have a keen eye and have noticed that a good recommendation is not just one that provides only the most similar results, it is one that provides the most useful results. If you have very large and homogeneous data and are solely using good thread representation to provide recommendations… The results would actually be devastating! You would be drowning in content where the text is too similar, and you would feel as if you are reading out of an echo-chamber. This is why we are working on different extensions outside of just improving thread representation.

Recommender systems is also a pretty big field of its own (since the early 1980’s). Generally speaking, we usually talk about two types of solutions; content-based, and collaborative-based (a combination of each at different levels or both together is referred to as a Hybrid solution, which is what we are working on). Collaborative-based solutions focus solely on activities that are intertwined. We care about how other users responded to similar items, we do not focus on the properties of the threads. In the case of a User to Item relation, we make strong assumptions, such as if two users have had common actions in the past they are more likely to develop similar patterns in the future. Rather than focusing on this type, for many reasons, our team will dedicate the majority of it is efforts on the Content-based half.

Content-Based filtering solutions have to do with what defines items (in our case threads) and how similar their features are (such as text); in some cases, we also use explicit user preferences. There is a situation where we make equal use of properties about both users and items which we call profiles (user-profiles, item-profiles), that we attempt to map to one another. There are other circumstances where we mostly rely on the properties of the items, and perhaps some feedback(like, post, etc.,) provided by the user.


CB-filtering in our case is vital because for the majority of forums where we have recommendations, our users are unfortunately unregistered, dropping-in and leaving after a short period (1 thread view). This behaviour accounts for a whooping ~99% of the traffic on some sites. Because of this, most of the data we are able to use is text, which makes content-based filtering the obvious choice.

We also don’t ask our users for explicit feedback in the form of user ratings (except when you kindly give us survey responses). This makes things slightly more complicated for both approaches. We therefore mainly rely on implicit feedback for our decision-making.

Thread-representation belongs to the content-based approach, as well as the majority of our currently on-going efforts.

So what are other things surrounding the Recommended Reading section?
To give you a rough idea, here are some other important concepts (Popularity, Time Relevance, Diversity) that we are working on as we speak using Data Science (mainly ML & Statistics).


Time Relevance
Why do we care about the timing of a thread? Well, as you probably guessed, how old or how new a thread is, can in one way or another affect the relevance of how it is paired to another thread.
For some Forums & Subforums, when it comes to utilizing the dates of each thread & posts for a recommendation, we care about 4 things.
  1. Can a thread be considered obsolete, given how old it is or how many interactions have occurred within a certain time frame?
    It sometimes happens that an older thread is suddenly revived, or can already be proven relevant based on the user’s activity.
    However, it is also possible that a thread which is too old will tend to be less exposed and therefore will suffer from ageism to no fault of its own. Here is where fun concepts like serendipity (basically being surprised by a result) take place, although it’s also something that can be taken for granted quite easily (as it also comes into play for any content which is relevant, but received weak feedback ).
  2. If two threads are discussing a topic around the same time in history, how important is this factor?
    Let’s imagine we have three threads titled: “2016 Olympics climbing competitors list”, “2020 Olympics climbing competitors list”, “2020 Olympics climbing walls”. Is it really safe to say the last two are more relevant? What if the user was just interested in individuals? What if the last thread was written in 2016?
  3. How do you take into account content that tends to be revived seasonally? What if there’s a topic which seems to belong to a certain time period, how would you factor it in?
  4. One of my favourite questions is one that is very helpful for this project, and other ongoing projects. How do you balance the recency & relevancy tradeoff when recommending threads?
All four of these questions borrow concepts from various literature, and we are just applying it in the domain of forum thread recommendation. The beauty of all four questions is that it comes down to a modelling & feature engineering problem, where machine learning allows us to determine for different scenarios how each feature should be weighted.

Here we talked about the independent impact of time (at the exception of recency & relevancy tradeoff), but of course when addressing different solutions we obtain new concepts — For instance in personalization we can talk about how the user’s interest drifts over time (also called concept-drift)[4]

If you are curious, no, there was no climbing in the 2016 Olympics but it was announced in time for 2020. ;)

In the recommendation system, there are a few concepts we care about when measuring the performance of our recommendation. There are certain metrics and guidelines we can use in recommendation systems, such as accuracy, coverage, diversity, serendipity, novelty, and unexpectedness. We can argue all the ones following diversity act as an extension of its definition. Diversity is a very broad topic, but it fulfils a simple purpose; avoiding overfitting on a subject while obtaining better recommendations. This can be done through the introduction of controlled noise.

While we could discuss item, topic, or user level diversity in all shapes and forms, this might take another blog to give it justice. However, I’d like to leave you with a question which I truly relish in solving, and that is very forum- oriented, touching on community-level representation and its impact on diversity.

Inner & outer node diversity and relevance
If we have a thread within a specific sub-forum, and we want to provide recommendations also containing threads belonging to different sub-forums, which features or statistics can we extract in order to do this sensibly?
Example from Danny - “Threads within a car performance forum are mostly about modification guides to improve performance, so recommending threads about selling cars to this audience may not be such a good idea”.

There are many factors here that come into play. What’s fun about this is that we can exploit some of our own efforts. If a sub-forum community is a collection of threads and their representation, we can compute how much diversity (and discorrelation) is existent within the sub-forums in accordance to their subject, and how generalizable the sub-forum is versus others (using the average representation of the threads from that community).

We can then look at the various actions done by our users to derive what makes a sub-forum more or less likely to accept recommendations from another. This will lead to recommendations that avoid being off-topic, where contextually misinterpreted content between very different (sometimes polarized) groups, can make people very unhappy.

So there you have it — a little bit about what we’re working on right now, and a little bit about what we plan to do in the future!

Don’t worry about waiting for every feature to come at once. We are constantly executing and releasing new features. We want to get these features in the hands of our users, as this is what we find most rewarding.

If you feel ready for more technical versions of these subjects let us know! As everything, we want to make sure we match our content to the crowd.

There is a lot of work ahead for our team when it comes to better our infrastructure to support quicker, faster iterations to more robust and complex solutioning. The MLDS team would like to thank everyone in Project California (our parent team) for their amazing support and making this possible.

We also really appreciate those who are using our forum surveys to help improve our sites (as well as those who don’t)!
If you just enjoyed our post, or felt like you learned something, feel free to like & share! Thanks for reading.

Phileas, MLDS Team - California, VerticalScope