Various clothes hanging in a wardrobe tagged with the days of the week or terms describing types of events; such as sport, casual or party

When a friend asks you for a book recommendation, it’s pretty natural to ask what kinds of books they like. From there, you could think of a few titles that are similar to the things they’ve liked in the past. This process, of recommending content based on its characteristics, is at the heart of content-based filtering, the technology behind Netflix and Pandora’s recommendation engines.

Content-based filtering is used in a number of applications, including information retrieval (as in search engines) as well as recommender systems. In this article, we’ll take a look at how content-based recommendation systems work, what their upsides and challenges are, and what skills and technologies you might need to start developing one.

Why Content-Based Filtering?

Collaborative filtering may be the state of the art when it comes to machine learning and recommender systems, but content-based filtering still has a number of advantages, especially in certain circumstances.

  • Results tend to be highly relevant. Because content-based recommendations rely on characteristics of objects themselves, they are likely to be highly relevant to a user’s interests. This makes them especially valuable for organizations with massive libraries of a single type of content (think subscription and streaming media services).
  • Recommendations are transparent. Another advantage is that the process by which any recommendation is generated can be made transparent, which may increase users’ trust in their recommendations or allow them to tweak them. With collaborative-filtering, the process is more of a black box–the algorithm and users alike may not really understand why they’re seeing the recommendations they are.
  • Users can get started more quickly. Content-based filtering avoids the cold-start problem that often bedevils collaborative-filtering techniques. While the system still needs some initial inputs from users to start making recommendations, the quality of those early recommendations is likely to be much higher than with a system that only becomes robust after millions of data points have been added and correlated.
  • New items can be recommended immediately. Related to the cold-start problem, another issue with collaborative-filtering is that new objects added to the library will have few (if any) interactions, which means they won’t be recommended very often. Unlike collaborative-filtering systems, content-based recommenders don’t require other users to interact with an object before it starts recommending it.
  • It’s technically easier to implement. Compared to the sophisticated math involved in building a collaborative-filtering system, the data science behind a content-based system is relatively straightforward. The real work, as we’ve seen is in assigning the attributes in the first place.

Assigning Attributes

At the most basic level, content-based filtering is about assigning attributes to items, so that the algorithm knows something about the content of each item in the database. When you see a hyper-niche Netflix collection (like, say, “Sci-fi capers with strong female leads” or “Quirky indie rom-coms set in Oregon”), you’re seeing content-based filtering in action.

But where do these attributes come from? The answer depends largely on what you’re trying to recommend. If you’re working with text (as in news articles or blog posts), you may be able to programmatically extract keywords using Natural Language Processing techniques, though these approaches have pitfalls of their own. Other types of content may come with varying amounts of metadata, though such data is often incomplete and covers only the most basic kind of data, making it of limited value when it comes to recommendations.

To solve this problem, many companies have turned to teams of subject matter experts to manually assign attributes to each piece of content. As you might imagine, that process can be a massive undertaking. For example, Pandora uses a proprietary system–the Music Genome Project–to produce their recommendations. These recommendations are based on an algorithm that uses more than 400 musical characteristics to recommend songs. Identifying these characteristics and developing the algorithm was a major undertaking requiring dozens of music theory experts and more than five years of development–work that continues to this day. Similarly, Netflix’s collection of more than 76,000 microgenres relied on human movie watchers assigning dozens of characteristics to the thousands of movies and TV shows in their library.

Building a User Profile

The other key part of a content-based recommendation system is the user profile. These profiles consist of the objects that user has interacted with as well as the attributes of those objects. Attributes that show up across a number of objects are weighted more heavily than those that appear less often. (Advanced algorithms can take into account not just the items a user has watched, read, or listened to, but also those they’ve browsed past or even had suggested to them.) In weighing the importance of different items, user feedback is critical–this is why services that provide recommendations are constantly asking you to rate content.

Based on these attribute weightings and histories, the system produces a unique model of each user’s preferences, often using machine learning techniques. These models consist of attributes that each user is found to like (or dislike), weighted by importance. These models are then compared against all the objects in the database and assigned scores based on their similarity to the user profile.

Here’s an example. Let’s say you’ve watched and liked Saving Private Ryan, The Martian, and E.T., a system might recognize that you like blockbusters (which describes all three films), Steven Spielberg films (which describes two), and movies where people have to rescue Matt Damon (also two). Based on your interest in movies about rescuing Matt Damon, as well as your interest in science fiction, Interstellar would probably receive a high recommendation score, while the film Weekend at Bernie’s, which has less in common with the other titles, would probably receive a lower score.

Challenges of Content-Based Filtering

  • Lack of novelty and diversity. Relevance is important, but it’s not all there is. If you watched and liked Star Wars, the odds are pretty good that you’ll also like The Empire Strikes Back, but you probably don’t need a recommendation engine to tell you that. It’s also important for a recommendation engine to come up with results that are novel (that is, stuff the user wasn’t expecting) and diverse (that is, stuff that represents a broad selection of their interests).
  • Scalability is a challenge. As we’ve seen, the key requirement when it comes to content-based filtering is exceptional domain-specific knowledge. Hiring subject-matter experts can be a labor-intensive and expensive process, making it impractical for many businesses who are just trying to build an MVP. Furthermore, manual tagging of attributes has to continue as new content is added.
  • Attributes may be incorrectly or inconsistently applied. Content-based recommendations are only as good as the subject-matter experts who are tagging items. When you have hundreds of thousands (or millions) of items, it can be a challenge to ensure attributes are applied consistently or accurately.

Relevant Skills and Tech

Building a recommendation engine is a classic machine learning problem. Not only should data scientists have experience with the tools of statistical analysis, they should also have some familiarity with data processing and storage frameworks like Hadoop and Spark. Java, Python, and Scala all support a number of libraries geared toward specific machine learning tasks. These libraries will do the heavy statistical work, though which ones are appropriate for your project will depend on what exactly you’re trying to do.