Item Recommendation - Collaborative Filtering

Assume you have a number of items to sell. You also have a number of users who already bought or rated some of these items. You need to recommend new items to your customers so that they keep buying! What to do now?

You make use of a recommendation engine, determine items to promote and see if it increases your sales. How can one implement such an engine?

Here is a summary of a straightforward method for beginners: Collaborative Filtering & tips to improve performance of this method.

For a user, there are two main steps:

Find users who are similar to each other in their purchase/rating history. They are likely to have a similar taste.
Use one user's history to recommend items for the other user.

A simple example

Assume there are three users who rated purchased items in a 1-5 scale. Their ratings were:

User 1: { A: 5, B:3, C:3, D:4}
User 2: { A: 5, B:3, C:3, D:2, E:5}
User 3: { A: 5, B:3, C:2, D:3, E:1}

User 1 hasn't purchased E yet and it is important to know what user's rating value will be for E after a purchase. In the end, the rating value will determine whether E should be recommended to User 1 or not.

First find out the closest user to User 1 by evaluating ratings for commonly purchased items. Items A, B, C, D are the common ones for Users 1, 2, 3. One needs to compute a numerical value that represents the distance between users. The lower the value is, the more similar those users are. This value is calculated according to a formula of your choice, called distance metric

Here is my super simple distance metric: number of items which received the same rating from both users.

Distance between User 1 - User 2: 3 (A,B and C)
Distance between User 1 - User 3: 2 (A and B)

User 2 is the closest user to User 1 according to my distance calculation. My simple algorithm argues that User 1 will rate E by 5 just like User 2.

Improvements

This example is way too simplistic for the problem considering all the possibilities. Here are a number of improvements:

My distance metric was too simple for this problem.
- Simple improvements: Using other distance calculation methods such as Manhattan, Normalized Manhattan, Euclidean, Normalized Euclidean, Minkowski, Chebyshev etc. Manhattan and Euclidean performs better when the data is dense.
- Complex improvements: Including behavior similarities in the calculation. Think of a Netflix users who rated the same movies with an exactly half rating difference: Pearson, Cosine Similarity, Spearman etc. These options require extra processing in subsequent steps. For sparse data, cosine similarity is always a good option.
Each user has its own scale. One generally rate items 3 or 4 in a scale of 5. For others, this can be 1 or 2. So user's own tendencies within the scale should be taken into consideration.
- As a start, calculate a mean for the ratings of the user and let this mean affect the outcome.
- Or classify item types: User A rates technological items with a higher rating than furniture items.
You can combine methods for a better result: For example, try using the average of Manhattan and Minkowski Distance Metrics to determine the closest user.
You need to be good at algorithms. More data means more accuracy, but it also means more processing power.. Performance comes into play when the amount of data increases
- Optimization: How can you store the same amount of information in less space? How can you do more calculations in the same time period?
- Pre-processing How can you prepare your data so that you can effectively use your resources? For example, you can decrease the size of your data without losing anything just by eliminating useless users. You can reshape your data so it takes less time to read.
There can be many other improvements similar to these examples...

Recommendation is an important and endless topic that requires a huge effort to excel at. See NetFlix Prize and efforts devoted to collaborative filtering for an example. It is fun because one can always make the algorithm better!

June 2013