next up previous
Next: Computing Predictions Up: Background on Item-Based Collaborative Previous: Background on Item-Based Collaborative

Finding Similar Items (Item neighbors)

The first step in computing the similarity of two items $i_p$ and $i_q$ (column vectors in the data matrix $M$) is to identify all the users who have rated (or visited) both items. Many measures can be used to compute the similarity between items. The most common approach, when dealing with Web usage data, is to use the standard cosine similarity between two vectors:

\begin{displaymath}
sim(i_p,i_q) = \frac{{\sum\limits_{k = 1}^m {M_{k,p} \times...
...{k,p} )^2 \times \sum\limits_{k = 1}^m {(M_{k,q} )^2 } } } }}
\end{displaymath}

where $M_{k,p}$ represents the weight associated with item $i_p$ in the session (or user) vector $k$. For ratings data, however, variances in user ratings styles must be taken into account. For example, in a movie rating scenario, with a rating scale between 1 and 5, some users may give a rating of 5 to many movies they consider to be "good"; while other more "strict" raters may only give a rating of 5 to those movies they consider "perfect". To offset the difference in rating scales, the data can be normalized to focus on rating variances (deviations from the mean ratings) on co-rated items. For our purposes, when dealing with ratings data, we adapt the Adjusted Cosine Similarity measure introduced by Sarwar et al. [23]:

\begin{displaymath}
sim(i_p,i_q) = \frac{{\sum\limits_{k = 1}^m {(M_{k,p} - \ov...
...um\limits_{k = 1}^m {(M_{k,q} - \overline {M_k } )^2 } } } }}
\end{displaymath}

where $M_{k,p}$ represents the rating of user $k$ on item $i_p$, and $\overline{M_k}$ is the average rating value of user $k$ on all items.


next up previous
Next: Computing Predictions Up: Background on Item-Based Collaborative Previous: Background on Item-Based Collaborative
Bamshad Mobasher 2004- 03-09