The first step in computing the similarity of two items
and
(column vectors in the data matrix
) is to identify all the users who have rated (or visited) both items. Many measures can be used to compute the similarity between items. The most common approach, when dealing with Web usage data, is to use the standard cosine similarity between two vectors:
where
represents the weight associated with item
in the session (or user) vector
. For ratings data, however, variances in user ratings styles must be taken into account. For example, in a movie rating scenario, with a rating scale between 1 and 5, some users may give a rating of 5 to many movies they consider to be "good"; while other more "strict" raters may only give a rating of 5 to those movies they consider "perfect". To offset the difference in rating scales, the data can be normalized to focus on rating variances (deviations from the mean ratings) on co-rated items. For our purposes, when dealing with ratings data, we adapt the Adjusted Cosine Similarity measure introduced by Sarwar et al. [23]:
where
represents the rating of user
on item
, and
is the average rating value of user
on all items.