next up previous
Next: Background on Item-Based Collaborative Up: Semantically Enhanced Collaborative Filtering Previous: Semantically Enhanced Collaborative Filtering

Introduction

The continued growth and increasing complexity of Web-based applications, from e-commerce, to Web services, to dynamic content providers; has led to a proliferation of personalization tools on a variety of sites. Personalized services, such as recommender systems, help engage visitors, turn casual browsers into customer, or help visitor to more effectively locate pertinent information. Collaborative filtering (CF) [25,14,5,11] is one of the most successful and widely used technologies in personalization and recommender systems.

Traditionally, CF-based systems compare a representation of an active user's preferences (such as explicit ratings on items or implicit navigational patterns) with the historical records of past users to find the $k$ most similar neighbors of the active user. These historical records are then used to predict the preference value of the active user on a particular, yet to be rated or visited, item; or to recommend the top $N$ items in which the user may be interested. Since the focus of such systems is on comparing the correlations or similarities among users, they are often referred to as user-based collaborative filtering systems.

Despite their success and popularity, traditional CF-based techniques suffer from some well-known limitations [24]. One of the critical limitations is the lack of scalability of the underlying memory-based $k$-nearest-neighbor approach which requires that the neighborhood formation phase be performed as an online process. For very large data sets this may lead to unacceptable latency for providing recommendations. The scalability problems are further accentuated when collaborative filtering is used in the context of Web usage data. In this case, users' browsing patterns are used to implicitly obtain measures of content preference. For frequent visitors the size of user or session vectors tends to be much larger than in the case of e-commerce purchase patterns. Performing user-user similarity computations in this context further degrades the system performance.

Another important limitation of CF-based systems emanates from the sparse nature of the underlying datasets. As the number of items in the database increases, the density of each user record with respect to these items will decrease. This, in turn, will decrease the likelihood of a significant overlap of visited or rated items among pairs of users, resulting in less reliable computation of correlations, and thus less reliable predictions.

Finally, a significant shortcoming of such systems is their inability to provide recommendations or predictions for new or recently added items: a user's rating on a new item cannot be compared with the ratings of other users on the same item. Furthermore, the system can never generate predictions for new items which have not yet been visited or rated by (a sufficient number of) other users. This problem is often referred to as the "new item problem".

A number of optimization strategies have been proposed and employed to remedy the scalability and sparsity problems associated with collaborative filtering. These strategies include similarity indexing [1] to reduce real-time search costs, and dimensionality reduction methods based on Latent Semantic Indexing (LSI) to alleviate the data sparsity in the user-item mappings [24,22]. Other approaches have focused on model-based techniques which use machine learning techniques, such as unsupervised clustering of user records [19] or supervised classification models [5]. These approaches separate the offline tasks of creating user models from the real-time task of recommendation generation, thus improving scalability. However, this is sometimes at the cost of lower recommendation accuracy.

In the context of click-stream and e-commerce data, Web usage mining [26] techniques, such as clustering and association rule discovery, that rely on offline pattern discovery from user transactions, have been studied as an underlying mechanism for personalization and recommender systems [16,17,18]. Such techniques generally provide both a computational advantage, as well as better recommendation effectiveness, than traditional CF-based techniques, particularly in the context of click-stream data. For a recent survey of personalization based on Web usage mining see [21].

There has also been a growing body of work in enhancing collaborative filtering by integrating data from other sources such as content and user demographics [6,20,2,15]. Content-oriented approaches, in particular, can be used to address the "new item problem" discussed above. Generally, in these approaches, keywords are extracted from the content of Web pages and are used to recommend other pages or items to a user, not only based on user ratings or visit patterns, but also (or alternatively) based on the content similarity of these pages to other pages already visited by the user. Keyword-based approaches, however, are incapable of capturing more complex properties of, or relationships among, objects at a deeper semantic level. Unstructured keyword-based representations often result in a substantial amount of noise resulting in reduced recommendation accuracy.

Recently, a new class of item-based CF algorithms has been proposed to deal with the scalability problems in user-based CF algorithms [23,8]. Item-based CF algorithms avoid the bottleneck in user-user computations by first considering the relationships among items. Rather than finding user neighbors, the system tries to find $k$ similar items that are rated (or visited) by different users in some similar way. Then, for a target item, predictions can be generated, for example, by taking a weighted average of the target user's item ratings (or weights) on these neighbor items. Thus, these algorithms alleviate the scalability problem that exists in user-based CF algorithms, because the similarity computations are performed in the smaller space of the items, and because often the item-item comparisons can be performed offline. At the same time, CF algorithms have been shown to achieve prediction accuracies that are comparable to or even better than user-based CF algorithms.

Item-based CF algorithms still suffer from the problems associated with data sparsity, and they still lack the ability to provide recommendations or predictions for new or recently added items. However, the item-based CF framework provides the necessary ingredients to seamlessly incorporate other sources of evidence about items (in addition to item ratings or weights). This flexibility comes from the fact that the computation of item similarities is independent of the methods used for generating predictions or recommendations, thus multiple knowledge sources, including structured semantic information about items, can be used for performing the similarity computations.

In this paper, we introduce an approach for semantically enhanced collaborative filtering in which structured semantic knowledge about items, extracted automatically from the Web based on domain-specific reference ontologies, is used in conjunction with user-item ratings (or weights) to create a combined similarity measure for item comparisons. In contrast to previous approaches to hybrid content-collaborative systems that enhance user based CF [2,15], we integrate semantic knowledge into the item-based CF framework. The integration of semantic similarities for items with rating (or usage-based) similarities provides two primary advantages. First, the semantic attributes for items provide additional clues about the underlying reasons for which a user may or may not be interested in particular items (something that is hidden behind the rating values in the usual context). This, in turn, allows the system to make inferences based on this additional source of knowledge, resulting in improved recommendation accuracy and coverage. Secondly, in cases where little or no rating (or usage) information is available (such as in the case of newly added items, or in very sparse data sets), the system can still use the semantic similarities to provide reasonable recommendations for users. These claims are verified by our experimental results, on two different data sets.

The rest of this paper is organized as follows. In Section 2, we provide the necessary background information on the item-based collaborative filtering framework. In Section 3, we discuss our semantically enhanced approach. In this section we first discuss the problem of ontology-based extraction of class instances in a particular domain and the structured representation of the extracted semantic attributes for items. We then present our approach for combining semantic and rating (or usage) similarity of items to generate predictions. In Section 4, we discuss the characteristics of our experimental data sets and present our experimental evaluation of the proposed approach. Finally, we conclude with a summary of our findings and some directions for future work.


next up previous
Next: Background on Item-Based Collaborative Up: Semantically Enhanced Collaborative Filtering Previous: Semantically Enhanced Collaborative Filtering
Bamshad Mobasher 2004-03-09