In order to obtain semantic information about items used in the collaborative filtering process, we must extract domain-level structured objects as semantic entities contained within Web pages on one or more Web sites. This task involves the automatic extraction and classification of objects of different types into classes based on an underlying reference domain ontology.
An ontology provides a set of well-founded constructs that define significant concepts and their semantic relationships. An example of an ontology is a relational schema for a database involving multiple tables and foreign keys semantically connecting these relations. Such constructs can be leveraged to build meaningful higher level knowledge in a particular domain. Domain ontologies for a Web site usually include concepts, subsumption relations between concepts (concept hierarchies), and other relations among concepts that exist in the domain represented by the Web site. In this paper, we do not directly deal with the problems of automatic ontology acquisition and learning. Rather, we assume the existence of a pre-defined reference ontology for a specific domain based on which the semantic attributes of items can be extracted. Our goal is to use this semantic knowledge about items together with item ratings (or weights in the context of Web usage data) to create a combined similarity measure for item-based collaborative filtering.
The problem of extracting instances of the ontology classes from Web pages is an interesting problem in its own right and has been studied extensively. This process can be viewed as the classification of objects embedded in one or more Web pages into classes specified as part of a reference ontology. For example, in [10] a text classifier is learned for each "semantic feature" based on a small manually labeled data set. First Web pages are extracted from different Web sites that belong to a similar domain, and then the semantic features are manually labeled. This small labeled data set is fed into a learning algorithm as training data to learn the mappings between Web objects and concept labels. Craven et al. [7] adopt a combined approach of statistical text classification and first-order text classification in recognizing concept instances. In that study, the learning process is based on both page content and linkage information. The problems and issues related to using ontologies in the context of Web mining has been discussed in [3].
In our approach, we have used domain-specific wrapper agents that use text mining and heuristic rules to extract class and attribute instances from Web sites based on a pre-specified reference ontology. At the present time, we do not use a general ontology representation language, such as DAML+OIL [12]. Rather, we represent the ontology classes as part of the schema for a relational database. Our simple representation scheme does not take into account complex relationships among classes (such as inheritance), but is adequate for specifying the attributes associated with classes (relations). Our wrapper agents use the relational schema for classes and simple heuristics based on textual cues to extract attribute values and populate instances of these classes (tuples). In the future, we intend to extend our work by incorporating more general ontology languages that can capture (and allow reasoning with) a richer set of structural relationships among classes and objects. The implementation details of the wrapper agents is beyond the scope of the present work and will be discussed elsewhere.
As an example, let us consider a movie Web site such as the Internet Movie Database (www.imdb.com). This Web site includes a collection of pages containing information about movies, actors, directors, etc. A collection of pages describing a specific movie might include attribute information such as the movie title, genre, actors, director, etc. These represent the attributes associated with a class that represents movies in our reference ontology. A domain ontology for this site may contain the classes Movie, Actor and Director along with their attributes. In our representation, some of the attributes represent properties of a given class and others represent reference slots corresponding to other classes. For instance, the "Actor" attribute of the Movie class represents a reference to the class (relation) Actor and, in the relational representation, is specified as a foreign key in the Movie relation. Figure 1 depicts the class Movie and its attributes. An actor or director's attribute information may include name, filmography (a set of movies), gender, nationality, etc. The dotted arrows in attributes such as "Actor" and "Director" indicated that they represent references to other classes in the ontology. The collection of Web pages in the site represent a group of embedded objects that are the instances of these classes.
In order to facilitate the computation of item similarities, generally, the extracted class instances will need to be converted into a vector representation. In our case, the values of semantic attributes associated with class instances are collected into a relational table whose rows represent the
items, and whose columns correspond to each of the extracted attributes. Additional preprocessing tasks, such as normalization and discretization (for continuous attributes), can be performed on the data in order to provide a uniform representation. This process generally results in the addition of attributes, for example, representing different intervals in a continuous range, or representing each unique discrete value for categorical attributes in the original data. The final result is a
matrix
, where
is the total number of unique semantic attributes. We call this matrix the semantic attribute matrix.