Before any mining is done on Web usage data, sequences of page references must be grouped into logical units representing Web transactions or user sessions. A user session is all of the page references made by a user during a single visit to a site. Identifying user sessions is similar to the problem of identifying individual users, as discussed above. A transaction differs from a user session in that the size of a transaction can range from a single page reference to all of the page references in a user session, depending on the criteria used to identify transactions. Unlike traditional domains for data mining, such as point of sale databases, there is no convenient method of clustering page references into transactions smaller than an entire user session. This problem has been addressed in [CMS97] and [CPY96].
[CMS97] assumes that each page reference is used for either navigation purposes to get to another page, or information content purposes. Two types of transactions are defined. The first type is navigation-content, where each transaction consists of a single content reference and all of the navigation references in the traversal path leading to the content reference. These transactions can be used to mine for path traversal patterns. The second type of transaction is content-only, which consists of all of the content references for a given user session. These transactions can be used to discover associations between the content pages of a site. A given page reference is classified as either navigational or content, based on the time spent on the page. This kind of "page typing" is further delineated in [PPR96], where various page types such as index pages, personal home pages, etc. are used in the discovery of user patterns.
[CPY96] defines the concept of maximal forward reference in order to identify transactions. Each transaction is defined to be the set of pages in the path from the first page in the log for a user up the page before a backward reference is made. A new transaction is started when the next forward reference is made. A forward reference is defined to be a page not already in the set of pages for the current transaction. Similarly, a backward reference is defined to be a page that is already contained in the set of pages for the current transaction. For example, an access sequence of A B C D C B E F E G would be broken into three transactions, i.e. A B C D, A B E F, and A B E G. The transactions created with this algorithm are similar to the navigation-content transactions of [CMS97] and can be used to mine for path traversal patterns.