With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in order to find, extract, filter, and evaluate the desired information and resources. In addition, with the transformation of the Web into the primary tool for electronic commerce, it is imperative for organizations and companies, who have invested millions in Internet and Intranet technologies, to track and analyze user access patterns. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge both across the Internet and in particular Web localities.
Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This broad definition on the one hand describes the automatic search and retrieval of information and resources available from millions of sites and on-line databases, i.e., Web content mining, and on the other hand, the discovery and analysis of user access patterns from one or more Web servers or on-line services, i.e., Web usage mining.
In this paper, we provide an overview of tools, techniques, and problems associated with both of the dimensions above. Our primary focus, however, is on the second dimension, or Web usage mining. We present a taxonomy of Web mining to clarify our usage of the term, and place various aspects and components of Web mining in their proper context.
There are several important issues, unique to the Web paradigm, that come into play if sophisticated types of analyses are to be done on server side data collections. These include the necessity of integrating various data sources such as server access logs, referrer logs, user registration or profile information; resolving difficulties in the identification of users due to missing unique key attributes in collected data; and the importance of identifying user sessions or transactions from usage data, site topologies, and models of user behavior. We devote the main part of this paper to the discussion of issues and problems that characterize Web usage mining. Furthermore, we survey some of the emerging tools and techniques, and identify several future research directions.
The rest of this paper is organized as follows: Section 2 presents a taxonomy of Web mining and a brief overview of research and development in each of its components. Section 3 identifies the major problems associated with Web usage mining and examines several techniques and approaches used for solving these problems. Section 4 describes the tools available for analyzing and interpreting discovered usage patterns. Section 5 presents a general architecture for Web usage mining and gives an overview of the WEBMINER, as system developed based on this architecture. Finally, sections 6 and 7 present future research directions and conclusions.