next up previous
Next: The Mining Process Up: Research Directions Previous: Research Directions

Data Pre-Processing for Mining

Web usage data is collected in various ways, each mechanism collecting attributes relevant for its purpose. There is a need to pre-process the data to make it easier to mine for knowledge. Specifically, we believe the following issues need to be addressed:

  1. Instrumentation & Data Collection: Clearly improved data quality can improve the quality of any analysis on it. A problem in the Web domain is the inherent conflict between the analysis needs of the analysts (who want more detailed usage data collected), and the privacy needs of users (who want as little data collected as possible). This has lead to the development of cookie files on one side and cache busting on the other. The emerging OPS standard on collecting profile data may be a compromise on what can and will be collected. However, it is not clear how much compliance to this can be expected. Hence, there will be a continual need to develop better instrumentation and data collection techniques, based on whatever is possible and allowable at any point in time.

  2. Data Integration: Portions of Web usage data exist in sources as diverse as Web server logs, referral logs, registration files, and index server logs. Intelligent integration and correlation of information from these diverse sources can reveal usage information which may not be evident from any one of them. Techniques from data integration [LHS95] should be examined for this purpose.

  3. Transaction Identification: Web usage data collected in various logs is at a very fine granularity. Hence, while it has the advantage of being extremely general and fairly detailed, it also has the corresponding drawback that it cannot be analyzed directly, since the analysis may start focusing on micro trends rather than on the macro trends. On the other hand, the issue of whether a trend is micro or macro depends on the purpose of a specific analysis. Hence, we believe there is a need to group individual data collection events into groups, called Web transactions [CMS97], before feeding it to the mining system. While [MJHS96,CPY96,CMS97] have proposed techniques to do so, more attention needs to be given to this issue.



next up previous
Next: The Mining Process Up: Research Directions Previous: Research Directions



Bamshad Mobasher
Wed Jul 16 02:08:33 CDT 1997