DePaul University DePaul CTI Homepage

Assignment 2

Notes:
  • This assignment is Due Thursday, November 16.
  • A list of data mining tools (including local versions) are available from the Tools and Software section. The page now includes links to some specific tools developed here for clustering and profile generation which will be used for this assignment.
    1. For this problem you will use some preprocessed data from a real e-commerce site, and through clustering and other data mining techniques, you will attempt to find interesting user profiles and patterns among products in the data. Due to the sensitive nature of this data, the zip file containing the data and its description are password protected. You can download the data using your class user-id and password, from the links provided below (see the bottom of this page).

      There are two primary types of products sold through the above site. Each category includes various subcategories and individual products from multiple vendors. For simplicity, the provided data combines visited pages from the log files, category and subcategory names, and product related content pages/categories. There are a total of 253 such attributes in the data. There are two data files which capture visit information: one includes the session level information (i.e., pages, products, categories viewed by a user in one session) and the other includes visitor level information (i.e., pages, products, categories viewed by a user across different visits/sessions). The session file contains 7551 sessions records, and the visitor file contains 5824 visitor records. In addition, an index file of attribute labels is provided which is needed for profile generation.

      Both data files are in tab-delimited form, and you can manipulate them for further processing using standard spreadsheet or database programs (for example, you may wish to limit the data mining tasks to specific pages or product categories by removing some columns, etc.).

      There are two primary tasks to be performed in this problem:

      1. Using the provided data files generate usage profiles (as discussed in class on October 19) both at the session level as well as at visitor level. For this you will first need to cluster the data sets and then use the clustering results for profile generation.

        The programs you will need to perform these tasks are included in the zip file available from the link below (also from the Tools section of the Web site). These include the command line programs "cluster.exe" (used to perform k-means clustering) and "gen-profiles.exe" (used to generate profiles). You should start by reading the readme files provided as part of the distribution and experimenting with the examples. Note that the data sets provided are in the format necessary for use by these programs.

        You should repeat the process several times by experimenting with parameter such as the number of clusters, thresholds for generating of profiles, etc. For example, you may want to start by using 5 clusters to generate profiles, then try 10, 15, 20, 25 clusters to see if this impact the identification of important patterns. For your final submission, you should summarize the results of these experiments, but you should only submit the actual results that worked best.

        Based on these results discuss important patterns that you have discovered in the data. This includes possible relationships among products, categories, and/or pages, as well as identifiable characteristics of various user segments (represented by each profile). Also, you should discuss similarities or differences in the results based on session level information versus visitor level information.

      2. Perform association rule mining on the session level data set. The goal here is to identify/discuss important and useful relationships discovered among products, categories, and pages visited by the users. In addition, compare the rules identified with the profiles generated in the previous part of the problem, and discuss any relationships you can find, whether reinforcing or contradictory.

        For your convenience, the ARFF file containing the session level transaction data is included in the data archive. This file can be used by WEKA to do association rule mining. However, you should again try your experiments several times by varying support and confidence, as well as other, parameters. Also, For the purpose of market basket analysis, you should focus only on selected attributes (that you think would be important) rather than all of the 253 attributes. This can be done by applying the WEKA attribute filters (either from the command line, or using the "Experimenter" GUI). It is also possible to do this by manually removing some attributes and the associated data from the input data, but that would require changing the ARFF file.

        Note: You should not submit the full set of rules discovered, rather select those that are relevant based on the above mentioned objectives, and you should provide a discussion of why you included them. You should also include your observations, if any, of the relationship discovered here and in the user profiles.


    2. In this problem, you will perform a very similar type of analysis as the previous problem, however, using a very different data set. The data set includes movie ratings for 500 users and 1000 movies. It is an adaptation of the data collected from MovieLens collaborative filtering project (www.movielens.org). Similar to the previous problem, this data set is provided in standard tabular format (one row for each user and one column for each movie). In addition, the index file of movie names is provided which can be used for profile generation. The movie ratings are in a scale of 1 (worst rating) to 5 (best rating).

      One important difference between this and the previous data set is in the way zero entries should be treated. In this case, a zero entry in the data should be interpreted as a missing rating (rather than as a poor rating). This is an important factor to consider in interpreting the results. This would require you to use different options available for the profile generation program than those used in problem 1.

      Your task in this problem is the same as part (a) of the previous problem. In other words, you are to experiment with clustering and profile generation using this data, and record and discuss your observations on potential relationships among items (movies) and potential patterns within or among user groups.

      In addition to the above data, the archive includes a data file (users.dat) containing some demographic attributes about each user (e.g., age, occupation, etc.). You will use this file to merge the demographic information with the clusters of users. This can be done using one of the programs called "get-clusters.exe" (please see the readme file for the program for details on how to use it). After performing this task, through observation, try to characterize some of the user groups based on this demographic data.



    Return to Assignments Page
    Return to Main Page



    Copyright © 2000, Bamshad Mobasher, School of CTI, DePaul University.