Assignment 2
- Notes:
- This assignment is Due Thursday, November 16.
- A list of data mining tools (including local versions) are
available from the Tools and Software
section. The page now includes links to some specific tools developed
here for clustering and profile generation which will be used
for this assignment.
|
-
For this problem you will use some preprocessed data from a real
e-commerce site, and through clustering and other data mining
techniques, you will attempt to find interesting user profiles and
patterns among products in the data. Due to the sensitive nature of
this data, the zip file containing the data and its description are
password protected. You can download the data using your class
user-id and password, from the links provided below (see the bottom
of this page).
There are two primary types of products sold through the above site.
Each category includes various subcategories and individual products
from multiple vendors. For simplicity, the provided data combines
visited pages from the log files, category and subcategory names, and
product related content pages/categories. There are a total of 253
such attributes in the data. There are two data files which capture
visit information: one includes the session level information (i.e.,
pages, products, categories viewed by a user in one session) and the
other includes visitor level information (i.e., pages, products,
categories viewed by a user across different visits/sessions). The
session file contains 7551 sessions records, and the visitor file
contains 5824 visitor records. In addition, an index file of attribute
labels is provided which is needed for profile generation.
Both data files are in tab-delimited form, and you can manipulate
them for further processing using standard spreadsheet or database
programs (for example, you may wish to limit the data mining tasks to
specific pages or product categories by removing some columns, etc.).
There are two primary tasks to be performed in this problem:
-
Using the provided data files generate usage profiles (as discussed
in class on October 19) both at the session level as well as at
visitor level. For this you will first need to cluster the data sets
and then use the clustering results for profile generation.
The programs you will need to perform these tasks are included in the
zip file available from the link below (also from the Tools section of the Web site). These include
the command line programs "cluster.exe" (used to perform k-means
clustering) and "gen-profiles.exe" (used to generate profiles). You
should start by reading the readme files provided as part of the
distribution and experimenting with the examples. Note that the data
sets provided are in the format necessary for use by these programs.
You should repeat the process several times by experimenting with
parameter such as the number of clusters, thresholds for generating
of profiles, etc. For example, you may want to start by using 5
clusters to generate profiles, then try 10, 15, 20, 25 clusters to
see if this impact the identification of important patterns. For your
final submission, you should summarize the results of these
experiments, but you should only submit the actual results that
worked best.
Based on these results discuss important patterns that you have
discovered in the data. This includes possible relationships among
products, categories, and/or pages, as well as identifiable
characteristics of various user segments (represented by each
profile). Also, you should discuss similarities or differences in the
results based on session level information versus visitor level
information.
-
Perform association rule mining on the session level data set. The
goal here is to identify/discuss important and useful relationships
discovered among products, categories, and pages visited by the
users. In addition, compare the rules identified with the profiles
generated in the previous part of the problem, and discuss any
relationships you can find, whether reinforcing or contradictory.
For your convenience, the ARFF file containing the session level
transaction data is included in the data archive. This file can be
used by WEKA to do association rule mining. However, you should again
try your experiments several times by varying support and confidence,
as well as other, parameters. Also, For the purpose of market basket
analysis, you should focus only on selected attributes (that you
think would be important) rather than all of the 253 attributes. This
can be done by applying the WEKA attribute filters (either from the
command line, or using the "Experimenter" GUI). It is also possible
to do this by manually removing some attributes and the associated
data from the input data, but that would require changing the ARFF
file.
Note: You should not submit the full set of rules discovered, rather
select those that are relevant based on the above mentioned
objectives, and you should provide a discussion of why you included
them. You should also include your observations, if any, of the
relationship discovered here and in the user profiles.
-
In this problem, you will perform a very similar type of analysis as
the previous problem, however, using a very different data set. The
data set includes movie ratings for 500 users and 1000 movies. It is
an adaptation of the data collected from
MovieLens collaborative
filtering project (www.movielens.org). Similar to the previous
problem, this data set is provided in standard tabular format (one
row for each user and one column for each movie). In addition, the
index file of movie names is provided which can be used for profile
generation. The movie ratings are in a scale of 1 (worst rating) to 5
(best rating).
One important difference between this and the previous data set is in
the way zero entries should be treated. In this case, a zero entry in
the data should be interpreted as a missing rating (rather than as a
poor rating). This is an important factor to consider in interpreting
the results. This would require you to use different options available
for the profile generation program than those used in problem 1.
Your task in this problem is the same as part (a) of the previous
problem. In other words, you are to experiment with clustering and
profile generation using this data, and record and discuss your
observations on potential relationships among items (movies) and
potential patterns within or among user groups.
In addition to the above data, the archive includes a data file
(users.dat) containing some demographic attributes about each user
(e.g., age, occupation, etc.). You will use this file to merge the
demographic information with the clusters of users. This can be done
using one of the programs called "get-clusters.exe" (please see the
readme file for the program for details on how to use it). After
performing this task, through observation, try to characterize some
of the user groups based on this demographic data.
|