DePaul University DePaul CTI Homepage

Assignment 2

Due Date: Friday, May 2

  1. Consider the following transaction database. Each row represents a single transactions in which the specified items have been purchased.
     
    Transaction ID Items Purchased
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    A,B,C,D
    B,C,D,E,G
    A,C,G,H,K
    B,C,D,E,K
    D,E,F,H,L
    A,B,C,D,E,L
    A,D,E,F,L
    B,I,K,L
    C,D,F,L
    A,B,D,E,K
    C,D,H,I,K
    B,C,E,K
    B,C,D,F
    A,B,C,D
    C,H,I,J
    A,E,F,H,L
    H,K,L
    A,B,D,H,K
    D,E,K
    B,C,D,E,H

    Applying the Apriori algorithm with minimum support of 30% and minimum confidence of 75%, find all the association rules in the data set. For each step in the algorithm, give the list of  frequent itemsets that satisfy minimum support (i.e., for each iteration i, give the set Li along with the support values for the items sets). Also specify the confidence and the Lift (improvement) values for each of the rules you discovered. Note: You must do this problem manually by hand-tracing the Apriori algorithm.


  2. Suppose that we have the following site map associated with the hypothetical Web site:

    where pageview A represents the homepage, pageviews E and F represent insertion of an item in the shopping cart, and G represents the purchase of the item(s) in the cart.

    We also have some preprocessed usage data recorded in the server log for this hypothetical site and stored in a sessions data file. Each row in the data file represents pageviews accesses by a user during one session, in the order in which these pageviews were accessed.

    Using the Markov Chain model discussed in class (see Lecture 3) and the training session data compute the transition probabilities for the above graph. Show the computed transition probabilities as labels on the edges of the graph.

    After computing the transition probabilities, answer each of the following using only the Markov model and not the original session data (show your work):

    • What is the probability that a user who starts from the homepage will place something in his/her shopping cart?
    • What is the probability that a user who starts from the homepage will actually make a purchase?
    • What is the probability that a user who enters the site from D will make a purchase?
    • Assuming that the state D represents a promotional campaign for a specific product (including some banner advertising on external sites), discuss how successful the campaign is. You might consider the probability that someone visiting D will make a purchase. Also consider the ratio of sales resulting from the the external promotions (resulting in people going to D directly) to sales when visitors start from the homepage.

    Note: In the Markov chain model, the label for a link from a page X to a page Y is the ratio of the no. of times Y is followed by X in the sessions to the total number of occurrence of page X. For example, in the above site map, the link from page C to page F should be labeled 16/60 (i.e., approx. 0.27), since C appears in 60 sessions, and in 16 of these 60, F directly follows C.


  3. Note: Please read the document Data Mining with WEKA (Sections on Data Preprocessing and Association Rule Mining). It provides a detailed example of the preprocessing steps and association rule mining with WEKA based on this problem.

    The marketing department of a financial firm keeps records on customers, including demographic information and, number of type of accounts. When launching a new product, such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is sent to existing customers, and a record kept as to whether that customer responded and bought the product. Based on this store of prior experience, the managers decide to use data mining techniques to build customer profile models. In this particular problem we are interested only in deriving (quantitative) association rules from the data (in a future assignment we will consider the use of classification.

    The data contains the following fields

    id a unique identification number
    age age of customer in years (numeric)
    sex MALE / FEMALE
    region inner_city/rural/suburban/town
    income income of customer (numeric)
    married is the customer married (YES/NO)
    children number of children (numeric)
    car does the customer own a car (YES/NO)
    save_acct does the customer have a saving account (YES/NO)
    current_acct does the customer have a current account (YES/NO)
    mortgage does the customer have a mortgage (YES/NO)
    pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)

    The data is contained in the file bank-data.csv. Each record is a customer description where the "pep" field indicates whether or not that customer bought a PEP after the last mailing.

    Your goal is to perform Association Rule discovery on the data set using the Weka package.

    Note: Association rule discovery requires discretization of continuous variables. This task can be performed in the data transformation step or (in some cases) by the mining program. WEKA is a full data mining suite which includes various preprocessing modules. When using WEKA, you will first apply the relevant preprocessing filters to transform the data before you perform association rule discovery.

    First perform the necessary preprocessing steps required for association rule mining. Specifically, the "id" field will need to be removed and the numerical attributes must be discretized.

    Next perform association rule discovery on the transformed data. Experiment with different parameters so that you get at least 20-30 strong rules (e.g., rules with high lift and confidence which at the same time have relatively good support). Select the top 5 most "interesting" rules and for each specify the following:

    • an explanation of the pattern and why you believe it is interesting based on the business objectives of the company;
    • any recommendations based on the discovered rule that might help the company to better understand behavior of its customers or in its marketing campaign.

    Note: The top 5 most interesting rules are most likely not the top 5 in the result set of the Apriori algorithm. They are rules that, in addition to having high support and lift, also provide some non-trivial, actionable knowledge based on the underlying business objectives.


Return to Assignments Page
Return to Main Page

Copyright © 2007-2008, Bamshad Mobasher, School of CTI, DePaul University.