|
Assignment 2
Due Date:
Friday, May 2
-
Consider the following transaction database. Each row represents a single
transactions in which the specified items have been purchased.
| Transaction ID |
Items Purchased |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 |
A,B,C,D
B,C,D,E,G
A,C,G,H,K
B,C,D,E,K
D,E,F,H,L
A,B,C,D,E,L
A,D,E,F,L
B,I,K,L
C,D,F,L
A,B,D,E,K
C,D,H,I,K
B,C,E,K
B,C,D,F
A,B,C,D
C,H,I,J
A,E,F,H,L
H,K,L
A,B,D,H,K
D,E,K
B,C,D,E,H |
Applying the Apriori algorithm with minimum support of 30% and minimum confidence of 75%,
find all the association rules in the data set. For each step in the algorithm, give the
list of frequent itemsets that satisfy minimum support (i.e., for each
iteration i, give the set Li along with
the support values for the items sets). Also specify the confidence and
the Lift (improvement) values
for each of the rules you discovered. Note: You must do this problem manually by
hand-tracing the Apriori algorithm.
-
Suppose that we have the following site map associated with the hypothetical
Web site:
where pageview A represents the homepage, pageviews E and F represent insertion of
an item in the shopping cart, and G represents the purchase of the item(s) in the
cart.
We also have some preprocessed usage data recorded in the server log for this
hypothetical site and stored in a sessions data file. Each row in the data file represents pageviews accesses by a
user during one session, in the order in which these pageviews were accessed.
Using the Markov Chain model discussed in class (see Lecture 3) and the
training session data compute the
transition probabilities for the above graph. Show the computed transition
probabilities as labels on the edges of the graph.
After computing the transition probabilities, answer each of the following
using only the Markov model and not the original session data
(show your work):
- What is the probability that a user who starts from the homepage will place
something in his/her shopping cart?
- What is the probability that a user who starts from the homepage will
actually make a purchase?
- What is the probability that a user who enters the
site from D will make a purchase?
- Assuming that the
state D represents a promotional campaign for a specific product (including some
banner advertising on external sites), discuss how successful the campaign is.
You might consider the probability that someone visiting D will make a purchase.
Also consider the ratio of sales resulting from the the
external promotions (resulting in people going to D directly) to sales when
visitors start from the homepage.
Note: In the Markov chain model, the label for a link from a page X to a
page Y is the ratio of the no. of times Y is followed by X in the sessions to the
total number of occurrence of page X. For example, in the above site map, the link
from page C to page F should be labeled 16/60 (i.e., approx. 0.27), since C appears
in 60 sessions, and in 16 of these 60, F directly follows C.
-
Note: Please read the document Data Mining with WEKA (Sections
on Data Preprocessing and Association Rule Mining). It provides a detailed example of the
preprocessing steps and association rule mining with WEKA based on this problem.
The marketing department of a financial firm keeps records on customers, including
demographic information and, number of type of accounts. When launching a new product,
such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the
product, is sent to existing customers, and a record kept as to whether that customer
responded and bought the product. Based on this store of prior experience, the managers
decide to use data mining techniques to build customer profile models. In this
particular problem we are interested only in deriving (quantitative) association rules
from the data (in a future assignment we will consider the use of classification.
The data contains the following fields
| id |
a unique identification number |
| age |
age of customer in years (numeric) |
| sex |
MALE / FEMALE |
| region |
inner_city/rural/suburban/town |
| income |
income of customer (numeric) |
| married |
is the customer married (YES/NO) |
| children |
number of children (numeric) |
| car |
does the customer own a car (YES/NO) |
| save_acct |
does the customer have a saving account (YES/NO) |
| current_acct |
does the customer have a current account (YES/NO) |
| mortgage |
does the customer have a mortgage (YES/NO) |
| pep |
did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO) |
The data is contained in the file bank-data.csv.
Each record is a customer description where the "pep" field indicates whether or not
that customer bought a PEP after the last mailing.
Your goal is to perform Association
Rule discovery on the data set using the Weka package.
Note: Association rule discovery requires discretization of continuous
variables. This task can be performed in the data transformation step or (in
some cases) by the mining program. WEKA is a full data mining suite which includes
various preprocessing modules. When using WEKA, you will first apply the relevant
preprocessing filters to transform the data before you perform association rule discovery.
First perform the necessary preprocessing steps required for association rule
mining. Specifically, the "id" field will need to be removed and the numerical
attributes must be discretized.
Next perform association rule discovery on the transformed data. Experiment with
different parameters so that you get at least 20-30 strong rules (e.g., rules
with high lift and confidence which at the same time have relatively good support). Select
the top 5 most "interesting" rules and for each specify the following:
- an explanation of the pattern and why you believe it is interesting based on
the business objectives of the company;
- any recommendations based on the discovered rule that might help the company
to better understand behavior of its customers or in its marketing campaign.
Note: The top 5 most interesting rules are most likely not the top 5 in the result
set of the Apriori algorithm. They are rules that, in addition to having high support
and lift, also provide some non-trivial, actionable knowledge based on the underlying
business objectives.
|