Assignment 1
- Notes:
- This assignment is Due Thursday October 12.
- A list of data mining tools (including local versions) are available from the Tools and Software section.
|
-
Consider the following transaction database:
TID Items
-------------------------
01 A, B, C, D, F
02 A, B, C, D, E, G
03 A, C, G, H, K
04 B, C, D, E, H, K
05 D, E, F, H, L
06 A, B, C, K, L
07 A, D, F, L
08 B, I, E, K, L
09 C, D, F, L
10 A, B, D, E, K
11 C, D, H, I, K
12 C, E, F, K
13 B, C, D, F
14 A, B, C, D
15 C, H, I, J, K
16 A, D, E, F, H, L
17 H, K, L
18 A, D, H, K
19 D, E, F, K, L
20 B, C, D, E, H, L
Applying the Apriori algorithm with minimum support of 30% and minimum confidence of 75%,
find all the association rules in the data set. Give details of your computation at each step,
however, at each step give only the frequent itemsets that satisfy minimum support (i.e.,
itemsets which appear in at least 6 transactions). Also specify the confidence and improvement
for each of the rules you discovered.
-
Consider the "weather" data set discussed in class.
Applying the ID3 algorithm, complete the analysis of the example and construct the optimal
decision tree (recall that the features are "outlook", "wind", "humidity", and
"temperature", the target class is "play" indicating whether the given day is a good day
to play golf). Give the details of your calculations in each step. Also, give the decision
rules derived from your decision tree.
-
The marketing department of a financial firm keeps records on customers, including
demographic information and, number of type of accounts. When launching a new product,
such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is
sent to existing customers, and a record kept as to whether that customer responded and
bought the product. Based on this store of prior experience, the managers decide to use
data mining techniques to build customer profile models.
The data contains of a number of the following fields
| id |
a unique identification number |
| age |
age of customer in years |
| sex |
MALE / FEMALE |
| region |
inner_city/rural/suburban/town |
| income |
income of customer |
| married |
Is the customer married (YES/NO) |
| children |
number of children |
| car |
Does the customer own a car (YES/NO) |
| save_acct |
Does the customer have a saving account (YES/NO) |
| current_acct |
Does the customer have a current account (YES/NO) |
| mortgage |
Does the customer have a mortgage (YES/NO) |
| pep |
Did the customer buy a PEP after the last mailing (YES/NO) |
Each record is a customer description where the "pep" field indicates whether or not
that customer bought a PEP after the last mailing.
There are two data sets available (the data sets are comma delimited, and the first row contains the
field names):
- bank-data.txt - Preclassified Data Set for Building a Model
- bank-new.txt - A set of customers from which to find the "hot
prospects" for the next mailing, using the profiles built from the training set.
- First use Excel or other spreadsheet to examine and possibly transform the data. For example,
the "id" field can be removed, if necessary, or other fields can be added. Record your findings,
and specify any transformations you perform on the data.
- Perform Association Rule discovery on the data set using available tools such as the
Weka package, or Magnum Opus. Report your findings and observations, such as interesting
rules, or any recommendations that might help the company to better understand behavior
of its customers or in its marketing campaign. Note that this may require discretizing
continuous variables into bins (it is easiest to do this in Magnum Opus, since bins can be
specified directly in the data specification file.
- Next, using available tools such as the Weka package or the See5 program
create a "C4.5" classification model based on the preclassified training data.
You may wish to experiment with various options such as cross validation, boosting, etc.
Record the decision tree and model accuracy statistics obtained from your model.
- Finally, apply the resulting model to the new customers data set. How many "hot
prospects" are targeted by your model? Note that you will need to map the resulting
answers back to the original customer "id" field for the new customers (this could be
done again using a spreadsheet program such as Excel). Also, note that in preparing the data
for the "new" cases, you may need to add an instance value of "?" for the missing target
class value of each record. In the case of See5 program, the program "sample.exe" which is
included in the distribution, can be used to classify new instances based on the model.
Examples of using Weka and See5 for classification will be given in class, including the
necessary data preparation steps.
|