Assignment 3
Due Date:
Saturday, May 17
-
In this problem we will use the PEP data from Assignment 2 for the purpose of
target marketing. In this case, we plan on using the historical data from past
customer responses (the training data from last assignment) in order to build a
classification model. The model will then be applied to a new set of prospects to
whom we may want extend an offer for a PEP. Rather than doing a mass marketing campaign to
all new prospects, we would like to target those that are likely to respond positively
to our offer (according to our classification model).
Be sure to first read the document Classification via Decision Trees in WEKA. which contains a
detailed example of classification with WEKA quite similar to what you have
to do in this problem.
There are two data sets available (the data sets are comma delimited, and the first
row contains the field names):
- bank-data.csv - Preclassified
training data Set for Building a Model
(this is the data from assignment 2)
- bank-new.csv - A set of new
customers from which to find the "hot prospects" for the next
mailing, using the profiles built from the training set.
- Using WEKA package create a "C4.5" classification model based on the
preclassified training data. In WEKA, the C4.5 algorithm is implemented by "
weka.classifiers.trees.J48". Use 10-fold cross-validation to evaluate your model
accuracy. Record the final decision tree and model accuracy statistics obtained from your
model. Be sure to indicate the parameters you use in building your classification model.
You can save the statistics and results by right-clicking the last result set in the
"Result list" window and selecting "Save result buffer." You should also generate and create a
screen shot of your tree by selecting the "Visualize tree" command from the same menu.
You should provide the decision tree together with the accuracy results from the
cross-validation as part of your submission.
- Next, apply the classification model from the previous part to the new customers
data set. How many "hot prospects" are targeted by your model? Note that
you will need to map the resulting answers back to the original customer "id" field
for the new customers (this could be done again using a spreadsheet program such as
Excel). Provide your resulting predictions for the 200 new cases and other supporting
documentation as part of your submission.
- Lift Charts: Suppose that we would like to use our predictive model
from the previous part as a response model for a future targeted marketing
campaign. To do so, we want to use the 200 new cases in part (c) as the test
data. Suppose that we have the actual positive responses provided by these 200
prospects. These actual responses are given in the spreadsheet
pep-actual-resp.xls. Note that the total
number of positive responses are 50 (i.e., the response rate for the untargeted
marketing is 25%). Given this information, and the predicted PEP values from
part (c), compute the Lift Chart corresponding to the response model. Note that
the predicted PEP values are not actual responses, but only a prediction that
the prospect is likely to be interested in PEP. To create the chart, you will
need to compute and record the probability of PEP="YES" for each of the 200
prospects (this can be done by using WEKA classifcation from command line, or by
selectimg the "Output text predictions on test set" in the test
options for the classifier. You can then sort the 200 prospects according to
this probability and compute the cumulative positive responses against the total
number of prospects contacted. This should then be compared against the
untargeted case which has a fixed 25% response rate. Your final lift chart
should look something like this. Finally,
based on your lift chart, compute the lift value if only the top 70 prospects
are contacted. What does this mean?
-
Suppose that we would like to build a simple Bayesian filter to filter out spam
email. In order to train our Naïve Bayesian model, we have identified 10
documents (emails) as our training data and manually classified them as spam or
not spam. Also suppose that the keywords whose occurrences in emails contribute
to the classification are terms t1 through t5. The document-term
matrix for the training data is given below. A one in the table cell [Dj,ti]
indicates the occurrence of term ti in email Dj.
Using the Naïve Bayesian Classification method and this training,
show how two new emails E containing terms t1, t4,
and t5; and F containing terms t2, t3, and
t4 would be classified.
Begin by computing the probabilities Pr(ti|yes) and
Pr(ti|no) for i = 1, 2, ..., 5. Give these
probabilities in a table. Also compute the marginal probabilities Pr(yes)
and Pr(no). Next use the Naïve Bayes method to compute Pr(yes|E)
and Pr(no|E). Finally, do a similar computation for F.
Give the details of your computation.
-
Consider the following document-term matrix, where each entry
represents the raw frequency of a term Ti in document
Dj. We would like to apply clustering to automatically
group these documents into 3 classes (clusters). Note: You must not
use WEKA or other clustering tools for this problem. However, you are
encouraged to use a spreadsheet program such as Microsoft Excel to
facilitate computation in intermediate steps.

Download table as a Microsoft Excel Worksheet
Suppose we initially assign D1 and D2 to Class1,
D4 and D6 to Class2, and D5 to Class3.
Using the K-means clustering method discussed in class, compute the final contents
of the 3 classes. Use the Cosine similarity of two vectors as your similarity measure.
Show the details of your computation, including intermediate steps in each iteration of the
algorithm.
Note: Recall that the Cosine similarity of two vectors is their dot product divided
by the product of their norms. For example, Consider the two vectors X and Y:
X = <3, 0, 1, 2, 0, 3>
Y = <2, 0, 0, 3, 8, 4>
The dot product is given by sum of the coordinatewise multiples:
dot-product(X, Y) = 3*2 + 0*0 + 1*0 + 2*3 + 0*8 + 3*4
= 6 + 0 + 0 + 6 + 0 + 12
= 24.
The norm of each vector is the square-root of the sum of the squares of its dimension
values. So, the norms of X and Y are:
and the Cosine similarity of X and Y is given by:
-
Consider the following distance matrix giving the distance scores among items I1 through I7.
- Compute the (binary) distance matrix using a threshold of 8, and then give its graph
representation (items that are considered similar are those with a distance of 8 or less).
- Show the clusters using the clique technique (explain your answer).
- Show the clusters using the single-link technique (explain your answer).
|