DePaul University DePaul CTI Homepage

Assignment 3

Due Date: Saturday, May 17

  1. In this problem we will use the PEP data from Assignment 2 for the purpose of target marketing. In this case, we plan on using the historical data from past customer responses (the training data from last assignment) in order to build a classification model. The model will then be applied to a new set of prospects to whom we may want extend an offer for a PEP. Rather than doing a mass marketing campaign to all new prospects, we would like to target those that are likely to respond positively to our offer (according to our classification model).

    Be sure to first read the document Classification via Decision Trees in WEKA. which contains a detailed example of classification with WEKA quite similar to what you have to do in this problem.

    There are two data sets available (the data sets are comma delimited, and the first row contains the field names):

    • bank-data.csv - Preclassified training data Set for Building a Model
    • (this is the data from assignment 2)
    • bank-new.csv - A set of new customers from which to find the "hot prospects" for the next mailing, using the profiles built from the training set.

    1. Using WEKA package create a "C4.5" classification model based on the preclassified training data. In WEKA, the C4.5 algorithm is implemented by " weka.classifiers.trees.J48". Use 10-fold cross-validation to evaluate your model accuracy. Record the final decision tree and model accuracy statistics obtained from your model. Be sure to indicate the parameters you use in building your classification model. You can save the statistics and results by right-clicking the last result set in the "Result list" window and selecting "Save result buffer." You should also generate and create a screen shot of your tree by selecting the "Visualize tree" command from the same menu. You should provide the decision tree together with the accuracy results from the cross-validation as part of your submission.

    2. Next, apply the classification model from the previous part to the new customers data set. How many "hot prospects" are targeted by your model? Note that you will need to map the resulting answers back to the original customer "id" field for the new customers (this could be done again using a spreadsheet program such as Excel). Provide your resulting predictions for the 200 new cases and other supporting documentation as part of your submission.
       

    3. Lift Charts: Suppose that we would like to use our predictive model from the previous part as a response model for a future targeted marketing campaign. To do so, we want to use the 200 new cases in part (c) as the test data. Suppose that we have the actual positive responses provided by these 200 prospects. These actual responses are given in the spreadsheet pep-actual-resp.xls. Note that the total number of positive responses are 50 (i.e., the response rate for the untargeted marketing is 25%). Given this information, and the predicted PEP values from part (c), compute the Lift Chart corresponding to the response model. Note that the predicted PEP values are not actual responses, but only a prediction that the prospect is likely to be interested in PEP. To create the chart, you will need to compute and record the probability of PEP="YES" for each of the 200 prospects (this can be done by using WEKA classifcation from command line, or by selectimg the "Output text predictions on test set" in the test options for the classifier. You can then sort the 200 prospects according to this probability and compute the cumulative positive responses against the total number of prospects contacted. This should then be compared against the untargeted case which has a fixed 25% response rate. Your final lift chart should look something like this. Finally, based on your lift chart, compute the lift value if only the top 70 prospects are contacted. What does this mean?
       


  2. Suppose that we would like to build a simple Bayesian filter to filter out spam email. In order to train our Naïve Bayesian model, we have identified 10 documents (emails) as our training data and manually classified them as spam or not spam. Also suppose that the keywords whose occurrences in emails contribute to the classification are terms t1 through t5. The document-term matrix for the training data is given below. A one in the table cell [Dj,ti] indicates the occurrence of term ti in email Dj.

    Using the Naïve Bayesian Classification method and this training, show how two new emails E containing terms t1, t4, and t5; and F containing terms t2, t3, and t4 would be classified.

    Begin by computing the probabilities Pr(ti|yes) and Pr(ti|no) for i = 1, 2, ..., 5. Give these probabilities in a table. Also compute the marginal probabilities Pr(yes) and Pr(no). Next use the Naïve Bayes method to compute Pr(yes|E) and Pr(no|E). Finally, do a similar computation for F. Give the details of your computation.


  3. Consider the following document-term matrix, where each entry represents the raw frequency of a term Ti in document Dj. We would like to apply clustering to automatically group these documents into 3 classes (clusters). Note: You must not use WEKA or other clustering tools for this problem. However, you are encouraged to use a spreadsheet program such as Microsoft Excel to facilitate computation in intermediate steps.


    Download table as a Microsoft Excel Worksheet

    Suppose we initially assign D1 and D2 to Class1, D4 and D6 to Class2, and D5 to Class3. Using the K-means clustering method discussed in class, compute the final contents of the 3 classes. Use the Cosine similarity of two vectors as your similarity measure. Show the details of your computation, including intermediate steps in each iteration of the algorithm.

    Note: Recall that the Cosine similarity of two vectors is their dot product divided by the product of their norms. For example, Consider the two vectors X and Y:

      X = <3, 0, 1, 2, 0, 3>
      Y = <2, 0, 0, 3, 8, 4>
    
    The dot product is given by sum of the coordinatewise multiples:
      dot-product(X, Y) = 3*2 + 0*0 + 1*0 + 2*3 + 0*8 + 3*4
                          = 6 + 0 + 0 + 6 + 0 + 12
                          = 24.
    
    The norm of each vector is the square-root of the sum of the squares of its dimension values. So, the norms of X and Y are:

       

    and the Cosine similarity of X and Y is given by:


  4. Consider the following distance matrix giving the distance scores among items I1 through I7.

    1. Compute the (binary) distance matrix using a threshold of 8, and then give its graph representation (items that are considered similar are those with a distance of 8 or less).

    2. Show the clusters using the clique technique (explain your answer).

    3. Show the clusters using the single-link technique (explain your answer).


Return to Assignments Page
Return to Main Page



Copyright © 2007-2008, Bamshad Mobasher, School of CTI, DePaul University.