DePaul University DePaul CTI Homepage

Assignment 1

Notes:
  • This assignment is Due Thursday October 12.
  • A list of data mining tools (including local versions) are available from the Tools and Software section.
    1. Consider the following transaction database:
         TID	Items
         -------------------------
         01	A, B, C, D, F
         02	A, B, C, D, E, G
         03	A, C, G, H, K
         04	B, C, D, E, H, K
         05	D, E, F, H, L
         06	A, B, C, K, L
         07	A, D, F, L
         08	B, I, E, K, L
         09	C, D, F, L
         10	A, B, D, E, K
         11	C, D, H, I, K
         12	C, E, F, K
         13	B, C, D, F
         14	A, B, C, D
         15	C, H, I, J, K
         16	A, D, E, F, H, L
         17	H, K, L
         18	A, D, H, K
         19	D, E, F, K, L
         20	B, C, D, E, H, L
      

      Applying the Apriori algorithm with minimum support of 30% and minimum confidence of 75%, find all the association rules in the data set. Give details of your computation at each step, however, at each step give only the frequent itemsets that satisfy minimum support (i.e., itemsets which appear in at least 6 transactions). Also specify the confidence and improvement for each of the rules you discovered.


    2. Consider the "weather" data set discussed in class. Applying the ID3 algorithm, complete the analysis of the example and construct the optimal decision tree (recall that the features are "outlook", "wind", "humidity", and "temperature", the target class is "play" indicating whether the given day is a good day to play golf). Give the details of your calculations in each step. Also, give the decision rules derived from your decision tree.


    3. The marketing department of a financial firm keeps records on customers, including demographic information and, number of type of accounts. When launching a new product, such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is sent to existing customers, and a record kept as to whether that customer responded and bought the product. Based on this store of prior experience, the managers decide to use data mining techniques to build customer profile models.

      The data contains of a number of the following fields

      id a unique identification number
      age age of customer in years
      sex MALE / FEMALE
      region inner_city/rural/suburban/town
      income income of customer
      married Is the customer married (YES/NO)
      children number of children
      car Does the customer own a car (YES/NO)
      save_acct Does the customer have a saving account (YES/NO)
      current_acct Does the customer have a current account (YES/NO)
      mortgage Does the customer have a mortgage (YES/NO)
      pep Did the customer buy a PEP after the last mailing (YES/NO)

      Each record is a customer description where the "pep" field indicates whether or not that customer bought a PEP after the last mailing.

      There are two data sets available (the data sets are comma delimited, and the first row contains the field names):

      1. bank-data.txt - Preclassified Data Set for Building a Model
      2. bank-new.txt - A set of customers from which to find the "hot prospects" for the next mailing, using the profiles built from the training set.

      • First use Excel or other spreadsheet to examine and possibly transform the data. For example, the "id" field can be removed, if necessary, or other fields can be added. Record your findings, and specify any transformations you perform on the data.

      • Perform Association Rule discovery on the data set using available tools such as the Weka package, or Magnum Opus. Report your findings and observations, such as interesting rules, or any recommendations that might help the company to better understand behavior of its customers or in its marketing campaign. Note that this may require discretizing continuous variables into bins (it is easiest to do this in Magnum Opus, since bins can be specified directly in the data specification file.

      • Next, using available tools such as the Weka package or the See5 program create a "C4.5" classification model based on the preclassified training data. You may wish to experiment with various options such as cross validation, boosting, etc. Record the decision tree and model accuracy statistics obtained from your model.

      • Finally, apply the resulting model to the new customers data set. How many "hot prospects" are targeted by your model? Note that you will need to map the resulting answers back to the original customer "id" field for the new customers (this could be done again using a spreadsheet program such as Excel). Also, note that in preparing the data for the "new" cases, you may need to add an instance value of "?" for the missing target class value of each record. In the case of See5 program, the program "sample.exe" which is included in the distribution, can be used to classify new instances based on the model.

      Examples of using Weka and See5 for classification will be given in class, including the necessary data preparation steps.


    Return to Assignments Page
    Return to Main Page



    Copyright © 2000, Bamshad Mobasher, School of CTI, DePaul University.