go to CTI home page

Intelligent Information Retrieval
DS 575 / IS 575

Assignment 3
Due: Thursday, February 16, 2006

  1. Using bigrams for stemming, which of the following terms is more likely to be considered "equivalent" to the term “informational”? Justify your answer by using Dice's coefficient to measure similarity of each term to "informational".
    • informal
    • formalization


  2. Consider the following document-term table with 10 documents and 8 terms (A through H) containing raw term frequencies. We also have a specified query, Q, with the indicated raw term weights. Answer the following questions, and in each case give the formulas you used to perform the necessary computations. Note: You should do this problem using a spreadsheet program such as Microsoft Excel. Alternatively, you can write a program to perform the computations. You should include your worksheets or code in the assignment submission).
           A     B     C     D     E     F     G     H
         -----------------------------------------------
    DOC1   0     3     4     0     0     2     4     0
    DOC2   1     5     0     0     12    0     1     3
    DOC3   1     0     4     3     9     0     0     0
    DOC4   0     7     0     3     0     0     3     3
    DOC5   0     4     0     0     0     5     1     0
    DOC6   1     2     2     0     3     1     0     1
    DOC7   0     5     3     4     0     0     4     2
    DOC8   0     3     0     0     0     4     2     0
    DOC9   4     8     9     0     10    8     0     9
    DOC10  0     5     0     0     0     4     1     2
          ----------------------------------------------
    Q      2     3     1     0     2     0     1     0
    
    1. Compute the ranking score for each document based on each of the following query-document similarity measures:
      • simple matching (dot product)
      • cosine similarity
      • Dice's coefficient
      • Jaccard's Coefficient
      • Overlap Coefficient

    2. Construct a similar table to above, but instead of raw term frequencies compute the (non-normalized) tf.idf weights for the terms. Then compute the ranking scores using the cosine similarity (only). Explain any significant differences between the ranking you obtained here and the cosine ranking from the previous part.


  3. Consider again the same document-term matrix given in the previous problem. Using the Extended Boolean model (with weighted Boolean queries) which documents will be returned by each of the following queries? Show your work.
    • B0.5 OR F1.0
    • F1.0 AND H0.25
    (Note: use dot product for computing similarities).


  4. Probabilistic Retrieval Model

    Consider the following document-term matrix, where a 1 entry indicates that the term occurs in a document, and 0 means it does not:

     

    Suppose the relevance judgments specified above represent some past user judgments on the relevance of these documents to queries. Using the basic probabilistic retrieval model, compute the discriminant for each of the two new documents

    • D11 = (0,1,1,0,0,1)
    • D12 = (1,0,1,1,0,1)

    with respect to the query Q = (1,1,0,1,0,1) Based on this discriminant, should these documents be retrieved? Explain your answer.


  5. Read the paper Understanding User Goals in Web Search by Rose and Levinson of Yahoo!. Then write a short summary (about one single-spaced page) which includes the following:

    1. How can the underlying user goals in Web search be categorized and what are the primary differences between these search types?
    2. What are some of the behavioral clues from which the search engine can deduce a user's search goals?
    3. What were some of the main findings of this study and how might they be used to improve future Web search engines?



Back to Assignments
 

Copyright © 2005-2006, Bamshad Mobasher, School of CTI, DePaul University.