Intelligent
Information
Retrieval
DS 575 / IS 575
Assignment 3
Due: Thursday, February 16, 2006
-
Using bigrams for stemming, which of the following terms is more likely to be considered
"equivalent" to the term “informational”? Justify your answer by using Dice's coefficient
to measure similarity of each term to "informational".
- Consider the following document-term table with 10 documents and
8 terms (A through H) containing raw term frequencies. We also have a
specified query, Q, with the indicated raw term weights. Answer the
following questions, and in each case give the formulas you used to perform
the necessary computations. Note: You should do this problem using a
spreadsheet program such as Microsoft Excel. Alternatively, you can write a
program to perform the computations. You should include your worksheets or code
in the assignment submission).
A B C D E F G H
-----------------------------------------------
DOC1 0 3 4 0 0 2 4 0
DOC2 1 5 0 0 12 0 1 3
DOC3 1 0 4 3 9 0 0 0
DOC4 0 7 0 3 0 0 3 3
DOC5 0 4 0 0 0 5 1 0
DOC6 1 2 2 0 3 1 0 1
DOC7 0 5 3 4 0 0 4 2
DOC8 0 3 0 0 0 4 2 0
DOC9 4 8 9 0 10 8 0 9
DOC10 0 5 0 0 0 4 1 2
----------------------------------------------
Q 2 3 1 0 2 0 1 0
- Compute the ranking score for each document based on each of the
following query-document similarity measures:
- simple matching (dot product)
- cosine similarity
- Dice's coefficient
- Jaccard's Coefficient
- Overlap Coefficient
- Construct a similar table to above, but instead of raw term frequencies
compute the (non-normalized) tf.idf weights for the terms. Then compute
the ranking scores using the cosine similarity (only). Explain any
significant differences between the ranking you obtained here and the
cosine ranking from the previous part.
- Consider again the same document-term matrix given in the previous problem. Using the
Extended Boolean model (with weighted Boolean queries) which
documents will be returned by each of the following queries? Show your work.
- B0.5 OR F1.0
- F1.0 AND H0.25
(Note: use dot product for computing similarities).
- Probabilistic Retrieval Model
Consider the following document-term matrix, where a 1 entry indicates that
the term occurs in a document, and 0 means it does not:

Suppose the relevance judgments
specified above represent some past user judgments on the relevance of these documents to queries. Using the
basic probabilistic retrieval model, compute the discriminant for
each of the two new documents
- D11 = (0,1,1,0,0,1)
- D12 = (1,0,1,1,0,1)
with respect to the query Q = (1,1,0,1,0,1) Based on this discriminant, should these documents
be retrieved? Explain your answer.
- Read the paper Understanding User Goals in Web
Search by Rose and Levinson of Yahoo!. Then write a short summary (about one
single-spaced page) which includes the following:
- How can the underlying user goals in Web search be categorized and what
are the primary differences between these search types?
- What are some of the behavioral clues from which the search engine can
deduce a user's search goals?
- What were some of the main findings of this study and how might they be
used to improve future Web search engines?
Back to Assignments
|