Intelligent
Information
Retrieval
DS 575 / IS 575
Assignment 4
Due: Monday, March 6, 2006
- Suppose that after receiving the results of a query Q0 = "dog
race", a user has rated the following two document as non-relevant:
DOC1: "greyhound race track betting"
DOC2: "dog race betting"
and the following three documents as relevant:
DOC3: "iditarod dog sled race"
DOC4: "husky dog sled race malamute dog sled"
DOC5: "alaska dog sled race"
Assuming simple term frequency weights, use Rocchio’s relevance feedback method
to compute a new query Q1 (use a positive feedback factor of
1.0 and negative feedback factor of 0.5). Explain any significant increase or
decrease in term weights. Show your work.
- Consider the following document-term matrix containing raw term frequencies:
- Construct a term-term distance matrix using Manhattan distance
as your measure.
- Determine the binary term relationship matrix using a distance threshold of
8,
and then construct the graph corresponding to this matrix. Note: there is an edge
between each pair of nodes whose distance is less than or equal to 8.
- Using the above graph, perform clustering based on the Clique algorithm (show your work).
- Using the above graph, perform clustering based on the Single Link algorithm (show your work).
- Consider again the same document-term matrix for the previous problem. In
this problem we focus on clustering documents.
- Starting with documents 1 and 2 in Class (cluster) 1, and with documents 5,
6 and 7
in Class 2, perform K-means clustering of the documents using K = 2 (show the reallocation of the documents to classes
after each iteration). Use the cosine similarity measure in the clustering
algorithm.
- Perform clustering of documents using the one-pass (single-pass)
technique (show your work). Use dot product as your similarity measure,
and use a pre-specified similarity threshold of 10. Allow for overlapping
clusters if an item satisfies the similarity threshold for more than one
cluster. The algorithm for
this techniques was given in Lecture 7 notes. For an example of how this
technique is applied see single-pass.html.
-
Suppose that an online bookseller has collected ratings information from
20 past users (U1-U20) on a selection of recent books. The ratings range from 1 = worst to 5 = best.
Two new users (NU1 and NU2) who have recently visited the site and rated some of the books
as follows ("?" represents missing ratings):
Using the K-Nearest Neighbor algorithm predict the ratings of these new users for
each of the books they have not yet rated. Use the Pearson correlation coefficient as the similarity measure. For your convenience, this data is given in
the Excel spreadsheet "knn-w06.xls".
Note: In Microsoft Excel, you can use the CORREL function to
compute correlation.
- Give the results for each of the cases where K=1, K=2, and K=3 (in cases of K=2 and K=
3, use average ratings to combine the results). Note: this can be done by creating a table
with 3 rows for the values of K, and columns for predicted rating of NU1 and NU2 on the books
with missing ratings. The entries in this table will be the predicted rating for the item
given the value of K. Be sure to show the intermediate steps in your work (or provide the
spreadsheet that contains your computations). Based on the above results, which books, if any, would you recommend to users NU1 and
NU2?
- Can negative correlations, in this case, help us in recommending books to these users?
Explain your answer.
-
Read the paper Hyperlink Analysis for the Web by
Monika Henzinger, Research Director at Google, and answer the following questions
(in each case, briefly discuss your answers or provide short explanations).
- What are the basic assumption behind the hyperlink analysis algorithms?
- What is meant by connectivity-based ranking, and how does it relate to co-citation analysis?
- What are some of the primary applications of hyperlink analysis in Web information retrieval?
- Briefly describe the Pagerank algorithm used by the Google search engine.
- Briefly describe the HITS algorithms. What are some of the problems associated with
this algorithm, and how might one resolve these problems?
- How can one spam hyperlink-based search engines? Are they subject ot spamming
attacks as keyword based approaches? If so, can you think of some approaches to reduce
or counter such attacks?
Back to Assignments
|