Intelligent
Information
Retrieval
DS 575 / IS 575
Assignment 1
Due: Wednesday, January 25
- In each of the following cases, discuss the impact (if any) on precision and recall in an information retrieval system:
- The use of stop lists in indexing
- The use of stemming algorithms
- The use of phrase indexing
- The use of a controlled vocabulary in indexing
- Expansion of user queries by adding synonyms
-
Consider the following table of retrieval results from a collection of 16
documents, D1 through D16, for two different retrieval algorithms in response to
some specific query. A 1 in each row entry for either of the algorithms indicates
that the algorithm retrieved the specified document in response to the query.
In the last row, a y indicates the actual relevance judgment that the
specified document is relevant to the query (n means not relevant).
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16
---------------------------------------------------------------
Alg1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0
Alg2 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1
---------------------------------------------------------------
Relv y y n n y y y n y y y n n n y y
Compute the precision and recall for each algorithm (show details). Can you make a judgment as
to which algorithm is more effective? Explain.
-
One metric that should be minimized in an IR system, from a user perspective, is
user overhead. Describe the places that the user overhead is encountered from
when a user has an information need until when it is satisfied. What is the
relationship between user overhead and precision/recall?
-
Apply Porter's stemming algorithm to each of the following words. In each case show the steps in the derivation of the final stem, and the intermediate stems produced by each step.
- queries
- compelling
- irritability
- morphological
- The following table shows the raw occurrence frequencies of some words in a hypothetical collection. If x and y represent two distinct words, f(x) and f(y) denote the raw frequencies for words x and y, respectively. The frequencies of co-occurrence of x and y (given a window of a some fixed size) is denoted by f(x,y).
[Note that the probability of occurrence is estimated as the raw frequency
divided by the total number of terms.]
x y f(x) f(y) f(x,y)
-------------------------------------------
color blue 32 95 25
color green 32 24 16
blue green 95 24 4
united states 40 18 12
united airline 40 10 6
states airline 18 10 2
Assuming that the total number of words, N, in the collection is
5,000, rank these pairs of words in decreasing order of expected mutual information. Show your detailed calculations in at least one case.
- WordNet is an advanced lexical ontology and thesaurus for the English language with many applications in information retrieval and Web information agents. Write a one page document about WordNet containing the following elements. (1) Your own description of WordNet. (2) What are the unique features of WordNet that distinguish it from other types of standard thesauri? And (3) what are some of the specific ways WordNet
can be (or has been) used in the context of information retrieval?
Back to Assignments
|