Intelligent
Information
Retrieval
DS 575 / IS 575
Assignment 2
Due: February 3, 2006
- Finite State Automata and Lexical Analysis (Note: your FSAs should only accept all terms corresponding to the
specified regular expression or pattern, but no other terms).
- Construct a Finite State Automata corresponding to regular expression:
b ( a b | c )* c b* c*.
- Construct a Finite State Automata recognizing the set of all
binary strings with an even number of 0’s. Examples include: 00, 111, 101110,
00101011, and empty string.
-
Suppose that
during document indexing we are interested in identifying all dates in the
format mm/dd/yyyy, where mm is a 2
digit string representing the month (ranging from 00 through 12), dd
is a 2-digit string representing the day (01-31), and yyyy are 4
digits representing the year. Furthermore, we are only interested in the years
between 1990 and 2005. Example: 02/30/2001.
Construct a Finite State Automata that exactly recognizes date tokens as
described above. Notes: You do not need to match the month with the
correct number of days. You may assume that dig
is a special token representing all digits (0-9).
- Consider the following three short documents:
- Doc #1
- Glimpse is an indexing and query system that allows for search through a file system or document collection quickly. Glimpse is the default search engine in a larger information retrieval system.
It has also been used as part of some web based search engines.
- Doc #2
- The main processes in an retrieval system are document indexing, query processing, query evaluation and relevance feedback.
Among these, efficient updating of the index is critical in large scale systems.
- Doc #3
- Clusters are created from short snippets of documents retrieved by web search engines which are as good as clusters created from the full text of web documents.
|
- First remove stop words, and punctuation, and apply Porter's stemming algorithm to the 3 documents (Note: You can use our
online stemming application for this purpose).
- Create and inverted (sorted) list index of the three documents, including the dictionary
and the postings. The dictionary should also contain (for each term) statistics such as
total number of occurrences in the collection and the document frequency. The postings for each
term should contain the document ids and the term frequencies.
- What are the hit lists for the following Boolean queries (in each case
explain how you obtained them from the inverted index):
- index AND query
- (search AND query) OR (search AND
retrieve)
- (search AND engine AND web)
OR
feedback
- (index OR cluster) AND (web
OR
system)
- Consider the following document-term table containing raw term frequencies.
Answer the following questions, and in each case give the formulas you used to perform
the necessary computations. Note: you should not do the computations manually. You
may use a spreadsheet program such as Microsoft Excel, or you can considering writing your
own program do the computations., In either case, include your spreadsheet or program in
your assignment submissions.
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
-----------------------------------------------
DOC1 0 3 4 0 0 2 4 0
DOC2 5 5 0 0 4 0 4 3
DOC3 3 0 4 3 4 0 0 5
DOC4 0 7 0 3 2 0 4 3
DOC5 0 1 0 0 0 5 4 2
DOC6 2 0 2 0 0 4 0 1
DOC7 3 5 3 4 0 0 4 2
DOC8 0 3 0 0 0 4 4 2
DOC9 0 0 3 3 3 0 0 1
DOC10 0 5 0 0 0 4 4 2
----------------------------------------------
- Compute the new weights for all the terms in document DOC7 using the
tf x idf approach.
- Compute the new weights for all the terms in documents DOC7 using the
signal-to-noise ratio approach.
- Using the Keyword Discrimination approach, determine if Term1
is a good index term or not (by computing it's discriminant). To compute average similarities
use the vector dot product (simple matching) as your
similarity measure (you do not need to compute the full Cosine similarity) . Show your work.
Back to Assignments
|