DS 575 / IS 575
Winter 2006
Syllabus
Announcements
Course Material
Assignments
Class Project
Online Resources
Home
Comments/Suggestions
|
Intelligent
Information
Retrieval
Class Project
|
IR Tools and Software - This page contains
a number of resources that may be helpful for the project. These resources include the
complete source code in Perl, Java, C, and C++ for several search engines and IR systems.
There are also a number of smaller scripts and some articles that you might find useful.
However, please note that you should do your own work for the project. Using all or part
of these resources (or other available tools) in your project should be first approved by
the instructor.
|
Notes:
-
Final Project Checklist - Information about
what you need to submit for the final project.
- Each group or individual must submit a specific project proposal to
be approved no later than February 9. Written projects must be done individually.
Implementation projects can be done in groups of no more than 3, depending on the
size and complexity of the project. Note that the size of the group must also be approved
along with the project proposal.
- Due date for final project: Monday, March 13.
The following are a list of ideas for the class project. You may
choose any of these ideas or their variations. You may also choose
to combine parts of these projects, or come up with your own idea. In
all case, however, you must consult the instructor to agree on a set of
project requirements and deliverables.
Written Projects
Written projects involve doing an in-depth study, survey, or evaluation of one or more topics
related to information retrieval and filtering. The project can take a form of a
research paper
examining the use of a specific technique or model in various IR systems, or it can be a detailed
case study involving two or more existing IR systems. In either case, the paper should contain a
summary and a technical evaluation of the state-of-the-art related to the particular topic studied.
If the paper involves a case study, then a thorough comparative evaluation with other similar
systems must be provided. A research
paper should present a new idea or provide a detailed survey of methods to solve a specific IR-related problem. The approach
presented should be, at least in part, a novel and original contribution, and should ideally be
evaluated experimentally. A research paper could be good start for a Masters or Ph.D. research
project. The maximum length for the written projects is 20 single-spaced pages (11 point font),
including figures and references. The evaluation of the papers will be based on clarity,
thoroughness, and soundness of ideas and concepts presented, as well as the overall organization
of the paper.
Note: Written project should not simply be a summary of some of the material covered
directly in the lectures, but rather should go beyond this material in one or more specific
areas related to that material. The following is a non-exhaustive list of ideas for a written
project (very broadly stated):
- Personalized Search: A study of various techniques and approaches used to
create personalized search applications on the Web. The study should include a
survey of techniques for re-ranking or filtering search results based on user
profiles, as well as intelligent agents that take into account user
characteristics or profiles to assist users in search.
- Exploring various techniques to measure the "quality" of documents as part of Web IR systems.
The study should include examination of techniques based on linkage as a measure of
authority of the information source (e.g., HITS or Pagerank algorithms), as well as other techniques
to use ratings or popularity as measures of quality.
- A comparative study of various "relevance feedback" and "query-by-example" methods currently
used by (and best suited for) the World Wide Web. The study should include technical summary of
various techniques, and evaluation of existing methods in various search engines.
- Study of the use of client-side (or sever-side) agents on the Web that assist users in
filtering information and in browsing or searching tasks on the Web. The paper should include a
summary of techniques, as well as a comparison of several commercially or freely available
systems.
- Web Content Mining: a study of various techniques to mine information and patterns
from semi-structured data on the Web. Examples include the use of agents designed to extract
specific types of information (e.g., shopping agents), the use of XML to integrated available
"meta-data" into current search technologies, Web data warehousing, etc.
- Web Usage Mining: a study of the feasibility and effectiveness of techniques to incorporate
Web usage data (e.g., navigational patterns, demographic data, etc.) into search. retrieval, and
filtering systems. This may also include the use of such techniques in server-side "Web
personalization."
- Collaborative Filtering and Recommender Systems: A comparative study of various collaborative
filtering techniques and their applications in several recommender systems. The study should include
a technical summary of various techniques, and evaluation of existing methods in use today on the Web.
- Use of Clustering in IR: a thorough study of the use of clustering and categorization techniques
to improve various aspects of IR (particularly on the Web). This includes the use of such techniques
to assist users in browsing, to do prediction or user profiling, or to improve the effectiveness
of various indexing or retrieval models.
Implementation Projects
- Build your own search/retrieval system:
- Should work on a local document corpus (which you can obtain
from several sources online).
- Should parse and index documents using inverted file format
- Should make use of stemming and stop lists (however, you can
existing tools for this part).
- You should implement matching based on one of the methods
discussed in class.
- Form of queries is up to you, but relevance ranking should be
used when results are displayed.
- Build a Web search engine:
- Design you own Web "spider" to go through several levels
deep starting from a specified page.
- Retrieved documents should be automatically indexed (inverted
file format) and stored on a server.
- Query engine will be CGI-based.
- On the client-side, a simple query interface should provide
the capability to perform simple Boolean queries.
- Results displayed should be ranked.
- Implement your own simple meta-search engine:
- Build a Java or HTML-based query interface which allows for
Boolean queries (and possibly category labels).
- Search several preselect search engines in parallel (this
would involve translating queries to an appropriate form for
each search engine).
- Allow user to specify filtering parameters such as number of
top documents returned from each engine, time-out value (in
seconds) for each engine, etc.
- Try to combine results, based on some measure of document rankings
from different engines, or some other ranking scheme that you
develop yourself (note that this may involve indexing documents
either using your own or an already available tool).
- Implement a Web browsing assistant using clustering:
- The system should collect (or build and index of) documents
the user is most likely to be interested in (for example, by
observing frequency of access of documents over a period of
time and from the bookmark/history files).
- The documents should then be clustered (using a clustering
algorithm of your choice).
- The clustered set of documents should be presented to the user
via a browsable interface. Each cluster can be labeled by a
group of keywords best describing documents in that cluster.
- Many variations to this project are possible.
- Design your own domain-specific search agent:
- Your system should be able to search several domain-specific
sites (e.g., online newspapers, or shopping sites) for documents
relating to a user query.
- You may design your agent to work on 2 or more specific sites.
- An index of the extracted information can be available locally
and updated on a regular basis.
- The query interface (and language) can be restricted (based on
characteristics of the domain and the sites) to make the search
and matching process easier and more efficient.
- Implement a mail, news, or Web filtering system:
- Your system should provide the capability for selective
dissemination of information based on a user profile.
- As news or mail files are received form a specified source,
these documents can be matched against available profiles (sets
of keywords, for example), and filtered and/or classified.
- Same idea could be applied to web pages served either through
the use of "push" technologies (e.g., Pointcast system), or
obtained by the client system through "polling" of specified
Web sources.
- Design and enhanced user interface for a retrieval system:
- Your interface should help guide the user in formulating a
query. You can explore options such as the use of (part of) a
classification hierarchy (such as Yahoo's category labels), or
allowing for natural language queries (possibly through the use
of WordNet thesaurus), etc.
- Your system should also provide and enhanced interface for
the user to browse the retrieved documents and provide mechanisms
such as relevance feedback and query by example.
- For this project you don't have to implement your own indexing
and matching algorithms, however, you may need to modify an
existing system (with source code) to incorporate the additional
capabilities.
- Experiments with term weighting using HTML parsing:
- Many retrieval systems use term weighting and ranking
algorithms that are based on term frequencies. However, HTML
documents provide additional information in the form of structural
tags that can indicate the importance of a term or phrase within a
document. For this project, you should conduct a set of controlled
experiments to try to determine if and how parsing of the HTML
documents and the use of "tagged" terms and phrases may improve
recall and precision in retrieval.
- Build a simple recommender system:
- Allow multiple users to access a server and rate items based on their
preferences (e.g., movies, music, Web pages, etc.);
- Use collaborative filtering technology (or other profiling techniques
such as clustering) to find similar groups of users.
- Based on ratings of other similar users create dynamic recommendations
for the current user of the system.
- Many different variations of this idea is possible.
|