DS 575 / IS 575
Winter 2006

 Syllabus 

 Announcements 

 Course Material 

 Assignments 

 Class Project 

 Online Resources 

 Home




Comments/Suggestions



Intelligent Information Retrieval

Class Project

IR Tools and Software - This page contains a number of resources that may be helpful for the project. These resources include the complete source code in Perl, Java, C, and C++ for several search engines and IR systems. There are also a number of smaller scripts and some articles that you might find useful. However, please note that you should do your own work for the project. Using all or part of these resources (or other available tools) in your project should be first approved by the instructor.

Notes:

  • Final Project Checklist - Information about what you need to submit for the final project.
  • Each group or individual must submit a specific project proposal to be approved no later than February 9. Written projects must be done individually. Implementation projects can be done in groups of no more than 3, depending on the size and complexity of the project. Note that the size of the group must also be approved along with the project proposal.
  • Due date for final project: Monday, March 13.

The following are a list of ideas for the class project. You may choose any of these ideas or their variations. You may also choose to combine parts of these projects, or come up with your own idea. In all case, however, you must consult the instructor to agree on a set of project requirements and deliverables.

Written Projects

Written projects involve doing an in-depth study, survey, or evaluation of one or more topics related to information retrieval and filtering. The project can take a form of a research paper examining the use of a specific technique or model in various IR systems, or it can be a detailed case study involving two or more existing IR systems. In either case, the paper should contain a summary and a technical evaluation of the state-of-the-art related to the particular topic studied. If the paper involves a case study, then a thorough comparative evaluation with other similar systems must be provided. A research paper should present a new idea or provide a detailed survey of methods to solve a specific IR-related problem. The approach presented should be, at least in part, a novel and original contribution, and should ideally be evaluated experimentally. A research paper could be good start for a Masters or Ph.D. research project. The maximum length for the written projects is 20 single-spaced pages (11 point font), including figures and references. The evaluation of the papers will be based on clarity, thoroughness, and soundness of ideas and concepts presented, as well as the overall organization of the paper.

Note: Written project should not simply be a summary of some of the material covered directly in the lectures, but rather should go beyond this material in one or more specific areas related to that material. The following is a non-exhaustive list of ideas for a written project (very broadly stated):

  • Personalized Search: A study of various techniques and approaches used to create personalized search applications on the Web. The study should include a survey of techniques for re-ranking or filtering search results based on user profiles, as well as intelligent agents that take into account user characteristics or profiles to assist users in search.
  • Exploring various techniques to measure the "quality" of documents as part of Web IR systems. The study should include examination of techniques based on linkage as a measure of authority of the information source (e.g., HITS or Pagerank algorithms), as well as other techniques to use ratings or popularity as measures of quality.

  • A comparative study of various "relevance feedback" and "query-by-example" methods currently used by (and best suited for) the World Wide Web. The study should include technical summary of various techniques, and evaluation of existing methods in various search engines.

  • Study of the use of client-side (or sever-side) agents on the Web that assist users in filtering information and in browsing or searching tasks on the Web. The paper should include a summary of techniques, as well as a comparison of several commercially or freely available systems.

  • Web Content Mining: a study of various techniques to mine information and patterns from semi-structured data on the Web. Examples include the use of agents designed to extract specific types of information (e.g., shopping agents), the use of XML to integrated available "meta-data" into current search technologies, Web data warehousing, etc.

  • Web Usage Mining: a study of the feasibility and effectiveness of techniques to incorporate Web usage data (e.g., navigational patterns, demographic data, etc.) into search. retrieval, and filtering systems. This may also include the use of such techniques in server-side "Web personalization."

  • Collaborative Filtering and Recommender Systems: A comparative study of various collaborative filtering techniques and their applications in several recommender systems. The study should include a technical summary of various techniques, and evaluation of existing methods in use today on the Web.

  • Use of Clustering in IR: a thorough study of the use of clustering and categorization techniques to improve various aspects of IR (particularly on the Web). This includes the use of such techniques to assist users in browsing, to do prediction or user profiling, or to improve the effectiveness of various indexing or retrieval models.

Implementation Projects

  1. Build your own search/retrieval system:
    • Should work on a local document corpus (which you can obtain from several sources online).
    • Should parse and index documents using inverted file format
    • Should make use of stemming and stop lists (however, you can existing tools for this part).
    • You should implement matching based on one of the methods discussed in class.
    • Form of queries is up to you, but relevance ranking should be used when results are displayed.

  2. Build a Web search engine:
    • Design you own Web "spider" to go through several levels deep starting from a specified page.
    • Retrieved documents should be automatically indexed (inverted file format) and stored on a server.
    • Query engine will be CGI-based.
    • On the client-side, a simple query interface should provide the capability to perform simple Boolean queries.
    • Results displayed should be ranked.

  3. Implement your own simple meta-search engine:
    • Build a Java or HTML-based query interface which allows for Boolean queries (and possibly category labels).
    • Search several preselect search engines in parallel (this would involve translating queries to an appropriate form for each search engine).
    • Allow user to specify filtering parameters such as number of top documents returned from each engine, time-out value (in seconds) for each engine, etc.
    • Try to combine results, based on some measure of document rankings from different engines, or some other ranking scheme that you develop yourself (note that this may involve indexing documents either using your own or an already available tool).

  4. Implement a Web browsing assistant using clustering:
    • The system should collect (or build and index of) documents the user is most likely to be interested in (for example, by observing frequency of access of documents over a period of time and from the bookmark/history files).
    • The documents should then be clustered (using a clustering algorithm of your choice).
    • The clustered set of documents should be presented to the user via a browsable interface. Each cluster can be labeled by a group of keywords best describing documents in that cluster.
    • Many variations to this project are possible.

  5. Design your own domain-specific search agent:
    • Your system should be able to search several domain-specific sites (e.g., online newspapers, or shopping sites) for documents relating to a user query.
    • You may design your agent to work on 2 or more specific sites.
    • An index of the extracted information can be available locally and updated on a regular basis.
    • The query interface (and language) can be restricted (based on characteristics of the domain and the sites) to make the search and matching process easier and more efficient.

  6. Implement a mail, news, or Web filtering system:
    • Your system should provide the capability for selective dissemination of information based on a user profile.
    • As news or mail files are received form a specified source, these documents can be matched against available profiles (sets of keywords, for example), and filtered and/or classified.
    • Same idea could be applied to web pages served either through the use of "push" technologies (e.g., Pointcast system), or obtained by the client system through "polling" of specified Web sources.

  7. Design and enhanced user interface for a retrieval system:
    • Your interface should help guide the user in formulating a query. You can explore options such as the use of (part of) a classification hierarchy (such as Yahoo's category labels), or allowing for natural language queries (possibly through the use of WordNet thesaurus), etc.
    • Your system should also provide and enhanced interface for the user to browse the retrieved documents and provide mechanisms such as relevance feedback and query by example.
    • For this project you don't have to implement your own indexing and matching algorithms, however, you may need to modify an existing system (with source code) to incorporate the additional capabilities.

  8. Experiments with term weighting using HTML parsing:
    • Many retrieval systems use term weighting and ranking algorithms that are based on term frequencies. However, HTML documents provide additional information in the form of structural tags that can indicate the importance of a term or phrase within a document. For this project, you should conduct a set of controlled experiments to try to determine if and how parsing of the HTML documents and the use of "tagged" terms and phrases may improve recall and precision in retrieval.

  9. Build a simple recommender system:
    • Allow multiple users to access a server and rate items based on their preferences (e.g., movies, music, Web pages, etc.);
    • Use collaborative filtering technology (or other profiling techniques such as clustering) to find similar groups of users.
    • Based on ratings of other similar users create dynamic recommendations for the current user of the system.
    • Many different variations of this idea is possible.


Copyright © 2005-2006, Bamshad Mobasher, School of CTI, DePaul University.