Search Engine project for

Introductory programming,
Object Oriented Programming,
Database Systems,
Concurrency and
Introduction to Algorithms and Data Structures

December 2002

The danish version of this page can be found here(will not be updated!!!!)

The search engine project is a standard project,
see here to find other standard projects.

This homepage :
http://www.it-c.dk/research/algorithms/Kurser/SoegeProjekt/2002DECEMBER/

The application form for instructor must be handed in at latest 18. of November. Because the teachers of the course are away in the time 15.-18. of November, students are encouraged to assign for the project before 15. of November. If you are later than 15 November. and want to do the search engine project and does not have an application form you will have to contact Carsten Butz in room 1:17, phone 38 16 88 20. Casten has completed application forms for the search engine project for you to sign and hand in at the student administration.

Table of contents

This page: Other:

Introductory information

Welcome to the homepage of the Search Engine Project. The Search Engine Project is for students with different skills and goals. All groups are required to solve the assignments mentioned in the section The joint part below. Special arrangements regarding the Joint part can be done with the teacher. When a group has solved the joint part, they have the first primitive version of a search engine for IT-C.

After the joint part the group can move on in several directions. For instance they can do one or more of the suggested extra assignments. What each group choose to work on will depend on interest and skills. In case the group has no other skills than introductory programming it is not expected to do more than the joint part during the project weeks. Groups with more skills are expected to hand in more than the joint part.

Each group are required to do a project agreement. Each group will have to describe its work done in a report, and this report is defended at a project exam.

During the project project days are held, where general information regarding the project will be given and questions can be asked. Furthermore meetings with each of the individual groups and the instructors will be held. If you or your group has any problems during the project you are welcome to take contact to one of the teachers, assigned for the course. There will also be an instructor for helping with programming problems or other problems of a more specific character.

At the first project day the project is presented and it will be possible to create the final groups, if this is not done already. Furthermore the programming techniques, which are needed for solving the joint part are presented. Therefore in the describtion below of the assignment there will supposedly be words you don't know before the first project day has been held. Parts of the text will be especially incomprehensible if you haven't heard about linked lists. Linked lists are one of the programming techniques presented at the first project day.


The assignment

Data files

To use for the project different data files containing the whole or parts of the homepage of IT-C is available. The format of the files can be illustrated by the following example:

*PAGE:http://www.it-c.dk
Here
is
a
word
and
one
word
more
*PAGE:http://www.it-c.dk/anotherpage.html
Here
is
more
and
yet
more

The idea is, that a data file contains information about a series of homepages. The information for a homepage starts with a *PAGE line which denotes at which address the the page can be found. Such an address is called a URL. A list of the words contained in the page follows the URL. The words are just listed one word at a time on each line in the same order as they occur on the page. As seen in the example each word can have several occurences on the same page.

You can get the following datafiles where the first two contains part of the homepages of IT-C and the last one contains all of them:

These files can also be found in the folder:

  s:\sysadm\soegemaskine

and it is recommended to use them directly from there when on IT-C.

When working using Linux the files can be found in the folder /import/share/stud/sysadm/soegemaskine/

The program SearchCmd

As part of this assignment the program SearchCmd.java is handed out. It implements a little search engine. From the command promt the name of the data file to use for searching is provided as a parameter to the program and the program is run. Refer to about the use of Java below for information on how to pass parameters to Java programs.

The parameter passed to the program is put in args[0] where args is the parameter to the programs main method. The first thing the program does is to construct a linked list datastructures with all the lines from the data file which name is in args[0]. The objects in the linked list contains the fields str and next. The field str contains a line from the data file and next points to the next object in the list. After reading the data file a simple user interface is started, enabling the user to get an answer on whether a given word is in one of the homepages read from the data file.

The joint part

The joint part consists of solving the following 4 assignments, where assignment 3 is the largest. These assignments are presented at the first project day.
  1. Run the program SearchCmd.
  2. Modify SearchCmd in such a way that if a search for "Hans" for instance occurs, all of the URL's where "Hans" exists is written out to the screen. This has to be done by changing the search-method in SearchCmd. It is not done by changing the datastructure.
  3. Modify the construction of the datastructure in such a way that a linked list of objects containing three fields is created. The three fields are str, next and urls. The fields str and next are like in SearchCmd, with the exception that there are only words in the list and each word only occurs once. In case an object contains the word "Hans" in the field str the objects urls field will be a pointer to a list of the URL's, wihch contains the word "Hans". A list of URL's consists of objects with two fields; next and url. The field next is a pointer to the next object in the list and the field url is a pointer to a URL. After construction of this datastructure coresponding search routines has to be made. The datastructure created in this assignment is of less size than the structure from SearchCmd and is faster to search in.
  4. Modify the solution from assignment 3 in such a way that you in stead of having a linked list of the different words, store them in a hashtable datastructures. You have to create the datastructure using chained hashing. That mean that each entry contains a reference to a linked list. Use of such a datastructure will have a dramatic effect on the time it takes to build and search in the datastructure.

In the solution of the above assignments it is not allowed to use any other packages than java.io and java.lang. As a consequence it is not allowed to use any of the classes from the java.util package. The reason for this demand is, that one of the purposes with this project is to learn how to program different datastructures and not just to learn how to use datastructures programmed by others.

As you can see it is allowed to use different methods on String objects. It is also allowed to use the method hashCode. It is also allowed to allocate arrays. If you have any doubt whether some Java functionality is allowed for use, contact one of the teachers.

For the assignments after the joint part there are no restrictions on the java functionality you must use.

The report

There are the following general demants for the report:

Furthermore there is the following requirements to the report, which are all related to the three search engines implemented in the joint part:

What the report contains apart from the joint part, depends on the work done by the group and the chosen focus on this work.

At Introductory programming there are made a note on how a project report might look. This note is recommended reading as preparation for this course. In the note an estimate of 25 pages are made for a project report. For this concrete project our estimate is less than 20 pages using a standard font (20 being a large report). Appendices are not counted in the 20 pages, neither are figures. Our experience shows that the following structure comprises a good report:

The desribtion of Searchmd ? can be done in the following way: If you find it relevant you can have a chapter after the foreword, describing linked lists, general demands for the input etc. which applies to all of the search engines described.

Datastructures

In order to solve the assignments in the joint part you need linked lists and hashing.

Linked lists are explained at the first project day. If you don't know about linked lists you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli, second edition, chapter 16.1 - 16.2. Alternatively take a look at this homepage. You can also get ListProgram.java which contain a small example of the use of linked lists. A real good way to learn about linked lists is by experimenting and modifying this program.

As supplementary material on hashing we recommend the material that can be found at this homepage.

Extra assignments

The following is suggestions for assignments to do after the joint part. The ordering of these suggestions is random.

About the use of Java

In this part you can find some usefull information on the use of Java. For further and more basic information refer to Introductory programming. Furthermore there is a link to the Java API specification.

When running SearchCmd one needs to pas parameters to the main method. Furthermore it is necessary to use so-called jar-files for many of the extra assignments. In the following section it is described how to do these two things in different Java developing environments.

In general it can be mentioned that jar-files contains functionality apart from the one given by the standard implementation. A jar file is necessary both when compiling and executing a program that use the extra functionality. All the jar-files that are refered to on this homepage can also be found in the folder:

  s:\sysadm

BlueJ

In BlueJ a program is executed by running the main method. A dialogobx appears where you have to provide the program parameters. Here for example you can write (beware of the use of double backslash):

  {"s:\\sysadm\\soegemaskine\\itcwww-small.txt"}

meaning that you pass the above string to the program (i.e. it is placed in args[0]

In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries. From here click Add and find the correct jar-file that is needed.

JCreator

If you use JCreator to translate and execute Java programs and encounters prooblems compiling and running the programs, enter the File menu and choose Close workspace each time JCreator is started.

You can pass parameters to a programs main-method in the following way: Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value Run application and press <default>. Then press Edit and choose Parameters. Parameters can then be entered in the field Main (...).

You an use a jar-file by selecting Configure in the menu, then select Options and finally JDK Profiles. Choose JDK version 1.3 and Edit. Press Add and choose Add package providing the jar-file to use.

Commandline tools

The classical way of compiling and running java programs is by using the so-called commandline programs. You can translate the program in SearchCmd.java in the following way:

  javac SearchCmd.java

Then running the program, using the small file, in the following way:

  java SearchCmd s:\sysadm\soegemaskine\itcwww-small.txt

If you want to use a jar-file it is done by adding it to the CLASSPATH. This is already done for the jar-files refered from this homepage.


Administrative information

Important dates

Project agreement

In this section some of the administrative rules regarding 4 week projects are shortly described.

It is easiest if you show up at one of the presentations of the project, where we can the fill out the needed applications. See dates for the presentations at Important dates.

At the intranet there is a document called "projektaftale" and a guide for this document.
The "Projektaftale"-document must be handed in at the study administration three times, first time the document is used as an application form for an instructor, then as a project agreement and finally as a front page for the project when handing it in.

Dates for handing in papers are found in the section Important dates. The text in this section follows the terminology of the study administration.

Please be aware that "Projektaftale" covers two concepts, a "document" and three hand ins of the document. Each group wishing to do the project together must hand in the document together all three times. The first two hand ins are signed by the teachers of the project.

Below a guide for each hand in is provided.

Application for instructor

A lot of fields has to be filled in in the application for instructor. Below is a run through of how to fill in these fields for this project.

Project agreement

A long list of fields has to be filled in. All information in the application for instructor must also be present in this form. Below is a list (not comlete) of fields to fill in for the project agreement.

Front page

Some fields has to be filled in on the frontpage. All information present in the two previous hand ins, must also be present on this paper. On the front page it must be stated whether each person in the group is evaluated as an individual or as a group at examination. Ususally you are evaluated as a group unless something different is agrred upon with the instructor. If you ae evaluated as an individual a describtion of how to do it must be provided.

Project exam

The time for examination will be given later.

When the projects are handed in an exam plan will be made ASAP

Exam plan

Each group must defend their report at an oral examination. Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.

Project days

During the project period there will be days where it's possible for the groups to get help.

Dates for the project days are not decided yet, but as in the previuos terms there will be a long list of days with different subjects such as servlets, lists, hashing, crawlers, which are of relevance for the construction of search engines.

Project groups

If you are interested in the search engine project you must write to stephen ASAP, providing information on hte technical courses you have followed.

The project groups below are to be regarded as temporary. Until now is just listed the persons who has shown interest in the course this far.

Be aware that at least two participants are required in a group unless otherwise is agreed on with the teacher. Our recommandation is at least 3 persons in a group.

If you are in a group with few persons, try to contact other groups, using the mail-adresses below, to create larger groups. We intend to create the final groups at the first project day.

Group 1 to 7 have Stephen as supervisor
Group 8 to 12 have Theis as supervisor

Group 1 :Mikael Blomqvist(DBD, GP, DB), Lars C. Hansen (OOP, DBD), Bertel K. Brodersen (DBD, GP, ITPO)
Group 2 : Kristian Thy (OOP, IADS, WEB), Lars Bengtsson (WEB, NP,IADS,OOP), Laust Juul Christensen (VOOP,Algoritmik)
Group 3 (early exam) : Xu Li,Jianjun Chen (IPBR, Mark, DSP), Shaolin Zhou, Nguyen Viet Thang (English report, IM, DSP,IPBR)
Group 4 (exam probl) : Esben R. Hansen (OOP, IAD,DBD), Ayhan Binici (VOP,IAD,BA.Dat),
--------Peter Tiedemann (DB,OOP,BA. Mat/DAT), Rebekka Andersen (DiS,NOP,DS,Algoritmik,OOP,BA.Dat)
Group 5 : Peter Lysdal Jakobsen (OOP, DB), Havard Semundseth (OOP,DBD,FP), Eirik Laberg (OOP,DBD,FP)
Group 6 :Claudio Arias (GP, AMPI), Kim Hartung Jørgensen (GP), Michael Kauffmann (GP, WEB)
Group 7 : Lene Christansen (DBD, GP, ITPO), Dorthe Skovshoved Rasmussen (GP, ITPO, DBD), Bente B. Petersen (GP, DBD)
Group 8 : Sandugash Olesen (GP, DBD, ITMAT kurser), Olga Hansen (DBS, DBD, GP)
Group 9 : Morten Franck, Peter Gath Hansen (DS, WEB, IT-mat)
Group 10 : Anne S. F. B-Christensen(GP,DBD,ITOP), Anders Musiat, Lene Marie Ramløse, Flemming Kristensen (DBD,ITPO,GP)
Group 11 : Siri Hamre (GP, ITPO, DBD), Andreas Hansen (GP, DBD), Ragnar Lundø(GP, DBD, Web)
Group 12 : Lena E. Fredeiksen (GP,DKM), Jens Ulrik Nielsen(GP,DBD,ITPO), Jakob Schrøder Andersen(GP,DBD,ITPO)
Group 13 :Orri Palsson (GP, DBD, WEB)

Alle lærere og studerende

Teachers

Stephen Alstrup (Teacher), email:stephen@it-c.dk
Theis Rauhe (Teacher), email:theis@it-c.dk
Jacob Borella (Teaching assistant), email:jborella@it-c.dk

Teaching assistant plan

The teaching assistant is in room 2.59 or in the terminal rooms. He can be found at the following times (changes may occur, but will be given at least 24 hours in advance):
day time
26112002 12-16
27112002 12-16
28112002 12-16
29112002 9-13
02122002 9-13
03122002 14-18
04122002 9-13
05122002 9-14
06122002 9-13
09122002 9-13
10122002 9-13
11122002 9-13
12122002 9-13
13122002 9-13
17122002 9-12
19122002 9-12