Search Engine project for

Introductory programming,
Object Oriented Programming,
Database Systems,
Concurrency and
Introduction to Algorithms and Data Structures

December 2003

The danish version of this page can be found here(will not be updated!!!!)


Table of contents

This page: Other:

Introductory information

Welcome to the homepage of the Search Engine Project. The Search Engine Project is for students with different skills and goals. All groups are required to solve the assignments mentioned in the section The joint part below. Special arrangements regarding the Joint part can be done with the teacher. When a group has solved the joint part, they have the first primitive version of a search engine for IT-C.

After the joint part the group can move on in several directions. For instance they can do one or more of the suggested extra assignments. What each group choose to work on will depend on interest and skills. In case the group has no other skills than introductory programming it is not expected to do more than the joint part during the project weeks. Groups with more skills are expected to hand in more than the joint part.

Each group are required to do a project agreement. Each group will have to describe its work done in a report, and this report is defended at a project exam.

During the project project days are held, where general information regarding the project will be given and questions can be asked. Furthermore meetings with each of the individual groups and the instructors will be held. If you or your group has any problems during the project you are welcome to take contact to one of the teachers, assigned for the course. There will also be an instructor for helping with programming problems or other problems of a more specific character.

At the first project day the project is presented and it will be possible to create the final groups, if this is not done already. Furthermore the programming techniques, which are needed for solving the joint part are presented. Therefore in the describtion below of the assignment there will supposedly be words you don't know before the first project day has been held. Parts of the text will be especially incomprehensible if you haven't heard about linked lists. Linked lists are one of the programming techniques presented at the first project day.


The assignment

Data files

To use for the project different data files containing the whole or parts of the homepage of IT-C is available. The format of the files can be illustrated by the following example:

*PAGE:http://www.it-c.dk
Here
is
a
word
and
one
word
more
*PAGE:http://www.it-c.dk/anotherpage.html
Here
is
more
and
yet
more

The idea is, that a data file contains information about a series of homepages. The information for a homepage starts with a *PAGE line which denotes at which address the the page can be found. Such an address is called a URL. A list of the words contained in the page follows the URL. The words are just listed one word at a time on each line in the same order as they occur on the page. As seen in the example each word can have several occurences on the same page.

You can get the following datafiles where the first two contains part of the homepages of IT-C and the last one contains all of them:

These files can also be found in the folder:

  s:\sysadm\soegemaskine

and it is recommended to use them directly from there when on IT-C.

When working using Linux the files can be found in the folder /import/share/stud/sysadm/soegemaskine/

The program SearchCmd

As part of this assignment the program SearchCmd.java is handed out. It implements a little search engine. From the command promt the name of the data file to use for searching is provided as a parameter to the program and the program is run. Refer to about the use of Java below for information on how to pass parameters to Java programs.

The parameter passed to the program is put in args[0] where args is the parameter to the programs main method. The first thing the program does is to construct a linked list datastructures with all the lines from the data file which name is in args[0]. The objects in the linked list contains the fields str and next. The field str contains a line from the data file and next points to the next object in the list. After reading the data file a simple user interface is started, enabling the user to get an answer on whether a given word is in one of the homepages read from the data file.

The joint part

The joint part consists of solving the following 4 assignments, where assignment 3 is the largest. These assignments are presented at the first project day.
  1. Run the program SearchCmd.
  2. Modify SearchCmd in such a way that if a search for "Hans" for instance occurs, all of the URL's where "Hans" exists is written out to the screen. This has to be done by changing the search-method in SearchCmd. It is not done by changing the datastructure.
  3. Modify the construction of the datastructure in such a way that a linked list of objects containing three fields is created. The three fields are str, next and urls. The fields str and next are like in SearchCmd, with the exception that there are only words in the list and each word only occurs once. In case an object contains the word "Hans" in the field str the objects urls field will be a pointer to a list of the URL's, wihch contains the word "Hans". A list of URL's consists of objects with two fields; next and url. The field next is a pointer to the next object in the list and the field url is a pointer to a URL. After construction of this datastructure coresponding search routines has to be made. The datastructure created in this assignment is of less size than the structure from Searchcmdand is faster to search in.
  4. Modify the solution from assignment 3 in such a way that you in stead of having a linked list of the different words, store them in a hashtable datastructures. You have to create the datastructure using chained hashing. That mean that each entry contains a reference to a linked list. Use of such a datastructure will have a dramatic effect on the time it takes to build and search in the datastructure.

In the solution of the above assignments it is not allowed to use any other packages than java.io and java.lang. As a consequence it is not allowed to use any of the classes from the java.util package. The reason for this demand is, that one of the purposes with this project is to learn how to program different datastructures and not just to learn how to use datastructures programmed by others.

As you can see it is allowed to use different methods on String objects. It is also allowed to use the method hashCode. It is also allowed to allocate arrays. If you have any doubt whether some Java functionality is allowed for use, contact one of the teachers.

For the assignments after the joint part there are no restrictions on the java functionality you must use.

The report

There are the following general demants for the report:

Furthermore there is the following requirements to the report, which are all related to the three search engines implemented in the joint part:

What the report contains apart from the joint part, depends on the work done by the group and the chosen focus on this work.

At Introductory programming there are made a note on how a project report might look. This note is recommended reading as preparation for this course. In the note an estimate of 25 pages are made for a project report. For this concrete project our estimate is less than 20 pages using a standard font (20 being a large report). Appendices are not counted in the 20 pages, neither are figures. Our experience shows that the following structure comprises a good report:

The desribtion of Searchmd ? can be done in the following way: If you find it relevant you can have a chapter after the foreword, describing linked lists, general demands for the input etc. which applies to all of the search engines described.

Datastructures

In order to solve the assignments in the joint part you need linked lists and hashing.

Linked lists are explained at the first project day. If you don't know about linked lists you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli, second edition, chapter 16.1 - 16.2. Alternatively take a look at this homepage. You can also get ListProgram.java which contain a small example of the use of linked lists. A real good way to learn about linked lists is by experimenting and modifying this program.

As supplementary material on hashing we recommend the material that can be found at this homepage.

Extra assignments

The following is suggestions for assignments to do after the joint part. The ordering of these suggestions is random.

About the use of Java

In this part you can find some usefull information on the use of Java. For further and more basic information refer to Introductory programming. Furthermore there is a link to the Java API specification.

When running SearchCmd one needs to pas parameters to the main method. Furthermore it is necessary to use so-called jar-files for many of the extra assignments. In the following section it is described how to do these two things in different Java developing environments.

In general it can be mentioned that jar-files contains functionality apart from the one given by the standard implementation. A jar file is necessary both when compiling and executing a program that use the extra functionality. All the jar-files that are refered to on this homepage can also be found in the folder:

  s:\sysadm

BlueJ

In BlueJ a program is executed by running the main method. A dialogobx appears where you have to provide the program parameters. Here for example you can write (beware of the use of double backslash):

  {"s:\\sysadm\\soegemaskine\\itcwww-small.txt"}

meaning that you pass the above string to the program (i.e. it is placed in args[0]

In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries. From here click Add and find the correct jar-file that is needed.

JCreator

If you use JCreator to translate and execute Java programs and encounters prooblems compiling and running the programs, enter the File menu and choose Close workspace each time JCreator is started.

You can pass parameters to a programs main-method in the following way: Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value Run application and press <default>. Then press Edit and choose Parameters. Parameters can then be entered in the field Main (...).

You an use a jar-file by selecting Configure in the menu, then select Options and finally JDK Profiles. Choose JDK version 1.3 and Edit. Press Add and choose Add package providing the jar-file to use.

Commandline tools

The classical way of compiling and running java programs is by using the so-called commandline programs. You can translate the program in SearchCmd.java in the following way:

  javac SearchCmd.java

Then running the program, using the small file, in the following way:

  java SearchCmd s:\sysadm\soegemaskine\itcwww-small.txt

If you want to use a jar-file it is done by adding it to the CLASSPATH. This is already done for the jar-files refered from this homepage.


Administrative information

Important dates

Signing up for the project

The easiest way to sign up for the project is to go to the presentation of the project on 10. of November, where you can get help to find a group and fill out the project agreement. Be aware that at least two participants are required in a group unless otherwise is agreed on with the teacher. We recommend at least 3 persons in a group.

This section describes in detail how to sign up for the project. These instruction are given for the english version of the system. Do the following steps:

  1. Go to the electronic project base https://adm.it-c.dk/. If you do not have a password get one by clicking on the "get password" link. Follow the instructions and await the password in your e-mail.
  2. After logging on click on Project Base.
  3. Click on "new project".
  4. In the field "Project agreement- common part" click "fill in " and fill in the fields as follows:
  5. Click "Save project agreement (common part)" and subsequently click "approve" in the project agreement - common part field.
  6. In the field "Participants - individual information " fill in the fields as follows:
  7. Click "Save individual information" and subsequently click "approve" under your individual information.
  8. Add each menber of the group by entering the name of him/her, clicking "add participant" and filling out the form as described in 6.
  9. In the field "Supervisor" enter Philip Bille, click "add supervisor", select "choose" to the left of Philip Bille and click "add supervisor".
  10. Finally, click "invite" on Philip Bille.
  11. The project agreement will be sent back to you ASAP by me. As soon as you recieve it follow the instructions to sent it to the study board. Note that you have to sent it to the study board before 17. of November.

 

Project exam

The time for examination will be given later.

When the projects are handed in an exam plan will be made ASAP

Exam plan

Each group must defend their report at an oral examination. Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.

Project days

During the project period there will be a number of days with lectures and discussion of most of the topics the in projects. Note that the ending time of each lecture/discussion is may vary depending on the need for discussion and questions posed by the audience.

Project groups

Group 1 Maisa Ibrahim El Rafie (IPBR, SU, IADS, DBS, OOP), Susanne Hoerby Christensen (IPBR, ITPO, DBS, OOP, IADS), Safuriat Oloruntoyin Johnson (IADS), Thomas Stuart Henney (GP, IADS).

Group 2 John Jensen (GP, OOP).

Group 3 Rune Hessing Russel (GP, DBD, ITPL), Louise Lahn Soerensen (GP, DBD, ITPL), Henrik Bie (GP, DBD, ITPL) Annika Hollesen Eirikstoft (GP, DBD, ITPL).

Group 4 Brigitte Susan Kammersgaard-Andersen (GP), Zhiqiang Sun (IAD), Yuqin Cao (IAD).

Group 5 Anders Lund (GP, DBD), Simon Lei Fredslund (GP, DBD, ITPO).

Group 6 Yan Yi (IPBR, IADS), Filip Skov Skjerning (GP, W2), Mehrdad Mahrouiyan (GP).

Group 7 Karin Agerbaek Jensen (GP), Helene Noergaard Bech (GP).

Group 8 Nicolai Ritz (IADS, DS), Deike Dreiseberg (OOP, IADS).

Group 9 Tru Dong Tran (IADS), Suraj Ramalingam (IADS, IPBR, NP), Priyadarsini Seetharaman (IADS, IPBR, NP).

Group 10 Jens Lykke Brandt (IADS).

Group 11 Louise Kjaer Brun Pedersen (GP), Danny Mortensen (GP).

All teachers and students

Teachers

Philip Bille (Teacher), email:beetle@itu.dk
Sebastian Jensen (Teaching assistant), email:seje@itu.dk

Teaching assistant plan

day time
26/11 (Sebastian is in Linux Lab) 9-16
28/11 (Sebastian is in Linux Lab) 9-16
2/12 (Sebastian is in room 3.15 or Linux Lab) 9-16
4/12 (Sebastian is in room 4.05) 9-16
9/12 (Sebastian is in room 4.05) 9-16
11/12(Sebastian is in room 4.05) 9-16
16/12(Sebastian is in room 4.05) 9-16
18/12(Sebastian is in room 4.05) 9-16