Crawler extra assignment

Preface

This challenge is about writing your own crawler, which downloads webpages from the internet. Fundamental skills in html is required to do this excercise-

About downloading homepages in Java

The file DownloadHtml.java contains an example of a Java program, that downloads a homepage at a given URL and writes its content to the screen. You can write a crawler by using this example to run through the pages downloaded, extracting words and references to other pages.

Parsing of homepages using jtidy

Another possibility is finding the logical structure of the homepages downloaded, then extracting references and words. The logical structure of a homepage can be represented by a tree. For instance take a look at the homepage:

<html>
  <head>
    <title>
      Title of the homepage
    </title>
  </head>
  <body>
    Here is something in
    <i>
      italics
    </i>
    and here is a
    <a href="http://www.it-c.dk">
      reference
    </a>
  </body>
</html>
(the content can be seen here. This homepage represents the following tree:
                      <html>
                       / \
          /-----------/   \-------------\    
          |                             |
        <head>                        <body>
	  |                           / | | \
	  |           /--------------/  | |  \------------------\
          |           |                 | |                     |
          |           |            /----/ \--\                  |
	  |           |	           |         |                  |
	  |       "Here is     <i>  "and here is a"  <a href="http://www.it-c.dk">
          |    something in"       |                            |
          |	  	       "italics"                    "reference"
          |
       <title>
          |
"Title of the homepage"
There is a Java package jtidy which can take a homepage and create the tree for the page. This process is also called to parse the homepage. In this case errors on the homepage could occur if for example <i> was closed with </a>. In these cases jtidy tries to make some guesses and correct a little to create a tree confirming to the html standard. You can browse the online documentation for jtidy and get the file Tidy.jar which is necessary to run and compile programs using jtidy. At the homepage of the Search Engine Project further information on the use of jar files can be found.

You can get the example TidyExample.java which prints what can be found directly under the body tag of a homepage at a given URL. If it is run on the homepage above, the following is printed:
Text:Here is something in
Node:i
Text: and here is a
Node:a
  Attr name:href
  Attr value:http://www.it-c.dk

About multithreading

It is a possibility to program the crawler using multible threads. In this way more than one page can be crawled at a time. This might be a good idea, since the time between requesting a page and getting it can be long. This is because information is sent to a server that can be far away, and have long response times as well. If multithreading is used this waiting time can be used to parse homepages and downloading other pages.

There are several things to be aware of when writing a multithreaded crawler.