25
THE UNIVERSITY OF THE GAMBIA SENIOR PROJECT WEB CRAWLER DOCUMENTATION Written by: Seedy Ahmed Jallow 2121210 Salieu Sallah 2112465 Landing Jatta 2121750

Senior Project Documentation

Embed Size (px)

Citation preview

Page 1: Senior Project Documentation

THE UNIVERSITY OF THE GAMBIA

SENIOR PROJECT

WEB CRAWLERDOCUMENTATION

Written by:

Seedy Ahmed Jallow 2121210

Salieu Sallah 2112465

Landing Jatta 2121750

Page 2: Senior Project Documentation

Table of ContentsINTRODUCTION........................................................................................................3

DESCRIPTION...........................................................................................................3

THEORITICAL BACKGROUND....................................................................................4

DOM PARSER............................................................................................................8

Using A DOM Parser...........................................................................................11

SOFTWARE ANALYSIS.............................................................................................12

Problem Definition..............................................................................................12

Functional Requirement.....................................................................................12

Non Functional Requirements............................................................................12

Target User.........................................................................................................13

Requirement Specification.................................................................................13

Acceptance Criteria............................................................................................14

System Assumption............................................................................................15

Relationship Description.....................................................................................15

Structure of the website.....................................................................................15

SOFTWARE DESIGN................................................................................................16

System Development Environment....................................................................16

System Development Languages..................................................................16

Classes...............................................................................................................19

Main Class......................................................................................................19

Web Crawler Class..........................................................................................20

SOFTWARE TESTING...............................................................................................22

BIBLIOGRAPHY AND REFERENCES..........................................................................25

Page 3: Senior Project Documentation

INTRODUCTION

This is an implementation of a web crawler using the Java programming language. This project is implemented fully from scratch using a DOM parser to parse our XML files. This is project is about taking a fully built XML website and visit recursively all the pages that are present in the website searching for links and saving them in a hash table and later printing the links recursively. In other words the Web crawler fetches data fromthe already built XML site. Starting with an initial URL, which is not only limited to the index page of the website, it crawls through all the pages of the website recursively.

However, the articles show a powerful technique to traverse the hierarchy and generate DOM events, instead of outputting an XML document directly. Now I can plug-in different content handlers that do different things or generate different versions of the XML.

Internet has become a basic necessity and without it, life is going to be very difficult. With the help of Internet, a person can get a huge amount of information related to any topic. A person uses a search engine to get information about the topic of interest. The user just enters a keyword and sometimes a string in the text-field of a search engine to get the related information. The links for different web-pages appear in the form of list and this is a ranked list generated by the necessary processing in the system. This is basically due to the indexing done inside the system in order to show the relevant resultscontaining exact information to the user. The user clicks on the relevant link of web pagefrom the ranked list of web-pages and navigates through the respective web pages. Similarly, sometimes there is a need to get the text of a web page using a parser and for this purpose many html parsers are available to get the data in the form of text. When thetags are removed from a web page then in order to do the indexing of words, some processing is needed to be done in the text and get some relevant results to know about the words and the set of data present in that web page respectively.

DESCRIPTION

Page 4: Senior Project Documentation

A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or (in software context) a Web strutter.

Web search engines and some other sites use Web crawling or spidering softwares to update their web content or indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently.

Crawlers can validate hyper links and HTML /XML code. They can also be used for web scraping.

Web crawlers are a key component of web search engines, where they are used to collectthe pages that are to be indexed. Crawlers have many applications beyond general search, for example in web data mining (e.g. Attributor, a service that mines the web forcopyright violations, or ShopWiki, a price comparison service).

THEORITICAL BACKGROUND

Web crawlers are almost as old as the web itself. In the spring of 1993, just months after the release of NCSA Mosaic, Matthew Gray wrote the first web crawler,

the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics about the growth of the web. A year later, David Eichmann wrote the

first research paper containing a short description of a web crawler, the RBSE spider. Burner provided the first detailed description of the architecture of a web

crawler, namely the original Internet Archive crawler Brin and Page’s seminal paper on the (early) architecture of the Google search engine contained a brief description of the Google crawler, which used a central database for coordinating the crawling. Conceptually, the algorithm executed by a web crawler is extremely simple: select a URL from a set of candidates, download the associated web pages, extract the URLs (hyperlinks) contained therein, and add those URLs that have not been encountered before to the candidate set. Indeed, it is quite possible to implement a simple functioningweb crawler in a few lines of a high-level scripting language such as Perl. However,

Page 5: Senior Project Documentation

building a web-scale web crawler imposes major engineering challenges, all of which are ultimately related to scale. In order to maintain a search engine corpus of say, ten billion web pages, in a reasonable state of freshness, say with pages being refreshed every 4 weeks on average, the crawler must download over 4,000 pages/second. In orderto achieve this, the crawler must be distributed over multiple computers, and each crawling machine must pursue multiple downloads in parallel. But if a distributed and highly parallel web crawler were to issue many concurrent requests to a single web server, it would in all likelihood overload and crash that web server. Therefore,web crawlers need to implement politeness policies that rate-limit the amount of traffic directed to any particular web server (possibly informed by that server’s observed responsiveness). There are many possible politeness policies; one that is particularly easy to implement is to disallow concurrent requests to the same web server; a slightly more sophisticated policy would be to wait for time proportional to the last download time before contacting a given web server again. In some web crawler designs (e.g. the original Google crawler and PolyBot the page downloading processes are distributed, while the major data structures – the set of discovered URLs and the set of URLs that have to be downloaded – are maintained by a single machine. This design is conceptually simple, but it does not scale indefinitely; eventually the central data structures become a bottleneck. The alternative is to partition the major data structures over the crawling machines.

This program starts by creating a hash table of Strings to store the attributes and the hyper links..

static Hashtable<String, String> openList = new Hashtable<String, String>();

static Hashtable<String, String> extList = new Hashtable<String, String>();

static Hashtable<String, String> closeList = new Hashtable<String, String>();

A HASHTABLE is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found. In the context of this web ,crawler it use to map our key (a) and our value (href).

After importing all the necessary files we then parse the XML files to the DOM. The Document Object Model (DOM) is a programming interface for HTML, XML and SVGdocuments. It provides a structured representation of the document (a tree) and it definesa way that the structure can be accessed from programs so that they can change the

Page 6: Senior Project Documentation

document structure, style and content. The DOM provides a representation of the document as a structured group of nodes and objects that have properties and methods. Nodes can also have event handlers attached to them, and once that event is triggered theevent handlers get executed. Essentially, it connects web pages to scripts or programming languages.

import java.io.File;

import java.net.MalformedURLException;

import java.net.URL;

import java.util.Hashtable;

import javax.xml.parsers.DocumentBuilderFactory;

import javax.xml.parsers.DocumentBuilder;

import org.w3c.dom.Document;

import org.w3c.dom.NodeList;

import org.w3c.dom.Node;

import org.w3c.dom.Element;

public static void parsePage(URL url) {

String xmlPath = url.getFile();

File xmlFile = new File(xmlPath);

String page = null;

The Document Object Model (DOM) is a set of language-independent interfaces for programmatic access to the logical XML document. We will use the latest Java DOM Interfaces. These correspond to the latest version of the language-

DOM Level 1 interface as specified by the W3C, which is always accessible through this link. IBM’s XML4J parser all latest version, pretty much as soon as it is available. .

As we have learned, the structure of a well formed XML document can be expressed logically as a tree with a single Interface that encapsulates the structural connections between the XML constructs is called the Node. The Node con that express structural connections such as Node# getChildNodes(), Node# getNextSibling(), Node# getParentNode(),

Page 7: Senior Project Documentation

The DOM Interfaces also contain separate interfaces for XML’s high-level constructs such as Element . Each of these interfaces extends Node . For example, there are interfaces for Element, Attribute, Comment, Text, and so on. Each of these specific and setter functions for their own specific data. For example, the Attribute interface has Attribute# getName(), and Attribute member functions. The Element interface has the means to get and set attributes via functions like Element# getAttributeNode(java.lang.String), and Element# setAttributeNode(Attribute).

Always remember that various high-level interfaces such as Element, Attribute, Text, Comment, and so on, all extend means the structural member functions of Node (such asgetNodeName() ) are available to Element, Attribute , and all these t illustration of this isthat any node such as Element or Text knows what it is, by re-implementing getNodeType() . This allow programmer to query the type using Node# getNodeType() instead of Java’s more expensive run-time type instance of f

So, in Java you can write a simple recursive function to traverse a DOM tree:

The root of the DOM tree is the Document interface. We have waited until now to introduce it because it serves multiple purposes. It represents the whole document and contains the methods by which you can get to the global document information and the root Element .

Second, it serves as a general constructor or factory for all XML types, providing methods to create the various cons an XML document. If an XML parser gives you a DOM Document reference, you may still invoke the create methods with to build more DOM nodes and use append Child and other functions to add them to the document node or other nodes if the client programmer changes, adds, or removes nodes from the DOM tree, there is no DOM requirement to check validity. This burden is left to the programmer (with possible help from the specific DOM or parser implementation).

Obviously the third step is the complicated one. Once you know the contents of the XML document, you might want to, for example, generate a Web page, create a purchase order, or build a pie chart. Considering the infinite range of data that could be contained in an XML document, the task of writing an application that correctly processes any potential input is intimidating. Fortunately, the common XML parsing tools discussed here can make the task much, much simpler.

Page 8: Senior Project Documentation

DOM PARSER

The XML Parser for Java provides a way for your applications to work

with XML data on the Web. The XML Parser provides classes for parsing, generating, manipulating, and validating XML documents. You

can include the XML Parser in Business-to-Business (B2B) and other applications that manage XML documents, work with metacontent,

interface with databases, and exchange messages and data. The XML Parser is written entirely in Java, and conforms to the XML 1.0

Recommendation and associated standards, such as Document ObjectModel (DOM) 1.0, Simple API for XML (SAX) 1.0, and the XML

Namespaces Recommendation.

DOM implementationsThe Document Object Model is an application programmer’s interface

to XML data. XML parsers produce a DOM representation of the parsedXML. Your application uses the methods defined by the DOM to access

and manipulate the parsed XML. The IBM XML Parser provides two DOM implementations:

– Standard DOM: provides the standard DOM Level 1 API, and is highlytuned for performance

– TX Compatibility DOM: provides a large number of features not provided by the standard DOM API, and is not tuned for performance.

You choose the DOM implementation you need for your application when you write your code. You cannot, however, use both DOM’s in

the XML Parser at the same time. In the XML Parser, the DOM API is implemented using the SAX API.

Modular designThe XML Parser has a modular architecture. This means that you can

Page 9: Senior Project Documentation

customize

the XML Parser in a variety of different ways, including the following:Construct different types of parsers using the classes provided,

including:– Validating and non-validating SAX parser

– Validating and non-validating DOM parser– Validating and non-validating TXDOM parser

To see all the classes for the XML Parser, look in the W3C for Java IDE for

the W3C XML Parser for Java project and the org.w3c.xml.parsers package.

Specify two catalog file formats: the SGML Open catalog, and the X-Catalog

format. Replace the DTD-based validator with a validator based on some other method,

such as the Document Content Description (DCD), Schema for Object-Oriented XML (SOX), or Document Definition Markup Language (DDML)

proposals under consideration by the World Wide Web Consortium (W3C).

Constructing a parser with only the features your application needs reduces the

number of class files or the size of the JAR file you need. For more information

about constructing the XML Parser, refer to the related tasks at the bottom of this

page.

Constructing a parserYou construct a parser by instantiating one of the classes in the com.ibm.xml.parsers package. You can instantiate the classes in one

of the following ways:– Using a parser factory

– Explicitly instantiating a parser class

Page 10: Senior Project Documentation

– Extending a parser class

For more information about constructing a parser, refer to the related tasks at the

bottom of this page.

SamplesWe provide the following sample programs in the IBM XML Parser for Java Examples project. The sample programs demonstrate the

features of the XML Parser using the SAX and DOM APIs:– SAXWriter and DOMWriter: parse a file, and print out the file in XML

format.– SAXCount and DOMCount: parse your input file, and output the total

parse time along with counts of elements, attributes, text characters, and white space characters you can ignore. SAXCount and DOMCount

also display any errors or warnings that occurred during the parse.– DOMFilter: searches for specific elements in your XML document.

– TreeViewer: displays the input XML file in a graphical tree-style interface. It also

highlights lines that have validation errors or are not well-formed.

Creating a DOM parserYou can construct a parser in your application in one of the following ways:

– Using a parser factory

– Explicitly instantiating a parser class

– Extending a parser class

To create a DOM parser, use one of the methods listed above, and specify

com.ibm.xml.parsers.DOMParser to get a validating parser, or

com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser. To access

the DOM tree, your application can call the getDocument() method on the parser.

For more information about constructing a parser, refer to the related tasks below.

Page 11: Senior Project Documentation

Using A DOM Parserimport com.ibm.xml.parsers.DOMParser;

import org.w3c.dom.Document;

import org.xml.sax.SAXException;

import java.io.IOException;

import java.io.UnsupportedEncodingException;

//Constructing parser by instantiating parser object

//In this case from DOMParser

public class example2 {

static public void main( String[] argv ) {

String xmlFile = “file:///xml_document_to_parse”;

DOMParser parser = new DOMParser();

try {

parser.parse(xmlFile);

} catch (SAXException se) {

se.printStackTrace();

} catch (IOException ioe) {

ioe.printStackTrace();

}

// The next lines are only for DOM Parsers

Document doc = ((DOMParser) parser).getDocument();

if ( doc != null ) {

try {

(new dom.DOMWriter( false ) ).print( doc ); // use print

method from dom.DOMWriter

} catch ( UnsupportedEncodingException ex ) {

ex.printStackTrace();

}

}

}

Page 12: Senior Project Documentation

SOFTWARE ANALYSIS

Problem DefinitionFor our senior project, we were asked to write a search engine program that will list all the pages that are present in a particular off-line website, and as well list all the external links that are reachable from one of the internal pages.

Search engines consist of many features like web crawling, words extracting, indexing, ranking, searching, search querying etc. In this project I am just concentrating on crawling through the website and indexing the pages and outputting them as well as the external links that are reachable through one of the internal pages.

Functional RequirementFunctional requirements means the physical module which are going to be produced by the proposed system. The only functional module for this system is web crawling. The crawler takes the index page of the website as input. Then it scans through all the elements on the page, extracting the hyper link references to other pages and storing them in a list to be scanned through later. The crawler will scan through the pages recursively storing all scanned pages in a hash table to make sure the crawler takes care of circular references.

Non Functional RequirementsThis program isn't meant to be an end user program, so very little emphasis is made on the user interface. As a result no user interface was developed. Input and output will be through the terminal. It is worth noting also that this is not a professional program either,so the issue of product security and the like are not considered.

Page 13: Senior Project Documentation

Target UserThe target users of this program are aside from the project instructor and supervisor (obviously), are the general programming community who want to see the very basic implementation of a search engine. They are allowed to use, reuse, share my code as long as I am credited for it.

Requirement Specification

Page 14: Senior Project Documentation

The logical model above is a data flow diagram overview showing the processes required for the proposed system. Details of the processes is explained in the physical design below.

Process Description

Input The index page of the website that is to be crawled is

inputed by the user.

Create URL Create a URL from the path of a file.

Parse Page Creates document builders to break down the structure of

the page into a tree of nodes. And traverses through the

nodes to collect hyper link references to other pages

Save Links Stores the hyper link references in a list and provide links

to the crawler.

Internal Links Gets all the URLs the whose references are internal pages

of the website or in other words, has the “file://” protocol.

External Links Gets all the URLs that are referencing to pages external to

the website or in other words has the “http://” protocol.

Save in table Stores all the links in their respective hash tables.

Html Page Checks whether the URL is referencing to a valid html

page and not an image, port etc.

Print Outputs the URLs

Acceptance CriteriaOn the day of completion of the project, all the features explained above will be provided, mainly web crawling.

Page 15: Senior Project Documentation

System Assumption

It is url search only. All processes of the web crawler are made to processes url info only.

It doesn't care about other searches including image searching. The results of other

languages are unexpected. Thou the use of anything other than urls will not lead to

system errors. It also assumes that the user is well versed with command line input,

output and other command line attributes that are necessary in running the program.

Relationship Description

Each page has many links, so the relationship between the pages and links are one to

many.

Structure of the websiteFedern, is the name of the website to be crawled. The website contains information on feathers. There are hundreds of feathers whose description and identification is given in this website. The website is available in three languages, German, English and French. Each page of the website has a link tab on the top of the page. That tab contains links to the home page, feathers, identification, news, bibliography, services and help.

The home page of Federn contains the description of the idea behind the website, the authors, the concept behind this project and the acknowledgement of the contribution of others in the development of the website. As you can see it contains a lot of links referring to other pages. All the links though are internal links.

The feathers page contains all the feathers that were identified and described in this website. The list of feathers are arranged in two formats. Firstly, they are arranged according to their Genus and Family names on one side of the page, and arranged according to alphabetical order on the other side. Each feather name is a link to the page containing the description of the feathers and scanned images.

The identification page contains an image of the feather with a picture of a bird which had that type of feather. It also contains detailed descriptions of the different types, colors and shapes of that feather and the main function of the feather in flight and temperature regulation.

Page 16: Senior Project Documentation

There is a news tab that contain any new information found on the feathers or any discovery made on feathers.

The bibliography contains links and resources where information on this website is gathered from. It also contains service and help pages.

As you can see this website is a huge one. Each page aside from the main index page, has 3 copies of itself in 3 different languages.

SOFTWARE DESIGN

System Development Environment

System Development LanguagesThe only language that is used in the development of this program is Java. Java is a highly dynamic language. It contains most of the functionalities that was needed in the development of the program.

Page 17: Senior Project Documentation

The Java IO package allowed me to utilize the file class which I used to create fileobjects which I feed to the parser to parse the pages of the website. I created the files by using the absolute file paths which was extracted from the urls.

The JavaNET

package contains classes which were used to create urls from file paths. The urls can be created as in my case by passing the absolute file path of the parent class and the page name of the url being processed. Using of urls in my program is very crucial, bearing in mind that I needed to check the referenced locations of the urls that are being processed to make sure that the urls are referring to pages that are local to the website being crawled or in order words are stored in the file system. You can check the protocols of the urls by using the getProtocol() method. If it returns “file://” then that page being referenced by the url is local to the file system. If it returns “http://” then that page beingreferenced by the url is referring to a page outside the website being crawled.

Getting the baseURI:

Creating a URL:

Checking the protocol of a url

Page 18: Senior Project Documentation

The Java UTIL class enables us to use a structure called Hash table. Hash tables are usedto store data objects in this case urls. I created two instances of the hash table class to store urls on the website being crawled that refer to pages that are internal to the websiteand store urls on the website that refer to other pages that are outside. Thou there are other storage systems that can be used like MySQL, array lists, arrays etc because it is unlike MySQL it is very simple to implement and use and unlike array lists, it is faster atstoring, searching and retrieving data which is very important considering that thousandsof urls can be stored and searched through over and over again.

Creating hash tables to store internal and external links

The Java

library also has a very important package which is by far the most important tool used inmy program, the xml parser package. This package contains the document builder factory which is used to create document builders which contains the parsers we are going to use to break down the pages. The parser parses the content of the file which it is feed as an XML document and returns a new DOM document object. This package also contains methods which validate the xml documents and verify whether the documents are well formed as well.

Getting the document builder factory, document builder and parsing a xml document into a DOM document object sample;

Page 19: Senior Project Documentation

It is very important that the parser doesn't not validate the xml pages because it will require Internet access and as you already know, the program is crawling off-line web pages. If it does, it will lead to system errors.

Setting off the validating features of the parser;

The external package DOM is used to create documents which will store the parsed pages content into a document. The document contains elements in a tree like structure with each element corresponding to a node on the tree. Traversing through the tree with any appropriate traversal method, all the nodes containing a-element tags are collected and stored in a list of nodes. Looping through that list of nodes, one is able to extract all a-element tags containing hyper link references.

ClassesIn my implementation of the program, I used only classes. The first class contained the main method while the second class contained the main implementation of the web crawler.

Main ClassThe main class contains the main method. The main method contains the prompt for the user to enter the absolute file path of the index page of the website to be crawled. When the user complies, the path is converted to a url object and is stored in the hash table containing internal links. The main method also contains the first call of the recursive method processPage(URL url).

Structure of the main method;

Page 20: Senior Project Documentation

Web Crawler ClassThis class contains 80% of the implementation. It has only one method definition, that ofthe recursive method, processPage(). At the beginning of the class, the hash tables are declared followed by the definition of the processPage() method.

The processPage() method contains only one parameter, the url object that is passed. Inside the method, the absolute path of the url is extracted and a file object is created thereof. The document builder object is created from the declaration and initialization ofthe document builder factory and document builders in the preceding lines of code. The method also contains the code snippet making sure that the parser doesn't validate the xml pages. The parser is then called to parse the xml document and then the root element of DOM document is then normalized. Thereafter, the root element of the document is extracted and the traversal of the nodes of the document begin. All a-element tags are selected and stored in a list of nodes. The nodes then are looped through and all the a-element tags containing the “href” attribute, the values of the

Page 21: Senior Project Documentation

attributes are extracted and a url is created therein of the pages that href is referencing to. As explained before, the url is created by extracting the base url of the parent file of page being referenced to and the page name of that file.

The protocol of the file is then checked, and if it is “file://”, the program proceeds to check whether that url isn't referring to an image, a port etc, that it is referring to an actual page. Then it proceeds to make sure that that url is not already stored in the hash table containing links to internal pages of the website. If it is stored in the hash table already, the link is discarded by the system and the next link on the node list is processed. If it ain't stored in the hash table, the url is stored and that page is processed for more urls.

If the protocol is tested and it returns “http://”, the program proceeds to check whether that url isn't already stored in the hash table containing links to external pages. It it is, the url is discarded and the next link on the list is processed. If not, the url is stored in the table and then it is printed to the screen

Page 22: Senior Project Documentation

SOFTWARE TESTINGDuring the testing of the program, many problems were encountered. One of the first problems we had during the initial testings was with the validation of the parser. It is standard that all xml documents are checked to see whether they are well formed documents and are valid.

The website we are crawling as you already know, is off-line and if the parser tries to validate it, errors like the one shown below will occur because it needs to connect to the Internet to perform the checks.

Page 23: Senior Project Documentation

To solve it, we set all features of the document builder factory that could start the validation of the xml documents to false, as we have shown you somewhere before.

Another problem we encountered during the implementation of the program was how to get the absolute paths of the relative path to the pages we found on each page we alreadycrawled. All that the crawler returned was the names of the files that were found to be referenced from the page that we were crawling. What we later did was to get the base uri of the file that was being crawled; it returns the absolute file path of that file being crawled and attached the names of the pages that were found on that file. That way we were able to create a url for all the links and processed them.

Aside from the problems mentioned above, the program was able to pass through the final tests without any major bugs therefore bringing us successfully to the end of the implementation of the program. Although it was not an easy ride, it was worth every bit of effort we invested in it. Below are Terminal images showing the compilation and running of the program and the results i.e. the links on the website being crawled. No graphical interface is developed therefore the default GUI; the terminal is used.

Command to compile the Program:

Running the Program:

Prompt and input of Index page:

The program ran smoothly and proceeded to print out all the links that were found on thewebsite and label them internal or external depending on were they are referencing to and the protocol they contain.

Page 24: Senior Project Documentation
Page 25: Senior Project Documentation

BIBLIOGRAPHY AND REFERENCES

[HREF1] What is a “Web Crawler" ? (http://research.compaq.com/SRC/mercator/faq.html )

[HERF2] inverted index ( http://burks.brighton.ac.uk/burks/foldoc/86/59.htm )

[MARC] Marckini, Fredrick. Secrets to making your Internet Web Pages Achieve Top Rankings (ResponseDirect.com, Inc., c1999 )

http://en.wikipedia.org/wiki/Web_crawler

http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf

http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf