Laboratory exercises for EIT031 WEB Intelligence · 2009. 3. 11. · Laboratory exercises for EIT031 WEB Intelligence Anders Ardö Department of Information technology Lund Institute

Laboratory exercises for

EIT031 WEB Intelligence

Anders Ardö

Department of Information technology

Lund Institute of Technology

April 1, 2008

Contents

1 HTML, CSS, HTTP, PHP, CGI, Cookies, XML, XSLT 3

2 XSLT, Web services, searching, user interface 13

3 Web crawling, focused crawling, character encoding, feature extraction 19

4 Computational Intelligence: Feedforward Neural Networks 25

5 Computational Intelligence: Preparing Neural Networks for Data Mining 41

6 Indexing, searching, ranking, integration 47

1

2

1 HTML, CSS, HTTP, PHP, CGI, Cookies, XML, XSLT

1.1 Objectives

The purpose of this lab exercise is to improve your understanding of the basic architecture of theWorld Wide Web (WWW). After the exercise you should be able to create Web pages, applyHTTP to retrieve header lines and entire Web pages, write simple PHP scripts and use cookies.You should also be able to understand how Server pages works. The lab should also improveyour understanding of major standards of the WWW related to this course. These include XMLand transformations of XML documents using XSLT.

1.2 Literature

In order to be able to solve the exercises, you could use the following resources:

• A basic introduction into the WWW. http://en.wikipedia.org/wiki/World_Wide_Web

• HTML, XHTML, CSS

� HTML:http://www.w3schools.com/html/default.asphttp://www.htmlcodetutorial.com/

� CSS: http://www.w3schools.com/css/

� XHTML: http://www.w3schools.com/xhtml/

• HTTP

� HTTP made really easy: http://www.jmarshall.com/easy/http/

� Uniform Resource Locator - URL: http://en.wikipedia.org/wiki/URL

� HTTP session management, including cookies:http://www.it.lth.se/courses/will/monash_session.html

� Lab 6 Internet inuti - HTTP-server:http://www.it.lth.se/Internet_inuti/laborationer/lab6/project6_descr.html

• PHP

� PHP tutorial: http://www.w3schools.com/php/

� PHP manual: http://www.php.net/manual/en/

• XML, XSLT

� XML tutorial: http://www.w3schools.com/xml/

� XSLT tutorial: http://www.w3schools.com/xsl/

� XSL Transformations (XSLT): http://www.w3.org/TR/xslt

• Regular expressions: http://www.regular-expressions.info/

3

http://en.wikipedia.org/wiki/World_Wide_Web

http://www.w3schools.com/html/

http://www.htmlcodetutorial.com/

http://www.w3schools.com/css/

http://www.w3schools.com/xhtml/

http://www.jmarshall.com/easy/http/

http://en.wikipedia.org/wiki/URL

http://courses/will/monash_session.html

http://www.it.lth.se/Internet_inuti/laborationer/lab6/project6_descr.html

http://www.w3schools.com/php/

http://www.php.net/manual/en/

http://www.w3schools.com/xml/

http://www.w3schools.com/xsl/

http://www.w3.org/TR/xslt

http://www.regular-expressions.info/

1.3 Home assignments

Questions in boldface are specially important.Answer the following questions:

• What does it mean �stateless�?

• What is a URL and how is it constructed?

• How is an HTML page sectioned?

• What are major elements of the �head� part of a Web page?

• What are metatags? How do you see metadata in a browser?

• Construct a few HTML Meta-tag lines (assignment 1.5.4) on paper.

• What are the bene�ts of using XHTML instead of HTML?

• In CSS, what is a 'selector', a 'property', and a 'value'?

• How do you add a special style to an element when you 'mouse' over it?

• Why is XML Namespace important?

• What is XSLT used for?

• Describe the following HTTP commands: GET, HEAD, POST.

• Which are major HTTP return codes and what do they mean?

• What are cookies and what can they be used for?

• Go through the basics of PHP and study the code example in 1.5.10. What is the e�ect ofthe parameter 'H:i:s' to the function 'date'?

• Study the following XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>

<collection>

<field id="prodName">

<value>Photograph of New York</value>

</field>

<field id="prodAuthor">

<value>Bruce Fairfield</value>

</field>

<field id="price">

<value>$300.00</value>

</field>

</collection>

� Write an XSLT �le to extract name of the product and its author.

� Modify the XSLT �le so that the extracted elements are formatted into asimple HTML �le with each element in a new line.

4

• Learn about the following PHP concepts (e.g. at http://www.php.net/manual/):

� import_request_variables

� �le_get_contents

� XSL, XSLTprocessor

• Look through section 1.4.1 on regular expressions below.

• Explain the regexp on extracting ISBNs: /isbn:?\s*([\d-x]+)/i

(ISBN stands for International Standard Book Number. You can look up theformats it can take in Wikipedia.)

1.4 Tools

All groups will get their own home account WebXX, where XX is your group number. You willget a password for this at the �rst lab. Change the password immediately! This account will beyours for the duration of the course and is used for the �nal project.

In the home account U:\ there are directories used by the Apache Web server (htdocs),catalogues used by the database system (db), and some examples and testdata in CodeExample,Examples, and testdata.

The Apache Web-server must be started manually from a CommandPrompt window.

PHP and Java are installed and supported. If you would like to use another programminglanguage you are free to do so, but you are on your own from installation to debugging!

Make sure that you save your work between labs - it will be reused later!

1.4.1 Regular Expressions

A regular expression, or regex for short, is a pattern describing a piece of text. It is used by manysearch programs such as grep. Many programming languages incorporate regular expressions,e.g. Perl, PHP. A regex is a very powerful way for extracting information (pieces of text) froma large document. In a regular expression each character matches itself except for the specialcharacters +?.*^$()[{|\ (and '/' if used as delimiter) which can be escaped with a '\'.

special patterns meaning\w matches alphanumerics

\W matches non-alphanumerics

\s matches whitespace

\S matches non-whitespace

\d matches numerics

\D matches non-numerics

\b matches word boundaries

\B matches non-boundaries

5

http://www.php.net/manual/

special character meaning+ matches the preceding expression one or more times

* matches the preceding expression zero or more times

? matches the preceding expression zero or one time

. matches any character

^ matches the beginning

$ matches end of line

() groups patterns

[...] matches any of the listed characters

[^...] matches any character except the listed

{n,m} denotes minimum and maximum match counts

... | ... | ... matches alternatives

\ escape character

Patterns are normally enclosed using '/' characters which means that this character has tobe escaped and written as '\/'.

• /Heja/ matches the string 'Heja'

• /Heja?/ matches the string 'Hej' and 'Heja'

• /^http:/ matches all lines that begin with 'http:'

• /\bFred\b/ matches 'Fred' but not 'Fredrick'

• /(\d+):(\d+):(\d+)/ matches for example times like 12:30:01 and gets hours in group 1,minutes in group 2, and seconds in group 3.

• /http:\/\/([^\/]+)(\/[^\s]+)\s/ matches URLs (server in group 1 and path in group2).

Example use in PHP of a pattern that matches URLs in hyperlinks from $text and placesthem in group 1 in $result.

$pattern = '/<a\s+href="([^"]+)">.+<\/a>/i';

preg_match_all($pattern,$text,$result)

foreach ($result[1] as $j => $hit) {

#process result group 1, hit no $j

}

• Look up the documentation of preg_match_all and make sure you understand what theabove code does - you will �nd it useful later!

• What does the 'i' modi�er after the pattern signify?

• What is the di�erence between preg_match and preg_match_all?

1.4.2 XSLT processor

You can test your XSLT style sheets by running them using 'xsltproc' in a CommandPromptwindow. Running it without parameters will give some basic help.

6

1.4.3 XMLWrench

Another tool you can use is XMLWrench, which has a graphical interface. It contains tools forvalidation of XML �les and transforming using XSLT.

1.5 Lab assignments

1.5.1 Login and Change your password

I repeatChange your password!!! (by pressing Ctrl+Alt+Del)This account will be yours for the duration of the course and is used for the �nal project.Programs developed in one lab will be used later in labs as well as in your �nal project.

1.5.2 Start the Apache Web-server

Give the command apache in a CommandPrompt window. Wait and see that no error messagesare displayed, then minimize that window for the remaining of the session. This will start Apachelistening to port 80 for HTTP requests. Apache uses U:\htdocs as the main document root forthe web-server, which means that a URL like http://localhost/ will correspond to and list thecontents of your directory U:\htdocs\.

1.5.3 Web site creation

Create a Web site in your directory U:\htdocs consisting of (at least) two di�erent Web pages(linking to each other), using either XHTML or HTML. The Web site pages should contain atleast the following elements: headings, paragraphs, bold text, lists, and both internal (bookmarkinside a Web page) and external hyperlinks with anchor-text.

• What are the URLs of your pages?

• Check that you can access the pages of another lab group nearby. Lab-machines are namedas fox-17.it.lth.se to fox-32.it.lth.se.

1.5.4 Metadata

• What are metadata good for?

• Add the following metadata elements to a Web page you created: keywords, description,and author.

1.5.5 HTML forms

• Add a form to a Web page you created.

• What kinds of control types exist in forms?

• What is 'hidden' used for?

7

Top box.

Should contain some title text and use a picture as the background.

Navigation box.

Should use a light

yellow background

and have a few

links.

Content box.

Just text, with some links.

Figure 1: Suggested Web page design.

1.5.6 Cascading Style Sheets - CSS

• Design a presentation style for your Web pages using an external CSS �le. Use a presenta-tion style with three boxes as suggested in Figure 1. Use <div> tags with di�erent classesto implement the boxes.

• Modify the Web pages to use this presentation style and verify that it works.

• Change the style so that links gets enhanced (bold, color and/or background) when themouse pointer hovers above it.

• Modify your style so that link enhancement only occurs in the navigation box.

1.5.7 HTML and CSS validation

• Validate your HTML pages. Use the HTML validator at http://validator.w3.org/. Hereyou can upload your page as a �le, and automatically get comments and errors, if any.Make sure there are no errors or comments on your pages.

• Validate your CSS code. Use the CSS validator at http://jigsaw.w3.org/css-validator/.

1.5.8 HTTP

Using putty (enable 'protocol raw', and 'never close window'), test HTTP interaction to a WWWserver (for example your own) and retrieve �rst only the HTTP header lines and then the entireweb page.

• Which port should you use?

• What does the �rst line of the returned text mean?

• Why is the connection terminated after one request?

1.5.9 Dynamic pages

There are a variety of techniques by which a Web server can generate content dynamically or"on-the-�y" by feeding the output of programs back to the remote browser.

8

http://validator.w3.org/

http://jigsaw.w3.org/css-validator/

On-the-�y content generation is usually but not always associated with a response to somedata previously obtained from the browser via the Common Gateway Interface (CGI) mechanism.

On-the-�y content generation inevitably requires some sort of program to generate HTML.There are numerous possible ways of doing this. In this lab we will look at server pages withPHP.

1.5.10 PHP server pages

PHP uses special mark-up within otherwise normal HTML pages. The special mark-up is inter-preted as a program by an Apache plug-in and the generated output is inserted into the normalHTML page at that place.

A very simple 'Hi there' page with the wall clock time looks like:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<html>

<head>

<title>Hi there</title>

</head>

<body bgcolor="#ffffee" text="#000000">

<h1>Hi there</h1>

The time is: <b>

<?php

print date('H:i:s');

?>

</b>

</body>

</html>

Notice the part in the tags <?php ... ?> which is the PHP program generating the dynamiccontent.

• Copy the example and try it out. Which extension did you use for the �lename?

• Write 2 simple PHP scripts to add two numbers: one using GET, and the second one usingPOST.

For your inspiration here is a simple example of a PHP script that uses an HTML formwith one parameter.

<html>

<head><title>Simple form</title></head>

<body>

<?php

$yname=$_GET['yname'];

if ( $yname=='' ) {

?>

<form action="simpleForm.phtml" method=get>

Your name:

9

<input type="text" name="yname" size="25" value="">

<input type="submit" value="Hi">

</form>

<?php

} else { echo "Hi there $yname\n";}

?>

</body>

</html>

• How are parameters transferred (encoded) in these two cases? What are the di�erences?

• Write two simple PHP scripts to add a number to a sum that is kept between the scriptinvocations: one with cookies, and one without cookies.

Cookies can be initialized before the '<html>'-tag as shown below:

<?php

if (!isset($_COOKIE["kaka"])) {

setcookie("kaka", 0, time()+3600);

}

?>

<html>

<head>

...

• What is the di�erence between using and not using cookies, and what are advantages anddisadvantages of each approach?

• Implement a counter that shows, on the page, how many times the page has been accessed.Have your neighbors test it.

� Why can't you do it in HTML with just a cookie or 'hidden' control type?

1.5.11 Simple Web-mining

• Check that you can access pages from the other lab-groups (as in assignment 1.5.3). If youuse a URL like http://fox-21.it.lth.se/ you should see a directory listing from thatserver in your browser.

• Write a script that dynamically searches all other lab-groups top directory listings for theirWeb-servers (fox-17.it.lth.se through fox-32.it.lth.se, assume that they are all available)and produces a clickable list with links to pages in the top directory at all active servers.You can extract relevant URLs with a regexp.

Use for example the PHP function file_get_contents to get directory listings from theother Web servers.

10

1.5.12 XML style sheet transformations - XSLT

• Test your XSLT style sheets from the home assignment. The XML example is in the �leU:\testdata\coll.xml

• Look at the record example at U:\testdata\rec1.xml. It consists of some XML describinga Web-page. Write an XSLT style sheet to extract title, URL, author (dc:creator) and linksfrom the record to plain text. Test it.

• Test the same transformations using Java. Look at the example U:\CodeExample\JavaXSLT\.

• Test your transformations on the �les U:\testdata\rec3.xml and U:\testdata\rec50.xmlas well.

PHP and XSLT

• Copy U:\testdata\rec3.xml to the htdocs catalogue. Write a small PHP page that readsthe XML �le and transforms it with the XSLT style sheet from above, and displays theresult in the Web-browser. Here is an example of how to do XML transformations in PHP:

$xsl_file = "mytransform.xsl";

$xsl = new XSLTProcessor();

$xsl->importStyleSheet(DOMDocument::load($xsl_file));

$xml_file = "test.xml";

$text = $xsl->transformToXML(DOMDocument::load($xml_file));

• Does it look nice? If not, �x it so that the display looks reasonable i.e. add some HTMLtags so that each documentRecord is described individually with title, URL, and links.

• Design a CSS presentation style for this example.

1.6 Conclusions

In order to pass the lab, talk to the lab instructor and answer the individual questions includedin the lab exercises.

11

12

2 XSLT, Web services, searching, user interface

2.1 Objectives

The purpose of this lab is to improve your understanding of major standards of the World WideWeb (WWW), which are related to the Web Intelligence course. This includes transformationsof XML documents, using XSLT, as well as the use of SOAP, REST, and SRU/CQL.

2.2 Literature


• XML, XSLT

� XML tutorial: http://www.w3schools.com/xml/

� Simon St. Laurent, Michael Fitzgerald: �XML Pocket Reference�, August 2005,O'Reilly

� XSLT tutorial: http://www.w3schools.com/xsl/

� Evan Lenz: �XSLT 1.0 Pocket Reference�, August 2005, O'Reilly

� XPath tutorial: http://www.w3schools.com/xpath/

• PHP

� PHP tutorial: http://www.w3schools.com/php/

� PHP manual: http://www.php.net/manual/en/

• Web services

� Web service: http://en.wikipedia.org/wiki/Web_service

� SOAP:http://en.wikipedia.org/wiki/Simple_Object_Access_Protocolhttp://www.w3.org/TR/soap/

� WSDL: http://www.w3.org/TR/wsdl

� REST Web Services: http://www.xml.com/pub/a/2004/08/11/rest.html

• Search/Retrieve via URL - SRU and CQL:

� http://www.loc.gov/standards/sru/

� http://www.loc.gov/standards/sru/specs/cql.html

(Beware - database servers may support only version 1.1 or a limited version 1.2)


Answer the following questions:

• What are Web services? How are they related to SOAP and REST?

• What is the di�erence between SOAP and REST?

13

http://www.w3schools.com/xml/

http://www.w3schools.com/xsl/

http://www.w3schools.com/xpath/

http://www.w3schools.com/php/

http://www.php.net/manual/en/

http://en.wikipedia.org/wiki/Web_service

http://en.wikipedia.org/wiki/Simple_Object_Access_Protocol

http://www.w3.org/TR/soap/

http://www.w3.org/TR/wsdl

http://www.xml.com/pub/a/2004/08/11/rest.html

http://www.loc.gov/standards/sru/

http://www.loc.gov/standards/sru/specs/cql.html

• What is WSDL?

• What is RPC?

• Write an example of how to use SOAP in PHP.

• What do these acronyms stand for: SRU, CQL? Read about them.

• Write an example of an SRU request.

• Give an example of a record schema.

• Prepare a solution to assignment 2.4.1.

2.4 Lab assignments

2.4.1 More XML transformations using XSLT

• Using U:\Examples\example1.xml as input write an XSLT script that transforms all<indexfield>-sections into <index>-sections as:

Original example1.xml transformed output

<indexfield name="alltext">

<type>text</type>

<title>All text</title>

<search>relevance</search>

</indexfield>

<index search="true" scan="true" sort="false">

<title>All text</title>

<map>

<name set="wi">alltext</name>

</map>

<map>

<attr type="1" set="bib1">alltext</attr>

<attr set="bib1" type="2">102</attr>

</map>

</index>

<indexfield name="url">

<type>text</type>

<title>URL</title>

<search>urx</search>

</indexfield>

<index search="true" scan="true" sort="false">

<title>URL</title>

<map>

<name set="wi">url</name>

</map>

<map>

<attr type="1" set="bib1">url</attr>

<attr set="bib1" type="4">104</attr>

</map>

</index>

Speci�cally the <search>-tag should be transformed into an <attr>-tag with the followingvalues:

14

search type value

relevance 2 102

equal 2 3

phrase 6 2

date 4 5

urx 4 104

• Using U:\Examples\example2.xml as input write an XSLT script that produces exactly:

Obstacle detection and collision avoidance;Faster;morpho-syntactic variation (Coordination);US-MC;914.1;Obstacle detection and collision avoidance;Faster;morpho-syntactic variation (Coordination);US-OC;431.5;

The �le U:\Examples\example2.xml:

<?xml version="1.0" encoding="UTF8"?>

<topicDefinition>

<source>Ei thesaurus</source>

<entry>

<term>

<original>niobium mines @and mining</original>

</term>

<topicList weight="US-MC">504.3</topicList>


</entry>

<entry>

<term>

<original>obstacle avoidance</original>

<expansion sw="Faster"

type="morpho-syntactic variation (Coordination)">

Obstacle detection and collision avoidance

</expansion>

</term>


<topicList weight="US-OC">431.5</topicList>

</entry>

</topicDefinition>

2.4.2 Simple searching

The �les search.phtml and simple.xsl in U:\CodeExample\ are examples of how to search adatabase that returns XML-records. The �le 'search.phtml' shows how to process this recordinto a formatted HTML list of hits.

Searching is done using SRU/CQL. An example working query is (all on one line):http://lup.lub.lu.se/luurSru/?version=1.1&operation=searchRetrieve&startRecord=1&maximumRecords=2&query=title

It searches for the word 'data' in the 'title' �eld, retrieves at maximum 2 records startingfrom record number 1.

15

• What does the '%3D' in the URL translate to? Help on URL-encodings can be found inhttp://www.w3schools.com/tags/ref_urlencode.asp.

The target database is http://lup.lub.lu.se/ which is the acronym for Lund University Pub-lications. In LUP you will �nd research publications, refereed and un-refereed and doctoraldissertations from 1996 onwards.

Most of SRU/CQL is supported. For details see http://lup.lub.lu.se/documents/luurSruInfo.html

• Read and understand how 'search.phtml' and 'simple.xsl' work together. Copy the �les tothe htdocs directory and test them.

• Where in the URL is the query? Which parameters decide which records to retrieve fromthe database?

• Extend search.phtml so that for each hit it shows a detailed display with:

� title

� department name

� author (if any)

� abstract (if any)

• Implement a possibility to sort records in the display, for example alphabetically by title.

• Remove the use of an unordered list (<ul> and <li>) and extract the record number (takenfrom the hit record itself - 'recordPosition') and place it in front of every hit.

• Implement a possibility to choose between brief/detailed display. Brief should be just aclickable title, detailed can be as above.

• Add Boolean AND searching (2 �elds) with selectable index �elds.

• Make it possible to search only for publications newer than 2005.

2.4.3 Web services, SOAP

SOAP is a protocol speci�cation that de�nes a uniform way of passing XML-encoded data. Italso de�nes a way to perform remote procedure calls (RPCs) using HTTP as the underlyingcommunication protocol.

WSDL is used to describe what a web service can do, where it resides, and how to invoke it.So, in plain English, WSDL is a template for how services should be described and bound byclients.

• Retrieve the WSDL �le (service description) for the conversion service athttp://www.webservicex.net/CurrencyConvertor.asmx?wsdl and study it together with theexample application in U:\CodeExample\convert.phtml. Identify requests and responsesand their respective parameters.

• Are WSDL �les best suited for human reading or machine processing?

• Test the application from your browser.

• Modify the application so that you can enter your own number in a Web-form.

16

http://www.w3schools.com/tags/ref_urlencode.asp

LUP (http://lup.lub.lu.se/)

http://lup.lub.lu.se/documents/luurSruInfo.html

http://www.webservicex.net/CurrencyConvertor.asmx?wsdl

• Study the XML structure of SOAP request and SOAP response.

• Test the Amazon example in U:\CodeExample\amazon.phtml (it might help to just lookat the SOAP response data-structure by uncommenting those lines in the code).

• Use REST to access the Amazon web service called E-commerce.(See article at http://programming.newsforge.com/article.pl?sid=06/03/03/175207)

• Modify the Amazon example to accept a keyword as parameter and display title, authorand price for the results.

• Further modify to show price (using the conversion service above) with letters instead ofnumbers.

2.4.4 Integration search interface/Web services

Use the SRU/CQL enabled database at http://lup.lub.lu.se/luurSru (as described and testedabove).

• Do an application that allows searches for words in the abstract �eld Process each hit andextract title and any possible ISBN. For the ISBNs found, use the Amazon Web Service tolook up the price. Final display should include title of the publication, any found literatureplus its price. One way of solving this would be to:

LOOP for NoOfRecordsToDisplay:

Get next record from the server

Use XSL to extract info (ISBN) from the record

Use the extracted info to query Amazon for details like price

Format for display

Send to screen

END LOOP

• Build a nice user interface which allows you to browse all hits in pages of 20 each. Hint -use di�erent startRecord for each displayed page.

• Implement a possibility to re�ne searches.

2.4.5 Search interface customization

For ambitious students: Modify your application so that it displays di�erent pages for requestscoming from odd- and even-numbered fox machines. For example, odd-numbered machines couldreceive just a page with an advertisement for your favorite �lm.

Test it!

2.5 Conclusions

In order to pass the lab, talk to the lab instructor and show that you have done and understoodthe applications in the assignments.

• How can you customize your application to use language of the country the requests arecoming from?

17

http://programming.newsforge.com/article.pl?sid=06/03/03/175207

• What are the advantages of using SOAP as compared to 'screen-scraping'?

• Try to �nd some useful Web-services on Internet.

18

3 Web crawling, focused crawling, character encoding, feature

extraction

3.1 Objectives

The purpose of this lab exercise is to improve your understanding of how a Web crawler works.After the exercise you should be able to implement a simple crawler that can operate either asa general crawler or in focused mode.

3.2 Literature


• Robot exclusion protocol - robots.txt. http://www.robotstxt.org/wc/exclusion.html

• A Web Crawler in Perl. http://www.linuxjournal.com/article/2200

• Pant, Srinivasan and Menczer: Crawling the Web.http://www.it.lth.se/courses/will/docs/crawling_the_web.pdf

• cpdetector - Document encoding detection.http://cpdetector.sourceforge.net/doc/javadoc/index.html

• Focused crawling, an example. http://www.cs.nyu.edu/courses/fall02/G22.3033-008/lec10.html

• Focused crawling, sections 2 and 3 from: Pant, Tsioutsiouliklis, Giles: �Panorama: Extend-ing Digital Libraries with Topical Crawlers�, JCDL04.http://dollar.biz.uiowa.edu/�pant/Papers/p102-pant.pdf

• �Writing XML with Java�, from Elliotte Rusty Harold: �Processing XML with Java�http://www.cafeconleche.org/books/xmljava/chapters/ch03.html

• Rainbow http://www.cs.cmu.edu/�mccallum/bow/rainbow/


• Read and try to understand the two papers �Crawling the Web� and �Focusedcrawling� (sections 2 and 3), linked above.

• Read and try to understand the code from the exampleWeb crawler in U:\CodeExample\Crawler\.

• What is the �Robots Exclusion Protocol�?

• Write two examples of 'robots.txt' �les, one that excludes all robots from everything andone that excludes all robots from the directory '/logs'.

• Write an example of a robots Meta-tag.

• Look at the documentation for htmlparser1 (see below under Tools).

• Why is document encoding important?

• How can you �nd out the encoding of an HTML-page fetched by an HTTPtransaction?

19

http://www.robotstxt.org/wc/exclusion.html

http://www.linuxjournal.com/article/2200

http://www.it.lth.se/courses/will/docs/crawling_the_web.pdf

http://cpdetector.sourceforge.net/doc/javadoc/index.html

http://www.cs.nyu.edu/courses/fall02/G22.3033-008/lec10.html

http://dollar.biz.uiowa.edu/~pant/Papers/p102-pant.pdf

http://www.cafeconleche.org/books/xmljava/chapters/ch03.html

http://www.cs.cmu.edu/~mccallum/bow/rainbow/

• Why is XML encoding important?

• What is an 'N-gram' (used in rainbow)?

3.4 Tools

3.4.1 Rainbow

Rainbow is a program that performs statistical text classi�cation. It can also be used to extractvarious statistics about a collection of documents as well as extracting features (words, N-grams)that can be used to classify documents into groups.

3.4.2 Web crawler

In U:\CodeExample\Crawler\ there is an extremely simple Web crawler written in Java. Thereis also a standard library for HTML-parsing, with its documentation in 'htmlparser1_6/docs/'or at http://htmlparser.sourceforge.net/javadoc/index.html.

The implementation closely follows the model below:

Frontier

List ofunvisitedpages

Database

Get URL

FetchWeb page

Analyze

Save

pagesWeb

Repositoryof visitedpages

URLs

Links

SeedURLs

Figure 2: Model for a simple Web crawler.

• HowTo describes how to compile and run the example

• SimpleCrawlerI.java is the main application

• GetNetworkSource.java fetches a Web page given a URL

• DataBase.java saves results as �les

• Analyzer.java extracts URLs from links in HTML code

20

http://htmlparser.sourceforge.net/javadoc/index.html

• htmlparser1_6 is a library

• Queue_String.java implements a queue of strings, in our case of URLs

3.5 Lab assignments

3.5.1 Simple Web crawler

Compile and run the crawler, as explained in HowTo.

• What happens?

• Where is the data stored?

• Which data is stored?

• Why are not all pages fetched?

• Note the conversion from relative URLs to absolute URLs in Analyzer. Why is that needed?

3.5.2 Absolute vs. relative URLs

Modify U:\CodeExample\Crawler\Analyzer.java by replacing

URL temp =

new URL(base_url,((LinkTag)list.elementAt(i)).extractLink

//System.out.println ("URL="+temp);

Links.insert(temp.toString());

with

Links.insert(((LinkTag)list.elementAt(i)).extractLink ());

• Compile, run and see what happens.

3.5.3 Document encoding

Inspect two pages http://www.it.lth.se/courses/will/Example/I8A.html andhttp://www.it.lth.se/courses/will/Example/I8B.html using a browser.

• Are they identical?

• Now crawl these two pages and inspect the �les that your crawler writes. Are they identical?What is the di�erence?

• How can you �nd out the encoding of an HTML page in a browser?

• Fix your crawler so that the �les you write for these two pages are identical.Hint: The easiest way of doing this is by determining the encoding of the pages andconverting them as they are read, i.e. in GetNetworkSource.java. The cpdetector package(�les are in U:\CodeExample\cpdetector\) is a simple solution to this rather boring andconvoluted problem. An example of how to use it is in U:\CodeExample\getCharset.java.

21

http://www.it.lth.se/courses/will/Example/I8A.html

http://www.it.lth.se/courses/will/Example/I8B.html

import java.net.*;

import java.io.*;

import java.util.*;

import cpdetector.io.*;

class GetCharset

{

public static String getCharset(String urlString){

java.nio.charset.Charset charset = null;

try {

ParsingDetector detector = new ParsingDetector();

charset = detector.detectCodepage(new URL(urlString));

} catch (IOException e) {

e.printStackTrace();

}

if(charset.name() == "void") {

System.out.println(" DetectCodepage=void using default");

return "ISO-8859-1";

} else {

System.out.println(" DetectCodepage="+charset);

return charset.name();

}

}

}

Use the returned character encoding to make your InputStreamReader do the decoding.

3.5.4 Improvements

• Make sure that a page is only fetched once even if the link is in several pages.

• Improve Analyzer so that you can save XML records instead of HTML. Use a format sim-ilar to the one in U:\testdata\rec1.xml with at least title, headings and anchortexts inseparate tags.Make sure that what you write is correct XML, speci�cally that the XML character en-coding is correct. Use UTF-8.

3.5.5 Focused crawling

• Add a URL selection �lter that allows you to select and add to the Frontier based onrequirements for the URL. For example, limit crawling to pages in Sweden or at LTH.

• Add a content selection �lter that allows you to only select pages to save based on thecontent of the page. Something simple, like �nding pages that have a speci�c word in thetitle or anywhere on the page.

Keep the code that you have now - you will have to use it further on in the course.

22

3.5.6 Further improvements

• Robots exclusion protocolAdd handling for Robots exclusion protocol (robots.txt), i.e. that your robot obeys therules given there. Test this by adding a 'robots.txt' �le to your ownWeb-server and crawlingit.

• Can you use the htmlparser built-in fetcher instead of GetNetworkSource.java?

• (Ambitious students) Add handling for robots Meta-tag.

3.5.7 Feature extraction

For these assignments we need some collections of topic-speci�c documents (one collection pertopic). You can use either the collections found in U:\Examples\Collections\ (links to cor-responding Web-pages at http://www.it.lth.se/courses/will/Example/TopicColl.html) oruse your own crawler to collect some topic-speci�c documents of your own choice. Choose atleast two topics. Use one directory for each topic.

Use the tool rainbow to gather and inspect statistics about your collections. The commandrainbow --help will give you information about how to use it.

• Index your collections using the switch --index and give the full catalogue names of yourselected topic collections as arguments (they are also used as classnames). This will an-alyze all documents and save statistics in a catalogue (default U:\.rainbow\). All otherinvocations of rainbow will use that information.

• Inspect word probabilities for each topic (--print-word-probabilities=CLASSNAME).

• Do word-counts for some of these words (--print-word-counts=WORD) di�er between top-ics?

• The switch --print-word-infogain N gives the N most selective words (calculated asinformation gain) for di�erentiating between your topics. Inspect that list. Do you agreethat they are good words for the purpose of classifying documents into Topics?

• Repeat the above assignments but use the switch --gram-size=N in order to take intoaccount word-pairs as well. Are the features improved?

3.6 Conclusions

• How do you know that a rule in robots.txt applies to your crawler?

• Why is it important to resolve relative URLs to absolute URLs?

• Why would you like to do a focused crawler instead of a general purpose crawler?

• Describe a way of selecting features for automatic classifying of documents into topic classes.

23

24

1

4 Feedforward Neural Networks, Binary XOR, Continuous XOR, Parity Problem and Composed Neural Networks.

4.1 Objectives The objective of the following exercises is to get acquainted with the inner working of the feed-forward neural network. This simple structure is probably the most popular version in use nowadays, notably in system control and classification applications. But it is not a black box that will simply learn from the presented examples: the learning environment has to be carefully controlled to make it work. But even then success is not guaranteed! It has been noted that large monolithic networks (i.e. large networks that are trained in one pass) as commonly occur in biology can in electronics still suffer from what is called “catastrophic forgetting” or “unlearning”. Therefore we will see how many small networks that each are learnt successfully can be assembled into a large network and subsequently post-trained without unlearning. This opens the road to the systematic development of intelligent systems.

4.2 Literature In order to be able to solve the exercises, consult the following resources:

• Brief Introduction to Neural Networks. • Complete Guide of Joone (Java Object Oriented Neural Engine). • A general neural network written in Java, GNet.java. • A zip-file containing Javadocs for all classes in Joone.

4.3 Home assignments Read the above-named literature so that you are answer the following questions:

• What is the difference of Single-layer and Multilayer Feedforward Neural Networks? • What is supervised learning? • Explain the following terms: epoch, training data and pattern. • How do you usually split the data set into training, validation and testing sets? • What is the back-propagation learning algorithm? Explain it briefly. • Write the generalized delta rule and explain the terms: learning rate, learning mode and

momentum. Acquaint yourself with the user manual of the neural network simulator Joone. In order to do that, a demonstration of the capabilities of Joone by means of an XOR circuit is given in Appendix A. Please take your time to go through this demonstration using the software as installed on your laboratory computer! Do the demonstration on the Parity Problem as appended to this text (Appendix B). This gives you some basic skills for doing the experiments composed Neural Networks. It is faster (and more accurate) to use the provided Java-class GNet.java to accomplish the assignments in 4.4.1 and 4.4.2. The GUI may be used to solve all assignments throughout this lab, but for some strange reason it does not really work for composed neural nets! You may need to

2

write your own Java code to train and test a composed neural network. The Joone complete guide provides you with good examples and hints.

4.4 Lab assignments

4.4.1 The set-theoretic OR The OR circuit is a digital instantiation of the more general function F= I1+I2-I1.I2, where both the inputs and the output carry values in (0…1). In the following we will study the training of this function in more detail.

1. Change the input file used for the Binary XOR to a set of input/outputs that describe the OR on the value range between 0 and 1 with steps of 0.1. The transition with the true and false output can be placed somewhere in the middle. The value pairs should be randomly ordered before fed to the network. Complete the table below with training error and network behavior when tested. Remember to reset the weights of the network before each training (see Hints at the end of Appendix A)

Epochs Learning Rate Momentum RMSE Behaviour? 1000 0.8 0.3 2000 0.8 0.3 3000 0.8 0.3 5000 0.8 0.3

Observation:

2. Split this set into a training set and a test set. Describe this division and argue which considerations have led to your choice. Then train the OR again, verify the generalization capability and test the performance. It may be needed to try other divisions to achieve a learning result of sufficient quality. Is the learning time (i.e. epochs) higher, equal or lower? Explain!

Size of Training Set Epochs Learning Rate Momentum RMSE Behaviour?

1000 0.8 0.3 2000 0.8 0.3

20 out of 121 5000 0.8 0.3

1000 0.8 0.3 2000 0.8 0.3

100 out of 121 5000 0.8 0.3

Observation:

3

3. Vary the learning rate between 0.1 and 0.9. Select what you judge is a good compromise between learning speed and quality. Explain your reasoning! Show a plot of learning rate versus the training error. Use always 5000 epochs for training!

Learning Rate Momentum RMSE 0.1 0.3 0.3 0.3 0.5 0.3 0.7 0.3 0.9 0.3

Observation:

4. Vary the momentum between 0.1 and 0.9. Keep the best learning rate obtained in the previous exercise. Select what you judge is a good compromise between learning speed and quality. Explain your reasoning! Show a plot of momentum versus training error. Use always 5000 epochs for training!

Learning Rate Momentum RMSE 0.1 0.3 0.5 0.7 0.9

Observation:

5. Set the range from which random values are taken to initialize the weights to 0.1, 0.3 and 0.5 respectively. Use the best combination of learning rate and momentum. How does this influence the learning?

Epochs Learning Rate Momentum RMSE

Observation:

4

4.4.2 Distance Function So far learning has almost seemed trivial. This is because the example function is a simple linear one, where a single line can separate the ‘good’ from the ‘bad’ examples. In the history of neural networks, Marvin Minsky from the MIT Artificial Intelligence Labs has almost brought the concept to death when he demonstrated in 1969 that the XOR function couldn’t be trained on a linear feed-forward network. He was only partially right, but it took till the late eighties before the confidence was restored. This XOR circuit is a digital instantiation of the more general distance function DF=(I1-I2)2. In the following we will see how right he was before we prove him wrong.

1. Change the input file used for the Binary XOR to a set of input/outputs that describe DF on the value range between 0 and 1 with steps of 0.1. Split this set into a training set and a test set. Then train the XOR again, verify the generalization capability and test the performance. Take the learning rate at 0.8 and the momentum at 0.1. Now compare the learning behavior with what you have experienced for the OR, and give an explanation

Train. Patterns Epochs RMSE Behviour? 20 / 121 60 /121 121 /121

Observation:

2. Vary the learning rate and the momentum. What are the best settings? Argue what the best remaining error in training the XOR function can be!


Observation:

4.4.3 Composed Neural Networks

1. For starters we are going to create a network containing the OR function and one with the AND function with a similar continuous value range as above. Split the example sets into a training set and a test set. Describe this division and argue which considerations have led to your choice.


5

Observation:

2. Then these networks are combined over a third network and the total is trained for a DF function, using the same training set as in 4.4.2. Compare the training time of this composed network to the one for the monolithic function.


Observation:

3. The knowledge within the composed network may easily disappear upon subsequent learning. So you are kindly requested to re-do the experiment for low learning rates. Check whether this has made any difference.

Observation: Epochs Learning Rate Momentum RMSE

4. Now take your optimally trained composed DF network and continue training but this time for a NOR function with continuous value range. What do you observe?


Observation:

5. And, at the end of this little experiment, lets try to return to where we started from by continuing the training with the DF example set. Is this faster or slower than before?


6

Observation:

7

Appendix A.

Simple XOR

This appendix will guide you through different steps to construct a neural network that solves the classical (binary) XOR problem. A binary XOR has the following truth table:

Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 0

This table has to be saved in a plaintext file (call it ‘binaryXOR_truth_table.txt’). The file contains 4 rows; each with 3 numbers separated by a semi colon ‘;’, as shown below. The numbers may be integer or real.

0;0;0 0;1;1 1;0;1 1;1;0

Now run the Joone GUI editor and follow the steps as described below.

1. Add a Linear layer by selecting the encircled button in the figure and then clicking in the drawing area.

2. Change the name of the layer and the number of neural nodes by viewing the properties

(right-mouse click)

8

3. Add a new Sigmoid layer by selecting the button marked with a circle in the figure and then

clicking in the drawing area. Change the name to ‘Hidden’ and the number of nodes to 3 as shown below. Repeat the procedure and add an ‘Output’ layer with one node only.

4. Now the three layers are connected to construct a neural network. As each node in a layer

has to be connected to all the nodes in the next layer, two ‘Full Synapse’ should be added. This is accomplished by dragging a line from the little circle on the right hand side of a layer and releasing the mouse button when the pointer is on the next layer.

9

5. After doing all the previous steps, you should have something like:

6. In order to train the neural network, a training set is provided by means of a file input layer.

In our simple example, the first two columns of all rows are used. For that reason set the parameter ‘Advanced Column Selector’ to “1,2” or “1-2”. Selecting ‘firstRow’ as 1 and ‘lastRow’ as 0 will force the usage of all rows in the text file that is specified in the field ‘inputFile’ (use ‘binaryXOR_truth_table.txt’). Connect the input file to the input layer.

7. As the neural network is supervised, we need a teacher. Connect the output layer to the

Teacher Layer (change the name to ‘supervisor’)

10

8. The ‘Supervisor’ must have access to the desired output for each pair of inputs that are sent to the network. Create another File Input layer and call it ‘DesiredData’. Set the different properties as shown below. Connect the ‘Supervisor’ to the ‘DesiredData’ by dragging a line from the little red square on the top side of the Teacher layer and then releasing the mouse button when the yellow arrow is on the File Input layer.

9. At this stage, you should have something similar to:

10. Now we need to teach the network how to solve the XOR problem. In the menu line, click

on ‘Tools -> Control Panel’. Fill in the parameters as shown below. The parameter ‘training patterns’ is the number of rows in the training set. The entire set is sent to the network 10000 times (epochs). Click the ‘Run’ button to start the training procedure. The Control Panel shows the number of performed epochs and the current error. The final value should be less than ‘0.1’. If this is not the case, click on ‘Tools -> Randomize’ and ‘Tools -> Add

11

noise’ in the menu line. This will randomize and add noise to the weights of the synapses and thereby improve the procedure of learning. Click ‘Run’ again!

Testing the trained network 11. In order to test the trained XOR-network, add an Output File layer. In the ‘Properties

window’, set the ‘name’ to ResultData and the ‘fileName’ to ‘binaryXOR_output.txt’ (including the path). When it comes to the Teacher layer, two options are possible: either it is kept connected to the network (together with the corresponding File Input, i.e. ‘DesiredData’) or it is removed. In both cases the testing will give same result!

12. Open the Control Panel, disable the ‘learningRate’ parameter and set the number of epochs

to 1. By clicking on ‘Run’, a text file with name ‘binaryXOR_output.txt’ is created in your working directory.

12

13. The output file contains four values corresponding to the outputs in the truth table. The

content should be similar to: 0.007830879673053221 0.9904490706938025 0.9903916946908758 0.013067862923140524

Hints: Tools ->Randomize: reset the weights of a neural network initializing it. Tools->Add Noise: random noise is added to the weights in order to permit the net to exit

from a local minimum. If the network seems to “memorize” the training patterns from a previous training set,

though a new training set is used, reset the input stream (Tools -> Reset Input Stream). It is possible to manually initialize synapse weights to certain values:

o In a text editor, write the weight values using ‘;’ as column separator (similar to the input file)

o Copy the inserted values. o Inspect the synapse connection that needs to be initialized and press the ‘paste’

button. If the network needs to be retrained, disable the File Output layer ‘ResultData’. This will

eliminate the OutOfMemory error that is raised due to the limited java heap size. The heap is rapidly filled because of updating the output file ‘binaryXOR_output.txt’ as many time as epochs are specified!

To test a trained network, you may need to save the network and re-open it!! The input layer of the XOR (binary / continuous) should use linear transfer function (not

sigmoid). Otherwise, the parity neural network will not be trainable!

Appendix B.

The Parity Problem

The parity problem has a long history in the study of neural networks. The N-bit parity function is a mapping defined on 2N distinct binary vectors that indicates whether the sum of the N components of a binary vector is odd or even. In other words, the result of the mapping is 0 if the number of ones is even, and 1 otherwise. The truth table of 4-bit parity function, i.e. N=4, is given in as follows:

I1 I2 I3 I4 f 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0

Many solution proposals make use of a standard Feedforward Neural Network. The most common used network architecture uses one input layer, one output layer and one hidden layer in between. The transfer function in both hidden and output layers is the sigmoid function. Such architectures require N nodes in the hidden layer to solve N-bit parity problem. In spite of the very long time the training procedure takes, the network may not learn to solve the problem! In this sense, modularity of neural networks provides a powerful solution. Actually, a better solution to the parity problem is obtained by a modular neural network composed of three instances of the XOR neural network presented before. Here, the output nodes of the first two XOR networks serve as an input layer to the third XOR network. Your task is to build a neural network (BinaryParityNN) that is trained to solve a 4-bit parity problem according to the truth table above.

In the following, a step-by-step manual will help you to build your BinaryParityNN. 1. Before you start building the BinaryParityNN, you must save the XOR network in a form that

can be inserted as a NeuralNet Object. Simply remove the teacher and all I/O components from your XOR network, before you save it in a serialized form. In the GUI Editor, choose File -> Export NeuralNet, and save it as “BinaryXOR.snet”.

2. The truth table is to be saved in a text file called “BinaryParity_truth_table.txt”. 3. In GUI Editor choose to build a new neural network. 4. Add two instances of the XOR neural networks by clicking the button for “New Nested NN”.

14

5. In the properties for both instances set the learning parameter to False (default) and link the

Nested ANN to file “xor.snet”. Name the instances preferably as “xor 1” and “xor 2”. 6. As the input layer of the third XOR is composed of the outputs of the first two XORs, two

different Linear layers serve as input layer to the third XOR. The hidden and the output layer use the sigmoid function as before. Add two Linear layers, call them “Intermediate 1” and “Intermediate 2”, with one node each (corresponding to the different outputs of “xor 1” and “xor 2”). Let the value of beta (in the properties) be 1.0 (default).

7. Now, add a hidden and the output layer, both of kind sigmoid. Call them “xor3_hidden” and “xor3_output” respectively. The hidden layer consists of two nodes and the output layer of 1 node only.

8. Connect the three layers of XOR 3 by using the Full Synapse. To ease the understanding of the diagram we group all the layers of XOR 3 together by drawing a rectangle.

9. The architecture of the parity network “ParityNN” is competed by combining the three

XORs. As the output from ‘xor 1’ serves as input to XOR 3, a direct connection between the networks is needed. Use the Direct Synapse to connect the networks as shown below.

15

10. The modular network “ParityNN “ is fed by input data through two File Input components,

called “Parity data 1” and “Parity data 2”. 11. Some of the properties of the File Input components are to be set according to the following

table. All other default properties remain unchanged. name Parity data 1 Parity data 2 Advanced Column Selector 1-2 3-4 fileName BinaryParity_truth_table.txt BinaryParity_truth_table.txtstepCounter True False

12. In order to train the network to solve the parity problem, a teacher is needed. Add a Teacher component and provide it with the desired output through a new File Input, called “Desired output”, with the Advanced Column Selector set to 5. The property fileName is set to “BinaryParity_truth_table.txt”.

16

13. Now is the ParityNN ready to be trained. Open the Control Panel (Tools -> Control Panel) and set the parameters as shown below. By running the network, a gradual descending RMSE value is observed, which shows that the network is learning the solution of the parity problem.

14. To verify the correctness of functionality, add a File Output component, call it “results”, and

connect it to the output layer of XOR 3. Run the Control Panel again for one epoch only. Don’t forget to set the parameter of learning to False. The values in the obtained output file must agree with truth table of the parity function.

5 Preparing Neural Networks for Data Mining

5.1 Objectives The objective of the following exercises is to get to the level where you are able to train a network and use it within a Data Mining application. First of all we take an example known from the previous Lab and investigate the quality of the learning in terms of reproducibility. For a typical Data Mining application we will have also inputs with discrete instead of continuous values. For instance, color of eyes, being an IEEE member and so on. These have to be preprocessed in a special way, as the neural network cannot handle them directly. Then we try a simple Data Mining example, where a network is trained, exported and finally integrated in a simple program. This brings us to the level, where we can freely create intelligent data handling for a Web application.

5.2 Literature In order to be able to solve the exercises, consult the following resources:

• Brief Introduction to Neural Networks. • Complete Guide of Joone (Java Object Oriented Neural Engine). • Two java classes, MultilayerFFNet.java and MembershipNet.java (zipped). • A zip-file containing Javadocs for all classes in Joone.

Knowing what is available, will help to handle the Lab at speed.

5.3 Home assignments Read the above-named literature so that you are answer the following questions:

• What is the difference of Single-layer and Multilayer Feedforward Neural Networks? • What is supervised learning? • Explain the following terms: epoch, training data and pattern. • What is the back-propagation learning algorithm? Explain it briefly. • Write the generalized delta rule and explain the terms: learning rate, learning mode and

momentum. You may not have been able to answer these questions satisfactorily before the previous Lab but if you do not know the answer by now, you are in serious problems! Read through the routine for determining Term Frequency as shown in Appendix A and make sure you understand the working.

5.4 Lab assignments

5.4.1 Getting back into the subject The EXOR circuit is a digital instantiation of the more general distance function DF=(I1-I2)2. We will approximate again this more general function and do this several times, each time having a slightly different initialization of the weight settings, to get some feeling about the reproducibility of the trained function..

1. Take the set of best parameters found in Lab 4.4.2 to approximate the function DF. Retrain the neural network using different initial weights as indicated in the table below. Every time that you train a neural network from scratch, the random initialization will be different and the results may be slightly different. If the results are drastically different, you have

2

probably some conflicts in the training set. Is everything still as you like it to be? Stop the training when the training RMSE is less than 0.01. Test the trained network (plot the obtained output values versus the ideal output values) and complete the table. Use the provided java class (MultilayerFFNet.java).

Weights [-0.1,0.1] [-0.2,0.2] [-0.3,0.3] Validation RMSE # epochs # training patterns Learning rate Momentum Behaviour

Observation:

2. Putting some noise on the inputs can influence the generalization properties of a neural network. Generate a new data set that describes the function DF on the value range between 0 and 1 with a step of 0.2 (the distance between input values is now doubled). Modify the provided java class that noise is added on the training patterns. Use the best range of weight randomization obtained in the previous assignment. Then train the neural network and compare the generalization with the earlier result.

Input Noise 0.15 0.10 0.05 Weights Training RMSE Validation RMSE Epochs Training Patterns Learning Rate Momentum Behaviour

Observation:

5.4.2 Data Fuzzification So far we have encountered situations, where a number of inputs were clustered and classified into a single output value. Unfortunately the real world uses not only continuous and discretized samples over a value range but also singletons. For instance, eye coloring is usually given as albino, green, brown or blue. This is hard to handle for a neural network and therefore such singletons tend to be fuzzified. To this purpose the attribute values are evenly spread over the input value range (for instance between 0 and 1, or between –1 and +1). Each attribute is then modeled as a membership function (familiar from Fuzzy Logic), stating to which degree the property holds. For instance, in Figure 1 a value 0 means a full membership of the category ‘albino eyes’, while a value 0.5 signifies a tiny relation to blue and a tiny relation to green eyes.

3

0 1

white blue green brown

Degree of m

embership

0

1

0.25 0.5 0.75

Figure 1 Membership function for eye color.

1. There are various ways to achieve this. Lets assume that we have a network with 2 inputs (one for color on one for noise) and 4 outputs (one for each color). We need an example set where each color is given by its center value as the first input, random noise is given as the second input and the output specifies the corresponding fuzzyfied value, similar to the value on the x-axis in Figure 1.

Observation:

2. The membership functions in Figure 1 show a small overlap. Do you think that this overlap is meaningful? It may be wise to demonstrate your thoughts in a Joone execution, by training with a noise level that is either larger or equal to the value range of a color with respect to the center value.

Observation:

5.4.3 Integrating Neural Networks These experiments provide some basic skills in creating the neural networks that provides intelligence to your Web search functions. In preparing for the final assignment, a number of additional things come in handy. In analogy with assignment 3.5.7, use the tool rainbow to list the terms with highest information gain among documents in two different document collections. The frequency of certain terms is much higher in the documents in the first set compared to the second set and vice versa. Generate a list of the 50 words with the highest information gain.

1. The resulting list can be taken as a suggestion, to be verified and modified manually to produce a file with a final list of words, one per line. Use the class TermFreq, provided in

4

Appendix A, to calculate the frequency of each term in your modified list. The result is a text file containing all terms with corresponding frequency.

Observation:

2. Copy this result file, use it to train and test a neural classifier. Observation:

5

Appendix A.

Finding Term Frequency from Documents

In addition Rafael has written some Java code that takes such a file with list of words and a document as parameters, and calculates word frequencies for these words in that document. import java.io.*; import java.net.*; import java.util.*; public class TermFreq { public static void main(String[] args) throws MalformedURLException,IOException { //initialize the hashtable Hashtable frequency_table = new Hashtable(); //read a file BufferedReader reader = new BufferedReader(new FileReader("list_of_terms.txt")); String s; System.out.println("LIST OF TERMS: "); while((s = reader.readLine())!= null) { System.out.println(s); frequency_table.put(s,new Integer(0)); } reader.close(); String content = new String(args[0]); System.out.println(); System.out.println("URL: " + content); URL url; url = new URL(content); URLConnection urlCon = url.openConnection(); String data; StringTokenizer st; int frequency = 0; if (urlCon.getContentType().endsWith("html")) { data=GetNetworkSource.getNetworkSource(url); //System.out.println("Content: " + data); Integer freq;

6

st = new StringTokenizer(data,"<,>, ,\",=,/,!--,--,;,.,?,:,|,%,_,&,',@"); while (st.hasMoreTokens()) { String current_token = st.nextToken(); //System.out.println(current_token); //if the current token is within the list of terms, //its frequency is increased in one if(frequency_table.containsKey(current_token)) { freq = (Integer)frequency_table.get(current_token); int int_freq = freq.intValue(); int_freq++; frequency_table.remove(current_token); frequency_table.put(current_token, new Integer(int_freq)); } } System.out.println(); System.out.println("FREQUENCY TABLE: "); for (Enumeration e = frequency_table.keys() ; e.hasMoreElements() ;) { String term = (String)e.nextElement(); System.out.println(term + " = " + (Integer)frequency_table.get(term)); } } else { System.out.println("The contentType of the specified URL is not HTML!!!"); } } }

6 Indexing, searching, ranking, integration

6.1 Objectives

Introduce XML indexing. Learn the importance of ordering hit sets by relevance ranking. Inte-grate neural network classi�ers with Web-crawling.

6.2 Literature

• YAZ User's Guide and Reference http://www.indexdata.dk/yaz/doc/ and speci�cally theuser manual on yaz-client http://www.indexdata.dk/yaz/doc/yaz-client.tkl

• Zebra - User's Guide and Reference http://www.indexdata.dk/zebra/doc/


• Refresh your knowledge of databases and how they are implemented.

• What is an inverted index?

• Read about tf-idf ranking and PageRank (for example in Wikipedia). Famil-iarize yourself with the algorithms used for calulation.

6.4 Tools

6.4.1 Database system

Zebra is a high-performance, general-purpose structured text indexing and retrieval engine. Itreads and indexes documents structured into XML records and allows access to them throughexact boolean search expressions and relevance-ranked free-text queries.

The system has 2 main components (see Figure 3), an indexer (zebraidx) and a server (ze-brasrv). The indexer is responsible for indexing documents and creating the data-structures forsearching. The server handles incoming requests, does the searching and returns results. Bothof them must be run from the CommandPrompt.

We will use the DOM XML �lter which has a standard DOM XML structure as internal datamodel, and can therefore parse, index, and display any XML document type. It is well suitedto work on standardized XML-based formats such as Dublin Core, MODS, METS, MARCXML,OAI-PMH, RSS, and performs equally well on any other non-standard XML format.

The DOM �lter architecture (see Figure 4) consists of four di�erent pipelines, each being achain of arbitrarily many successive XSLT transformations of the internal DOM XML represen-tations of documents.

The DOM XML �lter pipelines use XSLT (and if supported on your platform, even EXSLT),it brings thus full XPATH support to the indexing, storage and display rules of XML documents.

The DOM XML �lter pipelines are con�gured in the con�guration �le filter_dom_conf.xml(see section 6.4.3).

The root XML element <dom> and all other DOM XML �lter elements are residing in thenamespace xmlns="http://indexdata.com/zebra-2.0".

All pipeline de�nition elements - i.e. the <input>, <extract>, <store>, and <retrieve>elements - are optional. Missing pipeline de�nitions are just interpreted do-nothing identitypipelines. We will use <extract> and <retrieve> pipelines.

47

http://www.indexdata.dk/yaz/doc/

http://www.indexdata.dk/yaz/doc/yaz-client.tkl

http://www.indexdata.dk/zebra/doc/

Index

XML

documents

XML

documents

Search

resultsQuery

SRU, CQL

zebraidx zebrasrv

zebra.cfg

for searchingcommon

filter_dom_conf.xml

indexfilter.xsl

for indexing

Configuration files

yazserver.xml

(XSL schema files)cql2pqf.txtexplain.xml

Storage

Figure 3: Structure of the Zebra database system.

All pipeline de�nition elements may contain zero or more <xslt stylesheet="path/�le.xsl"/>XSLT transformation instructions, which are performed sequentially from top to bottom. Thepaths in the stylesheet attributes are relative to zebras working directory, or absolute to the �lesystem root.

The <input> pipeline de�nition element may contain one XML Reader de�nition <xmlreaderlevel="1"/>, used to split an XML collection input stream into individual XML DOM documentsat the prescribed element level.

Extract pipeline

The <extract> pipeline takes documents from any common DOM XML format to the Zebra spe-ci�c indexing DOMXML format. It may consist of zero ore more <xslt stylesheet="path/�le.xsl"/>XSLT transformations, and the outcome is handled to the Zebra core to drive the process ofbuilding the inverted indexes.

Retrieve pipeline

Finally, there may be one or more <retrieve> pipeline de�nitions, each of them again consistingof zero or more <xslt stylesheet="path/�le.xsl"/> XSLT transformations. These are used fordocument presentation after search, and take the internal storage DOM XML to the requestedoutput formats during record present requests.

The possible multiple <retrieve> pipeline de�nitions are distinguished by their unique nameattributes, these are the literal schema or element set names used in SRW , SRU and Z39.50protocol queries.

48

<extract>

pipeline

<store>

pipeline

<retrieve>

pipeline

Common

XML DOM

Indexing

XML DOM

Storage

XML DOM

<input>

pipeline

XML

document

XML

document

Indexes Storage

Figure 4: Zebra DOM �lter architecture.

6.4.2 Database client

A simple command-line client for searching Zebra databases is available as yaz-client. It isdocumented at http://www.indexdata.dk/yaz/doc/client.tkl

6.4.3 Database con�guration generation

Con�guration of Zebra is very detailed and quite complex. We provide a simpli�ed method thatcan handle most of the database features. It is a command-�le genzebraconf.bat (see Figure5) that takes an XML-�le ZebraConf.xml and generates all needed con�guration �les for Zebrausing a number of XSL transformations.

ZebraConf.xml

genzebraconf

genZebraExplain.xsl

genZebraIndex.xsl

genZebraYazServer.xsl

explain2cqlpqftxt.xsl

genZebraFilterConf.xsl

yazserver.xml

explain.xml

indexfilter.xsl

cql2pqf.txt

filter_dom_conf.xml

zebra.cfg

genZebraCfg.xsl

Figure 5: Con�guration generation for the Zebra database system.

49

http://www.indexdata.dk/yaz/doc/client.tkl

ZebraConf.xml has 4 main sections:

• serverInfo contains information about how to run the server, like host-name and portnumber.

• zebra information for generating zebra.cfg

• databaseInfo is textual information about the database.

• indexes de�nes record formats as well as how the records should be parsed and indexed

� filters speci�es �lters (XSLT programs) that are used in the pipelines (see Figure4 and the con�guration �le filter_dom_conf.xml).

� topLevel de�nes where to �nd the record identi�er and at which XML level to splita document �le into records.

� indexfield de�nes index�elds (searchable �elds) and their properties: type of index-ing (text � traditional word index, phrase, numeric, date, urx � URL index) and typeof search (equal, relevance, phrase, date, and, or, urx).

� data de�nes XPath expressions how to extract text from the record and into whichindex�eld this text should be placed.

The Zebra working directory U:\db\ holds the con�guration and other internal �les. All

zebra command-line commands have to be given in this directory!

6.5 Lab assignments

6.5.1 Generate con�guration �les

Start a commandPrompt window and change to the Zebra working directorycd db

Make sure all con�guration �les are there (re-generate them)genzebraconf

6.5.2 Indexing

Commands to build an index for a data �le:Initialize database:

zebraidx init

Index records:zebraidx update <DATA>

The argument <DATA> can be either a �le or a directory; if it is a directory all �les itcontains will be indexed. For testing use the �les in U:\testdata.

The con�guration �le 'zebra.cfg', contains pointers to further con�guration �les and whereresults are to be stored, as well as general con�guration options for the indexer. See commentsin the �le.

Exactly how a �le is indexed is determined by the index �lter de�ned in the �le 'filter_dom_conf.xml'pointed to by zebra.cfg. This �le points to a number of �lters (implemented as XSLT style-sheets) identi�ed by schema names. The style-sheet identi�ed by the attributeidenti�er="http://indexdata.dk/zebra/xslt/1" is used for indexing. The �le 'filter_dom_conf.xml'also speci�es split-level to be 1 which means that it assumes that data �les contain more than 1

50

document and are to be split into documents at the hierarchical XML level 1. An example withtwo documents (where each document is contained in a tag <documentRecord>) and split-level1 would be:

<?xml version="1.0" encoding="UTF8"?>

<documentCollection>

<documentRecord id="17">

This a document

</documentRecord>

<documentRecord id="22">

This a another document

</documentRecord>

</documentCollection>

The simple indexing style-sheet ('indexfilter.xsl') de�nes how to extract data from adocument and how it is indexed. In the style-sheet each element like

<z:index name="title" type="w">

identi�es an index. The rest of that element speci�es how to extract data/text from the documentfor indexing. The 'type' attribute speci�es how data is to be indexed on a low level. Default is "w"for text, other useful types are "d" for dates, "n" for numeric, and "u" for URLs (urx). Using thetype '0' gives you a raw indexing without any word splitting or lower-casing. indexfilter.xslis generated from ZebraConf.xml where 'index�eld/type' determines the index for each �eld.

• How many indexes are de�ned in 'indexfilter.xsl'?

• Compare indexfilter.xsl to ZebraConf.xml and identify how indexes are speci�ed. Canyou locate the XPath expressions for each index?

• What parts of the document will they contain?

• Try to index the documents in the directory U:\testdata.

• How many records are indexed? From which �les?

The indexer creates inverted indexes on disk in the directory 'register'.

6.5.3 Database server

To be able to search the indexes created by zebraidx, you must start the database daemon,zebrasrv, like this:zebrasrv -f yazserver.xml

The parameter �le 'yazserver.xml' tells the server where to �nd the con�guration �le(zebra.cfg - it has to be the same as used when indexing data), and which network portthe server should listen for search requests (here 9999). This is automatically taken care of bythe con�guration generation.

The server daemon will accept and answer queries using any of several supported networkprotocols, including Z39.50, SRW and SRU. We will mostly use SRU and formating queries usingCQL (as in lab 2).

• Start the database daemon, zebrasrv, in a Command window of it's own. Now the systemshould be ready to answer queries.

51

6.5.4 Database browsing

Zebra has a built-in feature that enables it to present an interface for testing the database usingthe �les explain.xml and docpath\*. By connecting, with a browser, directly to the Zebraserver (at the designated port) you will be presented with a simple interface that enables testingof the indexes and searches that you con�gured in ZebraConf.xml. Note that indexes in CQLshould be quali�ed with a set-name, in our case 'wi'.

Use this to test the database just built.

6.5.5 Searching

Another way of accessing the database server is with a dedicated client program like yaz-client.

• Use yaz-client (in another Command window) to test if the database generation wassuccessful:

yaz-client tcp:localhost:9999

> format xml

> querytype cql

> find wi.dc:title=knowlib

> elements identity

> show 1

> elements dc

> show 1

> quit

(If this doesn't work compare the document structure of rec1.xml in U:\testdata andcheck that paths in ZebraConf.xml are correct. If not �x them.)

• Why is the format of the document di�erent the second time? (Hint: what do you think'elements dc' does?)

• To further test your indexing and server start your Web browser and enter the SRU query(on one line):

http://localhost:9999/default?version=1.1&operation=searchRetrieve&

query=wi.dc:title%3Dknowlib&startRecord=1&maximumRecords=1&recordSchema=dc

This will search the dc:title index for the word 'knowlib' and retrieve the �rst record usingthe schema 'dc'.

� Why doesn't it look like the document you indexed?

� Try with recordSchema=identity.

� What decides the formatting of the returned records?

52

6.5.6 Interaction indexing-searching

• Add a new index called 'uri' which indexes all URLs (like the index 'url') with type 'urx'and searches 'urx'.

Re-index the entire database (don't forget to initialize the database), and verify that youcan search both the 'url' and the 'uri' indexes. What are the di�erences?

• Modify the search.phtml script to use the test database you just created. Does it work?

• Change con�guration (ZebraConf.xml) to

� add indexing of all metadata into the index 'meta'

� add indexing of all text (use index 'alltext') from the element 'documentText' in therecords

• Create a database from the documents your crawler has collected. Depending on the formatof the records, you may have to create appropriate indexing �lters in ZebraConf.xml.

• Test it.

• Try searches both with truncation and with numeric relations like '>'.

• Search your database with and without ranking (CQL modi�er /relevant). Is there anydi�erence?

• Create a simple user interface for your own database with search �eld selection, relevanceranking and Boolean combinations of search terms.

• Find and test some useful databases on WWW that are searchable by SRU.

6.5.7 Integration

(For the ambitious student - BUT IMPORTANT)Take the neural network classi�er developed in lab 5 and integrate it into your crawler. Use

the URLs in the positive class as seeds for the new focused crawler and crawl a number of pages.Verify that the result is reasonable.

53