29
Web Intelligence By Otto Borchert April 28, 2003

Web Intelligence By Otto Borchert April 28, 2003

Embed Size (px)

Citation preview

Page 1: Web Intelligence By Otto Borchert April 28, 2003

Web Intelligence

By Otto Borchert

April 28, 2003

Page 2: Web Intelligence By Otto Borchert April 28, 2003

Background

• Application Layer / HTTP

• Agents

• Present - Google / Page Rank

• Future - Semantic Web / OWL

Page 3: Web Intelligence By Otto Borchert April 28, 2003

Hypertext Transfer Protocol (HTTP)

• Application level protocol (World Wide Web)• Runs over TCP, normally port 80• Information retrieved using a URL (Uniform

Resource Locator) protocol://host:port• Typical HTTP packet format

– START_LINE<CRLF>– MESSAGE_HEADER<CRLF>– <CRLF>– MESSAGE_BODY<CRLF>

Page 4: Web Intelligence By Otto Borchert April 28, 2003

Request Messages

• Given by client on START_LINE• Includes:

– OPTIONS: request information about available options– GET: (one of 2 most commonly used) retrieve document

identified in URL– HEAD (other most common used) retrieve metainformation

about document identified in URL (find out how old a page is)– POST: give information to server– PUT: store document under specified URL– DELETE: delete specified URL– TRACE: loopback request message– CONNECT: for use by proxies

Page 5: Web Intelligence By Otto Borchert April 28, 2003

Example request

• GET http://www.cs.ndsu.nodak.edu/index.html HTTP/1.1– Give entire descriptor in START_LINE

• GET index.html HTTP/1.1Host: www.cs.ndsu.nodak.edu– Precise page given in START_LINE, host in

MESSAGE_HEADER

Page 6: Web Intelligence By Otto Borchert April 28, 2003

Server reply

• Server replies with a Response Message

• Contains version of HTTP being used, 3 digit code indicating whether or not the request was successful and the reason for giving that code

Page 7: Web Intelligence By Otto Borchert April 28, 2003

Codes

• 1xx – Informational (Request received, continuing process)

• 2xx – Success (Action successfully received, understood, and accepted)

• 3xx – Redirection (further action must be taken to complete the request)

• 4xx – Client Error (request contains bad syntax or cannot be fufilled)

• 5xx – Server Error (server failed to fulfill an apparently valid request)

Page 8: Web Intelligence By Otto Borchert April 28, 2003

Example Replies

• HTTP/1.1 202 Accepted– Web page request accepted, displays page

• HTTP/1.1 404 Not Found– The usual not found error

• HTTP/1.1 301 Moved Permanently– The page has moved, includes a

MESSAGE_HEADER like in request to tell where the page has been moved to

Page 9: Web Intelligence By Otto Borchert April 28, 2003

HTTP extras

• In version 1.0 one TCP connection for each request. 1.1 allowed for persistent connections

• HTTP was set up with web caching in mind. One can check the date a page was last updated and store the newest versions of frequently accessed pages on a local machine

Page 10: Web Intelligence By Otto Borchert April 28, 2003

Is the web intelligent?

• Intelligence is a poorly defined word anyway. For example, would you consider these intelligent?– Document analysis systems for cataloging and summarizing Web

pages– Profiling systems for placing selective Web advertising– Data mining and analysis– Tools for searching databases supported by Web browsers– Translation tools that convert to and from human languages– Statistical software for network caching, routing, and tracking– Knowledge-based systems for automated e-mail reading– Smart agents for Internet-based product and service marketing– Video object recognition and searching

Page 11: Web Intelligence By Otto Borchert April 28, 2003

Is the web intelligent? (2)

• One of the most important advances in making the web intelligent is through the use of agents.

• These agents take many forms including many listed on the previous slide

Page 12: Web Intelligence By Otto Borchert April 28, 2003

What is an agent?

• No standard definition• Can be:

– Web Crawler– Travel Agent– Secretary– Hard to distinguish between agent and program. Agent

normally performs actions based on data it finds, without much human intervention

• Agents can be defined as intelligent as well• Act as the glue for many of the following ideas

Page 13: Web Intelligence By Otto Borchert April 28, 2003

The Present of Web Intelligence - Google

• Presently the most used search engine the Internet has to offer.

• Provides a unique blend of computer hardware and software to complete millions of user searches each day

• Based on a system called Page Rank

Page 14: Web Intelligence By Otto Borchert April 28, 2003

PageRank

• Developed by Larry Page and Sergey Brin at Stanford University (Google’s founders)

• Uses a system of link ranking– If there is a link from page A to page B, page B

is correlated to page A– If page A is a strong page to begin with, page B

becomes stronger as well

Page 15: Web Intelligence By Otto Borchert April 28, 2003

Word Association

• On top of PageRank, there is also a system of word matching. – Word counts (Do the words exist on the page?)– Proximity checks (Are the words close

together?)

Page 16: Web Intelligence By Otto Borchert April 28, 2003

Can’t you cheat PageRank?

• People try everyday! • Higher search ranking == More exposure• Link Farms

– Places where people merely have millions of links to a web page in hopes the target will move higher on the list.

– Google’s answer: Page importance. Once link farms are discovered, they are given a negative rank, so if you have a page on a link farm, its rank will go down as well

Page 17: Web Intelligence By Otto Borchert April 28, 2003

Another way to cheat

• Put lots of words related to your page in your page (even if they are not visible)

• Google’s answer: PageRank is primary, cheaters are given lower priority

Page 18: Web Intelligence By Otto Borchert April 28, 2003

Moral Decisions

• Wired article– Computer screen shows location, query pairs

for random searches on Google’s engines.– One search during the late hours on the West

Coast was “How to stop a friend from committing suicide”

– Can’t do much about it but make sure they get the right information the next time

Page 19: Web Intelligence By Otto Borchert April 28, 2003

The Future of Web Intelligence

• The Semantic Web

Page 20: Web Intelligence By Otto Borchert April 28, 2003

What is the Semantic Web?

• As the web presently stands, it is complete nonsense to most software applications. – Two completely different statements

• The ball is round

• The round ball

• The semantic web is a series of protocols meant to enrich the current web with meaning

Page 21: Web Intelligence By Otto Borchert April 28, 2003

Series of Protocols

• RDF – Resource Description Framework

• OWL – Web Ontology Language (extension of RDF)

Page 22: Web Intelligence By Otto Borchert April 28, 2003

Resource Description Framework

• From World Wide Web Consortium webpage• RDF “defines a mechanism for describing

resources that makes no assumptions about a particular application domain, nor defines (a priori) the semantics of any application domain. The definition of the mechanism should be domain neutral, yet the mechanism should be suitable for describing information about any domain“

Page 23: Web Intelligence By Otto Borchert April 28, 2003

RDF – Some examples

• Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila. – Abstract, conceptual Framework

– Concrete syntax using XML

Page 24: Web Intelligence By Otto Borchert April 28, 2003

Abstract example

• Subject (Resource) – http://www.w3.org/Home/Lassila   

• Predicate (Property)   – Creator  

• Object (literal)   – "Ora Lassila“

• Graphic

Page 25: Web Intelligence By Otto Borchert April 28, 2003

Concrete syntax

• Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila.

<rdf:RDF>

<rdf:Description about="http://www.w3.org/Home/Lassila">

<s:Creator>Ora Lassila</s:Creator>

</rdf:Description>

</rdf:RDF>

Page 26: Web Intelligence By Otto Borchert April 28, 2003

Web Ontology Language

• What is an ontology?– “defines the terms used to describe and

represent an area of knowledge”

• OWL defines ontologies for use on the web

• Actually an extension of RDF

Page 27: Web Intelligence By Otto Borchert April 28, 2003

Ontologies

• Date and Time

• Countries of the World

• Wines

• Space Shuttle Information

Page 28: Web Intelligence By Otto Borchert April 28, 2003

Some example OWL statements

<owl:Class rdf:ID="WineGrape"> <rdfs:subClassOf rdf:resource="&food;Grape" /></owl:Class>

<owl:Class rdf:ID="WhiteWine"> <owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:about="#Wine" /> <owl:Restriction> <owl:onProperty rdf:resource="#hasColor" /> <owl:hasValue rdf:resource="#White" /> </owl:Restriction> </owl:intersectionOf> </owl:Class>

Page 29: Web Intelligence By Otto Borchert April 28, 2003

Conclusion

• Web intelligence is a broad new field for exploration

• Present efforts like Google can be improved upon with more semantic information

• Any questions?