How the WWW operates - some history and terminology
Mark Levene
(Follow the links to learn more!)
Bush 1945 – As We May Think
The memex is a desktop
machine, consisting of:1) A user interface.
2) A repository of documents.
3) A search engine.
4) A linking mechanism.
5) Memex II can learn from its experience.
Quote from As We May Think
“The human mind … operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. … trails that are not frequently followed are prone to fade …. Yet the speed of action, the intricacy of trails, … is awe-inspiring beyond all else in nature.”
“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”
“There is a new profession of trail blazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record.”
Nelson’s Hypertext
• A universal hypertext.• Xanadu is a
distributed network of documents (1960’s).
• User interface - transpointing windows.
• Elaborate copyright mechansim.
• Superceded by WWW
Engelbart’s oN Line System (1968)
First working hypertext system, where documents were liked together.
Tim Berners-Lee’s WWW
• Cern 1990 - First Browser
• Web protocols– URL– HTTP– HTML
• World Wide Web Consortium (W3C) founded in 1994
Mosaic – The Web Browser that changed history
• Released late 1993 – developed by Marc Andreessen
• Netscape triggered the boom of WWW throughout the 90’s
• Browser wars with Microsoft – IE won (2003 stats: 95.6% - IE, 3.7% NS)
Difference between the internet and the web
• Internet – physical computer network infrastructure on which the web is built.
• The World Wide Web (web) is a virtual network defined through the web protocols.
• The internet supports other protocols such as email, ftp and instant messaging.
Map of the Internet from 1998
Graph of Web pages related to www.dcs.bbk.ac.uk
IP Addresses • Internet Protocol (IP) address – each
machine connected to the Internet is identified by a unique 32 bit number.
• My IP address is: 193.61.29.152 (ipconfig.exe from command prompt)
• IP addresses may be dynamic.
• IP addresses have corresponding Domain Name Server (DNS) addresses.
• My DNS address is: dhcp34.dcs.bbk.ac.uk
URLs – Uniform Resource Locators• Address of an internet resource
• E.g. http://www.dcs.bbk.ac.uk/~mark– http is the protocol (others: ftp, mailto, file)– www.dcs.bbk.ac.uk is the domain name– ~mark is the path to the resource
• Query string follows a ? to run a script (dynamic URL) e.g.– http://www.google.com/search?q=url
HTTP – HyperText Transfer Protocol
• Protocol of messages exchanged by a user agent (client) and a web server.
• Most common request is GET:– GET URL (agent’s request)– HTTP/1.1 200 OK (server’s response)– Response header (includes display type)– Blank line– Response data follows
HTML – HyperText Markup Language
• I am assuming you all have some knowledge of HTML !
• The combination of the three components: URL, HTTP and HTML, defines the basic functionality of the web.
Server Log Files
• IP or DNS address of agent making request
• Timestamp, status, transfer volume
• Referrer URL (where the request was made from)
• Requested URL (from the HTTP request)
• User Agent (browser, OS)
• Other information such as authentication.
Cookies
• A cookie is a piece of text that a web site can store on the user's machine when the user is browsing the site.
• This information can be retrieved later by the web site, for example in order to identify a user returning to the site.
• Can be used for statistics, personalisation.
• Some security and privacy issues.
Tracking Users with CookiesAcross multiple sites
BrowserBanner
Ad Web site
HTTP requestfor web page
Send web pageincludes ad links
HTTP request for ad with cookie
Send ad andupdate cookie
W3C Extended Logging DefinitionsField Date Description
Date date The date that the activity occurredTime time The time that the activity occurredClient IP address c-ip The IP address of the client that accessed your server
User Name cs-usernameThe name of the autheticated user who access your server, anonymous users are represented by -
Servis Name s-sitename The Internet service and instance number that was accessed by a clientServer Name s-computername The name of the server on which the log entry was generatedServer IP Address s-ip The IP address of the server that accessed your serverServer Port s-port The port number the client is connected toMethod cs-method The action the client was trying to performURI Stem cs-uri-stem The resource accessedURI Query cs-uri-query The query, if any, the client was trying to performProtocol Status sc-status The status of the action, in HTTP or FTP termsWin32 Status sc-win32-status The status of the action, in terms used by Microsoft WindowsBytes Sent sc-bytes The number of bytes sent by the serverBytes Received cs-bytes The number of bytes received by the serverTime Taken time-taken The duration of time, in milliseconds, that the action consumedProtocol Version cs-version The protocol (HTTP, FTP) version used by the clientHost cs-host Display the content of the host header
User Agent cs(User Agent) The browser used on the clientCookie cs(Cookie) The content of the cookie sent or received, if any
Referrer cs(Referrer)The previous site visited by the user. This site provided a link to the current site
cs = client-to-server actions
s = server actionsc = client actions
sc = server-to-client actions
date
2003-01-07 08:58:12 193.133.103.63 DCSNT\gtuff01 193.61.29.180 80 GET /support/ - 302 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+NT+5.0)2003-01-07 08:58:19 193.133.103.63 - 193.61.29.180 80 GET /intranet/cs/ - 401 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+NT+5.0)
c-iptime cs-username s-ip s-port
cs-method
cs-uri-stem
cs-uri-query
sc-status cs (User-Agent)
2003-01-07 08:58:12 193.133.103.63 DCSNT\gtuff01 193.61.29.180 80 GET /support/ - 302 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+NT+5.0)2003-01-07 08:58:19 193.133.103.63 - 193.61.29.180 80 GET /intranet/cs/ - 401 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+NT+5.0)
Example of extended log entries format
2003-02-01 00:01:44 80.192.25.125 - GET /library/HM.js - 200 www.i-resign.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) i%2Dresign%2Dlogin=UID=3649008;+interstitial=not;+ASPSESSIONIDGQQGQYAO=OAMCCDGBODIOFHLAFHFAGKHD -
2003-02-01 00:02:19 62.255.0.5 - GET /uk/discussion/new_topic.asp t=331 200 www.i-resign.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) - http://www.google.co.uk/search?q=i+hate+my+job&ie=UTF-8&oe=UTF-8&hl=en&meta=cr%3DcountryUK%7CcountryGB
c-iptime
cs-username
cs-status
cs-uri-stem
cs-uri-query
sc-method cs (User-Agent)
cs (Cookie)
2003-02-01
2003-02-01
00:01:44
00:02:19
80.192.25.125
62.255.0.5
-
-
GET
GET
/library/HM.js
/uk/discussion/new_topic.asp
t=331
- 200
200
www.i-resign.com
www.i-resign.com
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
i%2Dresign%2Dlogin=UID=3649008;+interstitial=not;+ASPSESSIONIDGQQGQYAO=OAMCCDGBODIOFHLAFHFAGKHD http://www.google.co.uk/search?q=i+hate+my+job&ie=UTF-8&oe=
TF-8&hl=en&meta=cr%3DcountryUK%7CcountryGB
-
cs (Referrer)
date
-
cs-host
Another example of extended log entries format
• Yahoo! (www.yahoo.com) - (1994-) directory service and search engine.
• Infoseek – (1994-2001) search engine.• Inktomi – (1995-) search engine infrastructure, acquired by
Yahoo! 2003.• AltaVista – (1995-) search engine, acquired by Overture in
2003.• AlltheWeb – (1999-) search engine, acquired by Overture
in 2003 .• Ask Jeeves (www.ask.com) - (1996-) Q&A and search
engine, acquired by IAC/InterActiveCorp in 2005.• Overture – (1997-) pay-per-click search engine, acquired
by Yahoo! 2003.• Bing (www.bing.com) – (2009-) Microsoft rebarded search
engine, was Live in 2006 and MSN search before.• Google (www.google.com) – (1998-) – search engine.
Brief History of Search Engines