Outline
• What is the Web?
• What’s on the Web?
• What is the nature of the Web?
• Preserving the Web
Defining the Web
• HTTP, HTML, or URL?
• Static, dynamic or streaming?
• Public, protected, or internal?
Economics of the Web in 1995
• Affordable storage– 300,000 words/$
• Adequate backbone capacity– 25,000 simultaneous transfers
• Adequate “last mile” bandwidth– 1 second/screen
• Display capability– 10% of US population
• Effective search capabilities– Lycos (now google), Yahoo
Nature of the Web
• Over one billion pages by 1999– Growing at 25% per month!
– Google indexed about 3 billion pages in 2003
• Unstable– Changing at 1% per week
• Redundant– 30-40% (near) duplicates
• e.g., unix man page tree
What’s a Web “Site”?
• OCLC counts any server at port 80– Misses many servers at other ports
• Some servers host unrelated content– Geocities
• Some content requires specialized servers– rtsp
World Trade in 2001
Rank Exporters Value Share change Rank Importers Value Share change
1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5
10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5
Source: World Trade Organization
Widely Spoken Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown
Source: Jack Xu, Excite@Home, 1999
Web Page Languages
European Web Size: Exponential Growth
0
1
10
100
1,000
10,000
Oct
-96
Oct
-97
Oct
-98
Oct
-99
Oct
-00
Oct
-01
Oct
-02
Oct
-03
Oct
-04
Oct
-05
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
European Web Content
Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Live Streams
source: www.real.com, Feb 2000
529
1367
English
OtherLanguages
Almost 2000 Internet-accessible
Radio and TelevisionStations
Streaming Media
• SingingFish indexes 35 million streams
• 60% of queries are for music– Then movies– Then sports– Then news
Web Crawl Challenges• Temporary server interruptions
• Discovering “islands” and “peninsulas”
• Duplicate and near-duplicate content
• Dynamic content
• Link rot
• Server and network loads
• Have I seen this page before?
Duplicate Detection
• Structural– Identical directory structure (e.g., mirrors, aliases)
• Syntactic– Identical bytes– Identical markup (HTML, XML, …)
• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)
Robots Exclusion Protocol
• Based on voluntary compliance by crawlers
• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl
• Exclusion by document (in HTML head)– Not implemented by all crawlers
<meta name="robots“ content="noindex,nofollow">
The Deep Web
• Dynamic pages, generated from databases
• Not easily discovered using crawling
• Perhaps 400-500 times larger than surface Web
• Fastest growing source of new information
Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times
NameType URL
Web Size
(GBs)
National Climatic Data Center (NOAA)
Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html
366,000
NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html
219,600
National Oceanographic (combined with Geophysical) Data Center (NOAA)
Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/
32,940
Alexa Public (partial)
http://www.alexa.com/ 15,860
Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640
MP3.com Public http://www.mp3.com/
Hands on: The Wayback Machine
• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org
• Check out Maryland’s Web site in 1997
• Check out the history of your favorite site