Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
logolund
EITF25 Internet - Web Search
Anders Ardö
EIT – Electrical and Information Technology, Lund University
November 20, 2012
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 1 / 50
logolund
Agenda
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 2 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 3 / 50
logolund
Why Web search ...
Explosion of (digital) informationwithin all types of information collections
Harder and harder to follow information flowFaster way to find relevant information when its neededChallenges
Distributed, dynamic dataLarge volumeUnstructured, heterogeneous data
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 4 / 50
logolund
Size of the Web
no one knowsestimates (text pages)
2005 ’more than 11.5 billion’2007 ’more than 20 billion’2010 ’ 20 - 55 billion ’
Google claims to know of 1012 unique URLs (text, images, ...)
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 5 / 50
logolund
Important questions
Digital Libraries
How do I find relevant information?How do I navigate the digital information landscape?How structure and organize information to ease knowledgeextraction?How to create collections, properly organized, with relevantmaterial?How to keep collections updated?
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 6 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 7 / 50
logolund
Search Engine - Basic structure
���������������������������
���������������������������
Database
Interface
Database
Web pagesHTTP Web browserQuery
Answer
CGI−script
Web robot The WebHTTP
Size efficiency response time
software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 8 / 50
logolund
Size of search engines
not publishedguesses 1 - 20 - 50 billion pagesSearch Engine Total URLs Unique URLs Overlap (%)Google 182 166 8.79Altavista 181 167 7.74Hotbot 200 170 15Scirus 174 164 5.75Bioweb 200 200 0.0
From: Rather, Lone, Shah: “Overlap in Web Search Results: A Study of Five Search Engines”, Library Philosophy andPractice 2008, ISSN 1522-0222
http://www.webpages.uidaho.edu/ mbolin/rather-lone-shah.htm
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 9 / 50
logolund
started late 1990:sestimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 10 / 50
logolund
Search engine examples
Google, Bing, Yahoo,(DuckDuckGo)
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 11 / 50
logolund
Search Engine - Application
���������������������������
���������������������������
Web browser
Database
Web pages
CGI−script
HTTP
Web server
CGI/HTML
SRU/XML
HTTP
(Z39.50 ...)
(ASN, ...)
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 12 / 50
logolund
Meta Search Engine - Application
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 13 / 50
logolund
MetaSearch Engine
it’s software that simultaneously search several individual searchenginescollecting, reviewing and ranking their answersand give them back in a merged/condensed form to the userthey are not better than the quality of the search enginedatabases they obtain results from
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 14 / 50
logolund
MetaSearch engines
Simultaneously search several individual search enginesQuery translationResult merging
Simple mergeDuplicate detectiontf-idf/similarity rankingPosition based
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 15 / 50
logolund
MetaSearch Engine examples
Dogpile, Yippy, DuckDuckGo
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 16 / 50
logolund
Special (Vertical) search engines
pricesex: prisjakt, PriceRunner, ...http://www.pricerunner.co.uk/http://www.prisjakt.nu/jobsex: freejobsearch, jobspider, ...http://freejobsearch.org/http://www.jobspider.com/Housingex: rightmove, hemnet, bovision, ...http://www.rightmove.co.uk/http://www.hemnet.se/http://bovision.se/... and so on ...
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50
logolund
Other Search Engines
Wolfram Alpha
Wolfram|Alpha introduces a fundamentally new way to get knowledgeand answers — not by searching the web, but by doing dynamiccomputations based on a vast collection of built-in data, algorithms,and methods.From http://www.wolframalpha.com/about.html
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 18 / 50
logolund
Wolfram Alpha example
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 19 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 20 / 50
logolund
Web Robot - Basic architecture
Spider, Crawler, Robot, agent, ...
Frontier
List of
unvisited
pages
Database
Get URL
Fetch
Web page
Analyze
Save
pagesWeb
Repository
of visited
pages
URLs
Links
Seed
URLs
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 21 / 50
logolund
Web Robot - Ethics
Important - BE NICEDo not overloadnetwork or serverRobot exclusion protocolcheck forhttp://www.foobar.com/robots.txt
HTML meta-tag ROBOTS
robots.txt:User-agent: *Disallow: /cgi-bin/Disallow: /DATA/Disallow: /Images/
<META NAME="ROBOTS"CONTENT="NOINDEX,NOFOLLOW">
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 22 / 50
logolund
Web Robot - Problems
Network failuresErroneous URLsUnreachable serversPassword protectionSpider trapsRecursive URLsCharacter set encodingsSame page - different URLs
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 23 / 50
logolund
Web Robot - More Problems
Hidden Web
DatabasesDynamic scripts... ?
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 24 / 50
logolund
Web Robot - Traversal algorithms
Depth first (Stack, LIFO queue)Breadth first (FIFO queue)Best first (How?)Relevance order (How?)
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 25 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 26 / 50
logolund
Focused Crawling
Frontier
List of
unvisited
pages
Seed
URLs
Database
pagesWeb
Repository
of visited
pages
URLsGet URL
Fetch
Web page
URL
focus
filter
Analyze
Linksfocus
inNot
Within the
focusSave
filterFocus
Focus:
DomainProjectCountryRegionTopicSubject
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 27 / 50
logolund
Topic-specific Web-crawling
ProblemConstruct a topic specific search-engine(ex. Carnivorous plants)SolutionMake a Web-crawler walk through Internet and collect all pageswith topic ’Carnivorous plants’
easier said than done!
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 28 / 50
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 29 / 50
logolund
Automated Classification technologies
Machine learning methods
Statistical models (Bayes, SVM, ...)ANN
Information Retrieval methodsClustering (no predefined categories)
Library Science methodsString matching + Thesaurus
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 30 / 50
logolund
Topic Filter
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 31 / 50
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 32 / 50
logolund
Internet is Big
First pageOK, saveLinksChoosePage OK?New pagePage OK?SaveNew page
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 33 / 50
logolund
Basic Algorithm
Add good start pages (seeds) to frontierLOOP:
Choose a page among linksPage OK?
Save pageAdd all links to frontier
Go to LOOP
Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 34 / 50
logolund
Focused Crawling
Frontier
List of
unvisited
pages
Seed
URLs
Database
pagesWeb
Repository
of visited
pages
URLsGet URL
Fetch
Web page
URL
focus
filter
Analyze
Linksfocus
inNot
Within the
focusSave
filterFocus
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 35 / 50
logolund
Problems I
Which newpage?
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 36 / 50
logolund
Problems II
Isolatedpages
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 37 / 50
logolund
Problems III
Non relevantpages“blocking”
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 38 / 50
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 39 / 50
logolund
Compromises
Precision/recallcompleteness/speed
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 40 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 41 / 50
logolund
Browsing vs search
SearchLOTS of dataUnstructuredUnrelated items clutter results
BrowsingSmall amounts of dataHierarchically structuredQuality assessed
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 42 / 50
logolund
Browsing examples
Dmoz (ODP), Yahoo! Directory
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 43 / 50
logolund
Outline
1 Web search
2 Web search engines
3 Web robots, crawler
4 Focused Web crawling
5 Web search vs Browsing
6 Privacy, Filter bubble
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 44 / 50
logolund
Filter bubble
What do search engines or social sites know about me?At least location, search history, click history, likes, and more . . .Personalize whats shown (search results, . . . ) using this infoShow us what we want/like to see - algorithmically. . . and not whats relevant (who decides that?)
Problem?
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 45 / 50
logolund
Filter bubble example I
From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 46 / 50
logolund
Filter bubble example II
From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 47 / 50
logolund
ToS-DR
Terms-of-Service – Didn’t Read; http://tos-dr.info/
you give Google (and those we work with) a worldwide license touse, host, store, reproduce, modify, create derivative works (suchas those resulting from translations, adaptations or other changeswe make so that your content works better with our Services),communicate, publish, publicly perform, publicly display anddistribute such content.Facebook: you grant us a non-exclusive, transferable,sub-licensable, royalty-free, worldwide license to use any IPcontent that you post on or in connection with Facebook (IPLicense).
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 48 / 50
logolund
Privacy
Search history, clicks, photos, documents, comments, . . .leads to a profilethat can be used by ads or sold, or even stolenwhich might lead to it ending up in unwanted placesand used against you
Beware!
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 49 / 50
logolund
Questions!
QUESTIONS?
A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 50 / 50