View
36
Download
1
Category
Preview:
DESCRIPTION
Introduction to Web Robots, Crawlers & Spiders. Instructor: Joseph DiVerdi, Ph.D., MBA. Web Robot Defined. A Web Robot Is a Program That Automatically Traverses the Web Using Hypertext Links Retrieving a Particular Document Then Retrieving All Documents That Are Referenced Recursively - PowerPoint PPT Presentation
Citation preview
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Introductionto
Web Robots,Crawlers & Spiders
Instructor: Joseph DiVerdi, Ph.D., MBA
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Web Robot Defined
• A Web Robot Is a Program– That Automatically Traverses the Web
• Using Hypertext Links
– Retrieving a Particular Document• Then Retrieving All Documents That Are Referenced
– Recursively
• Recursive Doesn't Limit the Definition – To Any Specific Traversal Algorithm– Even If a Robot Applies Some Heuristic to the Selection
& Order of Documents to Visit & Spaces Out Requests Over a Long Time Period
• It Is Still a Robot
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Web Robot Defined
• Normal Web Browsers Are Not Robots– Because the Are Operated by a Human– Don't Automatically Retrieve Referenced Documents
• Other Than Inline Images
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Web Robot Defined
• Sometimes Referred to As – Web Wanderers– Web Crawlers– Spiders
• These Names Are a Bit Misleading– They Give the Impression the Software Itself Moves
Between Sites• Like a Virus
– This Not the Case• A Robot Visits Sites by Requesting Documents From Them
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Agent Defined
• The Term Agent Is (Over) Used These Days• Specific Agents Include:
– Autonomous Agent– Intelligent Agent– User-Agent
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Autonomous Agent Defined
• An Autonomous Agent Is a Program– That Automatically Travels Between Sites– Makes Its Own Decisions
• When To Move, When To Stay
– Are Limited to Travel Between Selected Sites– Currently Not Widespread on the Web
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Intelligent Agent Defined
• An Intelligent Agent Is a Program– That Helps Users With Certain Activities
• Choosing a Product• Filling Out a Form• Find Particular Items
– Generally Have Little to Do With Networking– Usually Created & Maintained by an Organization
• To Assist Its Own Viewers
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
User-Agent Defined
• An User-Agent Is a Program– Performs Networking Tasks for a User
• Web User-Agent– Navigator
– Internet Explorer
– Opera
• Email User-Agent– Eudora
• FTP User-Agent– HTML-Kit
– Fetch
– cute_FTP
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Search Engine Defined
• A Search Engine Is a Program– That Examines A Database
• Upon Request or Automatically• Delivers Results or Creates Digest
– In the Context of the Web A Search Engine Is• A Program That Examines Databases of HTML
Documents– Databases Gathered by a Robot
• Upon Request• Delivers Results Via HTML Document
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Purposes
• Robots Are Used for a Number of Tasks– Indexing
• Just Like a Book Index
– HTML Validation– Link Validation
• Searching for Broken Links
– What's New Monitoring– Mirroring
• Making a Copy of a Primary Web Site• On a Separate Server
– More Local to Some Users
– Shares the Work Load With the Primary Server
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Other Popular Names
• All Names for the Same Sort of Program– With Slightly Different Connotations
• Web Spiders– Sounds Cooler in the Media
• Web Crawlers– Webcrawler Is a Specific Robot
• Web Worms– A Worm Is a Replicating Program
• Web Ants– Distributed Cooperating Robots
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Ethics
• Robots Have Enjoyed a Checkered History– Certain Robot Programs Can
• And Have in the Past
– Overload Networks & Servers• With Numerous Requests
• This Happens Especially With Programmers – Just Starting to Write a Robot Program
• These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes– But Does Everyone Read It?
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Ethics
• Robots Have Enjoyed a Checkered History– Robots Are Operated by Humans
• Can Make Mistakes in Configuration• Don't Consider the Implications of Actions
• This Means – Robot Operators Need to Be Careful– Robot Authors Need to Make It Difficult for
Operators to Make Mistakes• With Bad Effects
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Ethics
• Robots Have Enjoyed a Checkered History– Indexing Robots Build Central Database of
Documents– Which Doesn't Always Scale Well
• To Millions of Documents• On Millions of Sites
– Many Different Problems Occur• Missing Sites & Links• High Server Loads• Broken Links
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Ethics
• Robots Have Enjoyed a Checkered History– Majority of Robots Are
• Well Designed• Professionally Operated• Cause No Problems• Provide a Valuable Service
• Robots Aren't Inherently Bad– Nor Are They Inherently Brilliant
• They Just Need Careful Attention
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Visitation Strategies
• Generally Start From Historical URL List– Especially Documents With Many or Certain Links
• Server Lists• What's New Pages• Most Popular Sites on the Web
• Other Sources for URLs Are Used– Scans Through USENET Postings– Published Mailing List Archives
• Robot Selects URLs to Visit, Index, & Parse• And Use As a Source for New URLs
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Indexing Strategies
• If an Indexing Robot Is Aware of a Document– Robot May Decide to Parse Document– Insert Document Content Into Robot's Database
• Decision Depends on the Robot– Some Robots Index
• HTML Titles• The First Few Paragraphs• Parse the Entire HTML & Index All Words
– With Weightings Depending on HTML Constructs
• Parse the META Tag– Or Other Special Internal Tags
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Visitation Strategies
• Many Indexing Services Also Allow Web Developers to Submit URL Manually– Which Is Queued– Visited by the Robot
• Exact Process Depends on Robot Service– Many Services Have a Link to a URL Submission
Form on Their Search Page
• Certain Aggregators Exist– Which Purport to Submit to Many Robots at Once
http://www.submit-it.com/
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Determining Robot Activity
• Examine Server Logs– Examine User-Agent, If Available– Examine Host Name or IP Address– Check for Many Accesses in Short Time Period– Check for Robot Exclusion Document Access
• Found at: /robots.txt
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Apache Access Log Snippet
"GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX"
"GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX"
"GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver"
"GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63"
"GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw-crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)"
"GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com; http://www.inktomi.com/slurp.html)"
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
After Robot Visitation
• Some Webmasters Panic After Being Visited– Generally Not a Problem– Generally a Benefit– No Relation to Viruses– Little Relation to Hackers– Close Relation to Lots of Visits
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Controlling Robot Access
• Excluding Robots Is Feasible Using Server Authentication Techniques– .htaccess File & Directives
• Deny From 0.0.0.0 (IP Address)• SetEnvIf User-Agent Robot is_a_robot
• Can Increase Server Load• Seldom Required
– More Often (Mis) Desired
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Robot Exclusion Standard
• Robot Exclusion Standard Exists– Consists of Single Site-wide File
• /robots.txt• Contains Directives, Comment Lines, & Blank Lines
– Not a Locked Door– More of a "No Entry" Sign– Represents a Declaration of Owner's Wishes– May Be Ignored by Incoming Traffic
• Much Like a Red Traffic Light– If Everyone Follows The Rules, The World's a Better Place
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Sample robots.txt File
# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism
User-agent: webcrawler
Disallow:
User-agent: lycra
Disallow: /
User-agent: *
Disallow: /tmp
Disallow: /logs
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism
• Lines Beginning With '#' Are Comments• Comment Lines Are Ignored
– Comments May Not Appear Mid-Line
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
User-agent: webcrawler
Disallow:
• Specify That the Robot Named 'webcrawler'• Has Nothing Disallowed
– It May Go Anywhere on This Site
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
User-agent: lycra
Disallow: /
• Specify That the Robot Named 'lycra'• Has All URLs starting with '/' Disallowed
– It May Go Nowhere on This Site– Because All URLs On This Server
• Begin With Slash
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
User-agent: *
Disallow: /tmp
Disallow: /logs
• Specify That All Robots• Has URLs starting with '/tmp' & '/log' Disallowed
– It May Not Access Any URLs Beginning With Those Strings
• Note The '*' is a Special Token– Meaning "any other User-agent"
• Regular Expressions Cannot Be Used
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
• Two Common Configuration Errors– Wildcards Are Not Supported
• Do Not Use 'Disallow: /tmp/*'• Use 'Disallow: /tmp'
– Put Only One Path on Each Disallow Line• This May Change in a Future Version of the Standard
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
robots.txt File Location
• The Robot Exclusion File Must be Placed at The Server's Document Root
• For example:Site URL Corresponding Robots.txt URL
http://www.w3.org/ -> http://www.w3.org/robots.txt
http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt
http://w3.org/ -> http://w3.org/robots.txt
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Common Mistakes
• Urls Are Case Sensitive– "/robots.txt" must be all lower-case
• Pointless robots.txt URLshttp://www.w3.org/admin/robots.txt
http://www.w3.org/~timbl/robots.txt
• On a Server With Multiple Users– Like linus.ulltra.com– robots.txt Cannot Be Placed in Individual Users'
Directories– It Must Be Placed in the Server Root
• By the Server Administrator
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
For Non-System Administrators
• Sometimes Users Have Insufficient Authority to Install a /robots.txt File– Because They Don't Administer the Entire Server
• Use META Tag In individual HTML Documents to Exclude Robots
<META NAME="ROBOTS" CONTENT="NOINDEX">
– Prevents Document From Being Indexed
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
– Prevents Document Links From Being Followed
Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO
Copyright © XTR Systems, LLC
Bottom Line
• Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed
• Don't Use It to Exclude Visitors• Don't Use It to Secure Sensitive Content
– Use Authentication If It's Important– Use SSL If It's Really Important
Recommended