Introduction to Web Robots, Crawlers & Spiders

Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO

Introductionto

Web Robots,Crawlers & Spiders

Instructor: Joseph DiVerdi, Ph.D., MBA

Web Robot Defined

• A Web Robot Is a Program– That Automatically Traverses the Web

• Using Hypertext Links

– Retrieving a Particular Document• Then Retrieving All Documents That Are Referenced

– Recursively

• Recursive Doesn't Limit the Definition – To Any Specific Traversal Algorithm– Even If a Robot Applies Some Heuristic to the Selection

& Order of Documents to Visit & Spaces Out Requests Over a Long Time Period

• It Is Still a Robot

Web Robot Defined

• Normal Web Browsers Are Not Robots– Because the Are Operated by a Human– Don't Automatically Retrieve Referenced Documents

• Other Than Inline Images

Web Robot Defined

• Sometimes Referred to As – Web Wanderers– Web Crawlers– Spiders

• These Names Are a Bit Misleading– They Give the Impression the Software Itself Moves

Between Sites• Like a Virus

– This Not the Case• A Robot Visits Sites by Requesting Documents From Them

Agent Defined

• The Term Agent Is (Over) Used These Days• Specific Agents Include:

– Autonomous Agent– Intelligent Agent– User-Agent

Autonomous Agent Defined

• An Autonomous Agent Is a Program– That Automatically Travels Between Sites– Makes Its Own Decisions

• When To Move, When To Stay

– Are Limited to Travel Between Selected Sites– Currently Not Widespread on the Web

Intelligent Agent Defined

• An Intelligent Agent Is a Program– That Helps Users With Certain Activities

• Choosing a Product• Filling Out a Form• Find Particular Items

– Generally Have Little to Do With Networking– Usually Created & Maintained by an Organization

• To Assist Its Own Viewers

User-Agent Defined

• An User-Agent Is a Program– Performs Networking Tasks for a User

• Web User-Agent– Navigator

– Internet Explorer

– Opera

• Email User-Agent– Eudora

• FTP User-Agent– HTML-Kit

– Fetch

– cute_FTP

Search Engine Defined

• A Search Engine Is a Program– That Examines A Database

• Upon Request or Automatically• Delivers Results or Creates Digest

– In the Context of the Web A Search Engine Is• A Program That Examines Databases of HTML

Documents– Databases Gathered by a Robot

• Upon Request• Delivers Results Via HTML Document

Robot Purposes

• Robots Are Used for a Number of Tasks– Indexing

• Just Like a Book Index

– HTML Validation– Link Validation

• Searching for Broken Links

– What's New Monitoring– Mirroring

• Making a Copy of a Primary Web Site• On a Separate Server

– More Local to Some Users

– Shares the Work Load With the Primary Server

Other Popular Names

• All Names for the Same Sort of Program– With Slightly Different Connotations

• Web Spiders– Sounds Cooler in the Media

• Web Crawlers– Webcrawler Is a Specific Robot

• Web Worms– A Worm Is a Replicating Program

• Web Ants– Distributed Cooperating Robots

Robot Ethics

• Robots Have Enjoyed a Checkered History– Certain Robot Programs Can

• And Have in the Past

– Overload Networks & Servers• With Numerous Requests

• This Happens Especially With Programmers – Just Starting to Write a Robot Program

• These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes– But Does Everyone Read It?

Robot Ethics

• Robots Have Enjoyed a Checkered History– Robots Are Operated by Humans

• Can Make Mistakes in Configuration• Don't Consider the Implications of Actions

• This Means – Robot Operators Need to Be Careful– Robot Authors Need to Make It Difficult for

Operators to Make Mistakes• With Bad Effects

Robot Ethics

• Robots Have Enjoyed a Checkered History– Indexing Robots Build Central Database of

Documents– Which Doesn't Always Scale Well

• To Millions of Documents• On Millions of Sites

– Many Different Problems Occur• Missing Sites & Links• High Server Loads• Broken Links

Robot Ethics

• Robots Have Enjoyed a Checkered History– Majority of Robots Are

• Well Designed• Professionally Operated• Cause No Problems• Provide a Valuable Service

• Robots Aren't Inherently Bad– Nor Are They Inherently Brilliant

• They Just Need Careful Attention

Robot Visitation Strategies

• Generally Start From Historical URL List– Especially Documents With Many or Certain Links

• Server Lists• What's New Pages• Most Popular Sites on the Web

• Other Sources for URLs Are Used– Scans Through USENET Postings– Published Mailing List Archives

• Robot Selects URLs to Visit, Index, & Parse• And Use As a Source for New URLs

Robot Indexing Strategies

• If an Indexing Robot Is Aware of a Document– Robot May Decide to Parse Document– Insert Document Content Into Robot's Database

• Decision Depends on the Robot– Some Robots Index

• HTML Titles• The First Few Paragraphs• Parse the Entire HTML & Index All Words

– With Weightings Depending on HTML Constructs

• Parse the META Tag– Or Other Special Internal Tags

Robot Visitation Strategies

• Many Indexing Services Also Allow Web Developers to Submit URL Manually– Which Is Queued– Visited by the Robot

• Exact Process Depends on Robot Service– Many Services Have a Link to a URL Submission

Form on Their Search Page

• Certain Aggregators Exist– Which Purport to Submit to Many Robots at Once

http://www.submit-it.com/

Determining Robot Activity

• Examine Server Logs– Examine User-Agent, If Available– Examine Host Name or IP Address– Check for Many Accesses in Short Time Period– Check for Robot Exclusion Document Access

• Found at: /robots.txt

Apache Access Log Snippet

"GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX"

"GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX"

"GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver"

"GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63"

"GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw-crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)"

"GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com; http://www.inktomi.com/slurp.html)"

After Robot Visitation

• Some Webmasters Panic After Being Visited– Generally Not a Problem– Generally a Benefit– No Relation to Viruses– Little Relation to Hackers– Close Relation to Lots of Visits

Controlling Robot Access

• Excluding Robots Is Feasible Using Server Authentication Techniques– .htaccess File & Directives

• Deny From 0.0.0.0 (IP Address)• SetEnvIf User-Agent Robot is_a_robot

• Can Increase Server Load• Seldom Required

– More Often (Mis) Desired

Robot Exclusion Standard

• Robot Exclusion Standard Exists– Consists of Single Site-wide File

• /robots.txt• Contains Directives, Comment Lines, & Blank Lines

– Not a Locked Door– More of a "No Entry" Sign– Represents a Declaration of Owner's Wishes– May Be Ignored by Incoming Traffic

• Much Like a Red Traffic Light– If Everyone Follows The Rules, The World's a Better Place

Sample robots.txt File

# /robots.txt file for http://webcrawler.com/

# mail webmaster@webcrawler.com for constructive criticism

User-agent: webcrawler

Disallow:

User-agent: lycra

Disallow: /

User-agent: *

Disallow: /tmp

Disallow: /logs

Exclusion Standard Syntax

# /robots.txt file for http://webcrawler.com/

# mail webmaster@webcrawler.com for constructive criticism

• Lines Beginning With '#' Are Comments• Comment Lines Are Ignored

– Comments May Not Appear Mid-Line

User-agent: webcrawler

Disallow:

• Specify That the Robot Named 'webcrawler'• Has Nothing Disallowed

– It May Go Anywhere on This Site

User-agent: lycra

Disallow: /

• Specify That the Robot Named 'lycra'• Has All URLs starting with '/' Disallowed

– It May Go Nowhere on This Site– Because All URLs On This Server

• Begin With Slash

User-agent: *

Disallow: /tmp

Disallow: /logs

• Specify That All Robots• Has URLs starting with '/tmp' & '/log' Disallowed

– It May Not Access Any URLs Beginning With Those Strings

• Note The '*' is a Special Token– Meaning "any other User-agent"

• Regular Expressions Cannot Be Used

• Two Common Configuration Errors– Wildcards Are Not Supported

• Do Not Use 'Disallow: /tmp/*'• Use 'Disallow: /tmp'

– Put Only One Path on Each Disallow Line• This May Change in a Future Version of the Standard

robots.txt File Location

• The Robot Exclusion File Must be Placed at The Server's Document Root

• For example:Site URL Corresponding Robots.txt URL

http://www.w3.org/ -> http://www.w3.org/robots.txt

http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt

http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt

http://w3.org/ -> http://w3.org/robots.txt

Common Mistakes

• Urls Are Case Sensitive– "/robots.txt" must be all lower-case

• Pointless robots.txt URLshttp://www.w3.org/admin/robots.txt

http://www.w3.org/~timbl/robots.txt

• On a Server With Multiple Users– Like linus.ulltra.com– robots.txt Cannot Be Placed in Individual Users'

Directories– It Must Be Placed in the Server Root

• By the Server Administrator

For Non-System Administrators

• Sometimes Users Have Insufficient Authority to Install a /robots.txt File– Because They Don't Administer the Entire Server

• Use META Tag In individual HTML Documents to Exclude Robots

– Prevents Document From Being Indexed

– Prevents Document Links From Being Followed

Bottom Line

• Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed

• Don't Use It to Exclude Visitors• Don't Use It to Secure Sensitive Content

– Use Authentication If It's Important– Use SSL If It's Really Important

Introduction to Web Robots, Crawlers & Spiders

Documents

#WAT19 · Nicolas Plantelin @Weekend_Evasion Cécilia Ouali Carneiro @VoyageursGourmd Comment ça marche… en résumé Les robots/crawlers/bots Google arrivent sur une page Ils lisent

Dealing with Crawlers

COMPACT CRAWLERS - EMCO

Spiders and Crawlers and Bots, Oh My: The Economic Efficiency and

Web - Crawlers

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1

Batelle Crawlers

Web Crawlers Detection - The American University in Cairorafea/CSCE590/Spring2015... · 2015-03-31 · The Need For Web Crawlers Detection The amount of traffic caused by crawlers

Dungeon Crawlers Mini Bible

Why Websites Block Spiders, Crawlers and Bots · Why Websites Block Spiders, Crawlers and Bots 2 There are quite a few reasons to have automated systems, like spiders, crawlers and

Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

Cassiopeia – Towards a Distributed and Composable Crawling ... · SEDA, Web crawler. 1. Introduction Nowadays, Internet robots, crawlers, and spiders are ar- ... in an eﬃcient

Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting

Improving the Performance of Focused Web Crawlers - TUCpetrakis/publications/BaPeMi09.pdf · Improving the Performance of Focused Web Crawlers ... Crawlers (also known as Robots or

Creepy Crawlers

TERMS & CONDITIONS - BigCommerce€¦ · spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to “scrape” or download data from any web pages contained

CHAPTER V WEB-REPOSITORY CRAWLINGfmccown/temp/chap5-web-repo-crawling.pdfWEB-REPOSITORY CRAWLING Web crawlers, robots that systematically download web pages from the Web, have been

How To Search The Internet€¦ · Web viewSearch engines use “spiders” to do the work for them!!! The spiders are also called “crawlers” or “knowledge bots” or “knowbots”

Parallel Crawlers

COMPACT CRAWLERS