WebCrawler in Java

Embed Size (px)

Citation preview

  • 7/27/2019 WebCrawler in Java

    1/4

    ng a Web Crawler in the Java Programming Language

    /java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

    SDN Home > Products & Technologies > J ava Technology > Reference > Technical Articles and Tips > Developer Technical Articles & Tips > Third-Party Technologies >

    Articl eWriting a Web Crawler in the Java Programming Language

    Print-friendly Version

    Articles Index

    By Thom Blum, Doug Keislar, J im Wheaton, and Erling Wold of Muscle Fish, LLCanuary 1998veryone uses web crawlersindirectly, at least! Every time you search the Internet using a service such as

    Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Webrawlersalso known as spiders, robots, or wanderersare software programs that automatically traversehe Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pageshat were found.

    However, you might want to use a crawler directly. You might even want to write your own! Here are someossible reasons:

    You want to maintain mirror sites for popular Web sites. You need to test web pages and links for valid syntax and structure. You want to monitor sites to see when their structure or contents change. Your company needs to search for copyright infringements. You'd like to build a special-purpose indexfor example, one that has some understanding of thecontent stored in multimedia files on the Web.

    his article explains what web crawlers are. It includes a web-crawling demo program, written in the J avarogramming language, that you can run from your browser. The demo traverses the Web automatically,hows a running list of files it has found, and updates the list each time it finds a new one. You can specify

    what type of file you want to find. The J ava language source code for this demo application is provided as arogramming example.

    How Web Crawlers WorkWeb crawlers start by parsing a specified web page, noting any hypertext links on that page that point tother web pages. They then parse those pages for new links, and so on, recursively. Web-crawler softwareoesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. Arawler resides on a single machine. The crawler simply sends HTTP requests for documents to other

    machines on the Internet, just as a web browser does when the user clicks on links. All the crawler reallyoes is to automate the process of following links.

    ollowing links isn't greatly useful in itself, of course. The list of linked pages almost always serves someubsequent purpose. The most common use is to build an index for a web search engine, but crawlers areso used for other purposes, such as those mentioned in the previous section.

    search tips APIs Downloads Products Support Training Participate

    Sun J ava Solaris Communities My SDN Account

    Search

    http://developers.sun.com/http://developers.sun.com/http://developers.sun.com/prodtech/index.htmlhttp://java.sun.com/index.jsphttp://java.sun.com/reference/index.htmlhttp://java.sun.com/reference/techart/index.htmlhttp://java.sun.com/reference/techart/index.htmlhttp://java.sun.com/developer/technicalArticles/index.htmlhttp://java.sun.com/developer/technicalArticles/index.htmlhttp://java.sun.com/developer/technicalArticles/index.htmlhttp://java.sun.com/developer/technicalArticles/ThirdParty/index.htmlhttp://java.sun.com/jsp_utils/PrintPage.jsphttp://java.sun.com/developer/technicalArticles/http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.javahttp://developers.sun.com/global/search_tips.htmlhttp://java.sun.com/global/mh/api/index.htmlhttp://java.sun.com/global/mh/downloads/index.htmlhttp://java.sun.com/global/mh/products/index.htmlhttp://java.sun.com/global/mh/support/index.htmlhttp://java.sun.com/global/mh/training/index.htmlhttp://java.sun.com/global/mh/participate/index.htmlhttp://java.sun.com/global/mh/suncom/index.htmlhttp://java.sun.com/global/mh/java/http://java.sun.com/global/mh/solaris/http://java.sun.com/global/mh/communities/http://developers.sun.com/global/my_profile.htmlhttp://developers.sun.com/global/my_profile.htmlhttp://java.sun.com/global/mh/communities/http://java.sun.com/global/mh/solaris/http://java.sun.com/global/mh/java/http://java.sun.com/global/mh/suncom/index.htmlhttp://java.sun.com/global/mh/participate/index.htmlhttp://java.sun.com/global/mh/training/index.htmlhttp://java.sun.com/global/mh/support/index.htmlhttp://java.sun.com/global/mh/products/index.htmlhttp://java.sun.com/global/mh/downloads/index.htmlhttp://java.sun.com/global/mh/api/index.htmlhttp://www.sun.com/http://www.sun.com/http://developers.sun.com/global/search_tips.htmlhttp://developers.sun.com/http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.javahttp://java.sun.com/developer/technicalArticles/http://java.sun.com/jsp_utils/PrintPage.jsphttp://java.sun.com/jsp_utils/PrintPage.jsphttp://java.sun.com/developer/technicalArticles/ThirdParty/index.htmlhttp://java.sun.com/developer/technicalArticles/index.htmlhttp://java.sun.com/developer/technicalArticles/index.htmlhttp://java.sun.com/reference/techart/index.htmlhttp://java.sun.com/reference/index.htmlhttp://java.sun.com/index.jsphttp://developers.sun.com/prodtech/index.htmlhttp://developers.sun.com/
  • 7/27/2019 WebCrawler in Java

    2/4

    ng a Web Crawler in the Java Programming Language

    /java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

    Muscle Fish uses a crawler to search the Web for audio files. This is a straightforward task, as shown by theemo in the next section. It turns out that searching for audio files is not very different from searching forny other kind of file. On the other hand, indexing audio is anything but straightforward. Most searchngines, if they handle audio at all, index only textual information that's associated with the sound file.

    Muscle Fish's approach is to acoustically analyze the audio itself. This feature lets you search for soundles based on how they actually soundyou're not limited to searching for whatever words happen to be

    ocated nearby on the same web page. (A forthcoming article and demo program will show this feature.)

    A Web-Crawl ing Demo Programhe simple application shown below crawls the Web, searching for a specified type of file.

    Note: This demo was written using J DK 1.1.3. Not all web browsers support such a recent version of J DK.ou can run the demo on any platform by using the HotJ ava browser. On the Macintosh, the demo should

    work with any browser that uses MRJ (Macintosh Runtime for J ava) 2.0.

    Application source code.

    o run the demo, follow these steps:

    Type a valid URL (web address), including the "http://" portion, in the text field at the top of theapplication window.Click the Search button.Look at the status area below the scrolling list. In this area, the application reports which page it iscurrently searching. As it encounters links on the page, it adds any new URLs to the scrolling list. Theapplication remembers which pages it's already visited, so it won't search any web page twice. Thisprevents infinite loops. As you inspect the list of URLs, you can see that the application performs abreadth-first search. In other words, it accumulates a list of all the links that are on the current pagebefore it follows any of the links to a new page.If you tire of witnessing this little tour of the Web, click the Stop button. The status area reports"stopped."

    you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point,reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant

    n the source code.) The application will also stop automatically if it encounters a dead endmeaning that

    http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.javahttp://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.java
  • 7/27/2019 WebCrawler in Java

    3/4

    ng a Web Crawler in the Java Programming Language

    /java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

    s traversed all the files that are directly or indirectly available from the starting position you specified. If thisappens, the application reports "done."

    he next time you click Search, the list of files gets cleared, and the search process starts over again.

    Notice that there's a pull-down menu that lets you specify what type of file you want to find. The default isHTML text files. You can also choose "audio/basic," "audio/au," "audio/aiff," "audio/wav," "video/mpeg," orvideo/x-avi."

    A Look at the Codeake a look at the J ava-language source code for this demo. The code occupies less than 400 lines,ncluding comments. It is a testament to J DK's elegance that this application took only a few person-hours towrite from scratch. (Muscle Fish had never written a crawler before, nor was any pre-existing web-crawlerode borrowed or studied.)

    Here's a pseudocode summary of the algorithm:

    Get t he user' s i nput : t he star t i ng URL and t he desi r edf i l e type. Add t he URL t o t he cur r ent l y empt y l i st of URLs t o search. Whi l e t he l i st of URLs t o search i snot empt y,

    {Get the f i r s t URL i n the l i s t .Move t he URL t o t he l i st of URLs al r eady searc hed.Check t he URL t o make sure i t s prot ocol i s HTTP

    ( i f not , break out of t he l oop, back to "Whi l e") .See whet her t here ' s a robot s . t xt f i l e at t h i s s i t e

    t hat i ncl udes a "Di sal l ow" st atement .( I f so, break out of t he l oop, back to "Whi l e" . )

    Try t o "open" t he URL ( t hat i s , re t r i evet hat document Fr om t he Web) .

    I f i t ' s not an HTML f i l e, break out of t he l oop,back t o " Whi l e. "

    Step t hr ough t he HTML f i l e. Whi l e t he HTML t extcont ai ns anot her l i nk,

    {Val i date t he l i nk' s URL and make sure r obots are

    al l owed ( j us t as i n the out er l oop) .I f i t ' s an HTML f i l e,

    I f t he URL i sn' t present i n ei t her t he t o- sear chl i s t or t he al r e ady - s ear c hed l i s t , add i t t ot he to - search l i s t .

    El s e i f i t ' s t he t y pe of t he f i l e t he us err equest ed,

    Add i t t o t he l i s t o f f i l es f ound.}}

    his demo tries to respect the robots exclusion standard, meaning that it avoids sites where it's unwelcome.Any site can exclude web crawlers from all or part of its filesystem, by putting certain statements in a filealled r obot s. t xt . See the r obotSaf e function in the demo's source code. This function is conservativen that it avoids sites where any crawler is disallowed, even if this particular one is not. (There is a newHTML meta-tag called ROBOTS , which this demo does not yet support. If you revise the source code toupport this meta-tag, send your code to the authors and the version posted here will be updated.)

    Where to Go from Here

    his simple programming example might have given you some ideas about how to write a full-fledged webrawler. Muscle Fish can't provide technical support for running this demo program or for writing crawlers.

    However, there are various resources on the Web for people interested in crawlers. The Web Robots Pagesa good starting point, and it contains links to other important sites.

    hom Blum, Doug Keislar, Jim Wheaton, and Erling Wold are members of Muscle Fish, LLC , a softwareonsulting firm in Berkeley, California. Muscle Fish specializes in audio and music technology, androduces software that searches for sound based on its acoustical content.

    http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.javahttp://www.robotstxt.org/wc/exclusion.htmlmailto:[email protected]://www.robotstxt.org/wc/robots.htmlhttp://www.musclefish.com/http://www.musclefish.com/http://www.robotstxt.org/wc/robots.htmlmailto:[email protected]://www.robotstxt.org/wc/exclusion.htmlhttp://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler.java
  • 7/27/2019 WebCrawler in Java

    4/4

    ng a Web Crawler in the Java Programming Language

    /java sun com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

    Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard productcommunication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are atthe sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does notrepresent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It isintended for information purposes only, and may not be incorporated into any contract.

    About Sun | About This Site | Newsletters | Contact Us |Employment | How to Buy | Licensing | Terms of Use | Privacy| Trademarks

    2010, Oracle Corporation and/or its affiliates

    A Sun Developer Networ kSite

    Unless otherwise licensed,code in all technical manualsherein (including articles,FAQs, samples) is providedunder this License .

    Sun Developer RSS Feeds

    http://developers.sun.com/global/aboutsun.htmlhttp://developers.sun.com/global/aboutsun.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/newsletters.htmlhttp://developers.sun.com/global/contact.htmlhttp://developers.sun.com/global/contact.htmlhttp://developers.sun.com/global/employment.htmlhttp://developers.sun.com/global/howtobuy.htmlhttp://developers.sun.com/global/licensing.htmlhttp://developers.sun.com/global/licensing.htmlhttp://developers.sun.com/global/termsofuse.htmlhttp://developers.sun.com/global/termsofuse.htmlhttp://developers.sun.com/global/privacy.htmlhttp://developers.sun.com/global/trademarks.htmlhttp://developers.sun.com/global/trademarks.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/berkeley_license.htmlhttp://developers.sun.com/global/berkeley_license.htmlhttp://developers.sun.com/global/content_feeds.htmlhttp://developers.sun.com/global/content_feeds.htmlhttp://developers.sun.com/global/rss_sdn.htmlhttp://developers.sun.com/global/berkeley_license.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/trademarks.htmlhttp://developers.sun.com/global/privacy.htmlhttp://developers.sun.com/global/termsofuse.htmlhttp://developers.sun.com/global/licensing.htmlhttp://developers.sun.com/global/howtobuy.htmlhttp://developers.sun.com/global/employment.htmlhttp://developers.sun.com/global/contact.htmlhttp://developers.sun.com/global/newsletters.htmlhttp://developers.sun.com/global/aboutsdn.htmlhttp://developers.sun.com/global/aboutsun.htmlhttp://www.oracle.com/index.html