Intelligent crawling and indexing using lucene

Intelligent Crawling and Indexing using Lucene

By Shiva Thatipelli

Mohammad Zubair (Advisor)

Contents Searching Indexing Lucene Indexing with Lucene Indexing Static and Dynamic Pages Extracting and Indexing Dynamic

Pages Implementation Screens

Searching

Looking up words in an index Factors Affecting Search Precision – How well the system can

filter Speed Single, Multiple Phase queries,

Results ranking, Sorting, Wild card queries, Range queries support

Indexing

Sequential Search is bad (Not Scalable)

Index speeds up selection Index is a special data structure

which allows rapid searching. Different Index Implementations

- B Trees- Hash Map

Search Process

Query

Hits

DocsDocs

Index

Indexing API

Lucene

High-performance, full-featured text search engine library

Written 100% in pure java Easy to use yet powerful API Jakarta Apache Product. Strong

open source community support.

Why Lucene? Open source (Not proprietary) Easy to use, good documentation Interoperable - ex: Index generated by java

can be used by VB, asp, perl application Powerful And Highly Scalable Index Format

Designed for interoperability Well Documented Resides on File System, RAM, custom store

Continued Algorithms

Efficient, fast and optimized• Incremental Indexing• Boolean Query, Fuzzy Query, Range Query,

Multi Phrase Query, Wild Card Query etc…• Content Tagging – Documents as Collection

of terms Heterogeneous documents - Useful when

different set of metadata present for different mime types

Indexing With Lucene

What type of documents can be indexed? Any document from which text can be

fetched and extracted over the net with a URL

Uses Inverted Index - The index stores statistics about

terms in order to make term-based search more efficient.

Indexing With Lucene Contd…

Parser

HTML WORDXLS PDF

ParserParser Parser

Analyzer

Index

extracted extractedextractedextracted

Indexing Static and Dynamic Pages

Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.

Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.

Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.

To index Dynamic Pages we need the parameters submitted by users to generate those pages.

Extracting and Indexing Dynamic Pages

Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs

To get the input parameters, we used of Apache Access logs which contain user request as URL.

A sample entry in Apache access log is as follows:

127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search?formname=simple&fulltext=maly&group=subject&sort=title HTTP/1.1" 200 9560

Extracting and Indexing Dynamic Pages Contd... It contains all the information like IP-address of the

computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.

The Request URL is the one which has all the input parameters, in this case formname=simple

fulltext=maly group=subject sort=title Results page is dynamic and dependent upon the

parameters passed. A full URL like

http://archon.cs.odu.edu:8066/archon/servlet/search?formname=simple&fulltext=maly&group=subject&sort=title Can be generated from Request URL by appending Website address.

Indexing Dynamic Pages…

Analyzer

Index

Apache Logs

Parse and generate URL

Results page Could be any file type

Implementation

The above flow chart describes the way Apache logs are parsed and URLs are generated

It shows how the Results pages are fetched and extracted from the URLs

The Results page is sent for analysis then Lucene generates the index which will be used for future searches.

Demo

Results: Hardware Environment Dedicated machine for indexing: No, but nominal usage

at time of indexing. CPU: Intel x86 P4 2.8Ghz RAM: 512 DDR Drive configuration: IDE 7200rpm Software environment Lucene Version: 1.4 Java Version: 1..2 OS Version: Windows 2000 Apache Web server version 1.3 to 2.0 Location of index: local

Create Index

IndexByLog.java file reads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.

Files extraction and Index Creation

Searching at the prompt

Searching on the web

Results on the web

Conclusion It is very easy to implement efficient and

powerful search engines using Lucene Lucene can be used to index dynamic

pages and database hidden pages Web Server Access logs can be used to

generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.

There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .

Questions?

Technology

Intelligent crawling and indexing using lucene