Upload
swapnil-patil
View
1.811
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Intelligent Crawling and Indexing using Lucene
By Shiva Thatipelli
Mohammad Zubair (Advisor)
Contents Searching Indexing Lucene Indexing with Lucene Indexing Static and Dynamic Pages Extracting and Indexing Dynamic
Pages Implementation Screens
Searching
Looking up words in an index Factors Affecting Search Precision – How well the system can
filter Speed Single, Multiple Phase queries,
Results ranking, Sorting, Wild card queries, Range queries support
Indexing
Sequential Search is bad (Not Scalable)
Index speeds up selection Index is a special data structure
which allows rapid searching. Different Index Implementations
- B Trees- Hash Map
Search Process
Query
Hits
DocsDocs
Index
Indexing API
Lucene
High-performance, full-featured text search engine library
Written 100% in pure java Easy to use yet powerful API Jakarta Apache Product. Strong
open source community support.
Why Lucene? Open source (Not proprietary) Easy to use, good documentation Interoperable - ex: Index generated by java
can be used by VB, asp, perl application Powerful And Highly Scalable Index Format
Designed for interoperability Well Documented Resides on File System, RAM, custom store
Continued Algorithms
Efficient, fast and optimized• Incremental Indexing• Boolean Query, Fuzzy Query, Range Query,
Multi Phrase Query, Wild Card Query etc…• Content Tagging – Documents as Collection
of terms Heterogeneous documents - Useful when
different set of metadata present for different mime types
Indexing With Lucene
What type of documents can be indexed? Any document from which text can be
fetched and extracted over the net with a URL
Uses Inverted Index - The index stores statistics about
terms in order to make term-based search more efficient.
Indexing With Lucene Contd…
Parser
HTML WORDXLS PDF
ParserParser Parser
Analyzer
Index
extracted extractedextractedextracted
Indexing Static and Dynamic Pages
Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.
Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.
Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.
To index Dynamic Pages we need the parameters submitted by users to generate those pages.
Extracting and Indexing Dynamic Pages
Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs
To get the input parameters, we used of Apache Access logs which contain user request as URL.
A sample entry in Apache access log is as follows:
127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search?formname=simple&fulltext=maly&group=subject&sort=title HTTP/1.1" 200 9560
Extracting and Indexing Dynamic Pages Contd... It contains all the information like IP-address of the
computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.
The Request URL is the one which has all the input parameters, in this case formname=simple
fulltext=maly group=subject sort=title Results page is dynamic and dependent upon the
parameters passed. A full URL like
http://archon.cs.odu.edu:8066/archon/servlet/search?formname=simple&fulltext=maly&group=subject&sort=title Can be generated from Request URL by appending Website address.
Indexing Dynamic Pages…
Analyzer
Index
Apache Logs
Parse and generate URL
Results page Could be any file type
Implementation
The above flow chart describes the way Apache logs are parsed and URLs are generated
It shows how the Results pages are fetched and extracted from the URLs
The Results page is sent for analysis then Lucene generates the index which will be used for future searches.
Demo
Results: Hardware Environment Dedicated machine for indexing: No, but nominal usage
at time of indexing. CPU: Intel x86 P4 2.8Ghz RAM: 512 DDR Drive configuration: IDE 7200rpm Software environment Lucene Version: 1.4 Java Version: 1..2 OS Version: Windows 2000 Apache Web server version 1.3 to 2.0 Location of index: local
Create Index
IndexByLog.java file reads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.
Files extraction and Index Creation
Searching at the prompt
Searching on the web
Results on the web
Conclusion It is very easy to implement efficient and
powerful search engines using Lucene Lucene can be used to index dynamic
pages and database hidden pages Web Server Access logs can be used to
generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.
There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .
Questions?