A novel web search engine model based on index query bit-level compression - PhD

بسم ال الرحمن الرحيم

A Novel Web Search Engine Model Based on Index-

Query Bit-Level Compression

Prepared By

Saif Mahmood Saab

Supervisor By

Dr. Hussein Al-Bahadili

Dissertation

Submitted In Partial Fulfillment

of the Requirements for the Degree of Doctorate of Philosophy

in Computer Information Systems

Faculty of Information Systems and Technology

University of Banking and Financial Sciences

Amman - Jordan

(May - 2011)

i

Authorization

I, the undersigned Saif Mahmood Saab authorize the Arab Academy for Banking and Financial Sciences to provide copies of this Dissertation to Libraries, Institutions, Agencies, and any Parties upon their request.

Name: Saif Mahmood Saab

Signature:

Date: 30/05/2011

ii

Dedications

الى روح والدي الطاهرة ...

الى والدتي الحبيبة ...

الى زوجتي الغالية ...

الى ابنائي العزاء...

أهدي عملي المتواضع هذا.

iii

Acknowledgments

First and foremost, I thank Allah (Subhana Wa Taala) for endowing me

with health, patience, and knowledge to complete this work.

I am thankful to anyone who supported me during my study. I would like

to thank my honorific supervisor, Dr. Hussein Al-Bahadili, who accepted

me as his Ph.D. student without any hesitation and offered me so much

advice, patiently supervising me, and always guiding me in the right

direction.

Last but not least, I would like to thank my parents for their support over

the years, my wife for her understanding and continuance encouragement

and my friends specially Mahmoud Alsiksek and Ali AlKhaledi.

It will not be enough to express my gratitude in words to all those people

who helped me; I would still like to give my many, many thanks to all

these people.

iv

List of Figures

Figure Title Page1.1 Architecture and main components of standard search engine

model.10

3.1 Architecture and main components of the CIQ Web search engine model.

41

3.2 Lists of IDs for each type of character sets assuming m=6. 483.3-a Locations of data and parity bits in 7-bit codeword 543.3-b An uncompressed binary sequence of 21-bit length divided

into 3 blocks of 7-bit length, where b1 and b3 are valid blocks, and b2 is a non-valid block

54

3.3-c The compressed binary sequence (18-bit length). 543.4 The main steps of the HCDC compressor 553.5 The main steps of the HCDC decompressor. 563.6 Variation of Cmin and Cmax with p. 583.7 Variation of r1 with p. 593.8 Variations of C with respect to r for various values of p. 603.9 The compressed file header of the HCDC scheme. 654.1 The compression ratio (C) for different sizes index files 754.2 The reduction factor (Rs) for different sizes index files. 764.3 Variation of C and average Sf for different sizes index files. 894.4 Variation of Rs and Rt for different sizes index files. 894.5 The CIQ performance triangle. 905.1 The CIQ performance triangle. 92

v

List of TablesTable Title Page

1.1 Document ID and its contents. 8

1.2 A record and word level inverted indexes for documents in Table (1.1).

8

3.1 List of most popular stopwords (117 stop-words). 473.2 Type of character sets and equivalent maximum number of

IDs47

3.4 Variation of Cmin, Cmax, and r1 with number of parity bits (p). 583.6 Variations of C with respect to r for various values of p. 593.7 Valid 7-bit codewords. 613.8 The HCDC algorithm compressed file header. 644.1 List of visited Websites 714.2 The sizes of the generated indexes. 724.3 Type and number of characters in each generated inverted

index file.73

4.4 Type and frequency of characters in each generated inverted index file.

74

4.5 Values of C and Rs for different sizes index files. 754.6 Performance analysis and implementation validation. 774.7 List of keywords. 784.8 Values of No, Nc, To, Tc, Sf and Rt for 1000 index file 79

4.9 Values of No, Nc, To, Tc, Sf and Rt for 10000 index file 804.10 Values of No, Nc, To, Tc, Sf and Rt for 25000 index file 814.11 Values of No, Nc, To, Tc, Sf and Rt for 50000 index file 824.12 Values of No, Nc, To, Tc, Sf and Rt for 75000 index file 834.13 Variation of Sf for different index sizes and keywords. 854.14 Variation of No and Nc for different index sizes and keywords. 864.15 Variation of To and Tc for different index sizes and keywords. 874.16 Values of C, Rs, average Sf, and average Rt for different sizes

index files.88

vi

Abbreviations

ACW Adaptive Character WordlengthAPI Application Programming Interface ASCII American Standard Code for Information Interchange ASF Apache Software FoundationBWT Burrows-Wheeler block sorting transformCIQ compressed index-queryCPU Central Processing Unit DBA Database AdministratorFLH Fixed-Length HammingGFS Google File SystemGZIP GNU zip HCDC Hamming Code Data CompressionHTML Hypertext Mark-up Language ID3 A metadata container used in conjunction with the MP3 audio file formatJSON JavaScript Object NotationLAN Local Area NetworksLANMAN Microsoft LAN ManagerLDPC Low-Density Parity CheckLZW Lempel-Zif-WelchMP3 A patented digital audio encoding formatNTLM Windows NT LAN Manager PDF Portable Document FormatRLE Run Length EncodingRSS Really Simple SyndicationRTF Rich Text FormatSAN Storage Area NetworksSASE Shrink And Search EngineSP4 Windows Service Pack 4UNIX UNiplexed Information and Computing Service URL Uniform Resource Locator XML Extensible Markup LanguageZIP A data compression and archive format, the name zip (meaning speed)

vii

Table of Contents

Authorization - ii -Dedications - iii -Acknowledgments - iv -List of Figures - v -List of Tables - vi -Abbreviations - vii -Table of Contents - viii -Abstract - x -

Chapter One - 1 -Introduction - 1 -

1.1 Web Search Engine Model - 3 -1.1.1 Web crawler - 3 -1.1.2 Document analyzer and indexer - 4 -1.1.3 Searching process - 9 -

1.2 Challenges to Web Search Engines - 10 -1.3 Data Compression Techniques - 12 -

1.3.1 Definition of data compression - 12 -1.3.2 Data compression models - 12 -1.3.3 Classification of data compression algorithms - 14 -1.3.4 Performance evaluation parameters - 17 -

1.4 Current Trends in Building High-Performance Web Search Engine - 20 -1.5 Statement of the Problem - 20 -1.6 Objectives of this Thesis - 21 -1.7 Organization of this Thesis - 21 -

Chapter Two - 23 -Literature Review - 23 -

2.1 Trends Towards High-Performance Web Search Engine - 23 -2.1.1 Succinct data structure - 23 -2.1.2 Compressed full-text self-index - 24 -2.1.3 Query optimization - 24 -2.1.4 Efficient architectural design - 25 -2.1.5 Scalability - 25 -2.1.6 Semantic search engine - 26 -2.1.7 Using Social Networks - 26 -2.1.8 Caching - 27 -

2.2 Recent Research on Web Search Engine - 27 -2.3 Recent Research on Bit-Level Data Compression Algorithms - 33 -

viii

Chapter Three - 39 -The Novel CIQ Web Search Engine Model - 39 -

3.1 The CIQ Web Search Engine Model - 40 -3.2 Implementation of the CIQ Model: CIQ-based Test Tool (CIQTT) - 42 -

3.2.1 COLCOR: Collects the testing corpus (documents) - 42 - 3.2.2 PROCOR: Processing and analyzing testing corpus (documents) - 46 -

3.2.3 INVINX: Building the inverted index and start indexing. - 46 -3.2.4 COMINX: Compressing the inverted index - 50 -

3.2.5 SRHINX: Searching index (inverted or inverted/compressed index) - 51 - 3.2.6 COMRES: Comparing the outcomes of different search processes

performed by SRHINX procedure. - 52 -3.3 The Bit-Level Data Compression Algorithm - 52 -

3.3.1 The HCDC algorithm - 52 -3.3.2 Derivation and analysis of HCDC algorithm compression ratio - 56 -3.3.3 The Compressed File Header - 63 -

3.4 Implementation of the HCDC algorithm in CIQTT - 65 -3.5 Performance Measures - 66 -

Chapter Four - 68 -Results and Discussions - 68 -

4.1 Test Procedures - 69 -4.2 Determination of the Compression Ratio (C) & the Storage Reduction Factor (Rs) - 70 -

4.2.1 Step 1: Collect the testing corpus using COLCOR procedure - 70 - 4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using

PROCOR and INVINX procedures - 72 - 4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure - 72 -

4.3 Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt) - 77 -4.3.1 Choose a list of keywords - 77 -4.3.2 Perform the search processes - 78 -4.3.3 Determine Sf and Rt. - 84 -

4.4 Validation of the Accuracy of the CIQ Web Search Model - 88 -4.5 Summary of Results - 88 -

Chapter Five - 91 -Conclusions and Recommendations for Future Work - 91 -

5.1 Conclusions - 91 -5.2 Recommendations for Future Work - 93 -

References - 94 -Appendix I - 105 -Appendix II - 108 -Appendix III - 112 -Appendix IV - 115 -

ix

AbstractWeb search engine is an information retrieval system designed to help finding information stored on the Web. Standard Web search engine consists of three main components: Web crawler, document analyzer and indexer, and search processor. Due to the rapid growth in the size of the Web, Web search engines are facing enormous performance challenges, in terms of: storage capacity, data retrieval rate, query processing time, and communication overhead. Large search engines, in particular, have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including succinct data structure, compressed text indexing, query optimization, high-speed processing and communication systems, and efficient search engine architectural design. However, it is believed that the performance of the current Web search engine models still short from meeting users and applications needs.

In this work we develop a novel Web search engine model based on index-query compression, therefore, it is referred to as the compressed index-query (CIQ) model. The model incorporates two compression layers both implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index, and the second layer be located after the query parser for query compression to enable compressed index-query search. The data compression algorithm used is the novel Hamming code data compression (HCDC) algorithm.

The different components of the CIQ model is implemented in a number of procedures forming what is referred to as the CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data, and to evaluate the performance of the CIQ model. The results obtained demonstrate that the new CIQ model attained an excellent performance as compared to the current uncompressed model, as such: the CIQ model achieved a tremendous accuracy with 100% agreement with the current uncompressed model.

The new model demands less disk space as the HCDC algorithm achieves a compression ratio over 1.3 with compression efficiency of more than 95%, which implies a reduction in storage requirement over 24%. The new CIQ model performs faster than the current model as it achieves a speed up factor over 1.3 providing a reduction in processing time of over 24%.

x

Chapter OneIntroduction

A search engine is an information retrieval system designed to help in finding files stored

on a computer, for example, public server on the World Wide Web (or simply the Web),

server on a private network of computers, or on a stand-alone computer [Bri 98]. The

search engine allows us to search the storage media for a certain content in a form of text

meeting specific criteria (typically those containing a given word or phrase) and

retrieving a list of files that match those criteria. In this work, we are concerned with the

type of search engine that is designed to help in finding files stored on the Web (Web

search engine).

Webmasters and content providers began optimizing sites for Web search engines in the

mid-1990s, as the first search engines were cataloging the early Web. Initially, all a

webmaster needed to do was to submit the address of a page, or the uniform resource

locator (URL), to various engines which would send a spider to crawl that page, extract

links to other pages from it, and return information found on the page to be indexed [Bri

98]. The process involves a search engine crawler downloading a page and storing it on

the search engine's own server, where a second program, known as an indexer, extracts

various information about the page, such as the words it contains and where there

location are, as well as any weight for specific words, and all links the page contains,

which are then placed into a scheduler for crawling at a later date [Web 4].

Standard search engine consists of the following main components: Web crawler,

document analyzer and indexer, and searching process [Bah 10d]. The main purpose of

using certain data structure for searching is to construct an index that allows focusing the

search for a given keyword (query). The improvement in the query performance is paid

by the additional space necessary to store the index. Therefore, most of the research in

this field has been directed to design data structures which offer a good trade between

queries and update time versus space usage.

For this reason compression appears always as an attractive choice, if not mandatory.

However space overhead is not the only resource to be optimized when managing large

1

data collections; in fact, data turn out to be useful only when properly indexed to support

search operations that efficiently extract the user-requested information. Approaches to

combine compression and indexing techniques are nowadays receiving more and more

attention. A first step towards the design of a compressed full-text index is achieving

guaranteed performance and lossless data [Fer 01].

In the light of the significant increase in CPU speed that makes more economical to store

data in compressed form than uncompressed. Storing data in a compressed form may

introduce significant improvement in space occupancy and also processing time. This is

because space optimization is closely related to time optimization in a disk memory

(improve time processing) [Fer 01].

There are a number of trends that have been identified in the literature for building high-

performance search engines, such as: succinct data structure, compressed full-text self-

index, query optimization, and high-speed processing and communication systems.

Starting from these promising trends, many researchers have tried to combine text

compression with indexing techniques and searching algorithms. They have mainly

investigated and analyzed the compressed matching problem under various compression

schemes [Fer 01].

Due to the rapid growth in the size of the Web, Web search engines are facing enormous

performance challenges, in terms of: (i) storage capacity, (ii) data retrieval rate, (iii) query

processing time, and (iv) communication overhead. The large engines, in particular, have

to be able to process tens of thousands of queries per second on tens of billions of

documents, making query throughput a critical issue. To satisfy this heavy workload,

search engines use a variety of performance optimizations including index compression.

With the tremendous increase in users and applications needs we believe that the current

search engines model need more retrieval performance and more compact and cost-

effective systems are still required.

In this work we develop a novel web search engine model that is based on index-query

bit-level compression. The model incorporates two bit-level data compression layers both

2

implemented at the back-end processor side, one after the indexer acting as a second

compression layer to generate a double compressed index, and the other one after the

query parser for query compression to enable bit-level compressed index-query search.

So that less disk space is required to store the compressed index file and also reducing

disk I/O overheads, and consequently higher retrieval rate or performance.

An important feature of the bit-level technique to be used for performing the search

process at the compressed index-query level, is to generate similar compressed binary

sequence for the same character from the search queries and the index files. The data

compression technique that satisfies this important feature is the HCDC algorithm [Bah

07b, Bah 08a]. Therefore; it will be used in this work. Recent investigations on using this

algorithm for text compression have demonstrated an excellent performance in

comparison with many widely-used and well-known data compression algorithms and

state of the art tools [Bah 07b, Bah 08a].

1.1 Web Search Engine Model

A Web search engine is an information retrieval system designed to help find files stored

on a public server on the Web [Bri 98, Mel 00]. Standard Web search engine consists of

the following main components:

• Web crawler

• Document analyzer and indexer

• Searching process

In what follows we provide a brief description for each of the above components.

1.1.1 Web crawler

A Web crawler is a computer program that browses the Web in a methodical, automated

manner. Other terms for Web crawlers are ants, automatic indexers, bots, worms, Web

spider and Web robot. Unfortunately, each spider has its own personal agenda as it

indexes a site. Some search engines use META tag; others may use the META description

3

of a page, and some use the first sentence or paragraph on the sites. That is mean; a page

that ranks higher on one web search engine may not rank as well on another. Given a set

of “URLs” unified resource locations, the crawler repeatedly removes one URL from the

set, downloads the targeted page, extracts all the URLs contained in it, and adds all

previously unknown URLs to the set [Bri 98, Jun 00].

Web search engines work by storing information about many Web pages, which they

retrieve from the Web itself. These pages are retrieved by a spider - sophisticated Web

browser which follows every link extracted or stored in its database. The contents of each

page are then analyzed to determine how it should be indexed, for example, words are

extracted from the titles, headings, or special fields called Meta tags.

1.1.2 Document analyzer and indexer

Indexing is the process of creating an index that is a specialized file containing a

compiled version of documents retrieved by the spider [Bah 10d]. Indexing process

collect, parse, and store data to facilitate fast and accurate information retrieval. Index

design incorporates interdisciplinary concepts from linguistics, mathematics, informatics,

physics and computer science [Web 5].

The purpose of storing an index is to optimize speed and performance in finding relevant

documents for a search query. Without an index, the search engine would scan every

(possible) document in the Internet, which would require considerable time and

computing power (impossible with the current Internet size). For example, while an index

of 10000 documents can be queried within milliseconds, a sequential scan of every word

in the documents could take hours. The additional computer storage required to store the

index, as well as the considerable increase in the time required for an update to take

place, are traded off for the time saved during information retrieval [Web 5].

4

Index design factors

Major factors should be carefully considered when designing a search engines, these

include [Bri 98, Web 5]:

• Merge factors: How data enters the index, or how words or subject features are

added to the index during text corpus traversal, and whether multiple indexers can

work asynchronously. The indexer must first check whether it is updating old

content or adding new content. Traversal typically correlates to the data collection

policy. Search engine index merging is similar in concept to the SQL Merge

command and other merge algorithms.

• Storage techniques: How to store the index data, that is, whether information

should be data compressed or filtered.

• Index size: How much computer storage is required to support the index.

• Lookup speed: How quickly a word can be found in the index. The speed of

finding an entry in a data structure, compared with how quickly it can be updated

or removed, is a central focus of computer science.

• Maintenance: How the index is maintained over time.

• Fault tolerance: How important it is for the service to be robust. Issues include

dealing with index corruption, determining whether bad data can be treated in

isolation, dealing with bad hardware, partitioning, and schemes such as hash-

based or composite partitioning, as well as replication.

Index data structures

Search engine architectures vary in the way indexing is performed and in methods of

index storage to meet the various design factors. There are many architectures for the

indexes and the most used is inverted index. Inverted index save a list of occurrences of

every keyword, typically, in the form of a hash table or binary tree [Bah 10c].

5

Through the indexing, there are several processes taken place, here the processes that

related to our work will be discussed. These processes may be used and this depends on

the search engine configuration [Bah 10d].

• Extract URLs. A process of extracting all URLs from the document being

indexed, it used to guide crawling the website, do link checking, build a site map,

and build a table of internal and external links from the page.

• Code striping. A process of removing hyper-text markup language (HTML) tags,

scripts, and styles, and decoding HTML character references and entities used to

embed special characters.

• Language recognition. A process by which a computer program attempts to

automatically identify, or categorize, the language or languages by which a

document is written.

• Document tokenization. A process of detecting the encoding used for the page;

determining the language of the content (some pages use multiple languages);

finding word, sentence and paragraph boundaries; combining multiple adjacent-

words into one phrase; and changing the case of text.

• Document parsing or syntactic analysis. The process of analyzing a sequence of

tokens (for example, words) to determine their grammatical structure with respect

to a given (more or less) formal grammar.

• Lemmatization/stemming. The process for reducing inflected (or sometimes

derived) words to their stem, base or root form – generally a written word form,

this stage can be done in indexing and/or searching stage. The stem doesn't need

to be identical to the morphological root of the word; it is usually sufficient that

relate words map to the same stem, even if this stem is not in itself a valid root.

The process is useful in search engines for query expansion or indexing and other

natural language processing problems.

6

• Normalization. The process by which text is transformed in some way to make it

consistent in a way which it might not have been before. Text normalization is

often performed before text is processed in some way, such as generating

synthesized speech, automated language translation, storage in a database, or

comparison.

Inverted Index

The inverted index structure is widely used in the modern supper fast Web search engine

like Google, Yahoo, Lucene and other major search engines. Inverted index (also referred

to as postings file or inverted file) is an index data structure storing a mapping from

content, such as words or numbers, to its locations in a database file, or in a document or

a set of documents. The main purpose of using the inverted index is to allow fast full text

searches, at a cost of increased processing when a document is added to the index [Bri 98,

Nag 02, Web 4]. The inverted index is one of the most used data structure in information

retrieval systems [Web 4, Bri 98].

There are two main variants of inverted indexes [Bae 99]:

(1) A record level inverted index (or inverted file index or just inverted file) contains a

list of references to documents for each word; we use this simple type in our

search engine.

(2) A word level inverted index (or full inverted index or inverted list) additionally

contains the positions of each word within a document; these positions can be used

to rank the results according to document relevancy to the query.

The latter form offers more functionality (like phrase searches), but needs more time and

space to be created. In order to simplify the understanding of the above two inverted

indexes let us consider the following example.

7

Example

Let us consider a case in which six documents have the text shown in Table (1.1). The

contents of a record and word level indexes are shown in Table (1.2).

Table (1.1)Document ID and its contents.

Document ID Text

1 Aqaba is a hot city

2 Amman is a cold city

3 Aqaba is a port

4 Amman is a modern city

5 Aqaba in the south

6 Amman in Jordan

Table (1.2)A record and word level inverted indexes for documents in Table (1.1).

Record level inverted index Word level inverted index

Text Documents Text Documents: Location

Aqaba 1, 3, 5 Aqaba 1:1 , 3:1 , 5:1

is 1, 2, 3, 4 is 1:2 , 2:2 , 3:2 , 4:2

a 1, 2, 3, 4 a 1:3 , 2:3 , 3:3 , 4:3

hot 1 hot 1:4

city 1, 2, 4 city 1:5 , 2:5 , 4:3

Amman 2, 4, 6 Amman 2:1 , 4:1 , 6:1

cold 2 cold 2:4

the 5 the 5:3

modern 4 modern 4:2

south 5 south 5:4

in 5, 6 in 5:2 , 6:2

Jordan 6 Jordan 6:3

8

When we search for the word “Amman”, we get three results which are documents 2, 4, 6

if a record level inverted index is used, and 2:1, 4:1, 6:1 if a word level inverted index is

used. In this work, the record level inverted index is used for it's simplicity and because

we don't need to rank our results.

1.1.3 Searching process

When the index is ready the searching can be perform through query interface, a user

enters a query into a search engine (typically by using keywords), the engine examines its

index and provides a listing of best matching Web pages according to its criteria, usually

with a short summary containing the document's title and sometimes parts of the text

[Bah 10d].

In this stage the results ranked, where ranking is a relationship between a set of items

such that, for any two items, the first is either “ranked higher than”, “ranked lower than”

or “ranked equal” to the second. In mathematics, this is known as a weak order or total

pre-order of objects. It is not necessarily a total order of documents because two different

documents can have the same ranking. Ranking is done according to document relevancy

to the query, freshness and popularity [Bri 98]. Figure (1.1) outlines the architecture and

main components of standard search engine model.

9

Figure (1.1). Architecture and main components of standard search engine model.

1.2 Challenges to Web Search Engines

Building and operating large-scale Web search engine used by hundreds of millions of

people around the world provides a number of interesting challenges [Hen 03, Hui 09,

Ois 10, Ami 05]. Designing such systems requires making complex design trade-offs in a

number of dimensions and the main challenges to designing efficient, effective, and

reliable Web search engine are:

• The Web is growing much faster than any present-technology search engine

can possibly index.

• The cost of index storing which include data storage cost, electricity and cool-

ing the data center.

• The real time web which updated in real time requires a fast and reliable

crawler and then indexes this content to make it searchable.

10

• Many Web pages are updated frequently, which forces the search engine to re-

visit them periodically.

• Query time (latency), the need to keep up with the increase of index size and

to perform the query and show the results in less time.

• Most search engine uses keyword for searching and this limited the results to

text pages only.

• Dynamically generated sites, which may be slow or difficult to index, or may

result in excessive results from a single site.

• Many dynamically generated sites are not indexable by search engines; this

phenomenon is known as the invisible Web.

• Several content types are not crawlable and indexable by search engines like

multi-media and flash content.

• Some sites use tricks to manipulate the search engine to display them as the

first result returned for some keywords and this known as Spamming. This

can lead to some search results being polluted, with more relevant links being

pushed down in the result list.

• Duplicate hosts, Web search engines try to avoid having duplicate and near-

duplicate pages in their collection, since such pages increase the time it takes

to add useful con-tent to the collection.

• Web graph modeling, the open problem is to come up with a random graph

model that models the behavior of the Web graph on the pages and host level.

• Scalability, search engine technology should scale in a dramatic way to keep

up with the growth of the Web.

• Reliability, search engine requires a reliable technology to support it 24 hour

operation to meet users needs.

11

1.3 Data Compression Techniques

This section presents definition, models, classification methodologies and classes, and

performance evaluation measures of data compression algorithms. Further details on data

compression can be found in [Say 00].

1.3.1 Definition of data compression

Data compression algorithms are designed to reduce the size of data so that it requires

less disk space for storage and less memory [Say 00]. Data compression is usually

obtained by substituting a shorter symbol for an original symbol in the source data,

containing the same information but with a smaller representation in length. The symbols

may be characters, words, phrases, or any other unit that may be stored in a dictionary of

symbols and processed by a computing system.

A data compression algorithm usually utilizes an efficient algorithmic transformation of

data representation to produce more compact representation. Such an algorithm is also

known as an encoding algorithm. It is important to be able to restore the original data

back, either in an exact or an approximate form, therefore a data decompression

algorithm, also known as a decoding algorithm.

1.3.2 Data compression models

There are a number of data compression algorithms that have been developed throughout

the years. These algorithms can be categorized into four major categories of data

compression models [Rab 08, Hay 08, Say 00]:

1. Substitution data compression model

2. Statistical data compression model

3. Dictionary based data compression model

4. Bit-level data compression model

12

In substitution compression techniques, a shorter representation is used to replace a

sequence of repeating characters. Example of substitution data compression techniques

include: null suppression [Pan 00], Run Length Encoding [Smi 97], bit mapping and half

byte packing [Pan 00].

In statistical techniques, the characters in the source file are converted to a binary code,

where the most common characters in the file have the shortest binary codes, and the

least common have the longest, the binary codes are generated based on the estimated

probability of the character within the file. Then, the binary coded file is compressed

using 8-bit character wordlength, or by applying the adaptive character wordlength

(ACW) algorithm [Bah 08b, Bah 10a], or it variation the ACW(n,s) scheme [Bah 10a]

Example of statistical data compression techniques include: Shannon-Fano coding [Rue

06], static/adaptive/semi-adaptive Huffman coding [Huf 52, Knu 85, Vit 89], and

arithmetic coding [How 94, Wit 87].

Dictionary based data compression techniques involved the substitution of sub-strings of

text by indices or pointer code, relative to a dictionary of the sub-strings, such as Lempel-

Zif-Welch (LZW) [Ziv 78, Ziv 77, Nel 89]. Many compression algorithms use a

combination of different data compression techniques to improve compression ratios.

Finally, since data files could be represented in binary digits, a bit-level processing can be

performed to reduce the size of data. A data file can be represented in binary digits by

concatenating the binary sequences of the characters within the file using a specific

mapping or coding format, such as ASCII codes, Huffman codes, adaptive codes, …, etc.

The coding format has a huge influence on the entropy of the generated binary sequence

and consequently the compression ratio (C) or the coding rate (Cr) that can be achieved.

The entropy is a measure of the information content of a message and the smallest

number of bits per character needed, on average, to represent a message. Therefore, the

entropy of a complete message would be the sum of the individual characters’ entropy.

The entropy of a character (symbol) is represented as the negative logarithm of its

probability and expressed using base two.

13

Where the probability of each symbol of the alphabet is constant, the entropy is

calculated as [Bel 89, Bel 90]:

E=−∑i=1

n

pi log 2 pi (1.1)

Where E is the entropy in bits

pi is the estimated probability of occurrence of character (symbol)

n is the number of characters.

In bit-level processing, n is equal to 2 as we have only two characters (0 and 1).

In bit-level data compression algorithms, the binary sequence is usually divided into

groups of bits that are called minterms, blocks, subsequences, etc. In this work we shall

used the term blocks to refer to each group of bits. These blocks might be considered as

representing a Boolean function.

Then, algebraic simplifications are performed on these Boolean functions to reduce the

size or the number of blocks, and hence, the number of bits representing the output

(compressed) data is reduced as well. Examples of such algorithms include: the

Hamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a], the adaptive

HCDC(k) scheme [Bah 07a, Bah 10b, Rab 08], the adaptive character wordlength (ACW)

algorithm [Bah 08b, Bah 10a], the ACW(n,s) scheme [Bah 10a], the Boolean functions

algebraic simplifications algorithm [Nof 07], the fixed length Hamming (FLH) algorithm

[Sha 04], and the neural network based algorithm [Mah 00].

1.3.3 Classification of data compression algorithms

Data compression algorithms are categorized by several characteristics, such as:

• Data compression fidelity

• Length of data compression symbols

14

• Data compression symbol table

• Data compression processing time

In what follows a brief definition is given for each of the above classification criteria.

Data compression fidelity

Basically data compression can be classified into two fundamentally different styles of

data compression depending on the fidelity of the restored data, these are:

(1) Lossless data compression algorithms

In a lossless data compression, a transformation of the representation of the original

data set is performed such that it is possible to reproduce exactly the original data

set by performing a decompression transformation. This type of compression is

usually used in compressing text files, executable codes, word processing files,

database files, tabulation files, and whenever the original needs to be exactly

restored from the compressed data.

Many popular data compression applications have been developed utilizing lossless

compression algorithms, for example, lossless compression algorithms are used in

the popular ZIP file format and in the UNIX tool gzip. It is mainly used for text and

executable files compression as in such file data must be exactly retrieved

otherwise it is useless. It is also used as a component within lossy data compression

technologies. It can usually achieve a 2:1 to 8:1 compression ratio range.

(2) Lossy data compression algorithms

In a lossy data compression a transformation of the representation of the original

data set is performed such that an exact representation of the original data set can

not be reproduced, but an approximate representation is reproduced by performing

a decompression transformation.

15

A lossy data compression is used in applications wherein exact representation of

the original data is not necessary, such as in streaming multimedia on the Internet,

telephony and voice applications, and some image file formats. Lossy

compression can provide higher compression ratios of 100:1 to 200:1, depending

on the type of information being compressed. In addition, higher compression

ratio can be achieved if more errors are allowed to be introduced into the original

data [Lel 87].

Length of data compression symbols

Data compression algorithms can be classified, depending on the length of the symbols

the algorithm can process, into fixed and variable length; regardless of whether the

algorithm uses variable length symbols in the original data or in the compressed data, or

both.

For example, the run-length encoding (RLE) uses fixed length symbols in both the

original and the compressed data. Huffman encoding uses variable length compressed

symbols to represent fixed-length original symbols. Other methods compress variable-

length original symbols into fixed-length or variable-length compressed data.

Data compression symbol table

Data compression algorithms can classified as either static, adaptive, or semi-adaptive

data compression algorithms [Rue 06, Pla 06, Smi 97]. In static compression algorithms,

the encoding process is fixed regardless of the data content; while in adaptive algorithms,

the encoding process is data dependent. In semi-adaptive algorithms, the data to be

compressed are first analyzed in their entirety, an appropriate model is then built, and

afterwards the data is encoded. The model is stored as part of the compressed data, as it is

required by the decompressor to reverse the compression.

Data compression/decompression processing time

Data compression algorithms can be classified according to the compression/

decompression processing time as symmetric or asymmetric algorithms. In symmetric

16

algorithms the compression/decompression processing time are almost the same; while

for asymmetric algorithms, normally, the compression time is much more than the

decompression processing time [Pla 06].

1.3.4 Performance evaluation parameters

In order to be able to compare the efficiency of the different compression techniques

reliably, and not allowing extreme cases to cloud or bias the technique unfairly, certain

issues need to be considered.

The most important issues need to be taken into account in evaluating the performance of

various algorithms includes [Say 00]:

(1) Measuring the amount of compression

(2) Compression/decompression time (algorithm complexity)

These issues need to be carefully considered in the context for which the compression

algorithm is used. Practically, things like finite memory, error control, type of data, and

compression style (adaptive/dynamic, semi-adaptive or static) are also factors that should

be considered in comparing the different data compression algorithms.

(1) Measuring the amount of compression

Several parameters are used to measure the amount of compression that can be achieved

by a particular data compression algorithm, such as:

(i) Compression ratio (C)

(ii) Reduction ratio (Rs)

(iii) Coding rate (Cr)

17

Definitions of these parameters are given below.

(i) Compression ratio (C)

The compression ratio (C) is defined as the ratio between the size of the data before

compression and the size of the data after compression. It is expressed as:

o

c

SCS

= (1.1)

Where So is the size of the original data (uncompressed data)

Sc is the sizes of the compressed data

(ii) Reduction ratio (R s)

The reduction ratio (R) represents the ratio between the difference between the size of the

original data (So) and the size of the compressed data (Sc) to the size of the original data.

It is usually given in percents and it is mathematically expressed as:

(1.2)

(1.3)

(iii) Coding rate (C r)

The coding rate (Cr) expresses the same concept at the compression ratio, but it relates

the ratio to a more tangible quantity. For example, for a text file, the coding rate may be

expressed in “bits/character” (bpc), where in uncompressed text file a coding rate of 7 or

8 bpc is used. In addition, the coding rate of an audio stream may be expressed in

“bits/analogue”. For still image compression, the coding rate is expressed in “bits/pixel”.

In general, it can the coding rate can be expressed mathematically as:

18

cr

o

q SCS⋅= (1.4)

Where q is the number of bit represents each symbol in the uncompressed file. The

relationship between the coding rate (Cr) and the compression ratio (C), for example, for

text file originally using 7 bpc, can be given by:

7rC

C= (1.5)

It can be clearly seen from Eqn. (1.5) that a lower coding rate indicates a higher

compression ratio.

(2) Compression/decompression time (algorithm complexity)

The compression/decompression time (which is an indication of the algorithm

complexity) is defined as the processing time required compressing or decompressing the

data. These compression and decompression times have to be evaluated separately. As it

has been discussed in Section 1.4.3, data compression algorithms are classified according

to the compression/decompression time into either symmetric or asymmetric algorithms.

In this context, data storage applications mainly concern with the amount of compression

that can be achieved and the decompression processing time that is required to retrieve

the data back (asymmetric algorithms). As in data compression applications, the

compression is only performed once or non-frequently repeated.

Data transmission applications focus predominately on reducing the amount of data to be

transmitted over communication channels, and both compression and decompression

processing times are the same at the respective junctions or nodes (symmetric algorithms)

[Liu 05].

For a fair comparison between the different available algorithms, it is important to

consider both the amount of compression and the processing time. Therefore, it would be

useful to be able to parameterize the algorithm such that the compression ratio and

processing time could be optimized for a particular application.

19

There are extreme cases where data compression works very well or other conditions

where it is inefficient, the type of data that the original data file contains and the upper

limits of the processing time have an appreciable effect on the efficiency of the technique

selected. Therefore, it is important to select the most appropriate technique for a

particular data profile in terms of both data compression and processing time [Rue 06].

1.4 Current Trends in Building High-Performance Web Search Engine

There are several major trends that can be identified in the literature for building high-

performance Web search engine. A list of these trends is given below and further

discussion will be given in Chapter 2; these trends include:

(1) Succinct data structure

(2) Compressed full-text self-index

(3) Query optimization

(4) Efficient architectural design

(5) Scalability

(6) Semantic Search Engine

(7) Using Social Network

(8) Caching

1.5 Statement of the Problem

Due to the rapid growth in the size of the Web, Web search engines are facing enormous

performance challenges, in terms of storage capacity, data retrieval rate, query processing

time, and communication overhead. Large search engines, in particular, have to be able to

process tens of thousands of queries per second on tens of billions of documents, making

query throughput a critical issue. To satisfy this heavy workload, search engines use a

variety of performance optimizations techniques including index compression; and some

20

obvious solutions to these issues are to develop more succinct data structure, compressed

index, query optimization, and higher-speed processing and communication systems.

We believe that current search engine model cannot meet users and applications needs

and more retrieval performance and more compact and cost-effective systems are still

required. The main contribution of this thesis is to develop a novel Web search engine

model that is based on index-query compression; therefore, it is referred to as the CIQ

Web search engine model or simply the CIQ model. The model incorporates two bit-level

compression layers both implemented at the back-end processor side, one after the

indexer acting as a second compression layer to generate a double compressed index, and

the other one after the query parser for query compression to enable bit-level compressed

index-query search. So that less disk space is required storing the index file, reducing

disk I/O overheads, and consequently higher retrieval rate or performance.

1.6 Objectives of this Thesis

The main objectives of this thesis can be summarized as follows:

• Develop a new Web search engine model that is accurate as the current Web

search engine model, requires less disk space for storing index files, performs

search process faster than current models, reduces disk I/O overheads, and

consequently provides higher retrieval rate or performance.

• Modify the HCDC algorithm to meet the requirements of the new CIQ model.

• Study and optimize the statistics of the inverted index files to achieve maximum

possible performance (compression ratio and minimum searching time).

• Validate the searching accuracy of the new CIQ Web search engine model.

• Evaluate and compare the performance of the new Web search engine model in

terms of disk space requirement and query processing time (searching time) for

different search scenarios.

21

1.7 Organization of this Thesis

This thesis is organized into five chapters. Chapter 1 provides an introduction to the

general domain of this thesis. The rest of this thesis is organized as follows: Chapter 2

presents a literature work and also summarizes some of the previous work that is related

to Web search engine, in particular, works that is related to enhancing the performance of

the Web search engine through data compression at different levels.

Chapter 3 describes the concept, methodology, and implementation of the novel CIQ Web

search engine model. It also includes the detail description of the HCDC algorithm and

the modifications implemented to meet the new application needs.

Chapter 4 presents a description of a number of scenarios simulated to evaluate the

performance of the new Web search engine model. The effect of index file size on the

performance of the new model is investigated and discussed. Finally, in Chapter 5, based

on the results obtained from the different simulations, conclusions are drawn and

recommendations for future work are pointed-out.

22

Chapter Two

Literature ReviewThis work is concern with the development of a novel high-performance Web search

engine model that is based on compressing the index files and search queries using a bit-

level data compression technique, namely, the novel Hamming codes based data

compression (HCDC) algorithm [Bah 07b, Bah 08a]. In this model the search process is

performed at a compressed index-query level. It produces a double compressed index file,

which consequently requires less disk space to store the index files, reduces

communication time, and on the other hand, compressing the search query, reduces I/O

overheads and increases retrieval rate.

This chapter presents a literature review, which is divided into three sections. Section 2.1

presents a brief definition of the current trends towards enhancing the performance of

Web search engines. Then, in Section 2.2 and 2.3, we present a review on some of the

most recent and related work on Web search engine and bit-level data compression

algorithms, respectively.

2.1 Trends Towards High-Performance Web Search Engine

Chapter 1 list several major trends that can be identified in the literature for building

high-performance Web search engine. In what follows, we provide a brief definition for

each of these trends.

2.1.1 Succinct data structure

Recent years have witnessed an increasing interest on succinct data structures. Their aim

is to represent the data using as little space as possible, yet efficiently answering queries

on the represented data. Several results exist on the representation of sequences [Fer 07,

Ram 02], trees [Far 05], graphs [Mun 97], permutations and functions [Mun 03], and

texts [Far 05, Nav 04].

One of the most basic structures, which lie at the heart of the representation of more

complex ones are binary sequences with rank and select queries. Given a binary sequence

23

S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1;

q]=s1s2 … sq, and by Selectc(S; q) the position in S of the q-th occurrence of bit c. The best

results answer those queries in constant time, retrieve any sq in constant time, and occupy

nH0(S)+O(n) bits of storage, where H0(S) is the zero-order empirical entropy of S. This

space bound includes that for representing S itself, so the binary sequence is being

represented in compressed form yet allowing those queries to be answered optimally

[Ram 02].

For the general case of sequences over an arbitrary alphabet of size r, the only known re-

sult is the one in [Gro 03] which still achieves nH0(S)+O(n) space occupancy. The data

structure in [Gro 03] is the elegant wavelet tree, it takes O(log r) time to answer Rankc(S;

q) and Selectc(S; q) queries, and to retrieve any character sq.

2.1.2 Compressed full-text self-index

A compressed full-text self-index [Nav 07] represents a text in a compressed form and

still answers queries efficiently. This represents a significant advancement over the full-

text indexing techniques of the previous decade, whose indexes required several times the

size of the text. full-text indexing must be used.

Although it is relatively new, this algorithmic technology has matured up to a point where

theoretical research is giving way to practical developments. Nonetheless this requires

significant programming skills, a deep engineering effort, and a strong algorithmic back-

ground to dig into the research results. To date only isolated implementations and focused

comparisons of compressed indexes have been reported, and they missed a common API,

which prevented their re-use or deployment within other applications.

2.1.3 Query optimization

Query optimization is an important skill for search engine developers and database ad-

ministrators (DBAs). In order to improve the performance of the search queries, develop-

ers and DBAs need to understand the query optimizer and the techniques it uses to select

an access path and prepare a query execution plan. Query tuning involves knowledge of

24

techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat-

form provides for explaining a query execution plan [Che 01].

2.1.4 Efficient architectural design

Answering large number of queries per second on a huge collection of data requires the

equivalent of a small supercomputer, and all current major engines are based on large

clusters of servers connected by high-speed local area networks (LANs) or storage area

networks (SANs).

There are two basic ways to partition an inverted index structure over the nodes:

• A local index organization where each node builds a complete index on its own

subset of documents (used by AltaVista and Inktomi)

• A global index organization where each node contains complete inverted lists for

a subset of the words.

Each scheme has advantages and disadvantages that we do not have space to discuss here

and further discussions can be found in [Bad 02, Mel 00].

2.1.5 Scalability

Search engine technology should scale in a dramatic way to keep up with the growth of

the Web [Bri 98]. In 1994, one of the first Web search engines, the World Wide Web

Worm (WWWW) had an index of 110,000 pages [Mcb 94]. At the end of 1997, the top

search engines claim to index from 2 million (WebCrawler) to 100 million Web docu-

ments [Bri 98]. In 2005 Google claim to index 1.2 billion pages (as they were showing in

Google home page) in July 2008 Google announced to hit a new milestone: 1 trillion (as

in 1,000,000,000,000) unique URLs on the Web at once [Web 2].

At the same time, the number of queries search engines handle has grown rabidly too. In

March and April 1994, the WWWW received an average of about 1500 queries per day.

In November 1997, Altavista claimed it handled roughly 20 million queries per day. With

the increasing number of users on the web, and automated systems which query search

engines, Google handled hundreds of millions of queries per day in 2000 and about 3 bil-

25

lion queries per day in 2009 and twitter handled about 635 millions queries per day [web

1].

Creating a Web search engine which scales even to today’s Web presents many challeng-

es. Fast crawling technology is needed to gather the Web documents and keep them up to

date. Storage space must be used efficiently to store indexes and, optionally, the docu-

ments themselves as cashed pages. The indexing system must process hundreds of giga-

bytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thou-

sands per second.

2.1.6 Semantic search engine

The semantic Web is an extension of the current Web in which information is given well-

defined meaning, better enabling computers and people to work together in cooperation

[Guh 03]. It is the idea of having data on the Web defined and linked in a way that it can

be used for more effective discovery, automation, integration, and reuse across various

applications.

In particular, the semantic Web will contain resources corresponding not just to media ob-

jects (such as Web pages, images, audio clips, etc.) as the current Web does, but also to

objects such as people, places, organizations and events. Further, the semantic Web will

contain not just a single kind of relation (the hyperlink) between resources, but many dif-

ferent kinds of relations between the different types of resources mentioned above [Guh

03].

Semantic search attempts to augment and improve traditional search results (based on in-

formation retrieval technology) by using data from the semantic Web and to produce pre-

cise answers to user queries. This can be done easily by taking advantage of the availabil-

ity of explicit semantics of information in the context of the semantic Web search engine

[Lei 06].

2.1.7 Using Social Networks

There is an increasing interest about social networks. In general, recent studies suggest

that a social network of a person has a significant impact on his or her information acqui-

26

sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor-

mation on social network sites in particular and in the Web.

As this information becomes more and more publicly available from these various social

network sites and the Web in general, the social relationships between people can be

identified. This in turn enables the automatic extraction of social networks. This trend is

furthermore driven and enforced by recent initiatives such as Facebook’s connect. MyS-

pace’s data availability and Google’s FriendConnect by making their social network data

available to anyone [Kir 08].

So to combine the social network data with the search engine technology to improve the

results relevancy to the users and to increase the sociality of the results is one of the

trends currently used by the search engine like Google and Bing. Microsoft and Facebook

have announced a new partnership that brings Facebook data and profile search to Bing.

The deal marks a big leap forward in social search and also represents a new advantage

for Bing [Web 3].

2.1.8 Caching

Popular Web search engines receive a round hundred millions of queries per day, and for

each search query, return a result page(s) to the user who submitted the query. The user

may request additional result pages for the same query, submit a new query, or quit

searching process altogether. An efficient scheme for caching query result pages may en-

able search engines to lower their response time and reduce their hardware requirements

[Lem 04].

Studies have shown that a small set of popular queries accounts for a significant fraction

of the query stream. These statistical properties of the query stream seem to call for the

caching of search results [Sar 01].

2.2 Recent Research on Web Search Engine

E. Moura et al. [Mou 97] presented a technique to build an index based on suffix arrays

for compressed texts. They developed a compression scheme for textual databases based

on words that generates a compression code that preserves the lexicographical ordering of

27

the text words. As a consequence, it permits the sorting of the compressed strings to gen-

erate the suffix array without decompressing. Their results demonstrated that as the com-

pressed text is under 30% of the size of the original text, they were able to build the suffix

array twice as fast on the compressed text. The compressed text plus index is 55-60% of

the size of the original text plus index and search times were reduced to approximately

half the time. They presented analytical and experimental results for different variations

of the word-oriented compression paradigm.

S. Varadarajan and T. Chiueh [Var 97] described a text search engine called shrink and

search engine (SASE), which operates in the compressed domain. It provides an exact

search mechanism using an inverted index and an approximate search mechanism using a

vantage point tree. SASE allows a flexible trade-off between search time and storage

space required to maintain the search indexes. The experimental results showed that the

compression efficiency is within 7-17% of GZIP, which is one of the best lossless

compression utilities. The sum of the compressed file size and the inverted indexes is

only between 55-76% of the original database, while the search performance is

comparable to a fully inverted index.

S. Brin and L. Page [Bri 98] presented the Google search engine, a prototype of a large-

scale search engine which makes heavy use of the structure present in hypertext. Google

is designed to crawl and index the Web efficiently and produce much more satisfying

search results than existing systems. They provided an in-depth description of the large-

scale web search engine. Apart from the problems of scaling traditional search techniques

to data of large magnitude, there are many other technical challenges, such as the use of

the additional information present in hypertext to produce better search results. In their

work they addressed the question of how to build a practical large-scale system that can

exploit the additional information present in hypertext.

E. Moura et al. [Mou 00] presented a fast compression and decompression technique for

natural language texts. The novelties are that (i) decompression of arbitrary portions of

the text can be done very efficiently, (ii) exact search for words and phrases can be done

on the compressed text directly by using any known sequential pattern matching

28

algorithm, and (iii) word-based approximate and extended search can be done efficiently

without any decoding. The compression scheme uses a semi-static word-based model and

a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.

N. Fuhr and N. Govert [Fuh 02] investigated two different approaches for reducing index

space of inverted files for XML documents. First, they considered methods for

compressing index entries. Second, they developed the new XS tree data structure which

contains the structural description of a document in a rather compact form, such that

these descriptions can be kept in main memory. Experimental results on two large XML

document collections show that very high compression rates for indexes can be achieved,

but any compression increases retrieval time.

A. Nagarajarao et al. [Nag 02] implemented an inverted index as a part of a mass

collaboration system. It provides the facility to search for documents that satisfy a given

query. It also supports incremental updates whereby documents can be added without re-

indexing. The index can be queried even when updates are being done to it. Further,

querying can be done in two modes. A normal mode that can be used when an immediate

response is required and a batched mode that can provide better throughput at the cost of

increased response time for some requests. The batched mode may be useful in an alert

system where some of the queries can be scheduled. They implemented generators to

generate large data sets that they used as benchmarks. They tested there inverted index

with data sets of the order of gigabytes to ensure scalability.

R. Grossi et al. [Gro 03] presented a novel implementation of compressed suffix arrays

exhibiting new tradeoffs between search time and space occupancy for a given text (or

sequence) of n symbols over an alphabet α, where each symbol is encoded by log | α, |

bits. They showed that compressed suffix arrays use just nHh+O(n log log n/ log| α, | n)

bits, while retaining full text indexing functionalities, such as searching any pattern

sequence of length m in O(mlog | α, |+polylog(n)) time. The term Hh<log | α, | denotes the

hth-order empirical entropy of the text, which means that the index is nearly optimal in

space apart from lower-order terms, achieving asymptotically the empirical entropy of the

text (with a multiplicative constant 1). If the text is highly compressible so that Hh=O(1)

29

and the alphabet size is small, they obtained a text index with O(m) search time that

requires only O(n) bits.

X. Long and T. Suel [Lon 03] studied pruning techniques that can significantly improve

query throughput and response times for query execution in large engines in the case

where there is a global ranking of pages, as provided by Page rank or any other method,

in addition to the standard term-based approach. They described pruning schemes for this

case and evaluated their efficiency on an experimental cluster based search engine with

million Web pages. Their results showed that there is significant potential benefit in such

techniques.

V. N. Anh and A. Moffat [Anh 04] described a scheme for compressing lists of integers

as sequences of fixed binary codewords that had the twin benefits of being both effective

and efficient. Because Web search engines index large quantities of text the static costs

associated with storing the index can be traded against dynamic costs associated with

using it during query evaluation. Typically, index representations that are effective and

obtain good compression tend not to be efficient, in that they require more operations

during query processing. The approach described by Anh and Moffat results in a

reduction in index storage costs compared to their previous word-aligned version, with no

cost in terms of query throughput.

Udayan Khurana and Anirudh Koul [Khu 05] presented a new compression scheme for

text. The same is efficient in giving high compression ratios and enables super fast

searching within the compressed text. Typical compression ratios of 70-80% and reduc-

ing the search time by 80-85% are the features of this paper. Till now, a trade-off between

high ratios and searchability within compressed text has been seen. In this paper, they

showed that greater the compression, faster the search.

Stefan Buttcher and Charles L. A. Clarke [But 06] examined index compression tech-

niques for schema-independent inverted files used in text retrieval systems. Schema-inde-

pendent inverted files contain full positional information for all index terms and allow the

structural unit of retrieval to be specified dynamically at query time, rather than statically

during index construction. Schema-independent indexes have different characteristics

30

than document-oriented indexes, and this difference can affect the effectiveness of index

compression algorithms greatly. There experimental results show that unaligned binary

codes that take into account the special properties of schema-independent indexes

achieve better compression rates than methods designed for compressing document in-

dexes and that they can reduce the size of the index by around 15% compared to byte-

aligned index compression.

P. Farragina et al [Fer 07] proposed two new compressed representations for general se-

quences, which produce an index that improves over the one in [Gro 03] by removing

from the query times the dependence on the alphabet size and the polylogarithmic terms.

R. Gonzalez and G. Navarro [Gon 07a] introduced a new compression scheme for suffix

arrays which permits locating the occurrences extremely fast, while still being much

smaller than classical indexes. In addition, their index permits a very efficient secondary

memory implementation, where compression permits reducing the amount of I/O needed

to answer queries. Compressed text self-indexes had matured up to a point where they

can replace a text by a data structure that requires less space and, in addition to giving

access to arbitrary text passages, support indexed text searches. At this point those

indexes are competitive with traditional text indexes (which are very large) for counting

the number of occurrences of a pattern in the text. Yet, they are still hundreds to

thousands of times slower when it comes to locating those occurrences in the text.

R. Gonzalez and G. Navarro [Gon 07b] introduced a disk-based compressed text index

that, when the text is compressible, takes little more than the plain text size (and replaces

it). It provides very good I/O times for searching, which in particular improve when the

text is compressible. In this aspect the index is unique, as compressed indexes have been

slower than their classical counterparts on secondary memory. They analyzed their index

and showed experimentally that it is extremely competitive on compressible texts.

A. Moffat and J. S. Culpepper [Mof 07] showed that a relatively simple combination of

techniques allows fast calculation of Boolean conjunctions within a surprisingly small

amount of data transferred. This approach exploits the observation that queries tend to

contain common words, and that representing common words via a bitvector allows

31

random access testing of candidates, and, if necessary, fast intersection operations prior to

the list of candidates being developed. By using bitvectors for a very small number of

terms that (in both documents and in queries) occur frequently, and byte coded inverted

lists for the balance can reduce both querying time and query time data-transfer volumes.

The techniques described in [Mof 07] are not applicable to other powerful forms of

querying. For example, index structures that support phrase and proximity queries have a

much more complex structure, and are not amenable to storage (in their full form) using

bitvectors. Nevertheless, there may be scope for evaluation regimes that make use of

preliminary conjunctive filtering before a more detailed index is consulted, in which case

the structures described in [Mof 07] would still be relevant.

Due to the rapid growth in the size of the web, web search engines are facing enormous

performance challenges. The larger engines in particular have to be able to process tens

of thousands of queries per second on tens of billions of documents, making query

throughput a critical issue. To satisfy this heavy workload, search engines use a variety of

performance optimizations including index compression, caching, and early termination.

J. Zhang et al [Zha 08] focused on two techniques, inverted index compression and index

caching, which play a crucial rule in web search engines as well as other high-

performance information retrieval systems. We perform a comparison and evaluation of

several inverted list compression algorithms, including new variants of existing

algorithms that have not been studied before. We then evaluate different inverted list

caching policies on large query traces, and finally study the possible performance benefits

of combining compression and caching. The overall goal of this paper is to provide an

updated discussion and evaluation of these two techniques, and to show how to select the

best set of approaches and settings depending on parameter such as disk speed and main

memory cache size.

P. Ferragina et al [Fer 09] presented an article to fill the gap between implementations

and focused comparisons of compressed indexes. They presented the existing implemen-

tations of compressed indexes from a practitioner's point of view; introduced the

Pizza&Chili site, which offers tuned implementations and a standardized API for the

32

most successful compressed full-text self-indexes, together with effective test-beds and

scripts for their automatic validation and test; and, finally, they showed the results of ex-

tensive experiments on a number of codes with the aim of demonstrating the practical rel-

evance of this novel algorithmic technology.

Ferragina et al [Fer 09], first, presented the existing implementations of compressed in-

dexes from a practitioner’s point of view. Second, they introduced the Pizza&Chili site,

which offers tuned implementations and a standardized API for the most successful com-

pressed full-text self-indexes, together with effective test beds and scripts for their auto-

matic validation and test. Third, they showed the results of their extensive experiments on

these codes with the aim of demonstrating the practical relevance of this novel and excit-

ing technology.

H. Yan et al [Yan 09] studied index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible or-dering of documents. In contrast, they assumed that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. They performed an extensive study of compression techniques for document IDs and pre-sented new optimizations of existing techniques which can achieve significant improve-ment in both compression and decompression performances. They also proposed and evaluated techniques for compressing frequency values for this case. Finally, they studied the effect of this approach on query processing performance. Their experiments showed very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million Web pages.

2.3 Recent Research on Bit-Level Data Compression Algorithms

This section presents a review of some of the most recent research on developing an

efficient bit-level data compression algorithms, as the algorithm we use in thesis is a bit-

level technique.

A. Jardat and M. Irshid [Jar 01] proposed a very simple and efficient binary run-length

compression technique. The technique is based on mapping the non-binary information

33

source into an equivalent binary source using a new fixed-length code instead of the

ASCII code. The codes are chosen such that the probability of one of the two binary

symbols; say zero, at the output of the mapper is made as small as possible. Moreover,

the "all ones" code is excluded from the code assignments table to ensure the presence of

at least one "zero" in each of the output codewords.

Compression is achieved by encoding the number of "ones" between two consecutive

"zeros" using either a fixed-length code or a variable-length code. When applying this

simple encoding technique to English text files, they achieve a compression of 5.44 bpc

(bit per character) and 4.6 bpc for the fixed-length code and the variable length

(Huffman) code, respectively.

Caire et al [Cai 04] presented a new approach to universal noiseless compression based

on error correcting codes. The scheme was based on the concatenation of the Burrows-

Wheeler block sorting transform (BWT) with the syndrome former of a low-density

parity-check (LDPC) code. Their scheme has linear encoding and decoding times and

uses a new closed-loop iterative doping algorithm that works in conjunction with belief-

propagation decoding. Unlike the leading data compression methods, their method is

resilient against errors, and lends itself to joint source-channel encoding/decoding;

furthermore their method offers very competitive data compression performance.

A. A. Sharieh [Sha 04] introduced a fixed-length Hamming (FLH) algorithm as

enhancement to Huffman coding (HU) to compress text and multimedia files. He

investigated and tested these algorithms on different text and multimedia files. His results

indicated that the HU-FLH and FLH-HU enhanced the compression ratio.

K. Barr and K. Asanovi’c [Bar 06] presented a study of the energy savings possible by

lossless compressing data prior to transmission. Because wireless transmission of a single

bit can require over 1000 times more energy than a single 32-bit computation. It can

therefore be beneficial to perform additional computation to reduce the number of bits

transmitted.

If the energy required to compress data is less than the energy required to send it, there is

34

a net energy savings and an increase in battery life for portable computers. This work

demonstrated that, with several typical compression algorithms, there was actually a net

energy increase when compression was applied before transmission. Reasons for this

increase were explained and suggestions were made to avoid it. One such energy-aware

suggestion was asymmetric compression, the use of one compression algorithm on the

transmit side and a different algorithm for the receive path. By choosing the lowest-

energy compressor and decompressor on the test platform, overall energy to send and

receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up

to 57% over the default symmetric scheme.

The value of this research is not merely to show that one can optimize a given algorithm

to achieve a certain reduction in energy, but to show that the choice of how and whether

to compress is not obvious. It is dependent on hardware factors such as relative energy of

the central processing unit (CPU), memory, and network, as well as software factors

including compression ratio and memory access patterns. These factors can change, so

techniques for lossless compression prior to transmission/reception of data must be re-

evaluated with each new generation of hardware and software.

A. Jaradat et al. [Jar 06] proposed a file splitting technique for the reduction of the nth-

order entropy of text files. The technique is based on mapping the original text file into a

non-ASCII binary file using a new codeword assignment method and then the resulting

binary file is split into several sub files each contains one or more bits from each

codeword of the mapped binary file. The statistical properties of the sub files are studied

and it was found that they reflect the statistical properties of the original text file which

was not the case when the ASCII code is used as a mapper.

The nth-order entropy of these sub files was determined and it was found that the sum of

their entropies was less than that of the original text file for the same values of

extensions. These interesting statistical properties of the resulting subfiles can be used to

achieve better compression ratios when conventional compression techniques were

applied to these sub files individually and on a bit-wise basis rather than on character-

wise basis.

H. Al-Bahadili [Bah 07b, Bah 08a] developed a lossless binary data compression scheme

35

that is based on the error correcting Hamming codes. It was referred to as the HCDC

algorithm. In this algorithm, the binary sequence to be compressed is divided into blocks

of n bits length. To utilize the Hamming codes, the block is considered as a Hamming

codeword that consists of p parity bits and d data bits (n=d+p).

Then each block is tested to find if it is a valid or a non-valid Hamming codeword. For a

valid block, only the d data bits preceded by 1 are written to the compressed file, while

for a non-valid block all n bits preceded by 0 are written to the compressed file. These

additional 1 and 0 bits are used to distinguish the valid and the non-valid blocks during

the decompression process.

An analytical formula was derived for computing the compression ratio as a function of

block size, and fraction of valid data blocks in the sequence. The performance of the

HCDC algorithm was analyzed, and the results obtained were presented in tables and

graphs. The author concluded that the maximum compression ratio that can be achieved

by this algorithm is n/(d+1), if all blocks are valid Hamming codewords.

S. Nofal [Nof 07] proposed a bit-level files compression algorithm. In this algorithm, the

binary sequence is divided into a set of groups of bits, which are considered as minterms

representing Boolean functions. Applying algebraic simplifications on these functions

reduce in turn the number of minterms, and hence, the number of bits of the file is

reduced as well. To make decompression possible one should solve the problem of

dropped Boolean variables in the simplified functions. He investigated one possible

solution and their evaluation shows that future work should find out other solutions to

render this technique useful, as the maximum possible compression ratio they achieved

was not more than 10%.

H. Al-Bahadili and S. Hussain [Bah 08b] proposed and investigated the performance of a

bit-level data compression algorithm, in which the binary sequence is divided into blocks

each of n-bit length. This gives each block a possible decimal values between 0 to 2n-1. If

the number of the different decimal values (d) is equal to or less than 256, then the binary

sequence can be compressed using the n-bit character wordlength. Thus, a compression

ratio of approximately n/8 can be achieved. They referred to this algorithm as the

36

adaptive character wordlength (ACW) algorithm, since the compression ratio of the

algorithm is a function of n, it was referred to it as the ACW(n) algorithm.

Implementation of the ACW(n) algorithm highlights a number of issues that may degrade

its performance, and need to be carefully resolved, such as: (i) If d is greater than 256,

then the binary sequence cannot be compressed using n-bit character wordlength, (ii) the

probability of being able to compress a binary sequence using n-bit character wordlength

is inversely proportional to n, and (iii) finding the optimum value of n that provides

maximum compression ratio is a time consuming process, especially for large binary

sequences. In addition, for text compression, converting text to binary using the

equivalent ASCII code of the characters gives a high entropy binary sequence, thus only a

small compression ratio or sometimes no compression can be achieved.

To overcome all drawbacks that mentioned in the ACW(n) algorithm, Al-Bahadili and

Hussain [Bah 10a] developed an efficient implementation scheme to enhance the

performance of the ACW(n) algorithm. In this scheme the binary sequence was divided

into a number of subsequences (s), each of them satisfies the condition that d is less than

256, therefore it is referred to as the ACW(n,s) scheme. The scheme achieved

compression ratios of more than 2 on most text files from most widely used corpora.

H. Al-Bahadili and A. Rababa’a [Bah 07a, Rab 08, Bah 10b] developed a new scheme

consists of six steps some of which are applied repetitively to enhance the compression

ratio of the HCDC algorithm [Bah 07b, Bah 08a], therefore, the new scheme was referred

to as the HCDC(k) scheme, where k refers to the number of repetition loops. The

repetition loops continue until inflation is detected. The overall (accumulated)

compression ratio is the multiplication of the compression ratios of the individual loops.

The results obtained for the HCDC(k) scheme demonstrated that the scheme has a higher

compression ratio than most well-known text compression algorithms, and also exhibits a

competitive performance with respect to many widely-used state-of-the-art software. The

HCDC algorithm and the HCDC(k) scheme will be discussed in details in the next

Chapter.

37

S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres-

sion technique that reduces the amount of bits sent over serial links. The proposed tech-

nique reduces the number of bits and the number of transitions when compared to the

original uncompressed data. Results of compression on two MPEG1 coded picture data

showed average bit reductions of approximately 17% to 47% and average transition re-

ductions of approximately 15% to 24% over a serial link. The technique can be employed

with such network-on-chip (NoC) technology to improve the bandwidth bottleneck issue.

Fixed and dynamic block sizing was considered and general guidelines for determining a

suitable fixed block length and an algorithm for dynamic block sizing were shown. The

technique exploits the fact that unused significant bits do not need to be transmitted. Also,

the authors outlined a possible implementation of the proposed compression technique,

and the area overhead costs and potential power and bandwidth savings within a NoC en-

vironment were presented.

J. Zhang and X. Ni [Zha 10] presented a new implementation of bit-level arithmetic cod-

ing using integer additions and shifts. The algorithm has less computational complexity

and more flexibility, and thus is very suitable for hardware design. They showed that their

implementation has the least complexity and the highest speed in Zhao’s algorithm [Zha

98], Rissanen and Mohiuddin. (RM) algorithm [Ris 89], Langdon and Rissanen (LR) al-

gorithm [Lan 82] and the basic arithmetic coding algorithm. Sometimes it has higher

compression rate than basic arithmetic encoding algorithm. Therefore, it provides an ex-

cellent compromise between good performance and low complexity.

38

Chapter Three

The Novel CIQ Web Search Engine ModelThis chapter presents a description of the proposed Web search engine model. The model

incorporates two bit-level data compression layers, both installed at the back-end

processor, one for index compression (index compressor), and one for query compression

(query or keyword compressor), so that the search process can be performed at the

compressed index-query level and avoid any decompression activities during the

searching process. Therefore, it is referred to as the compressed index-query (CIQ)

model. In order to be able to perform the search process at the compressed index-query

level, it is important to have a data compression technique that is capable of producing

the same pattern for the same character from both the query and the index.

The algorithm that meets the above main requirements is the novel Hamming code data

compression (HCDC) algorithm [Bah 07b, Bah 08a]. The HCDC algorithm creates a

compressed file header (compression header) to store some parameters that are relevant

to compression process, which mainly include the character-binary coding pattern. This

header should be stored separately to be accessed by the query compressor and the index

decompressor. Introducing the new compression layers should reduce disk space for

storing index files; increase query throughput and consequently retrieval rate. On the

other hand, compressing the search query reduces I/O overheads and query processing

time as well as the system response time.

This section outlines the main theme of this chapter. The rest of this chapter is organized

as follows: The detail description of the new CIQ Web search engine model is given in

Section 3.2. Section 3.3 presents the implementation of the new model and its main

procedures. The data compression algorithm, namely, the HCDC algorithm is described

in Section 3.4. In addition in Section 3.4, derivation and analysis of the HCDC

compression ratio is given. The performance measures that are use to evaluate and

compare the performance of the new model is introduced in Section 3.5.

39

3.1 The CIQ Web Search Engine Model

In this section, a description of the proposed Web search engine model is presented. The

new model incorporates two bit-level data compression layers, both installed at the back-

end processor, one for index compression (index compressor) and one for query

compression (query compressor or keyword compressor), so that the search process can

be performed at the compressed index-query level and avoid any decompression

activities, therefore, we refer to it as the compressed index-query (CIQ) Web search

engine model or simply the CIQ model.

In order to be able to perform the search process at the CIQ level, it is important to have a

data compression technique that is capable of producing the same pattern for the same

character from both the index and the query. The HCDC algorithm [Bah 07b, Bah 08a]

which will be described in the next section, satisfies this important feature, and it will be

used at the compression layers in the new model. Figure (3.1) outlines the main

components of the new CIQ model and where the compression layers are located.

It is believed that introducing the new compression layers reduce disk space for storing

index files, increases query throughput and consequently retrieval rate. On the other

hand, compressing the search query reduces I/O overheads and query processing time as

well as the system response time.

The CIQ model works as follows: At the back-end processor, after the indexer generates

the index, and before sending it to the index storage device it keeps it in a temporary

memory to apply a lossless bit-level compression using the HCDC algorithm, and then

sends the compressed index file to the storage device. So that it requires less-disk space

enabling more documents to be indexed and accessed in comparatively less CPU time.

The HCDC algorithm creates a compressed-file header (compression header) to store

some parameters that are relevant to compression process, which mainly include the

character-to-binary coding pattern. This header should be stored separately to be accessed

by the query compression layer (query compressor).

40

On the other hand, the query parser, instead of passing the query to the index file, it

passes it to the query compressor before accessing the index file. In order to produce

similar binary pattern for the similar compressed characters from the index and the query,

the character-to-binary codes used in converting the index file are passed to be used at the

query compressor. If a match is found the retrieved data is decompressed, using the index

decompressor, and passed through the ranker and the search engine interface to the end-

user.

Figure (3.1). Architecture and main components of the CIQ Web search engine model.

41

3.2 Implementation of the CIQ Model: The CIQ-based Test Tool (CIQTT)

This section describes the implementation of a CIQ-based test tool (CIQTT), which is

developed to:

(1) Validate the accuracy and integrity of the retrieved data. Ensuring that the same

data sets can be retrieved using the new CIQ model.

(2) Evaluate the performance of the CIQ model. Estimate the reduction in the index

file storage requirement and processing or search time.

The CIQTT consists from six main procedures; these are:

(1) COLCOR: Collecting the testing corpus (documents).

(2) PROCOR: Processing and analyzing the testing corpus (documents).

(3) INVINX: Building the inverted index and start indexing.

(4) COMINX: Compressing the inverted index.

(5) SRHINX: Searching the index file (inverted or inverted/compressed index).

(6) COMRES: Comparing the outcomes of different search processes performed by

SRHINX procedure.

In what follows, we shall provide a brief description for each of the above procedures.

3.2.1 COLCOR: Collects the testing corpus (documents)

In this procedure, the Nutch crawler [Web 6] is used to collect the targeted corpus

(documents). Nutch is an open-source search technology initially developed by Douglas

Reed Cutting who is an advocate and creator of open-source search technology. He

originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology

projects which are now managed through the Apache Software Foundation (ASF). Nutch

builds on Lucene [Web 7] and Solr [Web 8].

42

Lucene information retrieval software library

Lucene also known as Apache Lucene is a free/open source information retrieval

software library, originally created in Java, and it is released under the Apache Software

License. It has been ported to other programming languages including Delphi, Perl, C#, C

++, Python, Ruby and PHP. While it is suitable for any application which requires full

text indexing and searching capability, Lucene has been widely recognized for its utility

in the implementation of Internet search engines and local, single-site searching [Web 7].

At the core of Lucene's logical architecture is the idea of a document containing fields of

text. This flexibility allows Lucene's Application Programming Interface (API) to be

independent of the file format. Text from PDFs, HTML, Microsoft Word, and

OpenDocument documents, as well as many others, can all be indexed as long as their

textual information can be extracted.

Solr information retrieval software library

Solr is an open-source enterprise search platform from the Apache Lucene project. Its

major features include powerful full-text search, hit highlighting, faceted search, dynamic

clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing

distributed search and index replication, Solr is highly scalable.

Solr is written in Java and runs as a standalone full-text search server within a servlet

container such as Apache Tomcat. Solr uses the Lucene Java search library at its core for

full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make

it easy to use from virtually any programming language. Solr's powerful external

configuration allows it to be tailored to almost any type of application without Java

coding, and it has extensive plugin architecture when more advanced customization is

required.

Apache Lucene and Apache Solr are both produced by the same ASF development team

since the project merge in 2010. It is common to refer to the technology or products as

Lucene/Solr or Solr/Lucene.

43

The main features of the Nutch crawler can be summarized as follows:

• Fetching, parsing and indexation in parallel and/or distributed clusters.

• Support many formats, such as: plaintext, HTML, XML, ZIP, OpenDocument

(OpenOffice.org), Microsoft Office (Word, Excel, Access, PowerPoint), PDF,

JavaScript, RSS, RTF, MP3 (ID3 tags).

• It has a highly modular architecture, allowing developers to create plug-ins for

media-type parsing, data retrieval, querying and clustering.

• It supports ontology or Web archiving, which is the process of collecting portions

of the Web and ensuring the collection is preserved in an archive, such as an

archive site, for future researchers, historians, and the public. Due to the massive

size of the Web, Web archivists typically employ Web crawlers for automated

collection. The largest Web archiving organization based on a crawling approach

is the Internet Archive which strives to maintain an archive of the entire Web.

• It is based on MapReduce, which is a patented software framework introduced by

Google to support distributed computing on large data sets on clusters of

computers. The framework is inspired by the “map” and “reduce” functions

commonly used in functional programming, although their purpose in the

MapReduce framework is not the same as their original forms. MapReduce

libraries have been written in C++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby,

F#, R and other programming languages.

• It supports distributed file system (via Hadoop), [Web 10] which a software

framework that supports data-intensive distributed applications under a free

license. It enables applications to work with thousands of nodes and petabytes of

data. Hadoop was inspired by Google's MapReduce and Google File System

(GFS) papers. It uses the Java programming language. Yahoo! uses Hadoop

extensively across its businesses. Hadoop was created by Doug Cutting to support

distribution for the Nutch search engine project.

44

• It utilizes Windows NT LAN Manager (NTLM) authentication, [Web 11] which is

a suite of Microsoft security protocols that provides authentication, integrity, and

confidentiality to users. NTLM is the successor to the authentication protocol in

Microsoft LAN Manager (LANMAN), an older Microsoft product, and attempts

to provide backwards compatibility with LANMAN. NTLM version two

(NTLMv2), which was introduced in Windows NT 4.0 SP4 (and natively

supported in Windows 2000), enhances NTLM security by hardening the protocol

against many spoofing attacks, and adding the ability for a server to authenticate

to the client.

NTLM is a challenge-response authentication protocol which uses three messages

to authenticate a client in a connection oriented environment (connectionless is

similar), and a fourth additional message if integrity is desired.

i. First, the client establishes a network path to the server and sends a

NEGOTIATE_MESSAGE advertising its capabilities.

ii. Next, the server responds with CHALLENGE_MESSAGE which is

used to establish the identity of the client.

iii. Finally, the client responds to the challenge with an AUTHENTICATE

_MESSAGE. There three versions of NTLM: NTLMv1, NTLMv2,

and NTLMv2-Session [Web 11].

Furthermore, Nutch crawler has some advantages over a simple fetcher, such as:

1. It is highly scalable and relatively feature rich crawler.

2. It obeys robots.txt rules

3. It is robust and scalable. It can run on a cluster of 100 machines.

4. It is high quality as it can bias the crawling to fetch “important” pages first.

5. It supports clustering of the retrieved data.

6. It can access a link-graph database.

45

3.2.2 PROCOR: Processing and analyzing the testing corpus (documents)

This procedure implements the processing and analyzing of the corpus (documents),

which goes through several stages; these are:

• Language filtering. In this stage all non-English documents are filtered-out to get

an English index only.

• Content extracting and code striping. In this stage the content is extracted and

isolated from the site menus, navigators, copyright notes and any other non-

related text. Also remove and strip the HTML, tags, styles and any scripting code .

• Special characters removal: In this stage all the special characters are removed. A

list of 137 special characters is given in Appendix I.

• Stop-word removal. Removing the stop-words is a well-known practice in Web

search engine technology, specially, when uses the inverted indexes. A list of 117

stop-words is given in Table (3.1), [Web 12, Web 13] which represents the most

frequent words in the English language Internet content.

• Converting Characters to lower-case format. In this stage all characters are

converted to a lower-case format.

3.2.3 INVINX: Building the inverted index and start indexing.

We choose simple inverted index architecture that contains the keyword and the

documents ID where it is occurred. The source code which is developed by using python

programming language for this step is given in Appendix II. Before indexing process take

place, all the crawled documents must be renamed (numbered) with a new and unique ID

(sequence number). There are many numbering methods that can be adopted to derive

and generate these IDs, all of them are based on assigning a unique m-digit ID to each

crawled document.

These sequences are either a complete m-numeric or m-letter IDs or a combination of

numeric and letter digits. Table (3.2) shows the possible character sets that can be used

and the maximum number of IDs generated depending on the characters set and length of

the ID (m).

46

Table (3.1)List of most popular stopwords (117 stop-words).

a able about across after all almost also

am among an and any are as at

be because been but by can cannot could

dear did do does either else ever every

for from get got had has have he

her hers him his how however i if

in into is it its just least let

like likely may me might most must my

neither no nor not of off often on

only or other our own rather said say

says she should since so some than that

the their them then there these they this

to too us wants was we were what

when where which while who whom why will

with would yet you your

Table (3.2)Type of character sets and equivalent maximum number of IDs

# Character set Max. No. of IDs

1 Numeric (0-9) 10m

2 Small-letter only (a-z) 26m

3 Capital-letter only (A-Z) 26m

4 Numeric and Small-letter (0-9 and a-z) 36m

5 Numeric and Capital-letter (0-9 and A-Z) 36m

6 Small-letter and Capital-letter (a-z and A-Z) 52m

7 Numeric, Small-letter, and Capital-letter (0-9, a-z, and A-Z) 62m

47

These sequences are either generated sequentially in ascending or descending order or

generated randomly subject to a condition that each document should have a unique ID.

As an example, Figure (3.2) lists the range of IDs that are generated sequentially using

different numbering methods for each type of character sets. Figure (3.2) also shows the

maximum number of documents that can be renamed for each character set assuming

m=6 (6-digit IDs).

Numeric (Max. IDs = 106 = 1 000 000)00000, 00001, …, 0000900010, 00011, …, 00019

99990, 99991, …, 99999Small-letter only or Capital-letter only(Max. IDs = 266 = 308 915 776)

aaaaa, aaaab, …, aaaazaaaba, aaabb, …, aaabz

zzzza, zzzzb, …, zzzzz

AAAAA, AAAAB, …, AAAAZAAABA, AAABB, …, AAABZ

ZZZZA, ZZZZB, …, ZZZZZNumeric and Small-letter or Numeric and Capital-letter (Max. IDs = 366 = 2 176 782 335)

00000, 00001, …, 00009, 0000a, …, 0000z00010, 00011, …, 00019, 0001a, …, 0001z

99990, 99991, …, 99999, 9999a, …, 9999z

aaaa0, aaaa1, …, aaaa9, aaaaa, …, aaaaz

zzzz0, zzzz1, …, zzzz9, zzzza, …, zzzzz

00000, 00001, …, 00009, 0000A, …, 0000Z00010, 00011, …, 00019, 0001a, …, 0001Z

99990, 99991, …, 99999, 9999a, …, 9999Z

AAAA0, AAAA1, …, AAAA9, AAAAA, …, AAAAZ

ZZZZ0, ZZZZ1, …, ZZZZ9, ZZZZA, …, ZZZZZSmall-letter and Capital-letter (Max. IDs = 526 = 19 770 609 664)

aaaaa, aaaab, …, aaaaz, aaaaA, aaaaB, …, aaaaZ

zzzza, zzzzb, …, zzzzz, zzzzA, zzzzB, …, zzzzZ

AAAAa, AAAAb, …, AAAAz, AAAAA, …, AAAAZ

ZZZZa, ZZZZb, …, ZZZZz, ZZZZA, ZZZZB, …, ZZZZZ Numeric, Small-letter, and Capital-letter (Max. IDs = 626 = 56 800 235 584)

00000, 00001, …, 00009, 0000a, …, 0000z, 0000A, …, 0000Z00010, 00011, …, 00019, 0001a, …, 0001z, 0001A, …, 0001Z

zzz10, zzz11, …, zzz19, zzz1a, …, zzz1z, zzz1A, …, zzz1Z

ZZZZ0, ZZZZ1, …, ZZZZ9, ZZZZa, …, ZZZZz, ZZZZA, …, ZZZZZ

Figure (3.2). Lists of IDs for each type of character sets assuming m=6.

48

It can be easily seen from Figure (3.2) that the type of character set and length of the ID

determine the maximum number of IDs that can be generated. At the same time, the

character frequencies of the index file may change significantly, and the actual change

depends on the size of the index file generated. This last issue should be carefully

considered as it could significantly determine the data compression ratio of the

compression algorithm. However, we must emphasize that at this stage we have given

little attention to this issue, and we use the indexing method described below.

The procedure INVINX is implemented in a number of stages to provide high flexibility

and to enable straightforward evolution. These stages can be summarized as follows:

• Select the indexing (Documents IDs) character set, e.g., one of the characters sets

in Table (3.2).

• Select the length of the documents IDs (m).

• Select the index generation method (e.g., ascending, descending, random).

In this work, a numeric character set of six digits is used, which enable us to

index up to 106 documents. A unique 6-digits random numbers list is generated,

and every document is renamed with a number from this list. This number is

known as the document ID.

• Scan and list all the keywords occur in all the documents that are crawled.

To generate the inverted index, we scan and list all the keywords occur in all

documents that were crawled, and then sorted these keywords with the ID of the

documents containing that keyword. Two special characters before and after

every keyword are added to differentiate them from the documents ID; these are:

• The character '^' is added before the keyword.

• The character '|' is added after the keyword.

• Construct and store the index file.

49

Illustrative Example 2

Suppose crawled documents are scanned, and the keywords phd, compute, and science

are found in some of the crawled documents as shown next to each of them below:

^phd|317056002649740002861661625740406306

^compute|312045910913739159637103

^science|513222003112101555

• The keyword phd occurred 6 times in documents: 317056, 002649, 740002,

861661, 625740, and 406306

• The keyword compute occurred 4 times in documents: 312045, 910913,

739159, and 637103.

• The keyword science occurred 3 times in documents: 513222, 003112, and

101555.

It can be seen from the above discussion that the maximum number of character-type

may appear in the index file is 38, these are: 10 numeric (0-9), 26 letters (a-z), and 2

special characters (^ and |).

3.2.4 COMINX: Compressing the inverted index

In general, in this procedure, the data compression algorithm is implemented. It includes

the flowing sub-procedures:

1. Read the index file.

2. Compress the index file using the index file compressor.

3. Store the compressed index file.

4. Store the compression header to be accessed by the query (keyword) compressor

and the extracted data decompressor.

50

The main requirements for a data compression algorithm to be used by COMINX are:

• It should be lossless data compression algorithm.

• It should allow compressed index-query search, i.e., searching compressed query

on compressed index file.

• Provides adequate performance; such as: high data compression ratio, fast

processing (compression/decompression) time, small memory requirement, and

small-size compressed file header. Nevertheless, it is not necessary for the

algorithm to have symmetric processing time.

In this thesis, the data compression algorithm that meets the above requirements and uses

is the HCDC algorithm [Bah 07b, Bah 08a]. The source code which is developed by

using python programming language for this step is given in Appendix III. A detail

description of this algorithm is given next.

3.2.5 SRHINX: Searching the index file (inverted or inverted/compressed index)

This procedure performs the searching process as follows:

• Read the query and identify the list of keywords to be searched for.

• Compress each keyword separately using the query compressor and the

compression header created by the index file compressor.

• Perform the searching process to extract the documents IDs that matched all or

some keywords.

• Decompress the extracted data using the index decompressor and the compression

header.

• Store the decompressed data to be accessed by COMRES for performance

comparison.

• Measure and record the overall and individual processing times to be accessed by

COMRES for performance comparison.

51

The source code which is developed by using python programming language for this step

is given in Appendix IV.

3.2.6 COMRES: Comparing the outcomes of different search processes performed by SRHINX procedure.

In this procedure, the outcomes of different search processes for the same keywords on

the current (non compressed index-query) and the new (compressed index-query) models

are compared. In particular, the number and IDs of documents that contain the same

keyword are compared for the different models. To validate the accuracy of the new

model both the number of extracted documents and their IDs must be similar to those

extracted by the current model. The source code which is developed by using python

programming language for this step is given in Appendix IV.

3.3 The Bit-Level Data Compression Algorithm

This section provides the detail description of the data compression scheme that will be

used at the back-end for both index file and query compression, namely, the novel

Hamming codes based data compression (HCDC) algorithm [Bah 07b, Bah 08a]. It is an

adaptive, lossless, asymmetric, bit-level, data compression algorithm.

3.3.1 The HCDC algorithm

The error-correcting Hamming code has been widely used in computer networks and

digital data communication systems as a single bit error correcting code or two bits errors

detection code. It can also be tricked to correct burst errors. The key to Hamming code is

the use of extra parity bits (p) to allow the identification of a single bit and a detection of

two bits errors [Kim 05, Tan 03].

Thus, for a message having d data bits and to be coded using Hamming code, the coded

message (also called codeword) will then have a length of n bits, which is given by:

n = d + p (3.1)

Where n is the block (codeword) length

52

d is the length of the data

p is the number of parity bits

This would be called a (n,d) code. The optimum length of n depends on p, and it can be

calculated as:

n = 2p - 1 (3.2)

The data and the parity bits are located at particular locations in the codeword. The parity

bits are located at positions 20, 21, 22, …, 2p-1 in the coded message, which has at most n

positions. The remaining positions are reserved for the data bits, as shown in Figure (3.3).

Each parity bit is computed on different subsets of the data bits, so that it forces the parity

of some collection of data bits, including itself, to be even or odd.

In the HCDC algorithm, the characters of a source file is converted to binary sequence by

concatenated the individual binary codes of the data symbols. The binary sequence is,

then, divided into a number of blocks, each of n-bit length as shown in Figure (3.3b). The

last block is padded with 0s if its length is less than n. For a binary sequence of So bits

length, the number of blocks B (where B is a positive integer number) is given by:

oSBn

= (3.3)

The number of padding bits (g), which may be added to the last block is calculated by:

g = B * n – So (3.4)

The number of parity bits (p) within each block is given by:

ln ( 1)ln (2)

np +=

(3.5)

For a block of n-bit length, there are 2n possible binary combinations (codeword) having

decimal values ranging from 0 to 2n-1, only 2d of them are valid codewords and 2n-2d are

non-valid codewords.

53

Each block is then tested to find if it is a valid block (valid codeword) or a non-valid

block (non-valid codeword). During the compression process, for each valid block the

parity bits are omitted, in other words, the data bits are extracted and written into a

temporary compressed file. However, these parity bits can be easily retrieved back during

the decompression process using Hamming codes. The non-valid blocks are stored in the

temporary compressed file without change.

In order to be able to distinguish between the valid and the non-valid blocks during the

decompression process, each valid block is preceded by 0, and each non-valid block is

preceded by 1 as shown in Figure (3.3c). For a 7-bit blocks (n=7), the compressor works

as follows: The compressor reads 7 bits (b0, b1, b2, b3, b4, b5, b6) and check to see if they

constitute a valid Hamming codeword; if so it extracts 4 bits (d=4) out of these 7 bits;

these are: b2, b4, b5, and b6, and then stores them preceded by 0. If the 7 bits represent

non-valid codeword, the same 7 bits (b0, b1, b2, b3, b4, b5, b6) are stored preceded by 1.

The decompressor works as follows: It starts by reading a single bit from the compressed

binary sequence, if this bit is 0, then it reads the next 4 bits as a data bits (d0, d1, d2, d3)

and applies the Hamming codes to calculate 3 parity bits (p0, p1, p2). The 7 bits are

arranged as follows (p0, p1, d0, p2, d1, d2, d3) to represent the uncompressed data (b0, b1, b2,

b3, b4, b5, b6). If the single bit is 1, then it reads the next 7 bits to represent the

uncompressed data (b0, b1, b2, b3, b4, b5, b6). Figures (3.4) and (3.5) summarize the flow of

the compressor and the decompressor of the HCDC algorithm.

54

b0

b1

b2

b3

b4

b5

b6

p0

p1

d0

p2

d1

d2

d3

b ( n = d + p)(a)

b0

b1

b2

b3

b4

b5

b6

b0

b1

b2

b3

b4

b5

b6

b0

b1

b2

b3

b4

b5

b6

b1 ( n) b2 ( n) b3 ( n)(b)

0 b2

b4

b5

b6

1 b0

b1

b2

b3

b4

b5

b6

0 b2

b4

b5

b6

b1 ( d + 1) b2 ( n + 1) b3 ( d + 1)

. Figure (3.3). (a) Locations of data and parity bits in 7-bit codeword, (b) an uncompressed binary sequence of 21-bit length divided into 3 blocks of 7-bit length, where b1 and b3 are valid blocks, and b2 is a non-valid block, and (c) the compressed binary sequence (18-bit length).

Figure (3.4). The main steps of the HCDC compressor.

55

(i) InitializationSelect pCalculate n = 2p - 1Calculate d = n - pCalculate B = ceiling(So/n)Calculate g = B * n - So

Initialize b = 0(ii) Reading binary data

Read a block of n-bit length[Add 1 to b]

(iii) Check for block validityIf (block = valid codeword) then

Add 1 to Extract the data bits (d-bit)Write 0 followed by the extracted d-bits to the temporary compressed file

Else (block = non-valid codeword)Add 1 to Write 1 followed by all n-bits to the temporary compressed file

End if(iv) If (b<B) then Goto Step 2

Figure (3.5). The main steps of the HCDC decompressor.

3.3.2 Derivation and analysis of HCDC algorithm compression ratio

This section presents the analytical derivation of a formula that can be used to compute

the compression ratio achievable using the HCDC algorithm. The derived formula can be

used to compute C as a function of two parameters:

I. The block size (n).

II. The fraction of valid blocks (r).

In the HCDC algorithm, the original binary sequence is divided into B blocks of n-bit

length. These B blocks are either valid or non-valid blocks; therefore, the total number of

blocks is given by:

B = v + w (3.6)

where v and w are the number of valid and non-valid blocks, respectively.

56

1. InitializationSelect pCalculate n = 2p - 1Calculate d = n - pInitialize b = 0

2. Reading binary dataRead one bit (h) [Add 1 to b]

3. Check for block validityIf {h = 0} then

[Add 1 to v]Read the following d data bitsCompute the Hamming codes for these d data bitsWrite the coded block the temporary decompressed binary sequence

Else {h = 1} then[Add 1 to w]Read a block of n bits lengthWrite n bits block to the temporary decompressed binary sequence

End if4. If (not end of data) go to Step 2

As it has been discussed in the previous section, in the HCDC algorithm, the binary

sequence is divided into a number of blocks of n-bit length, each block is then subdivided

into d data bits and p parity bits. For a valid block only the d data bits preceded by 0 are

appended to the compressed binary sequence (i.e., d + 1 bits for each valid block). So that

the length of the compressed valid blocks (Sv) is given by:

Sv = v (d + 1) (3.7)

For a non-valid block all bits are appended to compressed binary sequence (i.e., n + 1 bits

for each non-valid block). The number of bits appended to the compressed binary

sequence is given by:

Sw = w (n + 1) (3.8)

Thus, the length of the compressed binary sequence (Sc) can be calculated by:

Sc = Sv + Sw = v (d + 1) + w (n + 1) (3.9)

Using Eqns. (3.1) and (3.6), Eqn. (3.9) can be simplified to

Sc = Bn + B – v · p (3.10)

Substituting So=nB and Sc as it is given by Eqn. (3.10) into Eqn. (1.1) for the compression

ratio (C) yields:

1o

c

S nCS n r p

= =+ − (3.11)

where r=v/B, and it represents the fraction of valid blocks. Substitute Eqn. (3.2) into Eqn.

(3.11) gives:

2 12

p

pCr p

−=− (3.12)

It is clear from Eqn. (3.12) that, for a certain value of p, C is inversely proportional to r,

and C is varied between a maximum value (Cmax) when r=1 and a minimum value (Cmin)

57

when r=0. It can also be seen from Eqn. (3.12) that for each value of p, there is a value of

r at which C=1. This value of r is referred to as r1, and it can be found that r1=1/p.

Table (3.4) lists the values of Cmin, Cmax, and r1 for various values of p. These results are

also shown in Figures (3.6) and (3.7), where Figure (3.6) shows the variation of Cminx and

Cmax with p, and Figure (3.7) shows the variation of r1 with p.

Table (3.4)Variation of Cmin, Cmax, and r1 with number of parity bits (p).

p Cmin Cmax r1

2 0.750 1.500 0.500

3 0.875 1.400 0.333

4 0.938 1.250 0.250

5 0.969 1.148 0.200

6 0.984 1.086 0.167

7 0.992 1.050 0.143

8 0.996 1.028 0.125

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9 10

Number of parity bits (p )

Com

pres

sion

ratio

( C)

C m ax

C m in

Figure (3.6) - Variation of Cmin and Cmax with p.

58

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4 5 6 7 8 9 10

Number of parity bits (p )

Crit

ical

frac

tion

of v

alid

blo

cks.

r 1

Figure (3.7) - Variation of r1 with p.

The values of C for various values of r (0 to 1) and also various values of p (2 to 8) are

calculated using Eqn. (3.12) and listed in Table (3.6) and plotted in Figure (3.8). It can be

deduced from Table (3.6) and Figure (3.8) that satisfactory values of C can be achieved

when p≤4 and r>r1.

Table (3.6)Variations of C with respect to r for various values of p.

rNumber of the parity bits (p)

2 3 4 5 6 7 80.0 0.750 0.875 0.938 0.969 0.984 0.992 0.9960.1 0.789 0.909 0.962 0.984 0.994 0.998 0.9990.2 0.833 0.946 0.987 1.000 1.003 1.003 1.0020.3 0.882 0.986 1.014 1.016 1.013 1.009 1.0060.4 0.938 1.029 1.042 1.033 1.023 1.014 1.0090.5 1.000 1.077 1.071 1.051 1.033 1.020 1.0120.6 1.071 1.129 1.103 1.069 1.043 1.026 1.0150.7 1.154 1.186 1.136 1.088 1.054 1.032 1.0180.8 1.250 1.250 1.172 1.107 1.064 1.038 1.0220.9 1.364 1.321 1.210 1.127 1.075 1.044 1.0251.0 1.500 1.40 1.250 1.148 1.086 1.050 1.028

59

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Ratio of valid blocks (r )

Com

pres

sion

ratio

( C

) .

p = 2 p = 3 p = 4 p = 5 p = 6 p = 7 p = 8

Figure (3.8) - Variations of C with respect to r for various values of p.

English text characters are usually converted to binary using its equivalent 7-bit ASCII

codes, which means that each character can be considered as a (7,4) codeword. It has

been mentioned earlier that not all 7-bit codewords are valid codewords, in fact only 16

codewords (2d) are valid, and the remaining 112 codewords (2n-2d) are non-valid. The

ASCII codes of these 16 valid codewords and the characters they represent are listed in

Table (3.7).

According to the statistics in Standard English text, these valid codewords can be

categorized into three groups:

I. Widely-use (their decimal values are equivalent to the ASCII code of the

characters a and f).

II. Rare-use (their decimal values are equivalent to the ASCII code of the char-

acters K, L, R, U, x, *, -, 3, and 4).

III. Not-used (their decimal values are equivalent to the ASCII code of the un-

printable characters of 0, 7, 25, 30, and 127 ASCII code).

60

Table (3.7) Valid 7-bit codewords.

ASCII Code Character Category0 Control Character Not used7 Control Character Not used25 Control Character Not used30 Control Character Not used42 * Rare-use45 - Rare-use51 3 Rare-use52 4 Rare-use75 K Rare-use76 L Rare-use82 R Rare-use85 U Rare-use97 a Frequent -use102 f Frequent-use120 x Rare-use127 Control Character Not-used

Thus, encoding characters to binary using their equivalent ASCII codes and testing the

validity of these characters either yields a very low compression ratio or most properly

yields inflation. This is because a small proportion of the characters within the text may

have valid codewords.

In order to enhance the compression power of the HCDC algorithm is implemented in

five main steps, these are:

Step 1: Calculate characters frequencies (fi, i=1, 2, 3, …, Nc, where Nc is the number of

symbols within the data file). At this stage, if the standard HCDC algorithm is

performed, the resulting compression ratio for a single loop can be derived as:

1 1 1

( 1) ( 1)c c c

i c i c

N N N

i i ii i i

A V A V

C n f d f n f= = =

= ≠

= + + +

∑ ∑ ∑ (3.13)

where Ai is the ASCII code of the ith character, and Vc is any valid 7-bit codeword. In the

61

above equation, 1

cN

ii

f=

∑ represents the total number of characters (number of blocks (B)

within the data file). While 1

c

i c

N

ii

A V

f==

∑ and 1

c

i c

N

ii

A V

f=≠

∑ represent the number of valid

characters (v) and non-valid characters (w), respectively.

For a (7, 4) Hamming code, it can be easily shown that for a compression to be achieved

(C>1) the fraction of valid characters (r=v/B) should be greater than 1/3. The

compression ratio is directly proportional to r, and the maximum compression ratio that

can be achieved is 1.4 (7/5) when all characters have valid codewords.

Step 2: Sort characters in descending order according to their frequencies. Start with 0

for the most common character and Nc-1 for the least common character. A list

of these sorted characters is stored in the compressed file header, to be used

during the decompression process.

Step 3: Encode each character within the text file to an equivalent binary digits. Each

character in the data file is converted to 7-bit binary digits according to its

ASCII code.

Step 4: Replace any of the 16 most common codewords with a valid codeword if it is

not originally representing a valid codeword. A record of each original

codeword and its replacement must also be recorded and stored in the

compressed file header. In this case, the compression ratio can be expressed as:

2

1 1 17

( 1) ( 1)d

c cN N

i i ii i i

C n f d f n f= = =

= + + +

∑ ∑ ∑ (3.14)

Step 5: Test each 7-bit block to find if it is a valid or a non-valid codeword. If the

codeword is valid, then only the 4 bits (bits at positions 3, 5, 6, and 7) (b2, b4,

b5, b6) preceded by 0 are written to the compressed binary file, otherwise all 7

bits (b0, b1, b2, b3, b4, b5, b6) preceded by 1 are written to the compressed file.

62

3.3.3 The Compressed File Header

The HCDC algorithm compressed file header contains all information that is required by

the decompression algorithm. The header is designed to be simple and flexible.

Furthermore, it is designed so that it can be easily developed to meet future development.

It consists of five fields:

I. HCDC field

II. Coding filed

III. Valid codewords field

IV. Replacement field

V. Padding field

In what follows a description is given for each of the above fields.

I. HCDC field

The HCDC field is an 8-byte field encloses information related to the HCDC algorithm,

such as: algorithm name (HCDC), algorithm version (V), coding format (F), number of

compression loops (k).

The original HCDC algorithm was designated as Version-0 (i.e., V is set to 0). The coding

format (F) is set to 0 for ASCII coding, 1 for Huffman coding, 2 for adaptive coding, etc.

Table (3.8) lists the constituents of this field and their description.

II. Coding field

The coding field encloses information related to the coding format, so that their content

depends on the coding format indicated in the HCDC field. This field is not required for

the ASCII coding format. For the adaptive coding, it contains the symbols within the

source file sorted descendingly according to their frequency of occurrence. The length of

this field is Nc bytes.

63

1. Valid codewords field

This field encloses the 16 valid codewords that to be assigned for the most frequently

used characters. The length of this field is constant 16 Byte.

2. Replacement field

The replacement field contains information on the original codewords and their

replacements. The length of this field is 2Ri Byte for each loop. The maximum value of Ri

is 16, if all most common 16 codewords are non-valid and need to be replaced with valid

codewords.

3. Padding field

This field contains information on the number of padding bits (g) during each

compression loop from loop 1 to loop k. Thus, the length of this field is k Byte.

Taking into account all fields mentioned above, the length of the HCDC algorithm

compressed file header (Hl) can be expressed as:

18 16 2

k

l c ii

H N R k=

= + + + +∑ (3.15)

The fields of the compressed file header and their individual lengths are shown in Figure (3.9).

Figure (3.9) - The compressed file header of the HCDC scheme.

64

Table (3.8)The HCDC algorithm compressed file header.

Field Length (Byte) Description

HCDC 4 Name of the compression algorithm.

V 1 Version of the HCDC algorithm.

Nc 1 Number of symbols within the original text file.

F 1 Coding format: 0 for ASCII coding, 1 for Huffman coding, 2 for adaptive coding, etc.

k 1 The number of compression loops.

The main features of the HCDC algorithm can be summarized as:

I. Bit-level. It processes the data file at a binary level.

II. Lossless. An exact form of the source file can be retrieved.

III. Symmetric. The compression/decompression CPU times are almost the same.

3.4 Implementation of the HCDC algorithm in CIQTT

The HCDC algorithm described above is implemented in three different sub-procedures.

Two of them for compression, in particular for index and query compression, and one for

index decompression to be used at the different compression/decompression layers shown

in Figure (3.2). These sub-procedures can be explained as:

(1) INXCOM for index compression. It reads the index file, finds the character

frequencies, converts characters to binary sequence using the characters ASCII

codes, creates and stores the compression header, compresses the binary

sequence, and converts the compressed binary sequence to characters and stores

them in a compressed index file.

(2) QRYCOM for query compression. It reads the keywords, converts characters to

binary sequence using the characters ASCII codes, reads the compression header

created by INXCOM, compresses the binary sequence, and converts the

compressed binary sequence to characters so that searching the compressed index

can be performed at compressed level.

65

HCDC field

(8 Byte)

Codes field

(Nc Byte)

Valid codewords field

(16 Byte)

Replacement field

(2∑i=1

k

R i Byte

)

Padding field

(k Byte)

(3) INXDEC for index decompression. It reads part of the compressed index file that

match the search keyword, converts characters in that particular part to binary

sequence using the characters ASCII codes, reads the compression header created

by INXCOM, decompresses the binary sequence, and converts the decompressed

binary sequence to characters so that it can process by the query ranker and other

procedures at the front-end processor.

3.5 Performance Measures

In order to evaluate and compare the performance of the proposed CIQ model, a number

of performance measures can be considered. The accuracy of the new model is validated

by comparing the searching outcomes of the current (uncompressed) and the new CIQ

models. Furthermore, the performance of the CIQ model is evaluated and compared in

terms of the followings:

1. Measuring the disk space requirement. The sizes of the index file before and after

compression are recorded and the compression ratio (C) is calculated using Eqn.

(1.1). The size of the index file before compression (uncompressed index file)

represents the disk space requirement by the current model and it is denoted as So,

while the size of the index file after compression (compressed index file)

represents the disk space requirement by the new model and it is denoted as Sc.

2. Estimating the query processing time (searching time). The CPU time required for

searching the compressed and uncompressed index files for a certain word or

phrase are recorded and the speedup factor (Sf) is calculated as:

(3.16)

Where Tc and To are the CPU times required for searching the compressed and

uncompressed index files for a certain word or phrase, respectively.

In this thesis, we calculate and use two other indicative parameters to evaluate the

performance of the new model. These are:

66

• The storage reduction factor (Rs), which represents the percentage reduction in

storage requirement. It is calculated by dividing the difference between the sizes

of the uncompressed index file (So) and the compressed index file (Sc) by So. It

can be expressed mathematically as:

(3.17)

• The time reduction factor (Rt), which represents the percentage reduction in

processing (searching) time. It is calculated by dividing the difference between

the time for search the uncompressed index file (To) and the time for searching

the compressed index file (Tc) by To It can be expressed mathematically as:

(3.18)

67

Chapter Four

Results and DiscussionsIn this chapter, a number of search processes (test procedures) are performed to evaluate

the performance and validate the accuracy of the new compressed index-query (CIQ)

Web search engine model by using the CIQ test tool (CIQTT). The CIQ model and the

CIQTT were described in Chapter 3. For each search process, the performance of the CIQ

model is evaluated by estimating the following parameters:

(1) The compression ratio of the inverted index file (C), Eqn. (1.1).

(2) The storage reduction factor (Rs), Eqn. (3.17).

(3) The processing times (searching time) of the uncompressed model (To).

(4) The processing times (searching time) of the compressed model (Tc).

(5) The speedup factor (Sf=To/Tc), Eqn. (3.16).

(6) The time reduction factor (Rt) , Eqn. (3.18).

The accuracy of the new CIQ model is validated by comparing the following parameters:

(1) The number of documents found inside the inverted index file that matches a

certain keyword search while performing an uncompressed index-query search

(i.e., uncompressed search results), and it is denoted as No.

(2) The number of documents found inside the inverted index file that matches a

certain keyword search while performing a compressed index-query search (i.e.,

compressed search results), and it is denoted as Nc.

(3) The IDs of the matched documents.

68

Further details on the above parameters were given in Chapters 1 and 3. The search

processes for a list of 29 randomly selected keywords are performed on inverted index

files of different sizes (1000, 10000, 25000, 50000, and 75000 document/index). The

index files are generated from different 30 well-known Websites including sub-domains.

The results of the different tests are presented in tables and graphs, and discussed.

Section 4.1 outlines the test procedures adopted in this chapter. Section 4.2 explains the

steps for determining C and Rs, namely, collect the testing corpus (documents), process

and analyze the corpus to build the inverted index files, compress the inverted index files,

and determine C and Rs. While Section 4.3 discusses the steps for determining Sf and Rt,

which are: choose a list of keywords, perform the search processes in both compressed

and uncompressed index files, measure the processing times, and determine Sf and Rt.

Section 4.4 discusses the validation of the CIQ model, and finally the main outcomes of

this test procedures are summarized in Section 4.5.

4.1 Test Procedures

In order to evaluate the performance and validate the accuracy of the new CIQ Web

search engine model, we performed a comprehensive test using the CIQTT. The test

performs all processes carried-out by a Web search engine, it includes collecting the test

corpus, processing and analyzing the collected corpus, generating the index file,

compressing, the index file, selecting a group of keywords to search for, retrieving the

documents IDs, determining the performance evaluation measures, and finally comparing

the results.

The main objectives of the test procedures can be summarized as follows:

(i) Determine C and Rs, which is performed in three steps:

(1) Step 1: Collect the testing corpus (documents) using COLCOR

procedure.

(2) Step 2: Process and analyze the corpus to build the inverted index file

using PROCOR and INVINX procedures.

69

(3) Step 3: Compress the inverted index file using the index compressor

procedure, namely, INXCOM.

(ii) Determine Sf and Rt, which is performed in three steps:

(1) Step 1: Choose a list of keywords to be search for.

(2) Step 2: Perform the search processes in both compressed and uncompressed

inverted index files, and measure To and Tc for each search process.

(3) Step 3: Determine Sf and Rt.

(iii) Validate the accuracy of the search process, which is performed in one step:

(1) Step 1: Compare the uncompressed (No) and compressed (Nc) search results

for each search process.

(2) Step 2: Compare the IDs of the matched documents.

4.2 Determination of the Compression Ratio (C) and the Storage Reduction Factor (Rs)

This section describes the first part of the test procedure, which concerns with the

determination of C and Rs for different size index files. This part of the test procedure

consists of three steps, these are:

i. Step 1: Collect the testing corpus using COLCOR procedure.

ii. Step 2: Process and analyze the corpus to build the inverted index file using

PROCOR and INVINX procedures.

iii. Step 3: Compress the inverted index file using the index compressor procedure

(INXCOM).

In what follows, a description is given for each of the above steps.

70

4.2.1 Step 1: Collect the testing corpus using COLCOR procedure

In this step, a corpus is collected from 30 well-known Websites (include sub-domains)

using the procedure COLCOR. A list of these Websites is given in Table (4.1). The total

number of documents collected is 104000 documents; all of them are in English and in

HTML format. Other formats like PDF, PS, DOC, etc. are excluded.

Table (4.1)

List of visited Websites

# Website # Website

1 http://www.aljazeera.net/ 16 http://www.azzaman.com/

2 http://www.bbc.co.uk/ 17 http://www.en.wna-news.com/

3 http://www.tradearabia.com/ 18 http://www.visitabudhabi.ae/

4 http://www.krg.org/ 19 http://interactive.aljazeera.net/

5 http://www.iranpressnews.com/ 20 http://bbcworld.mh.bbc.co.uk/

6 http://blogs.bbc.co.uk/ 21 http://bahrainnews.net/

7 http://labs.aljazeera.net/ 22 http://www.khilafah.com/

8 http://www.terrorism-info.org.il/ 23 http://conference.khilafah.com/

9 http://www.live.bbc.co.uk/ 24 http://electroniciraq.net/

10 http://evideo.alarabiya.net/ 25 http://www.alarabiya.net/

11 http://www.electroniciraq.net/ 26 http://blogs.aljazeera.net/

12 http://www.alsumaria.tv/ 27 http://bbcworld.mh.bbc.co.uk/

13 http://www.pukmedia.com/ 28 http://terrorism-info.org.il/

14 http://alhurriatv.com/ 29 http://www.ameinfo.com/

15 http://www.tacme.com/ 30 http://grievance.khilafah.com/

71

http://grievance.khilafah.com/

http://www.tacme.com/

http://www.ameinfo.com/

http://alhurriatv.com/

http://terrorism-info.org.il/

http://www.pukmedia.com/

http://bbcworld.mh.bbc.co.uk/

http://www.alsumaria.tv/

http://blogs.aljazeera.net/

http://www.electroniciraq.net/

http://www.alarabiya.net/

http://evideo.alarabiya.net/

http://electroniciraq.net/

http://www.live.bbc.co.uk/

http://conference.khilafah.com/

http://www.terrorism-info.org.il/

http://www.khilafah.com/

http://labs.aljazeera.net/

http://bahrainnews.net/

http://blogs.bbc.co.uk/

http://bbcworld.mh.bbc.co.uk/

http://www.iranpressnews.com/

http://interactive.aljazeera.net/

http://www.krg.org/

http://www.visitabudhabi.ae/

http://www.tradearabia.com/

http://www.en.wna-news.com/

http://www.bbc.co.uk/

http://www.azzaman.com/

http://www.aljazeera.net/

4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using PROCOR and INVINX procedures

The second step in the first test is to process and analyze the collected documents to build

the inverted index file. This is done through calling the PROCOR and INVINX

procedures. Simple inverted index architecture contains the keyword and the documents

ID is chosen. All crawled documents are renamed using a unique 6-digits numeric name

generated randomly. To inverted index is generated by scanning and listing all the

keywords occur in all the documents that were crawled, and then sorted with each

keyword having the IDs of the documents containing that keyword in its text. Further

details on these two procedures were provided in Section 3.3.

In this work, five inverted indexes of different sizes are generated. These indexes contain

1000 documents, 10000 documents, 25000 documents, 50000 documents and 75000

documents. The sizes of these indexes are given in Table (4.2).

Table (4.2)The sizes of the generated indexes.

# No. of document/index Size of index file (Byte)

1 1000 1 937 277

2 10000 16 439 380

3 25000 24 540 706

4 50000 52 437 194

5 75000 65 823 201

4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure

In this step, the index file generated in the previous step is compressed using the HCDC

algorithm. In particular in this step, we call the INXCOM procedure, which is especially

developed for index compression. It reads the index file, finds the character frequencies,

converts characters to binary sequence using the characters ASCII codes, creates and

stores the compression header, compresses the binary sequence, and converts the

compressed binary sequence to characters and stores them in a compressed index file.

72

Since only numeric IDs are chosen, the type of characters per index file are 38 characters

(0-9, a-z, and ^ and |). The number of each character in each of the generated indexes is

given in Table (4.3), while the characters frequencies are listed in Table (4.4).

Table (4.3)Type and number of characters in each generated inverted index file.

#1000 10000 25000 50000 75000

Char. No. of Char. Char. No. of

Char. Char. No. of Char. Char. No. of

Char. Char. No. of Char.

1 0 228444 0 2126846 0 3317395 0 7130005 0 89277352 3 176720 3 1661520 6 2474761 3 5374692 4 67452343 6 173205 2 1659217 1 2443803 7 5353971 2 67244944 7 172576 6 1648063 4 2443710 4 5312508 7 67221815 2 166917 5 1620894 7 2427304 2 5309417 6 67030196 1 160759 1 1596396 3 2418697 1 5304553 5 66684527 5 160087 7 1594639 2 2401394 6 5265015 1 66445158 4 154281 4 1592327 5 2398808 5 5255336 3 66171769 9 87371 8 872045 8 1385016 9 2956711 9 374258510 8 82128 9 855241 9 1372914 8 2884093 8 371650011 ^ 41764 ^ 136993 ^ 166394 ^ 261979 ^ 30098812 | 41764 | 136993 | 166394 | 261979 | 30098813 e 32213 e 98666 e 117899 e 183172 e 20728314 a 25073 a 88525 a 106423 a 170896 a 19442315 i 24502 i 75044 i 89490 i 139352 i 15764716 s 21750 s 66247 s 79411 n 122048 n 13831617 n 21138 n 65665 n 78320 s 121825 s 13762618 r 20506 r 64833 r 77493 r 120055 r 13662219 t 18789 o 57452 o 68795 o 109298 o 12452820 o 17587 t 56875 t 67851 t 105123 t 11842821 l 15026 l 49129 l 58765 l 93176 l 10635422 d 11758 d 36580 d 43132 c 66491 d 7544423 c 11721 c 36000 c 43104 d 66193 c 7508724 u 9320 u 30193 u 36434 m 57925 m 6566525 m 8887 m 29970 m 36189 u 57414 u 6557826 g 8334 h 28790 h 33999 h 55971 h 6448727 p 7986 g 25861 g 30486 g 47126 g 5366428 h 7703 p 23473 p 28038 b 43672 b 5008129 b 6182 b 22386 b 27639 p 42982 p 4863230 y 4921 y 17104 y 20394 y 33213 y 3844331 f 4433 f 15110 f 18183 f 28304 f 3206532 v 3367 k 13138 k 16166 k 26525 k 3064633 k 3248 w 10898 w 12870 w 20748 w 2402534 w 3011 v 10179 v 12151 v 19235 v 2175835 x 1254 x 5426 x 7574 x 13960 x 1680436 j 1012 z 4415 z 5653 z 9247 z 1052937 z 943 j 4192 j 5106 j 8770 j 1043038 q 597 q 2055 q 2551 q 4214 q 4769

Total 1 937 277 16 439 380 24 540 706 52 437 194 65 823 201

73

Table (4.4)Type and frequency of characters in each generated inverted index file.

#1000 Doc./Index 10000 Doc./Index 25000 Doc./Index 50000 Doc./Index 75000 Doc./Index

Char. Freq. (%) Char. Freq.

(%) Char. Freq. (%) Char. Freq.

(%) Char. Freq. (%)

1 0 11.79 0 12.94 0 13.52 0 13.60 0 13.562 3 9.12 3 10.11 6 10.08 3 10.25 4 10.253 6 8.94 2 10.09 1 9.96 7 10.21 2 10.224 7 8.91 6 10.03 4 9.96 4 10.13 7 10.215 2 8.62 5 9.86 7 9.89 2 10.13 6 10.186 1 8.30 1 9.71 3 9.86 1 10.12 5 10.137 5 8.26 7 9.70 2 9.79 6 10.04 1 10.098 4 7.96 4 9.69 5 9.77 5 10.02 3 10.059 9 4.51 8 5.30 8 5.64 9 5.64 9 5.6910 8 4.24 9 5.20 9 5.59 8 5.50 8 5.6511 ^ 2.16 ^ 0.83 ^ 0.68 ^ 0.50 ^ 0.4612 | 2.16 | 0.83 | 0.68 | 0.50 | 0.4613 e 1.66 e 0.60 e 0.48 e 0.35 e 0.3114 a 1.29 a 0.54 a 0.43 a 0.33 a 0.3015 i 1.26 i 0.46 i 0.36 i 0.27 i 0.2416 s 1.12 s 0.40 s 0.32 n 0.23 n 0.2117 n 1.09 n 0.40 n 0.32 s 0.23 s 0.2118 r 1.06 r 0.39 r 0.32 r 0.23 r 0.2119 t 0.97 o 0.35 o 0.28 o 0.21 o 0.1920 o 0.91 t 0.35 t 0.28 t 0.20 t 0.1821 l 0.78 l 0.30 l 0.24 l 0.18 l 0.1622 d 0.61 d 0.22 d 0.18 c 0.13 d 0.1123 c 0.61 c 0.22 c 0.18 d 0.13 c 0.1124 u 0.48 u 0.18 u 0.15 m 0.11 m 0.1025 m 0.46 m 0.18 m 0.15 u 0.11 u 0.1026 g 0.43 h 0.18 h 0.14 h 0.11 h 0.1027 p 0.41 g 0.16 g 0.12 g 0.09 g 0.0828 h 0.40 p 0.14 p 0.11 b 0.08 b 0.0829 b 0.32 b 0.14 b 0.11 p 0.08 p 0.0730 y 0.25 y 0.10 y 0.08 y 0.06 y 0.0631 f 0.23 f 0.09 f 0.07 f 0.05 f 0.0532 v 0.17 k 0.08 k 0.07 k 0.05 k 0.0533 k 0.17 w 0.07 w 0.05 w 0.04 w 0.0434 w 0.16 v 0.06 v 0.05 v 0.04 v 0.0335 x 0.06 x 0.03 x 0.03 x 0.03 x 0.0336 j 0.05 z 0.03 z 0.02 z 0.02 z 0.0237 z 0.05 j 0.03 j 0.02 j 0.02 j 0.0238 q 0.03 q 0.01 q 0.01 q 0.01 q 0.01

Total 100.00 100.00 100.00 100.00 100.00

74

As it was explained above, the outcomes of INXCOM is a compressed index file. In

order to reveal the amount of compression and the actual reduction in storage

requirement that can be achieved by the new model, C and Rs for each index file are

calculated using Eqns. (1.1) and (3.17), respectively. For these two equations, So is the

length of the uncompressed binary sequence, while Sc is the length of the compressed

binary sequence. To achieve a fair comparison, the uncompressed index file is treated as a

text file and converted to binary using 7-bit character wordlength. So that

So(Bit)=7*So(Byte).

Table (4.5) lists the values of C, Rs, So, and Sc for each index file. The results in Table

(4.5) for C and Rs are also plotted in Figures (4.1) and (4.2), respectively.

Table (4.5)Values of C and Rs for different sizes index files.

IndexSo Sc

Byte Bit Byte BitC

Rs

(%)

1000 1 937 277 13 560 939 1 464 222 10 249 554 1.323 24.42

10000 16 439 380 115 075 660 12 003 725 84 026 072 1.370 26.98

25000 24 540 706 171 784 942 17 842 316 124 896 209 1.375 27.30

50000 52 437 195 367 060 365 37 948 720 265 641 040 1.382 27.63

75000 65 823 202 460 762 414 47 579 010 333 053 070 1.384 27.72

Figure (4.1). The compression ratio (C) for different sizes index files.

75

Figure (4.2). The reduction factor (Rs) for different sizes index files.

It can be seen from Table (4.5) and Figures (4.1) and (4.2) that the compression ratio is

over 1.3 and the size of the index files can be reduced by nearly 25% for a range of index

sizes, which consequently reduces storage requirements or enables larger index files to be

stored.

Reference to our discussion in Chapter 3, setting the block size (n) to 7 and the number of

parity bits (p) to 3, the maximum compression ratio (Cmax) that can be achieved by the

HCDC algorithm is 1.4 (Eqns. (3.11) and (3.12)). This demonstrates that the compression

efficiency (E=C/Cmax) of the HCDC algorithm is around 95%.

In order to validate the accuracy of our compression algorithm, we use the results

presented in Table (4.3) to analytically calculate C, and then compare it with the

practically calculated C in Table (4.5). In particular, Table (4.3) can be used to calculate

the total number of characters (B), the number of valid characters (v) (the sum of the first

16 characters), the number of non-valid characters (w), fraction of valid characters (r),

and fraction of non-valid characters (1-r). Thus, since we have n=7, p=3, and r for each

index file, Eqns. (3.11) or (3.12) can be used to analytically calculate C. The results of the

above calculations are summarized in Table (4.6), and they demonstrated a 100%

agreement with the practical results. This can be considered as an excellent validation to

our implementation.

76

Table (4.6)Performance analysis and implementation validation.

Doc./Index No. of blocks (B)

Valid characters Non-valid charactersv r=v/B w 1-r

C E= C/Cmax

1000 1937277 1749554 90.31 187723 9.69 1.323 0.94510000 16439380 15829656 96.29 609724 3.71 1.370 0.97825000 24540706 23809813 97.02 730893 2.98 1.375 0.98250000 52437195 51285727 97.80 1151468 2.20 1.382 0.98775000 65823202 64511536 98.01 1311666 1.99 1.383 0.988

In real world the main cost of the search engine comes from the data centers which store

the huge indexes. Thus, reducing the index size by nearly 25% means reduces the cost of

data centers in a close percentage.

4.3 Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt)

This section presents the second part of the test procedure, which concerns with the

determination of Sf and Rt that are achieved by the new model. This part of the test

procedure consists of three steps, these are:

(1) Step 1: Choose a list of keywords to be search for.

(2) Step 2: Perform the search processes in both compressed and uncompressed

inverted index files, and measure To and Tc for each search process.

(3) Step 3: Determine Sf and Rt.

In what follows, a description is given for each of the above steps.

4.3.1 Choose a list of keywords

In this step we chose a list of 29 keywords of diverse interest and character combination

to search for in the different index files created by the previous test. The list of keywords

is given in Table (4.7).

77

Table (4.7)List of keywords.

# Keyword # Keyword1 24am 11 mail 21 time2 appeared 12 met 22 unions3 business 13 microwave 23 victim4 charms 14 modeling 24 white5 children 15 numbers 25 worth6 environmental 16 people 26 young7 estimated 17 problem 27 years8 government 18 reached 28 zoo9 jail 19 science 29 zero10 leader 20 success

4.3.2 Perform the search processes

In this step, we perform the search processes. We search all index files in Table (4.2) for

each of the keywords listed in Table (4.7) in both compressed and uncompressed index-

query models. For each search process, we take a record of the CPU time required to

search for each keyword, and the number of documents that contains this particular

keyword (search results). Specifically, the following parameters are recorded:

To The CPU time required to search a certain uncompressed inverted index file for

a particular uncompressed keyword.

Tc The CPU time required to search a certain compressed inverted index file for a

particular compressed keyword.

No The number of search results found by an uncompressed index-query search.

Nc The number of search results found by a compressed index-query search.

Then, Eqns. (3.16) and (3.18) are used to calculate Sf and Rt, respectively. The recorded

parameters (To, Tc, No, and Nc) and the calculated parameters (Sf and Rt) are listed in

Tables (4.8) to (4.12) for 1000, 10000, 25000, 50000, and 75000 index files, respectively.

78

Table (4.8)Values of No, Nc, To, Tc, Sf and Rt for 1000 index file

Keyword No=Nc To Tc Sf Rt (%)

24am 8 1.193484 0.901777 1.323 24.44

appeared 47 0.955938 0.716052 1.335 25.09

business 263 0.527066 0.427525 1.233 18.89

charms 1 0.910248 0.705448 1.290 22.50

children 146 0.457664 0.362062 1.264 20.89

environmental 29 0.897533 0.693779 1.294 22.70

estimated 45 0.882637 0.670518 1.316 24.03

government 300 0.810061 0.625981 1.294 22.72

jail 39 0.937251 0.71951 1.303 23.23

leader 108 0.386310 0.302036 1.279 21.82

mail 148 0.468379 0.363157 1.290 22.47

met 59 1.174884 0.897493 1.309 23.61

microwave 3 1.175545 0.896260 1.312 23.76

modeling 1 1.046845 0.721283 1.451 31.10

numbers 59 1.443292 0.865641 1.667 40.02

people 505 1.144459 0.822244 1.392 28.15

problem 145 1.267543 0.938501 1.351 25.96

reached 59 1.150269 0.899623 1.279 21.79

science 107 1.032384 0.785476 1.314 23.92

success 77 1.209520 0.910441 1.328 24.73

time 423 0.605407 0.459194 1.318 24.15

unions 14 0.505437 0.380412 1.329 24.74

victim 19 0.525634 0.401155 1.310 23.68

white 78 0.414360 0.318089 1.303 23.23

worth 77 1.251581 0.942386 1.328 24.70

young 108 0.804264 0.605161 1.329 24.76

years 346 1.290238 0.979362 1.317 24.09

zoo 5 1.267422 0.962337 1.317 24.07

zero 1 1.148785 0.873075 1.316 24.00

79



24am 72 6.392787 4.669027 1.369 26.96

appeared 395 9.249416 6.835049 1.353 26.10

business 2651 7.526987 5.586352 1.347 25.78

charms 18 9.101312 6.709699 1.356 26.28

children 1582 7.236499 7.213435 1.003 0.32

environmental 216 9.068859 6.684283 1.357 26.29

estimated 353 9.098026 6.676871 1.363 26.61

government 3219 4.760730 3.664147 1.299 23.03

jail 405 5.222144 4.350456 1.200 16.69

leader 1111 3.382493 2.289197 1.478 32.32

mail 1437 7.550765 5.434512 1.389 28.03

met 653 10.403865 7.615744 1.366 26.80

microwave 22 10.251403 9.996164 1.026 2.49

modeling 9 9.225992 6.804600 1.356 26.25

numbers 621 7.276339 4.595410 1.583 36.84

people 5254 10.466367 7.986437 1.311 23.69

problem 1385 10.724092 7.944967 1.350 25.91

reached 416 10.090774 7.474561 1.350 25.93

science 1050 5.757277 4.249268 1.355 26.19

success 712 10.158388 7.832787 1.297 22.89

time 4424 8.660649 6.369665 1.360 26.45

unions 157 3.027073 2.302291 1.315 23.94

victim 230 3.570685 2.703483 1.321 24.29

white 790 3.165214 2.369342 1.336 25.14

worth 790 10.681316 7.898203 1.352 26.06

young 1206 4.760085 3.552300 1.340 25.37

years 3752 6.954238 5.254474 1.323 24.44

zoo 30 10.798089 7.965333 1.356 26.23

zero 2 10.380481 8.059916 1.288 22.36

80

Table (4.10)Values of To, Tc, No, Nc, Sf and Rt for 25000 index file


24am 107 10.811565 6.845804 1.579 36.68

appeared 574 13.644716 10.065472 1.356 26.23

business 3854 11.088775 8.330853 1.331 24.87

charms 12 13.450452 10.054532 1.338 25.25

children 2435 10.721436 8.077861 1.327 24.66

environmental 319 13.795873 9.910745 1.392 28.16

estimated 538 14.106488 9.079174 1.554 35.64

government 4926 7.148870 5.391486 1.326 24.58

jail 568 7.806072 5.804012 1.345 25.65

leader 1775 4.434092 3.451145 1.285 22.17

mail 2323 10.736373 8.077723 1.329 24.76

met 884 15.242846 11.850393 1.286 22.26

microwave 43 15.216985 11.268990 1.350 25.94

modeling 7 13.526139 10.044590 1.347 25.74

numbers 969 10.433503 7.667062 1.361 26.51

people 8008 14.908253 11.036023 1.351 25.97

problem 2183 15.820238 15.009937 1.054 5.12

reached 686 19.897663 11.035200 1.803 44.54

science 1666 8.556217 6.332746 1.351 25.99

success 1150 10.882603 6.919852 1.573 36.41

time 6834 11.689327 8.780494 1.331 24.88

unions 252 5.139586 3.961825 1.297 22.92

victim 302 5.309767 4.069763 1.305 23.35

white 1196 4.720609 3.550169 1.330 24.79

worth 1202 13.825045 10.207422 1.354 26.17

young 1850 7.066934 5.317212 1.329 24.76

years 5764 10.337936 7.728312 1.338 25.24

zoo 52 16.647460 11.726578 1.420 29.56

zero 3 15.303136 11.467189 1.335 25.07

81



24am 245 19.545303 10.908365 1.792 44.19

appeared 1281 19.950790 14.492266 1.377 27.36

business 8354 16.661334 12.380967 1.346 25.69

charms 20 19.123015 14.243997 1.343 25.51

children 5134 18.905578 11.991160 1.577 36.57

environmental 783 19.146637 17.658853 1.084 7.77

estimated 1075 19.074374 14.039539 1.359 26.40

government 10555 12.662540 9.578430 1.322 24.36

jail 1260 12.828572 9.549236 1.343 25.56

leader 3756 9.546542 7.192564 1.327 24.66

mail 4897 16.106254 12.596043 1.279 21.79

met 2058 21.312695 20.644879 1.032 3.13

microwave 89 25.971385 15.477604 1.678 40.41

modeling 22 26.806913 14.598512 1.836 45.54

numbers 2034 27.642737 17.907367 1.544 35.22

people 16967 48.673569 25.974665 1.874 46.63

problem 4811 49.641833 28.058616 1.769 43.48

reached 1461 20.840356 15.325443 1.360 26.46

science 3540 13.562762 10.061024 1.348 25.82

success 2462 20.982485 15.331279 1.369 26.93

time 14553 17.604409 13.258523 1.328 24.69

unions 537 9.996396 7.510691 1.331 24.87

victim 690 15.874951 11.372757 1.396 28.36

white 2648 9.537411 7.319981 1.303 23.25

worth 2615 53.874095 25.597647 2.105 52.49

young 3906 12.491169 9.233943 1.353 26.08

years 12259 28.589397 16.357110 1.748 42.79

zoo 105 27.211613 15.915723 1.710 41.51

zero 5 20.772571 15.379120 1.351 25.96

82



24am 292 18.663617 13.482161 1.384 27.76

appeared 1631 31.181618 17.821380 1.750 42.85

business 10769 21.213269 15.858084 1.338 25.24

charms 45 23.927043 17.633614 1.357 26.30

children 6397 20.288563 15.079594 1.345 25.67

environmental 957 23.680958 22.073657 1.073 6.79

estimated 1376 23.856449 17.361409 1.374 27.23

government 13374 27.446325 20.889583 1.314 23.89

jail 1553 16.211384 12.138495 1.336 25.12

leader 4777 11.792067 9.339108 1.263 20.80

mail 6243 91.331064 35.316792 2.586 61.33

met 2521 35.071219 20.009483 1.753 42.95

microwave 95 25.859182 19.418625 1.332 24.91

modeling 25 24.244812 17.969845 1.349 25.88

numbers 2666 19.768184 18.701697 1.057 5.39

people 21599 142.675981 44.706663 3.191 68.67

problem 6096 205.862538 51.322048 4.011 75.07

reached 1782 26.096704 20.539451 1.271 21.29

science 4514 13.963725 10.486536 1.332 24.90

success 3156 35.208754 19.210081 1.833 45.44

time 18398 22.305036 16.641086 1.340 25.39

unions 651 27.682520 20.250572 1.367 26.85

victim 894 28.649071 21.252775 1.348 25.82

white 3290 12.146824 9.275699 1.310 23.64

worth 3333 137.271984 57.517019 2.387 58.10

young 4996 15.535387 11.657550 1.333 24.96

years 15583 19.981251 15.027456 1.330 24.79

zoo 136 30.389237 20.020076 1.518 34.12

zero 12 26.899483 19.243481 1.398 28.46

83

4.3.3 Determine Sf and Rt

The results presented in Tables (4.8) to (4.12) are summarized in Tables (4.13) to (4.15).

The values of Sf and Rt for all search processes are listed in Table (4.13). The table also

shows the average values of Sf and Rt achieved by the new model in searching the

different index files. The values of No and Nc and the values of To and Tc for all search

processes on the different all index files are given in Tables (4.14) and (4.15),

respectively. The main outcomes of the results can be summarized as follows:

• The value of Sf is always >1 and Rt is always <1, i.e., (Tc<To) for all search

processes, regardless of the size of index files, number of matched documents,

characteristics of the keyword. This means that the new CIQ model requires less

processing time than the uncompressed search model.

• The results demonstrate that most Sf values lie between 1.3 and 1.4 which means a

reduction in processing time between 30% and 40%. Very few cases demonstrate

a speedup factor outside this range. Thus, as shown in Table (4.13), the average

values, achieved by the new model, varies for Sf from 1.327 to 1.606 and for Rt

from 24.64% to 31.71%, depending on the size of the index file or in another

word on the total number of matches for all keywords. This is very encouraging

results as it means Sf increases as the size of the index file increases.

• As it can be seen in Table (4.14), No for the uncompressed model and Nc for the

compressed model increase with increasing index file size for all search

keywords. However, there is no certain relation can be derived because this

depends on the contents of the crawled documents.

• As a consequence of the above conclusion, for each keyword as summarized in

Table (4.15), the associated search time To for the uncompressed model and Tc for

the compressed model increases with increasing No/Nc. Once again, there is no

certain relation can be derived because this depends on the contents of the

crawled documents and location of the keyword in the index file.

84

Table (4.13)Variation of Sf for different index sizes and keywords.

KeywordSf Rt (%)

1000 10000 25000 50000 75000 1000 10000 25000 50000 75000

24am 1.323 1.369 1.579 1.792 1.384 24.44 26.96 36.68 44.19 27.76

appeared 1.335 1.353 1.356 1.377 1.750 25.09 26.10 26.23 27.36 42.85

business 1.233 1.347 1.331 1.346 1.338 18.89 25.78 24.87 25.69 25.24

charms 1.290 1.356 1.338 1.343 1.357 22.50 26.28 25.25 25.51 26.30

children 1.264 1.003 1.327 1.577 1.345 20.89 0.32 24.66 36.57 25.67

environmental 1.294 1.357 1.392 1.084 1.073 22.70 26.29 28.16 7.77 6.79

estimated 1.316 1.363 1.554 1.359 1.374 24.03 26.61 35.64 26.40 27.23

government 1.294 1.299 1.326 1.322 1.314 22.72 23.03 24.58 24.36 23.89

jail 1.303 1.200 1.345 1.343 1.336 23.23 16.69 25.65 25.56 25.12

leader 1.279 1.478 1.285 1.327 1.263 21.82 32.32 22.17 24.66 20.80

mail 1.290 1.389 1.329 1.279 2.586 22.47 28.03 24.76 21.79 61.33

met 1.309 1.366 1.286 1.032 1.753 23.61 26.80 22.26 3.13 42.95

microwave 1.312 1.026 1.350 1.678 1.332 23.76 2.49 25.94 40.41 24.91

modeling 1.451 1.356 1.347 1.836 1.349 31.10 26.25 25.74 45.54 25.88

numbers 1.667 1.583 1.361 1.544 1.057 40.02 36.84 26.51 35.22 5.39

people 1.392 1.311 1.351 1.874 3.191 28.15 23.69 25.97 46.63 68.67

problem 1.351 1.350 1.054 1.769 4.011 25.96 25.91 5.12 43.48 75.07

reached 1.279 1.350 1.803 1.360 1.271 21.79 25.93 44.54 26.46 21.29

science 1.314 1.355 1.351 1.348 1.332 23.92 26.19 25.99 25.82 24.90

success 1.328 1.297 1.573 1.369 1.833 24.73 22.89 36.41 26.93 45.44

time 1.318 1.360 1.331 1.328 1.340 24.15 26.45 24.88 24.69 25.39

unions 1.329 1.315 1.297 1.331 1.367 24.74 23.94 22.92 24.87 26.85

victim 1.310 1.321 1.305 1.396 1.348 23.68 24.29 23.35 28.36 25.82

white 1.303 1.336 1.330 1.303 1.310 23.23 25.14 24.79 23.25 23.64

worth 1.328 1.352 1.354 2.105 2.387 24.70 26.06 26.17 52.49 58.10

young 1.329 1.340 1.329 1.353 1.333 24.76 25.37 24.76 26.08 24.96

years 1.317 1.323 1.338 1.748 1.330 24.09 24.44 25.24 42.79 24.79

zoo 1.317 1.356 1.420 1.710 1.518 24.07 26.23 29.56 41.51 34.12

zero 1.316 1.288 1.335 1.351 1.398 24.00 22.36 25.07 25.96 28.46

Average 1.327 1.328 1.352 1.468 1.606 24.46 24.13 26.34 30.12 31.71

85

Table (4.14)Variation of No and Nc for different index sizes and keywords.

KeywordNo=Nc

1000 10000 25000 50000 75000

24am 8 72 107 245 292

appeared 47 395 574 1281 1631

business 263 2651 3854 8354 10769

charms 1 18 12 20 45

children 146 1582 2435 5134 6397

environmental 29 216 319 783 957

estimated 45 353 538 1075 1376

government 300 3219 4926 10555 13374

jail 39 405 568 1260 1553

leader 108 1111 1775 3756 4777

mail 148 1437 2323 4897 6243

met 59 653 884 2058 2521

microwave 3 22 43 89 95

modeling 1 9 7 22 25

numbers 59 621 969 2034 2666

people 505 5254 8008 16967 21599

problem 145 1385 2183 4811 6096

reached 59 416 686 1461 1782

science 107 1050 1666 3540 4514

success 77 712 1150 2462 3156

time 423 4424 6834 14553 18398

unions 14 157 252 537 651

victim 19 230 302 690 894

white 78 790 1196 2648 3290

worth 77 790 1202 2615 3333

young 108 1206 1850 3906 4996

years 346 3752 5764 12259 15583

zoo 5 30 52 105 136

zero 1 2 3 5 12

Total Match 3220 32962 50482 108122 137161

86

Table (4.15)Variation of To and Tc for different index sizes and keywords.

KeywordTo Tc

1000 10000 25000 50000 75000 1000 10000 25000 50000 75000

24am 1.193 6.393 10.812 19.545 18.664 0.902 4.669 6.846 10.908 13.482

appeared 0.956 9.249 13.645 19.951 31.182 0.716 6.835 10.065 14.492 17.821

business 0.527 7.527 11.089 16.661 21.213 0.428 5.586 8.331 12.381 15.858

charms 0.910 9.101 13.450 19.123 23.927 0.705 6.710 10.055 14.244 17.634

children 0.458 7.236 10.721 18.906 20.289 0.362 7.213 8.078 11.991 15.080

environmental 0.898 9.069 13.796 19.147 23.681 0.694 6.684 9.911 17.659 22.074

estimated 0.883 9.098 14.106 19.074 23.856 0.671 6.677 9.079 14.040 17.361

government 0.810 4.761 7.149 12.663 27.446 0.626 3.664 5.391 9.578 20.890

jail 0.937 5.222 7.806 12.829 16.211 0.720 4.350 5.804 9.549 12.138

leader 0.386 3.382 4.434 9.547 11.792 0.302 2.289 3.451 7.193 9.339

mail 0.468 7.551 10.736 16.106 91.331 0.363 5.435 8.078 12.596 35.317

met 1.175 10.404 15.243 21.313 35.071 0.897 7.616 11.850 20.645 20.009

microwave 1.176 10.251 15.217 25.971 25.859 0.896 9.996 11.269 15.478 19.419

modeling 1.047 9.226 13.526 26.807 24.245 0.721 6.805 10.045 14.599 17.970

numbers 1.443 7.276 10.434 27.643 19.768 0.866 4.595 7.667 17.907 18.702

people 1.144 10.466 14.908 48.674 142.676 0.822 7.986 11.036 25.975 44.707

problem 1.268 10.724 15.820 49.642 205.863 0.939 7.945 15.010 28.059 51.322

reached 1.150 10.091 19.898 20.840 26.097 0.900 7.475 11.035 15.325 20.539

science 1.032 5.757 8.556 13.563 13.964 0.785 4.249 6.333 10.061 10.487

success 1.210 10.158 10.883 20.982 35.209 0.910 7.833 6.920 15.331 19.210

time 0.605 8.661 11.689 17.604 22.305 0.459 6.370 8.780 13.259 16.641

unions 0.505 3.027 5.140 9.996 27.683 0.380 2.302 3.962 7.511 20.251

victim 0.526 3.571 5.310 15.875 28.649 0.401 2.703 4.070 11.373 21.253

white 0.414 3.165 4.721 9.537 12.147 0.318 2.369 3.550 7.320 9.276

worth 1.252 10.681 13.825 53.874 137.272 0.942 7.898 10.207 25.598 57.517

young 0.804 4.760 7.067 12.491 15.535 0.605 3.552 5.317 9.234 11.658

years 1.290 6.954 10.338 28.589 19.981 0.979 5.254 7.728 16.357 15.027

zoo 1.267 10.798 16.647 27.212 30.389 0.962 7.965 11.727 15.916 20.020

zero 1.149 10.380 15.303 20.773 26. 899 0.873 8.060 11.467 15.379 19.243

87

4.4 Validation of the Accuracy of the CIQ Web Search Model

This section describes the last part of the test procedures, which concerns with the

validation of the accuracy of the new CIQ model. In order to validate the accuracy of the

CIQ model, we determine and compare the values of No and Nc. In addition, we compare

the actual IDs of the retrieved documents. Thus, this part of the test procedure consists of

two steps, these are:

(1) Step 1: Compare the uncompressed (No) and compressed (Nc) search results for

each search process.

(2) Step 2: Compare the IDs of the matched documents.

As it was shown in Table (4.14) that No and Nc are equal for all searched keywords in all

index files. Furthermore, we carried-out an internal comparison procedure to ensure that

the documents IDs are retrieved by both the compressed and the uncompressed models.

The test demonstrates a 100% agreement between the two models, which means despite

the compressed level search, the new model achieved 100% accuracy.

4.5 Summary of Results

The main results and outcomes of the above test procedures are summarized in Table

(4.16) and plotted in Figures (4.3) and (4.4).

Table (4.16)Values of C, Rs, average Sf, and average Rt for different sizes index files.

Index C Rs (%) Sf Rt (%)

1000 1.323 24.42 1.327 24.46

10000 1.370 26.98 1.328 24.13

25000 1.375 27.30 1.352 26.34

50000 1.382 27.63 1.468 30.12

75000 1.384 27.72 1.606 31.71

88

Figure (4.3). Variation of C and average Sf for different sizes index files.

Figure (4.4). Variation of Rs and Rt for different sizes index files.

89

Finally, the new model attained the following performance:

(1) A compression ratio over 1.3.

(2) A reduction in storage requirement over 24%.

(3) A speed up factor over 1.3.

(4) A reduction in processing (searching time) over 24%.

(5) The total number and IDs of the retrieved documents attained 100% agreement.

This final results are outlines in the CIQ performance triangle shown in Figure (4.5).

Figure (4.5). The CIQ performance triangle.

Storage RequirementC > 1.3

Rs > 24%

Accuracy100% agreement

The performance of the novel CIQ Web search engine model as compared

to current uncompressed model

Processing TimeS

f > 1.3

Rt > 24%

90

Chapter Five

Conclusions and Recommendations for Future Work

5.1 Conclusions

The main conclusion of this work can be summarized as follows:

(1) A novel Web search engine model based on the concept of index-query

compression was developed. Therefore, it is referred to as the compressed index-

query (CIQ) model. The model incorporates two compression layers both

implemented at the back-end processor side, one after the indexer acting as a

second compression layer to generate a double compressed index, and the other

one after the query parser for query compression to enable compressed index-

query search. So that less disk space is required storing the index file, reducing

disk I/O overheads, and consequently higher retrieval rate. The data compression

algorithm used is the novel Hamming code data compression (HCDC) algorithm.

(2) The different components of the CIQ model is implemented in a number of

procedures forming what is referred to as the CIQ test tool (CIQTT). CIQTT

consists of six main procedures for collecting the testing corpus or documents

(COLCOR), processing and analyzing the testing corpus (PROCOR), building the

inverted index and start indexing (INVINX), compressing the inverted index

(COMINX), searching the index file (SRHINX), and comparing the outcomes of

different search processes performed by SRHINX procedure (COMRES).

(3) In order to validate the accuracy and integrity of the retrieved data, and to evaluate

the performance of the CIQ model, a number of tests were performed. Based on

these tests, the new CIQ model attained an excellent performance as compared to

the current uncompressed model, as such:

i. It demonstrated a tremendous accuracy with 100% agreement between the

results retrieved by the compressed CIQ and the uncompressed models.

91

ii. The new model demands less disk space as the HCDC algorithm achieves a

compression ratio over 1.3 with compression efficiency of more than 95%.

This implies a reduction in storage requirement over 24%.

iii. The new CIQ model performs faster than the current model. It achieved a

speed up factor over 1.3 providing a reduction in processing time of over

24%.

The above performance outcomes are outlines in the performance triangle

shown in Figure (4.5), and the same figure is repeated here for convenience

and ease of referencing.

Figure (5.1). The CIQ performance triangle.

(4) The test tool CIQTT can be easily modified to adopt any variation in any of its

procedures. Such as:

i. Implement the enhanced HCDC (E-HCDC) algorithm. Results of the

enhanced version will be published in the literature very soon.

ii. Implement different character sets in naming or numbering the crawled

documents.

Storage RequirementC > 1.3

Rs > 24%

Accuracy100% agreement

The performance of the novel CIQ Web search engine model as compared

to current uncompressed model

Processing TimeS

f > 1.3

Rt > 24%

92

5.2 Recommendations for Future Work

It is believed that the new CIQ model has opened an interested area of research aiming to

develop more efficient Web search engine models by utilizing the concept of compressed

index-query search. Thus, there a number of recommendation for future work, however,

we shall emphasize on what we believe are the main issues that need to be considered

soon. These are:

(1) Perform further studies to optimize the statistics of the inverted index file to

achieve maximum possible performance in terms of compression ratio and

minimum processing time.

(2) Cluster documents according to their characters frequencies to ensure higher

compression ratio.

(3) Evaluate and compare the performance of the CIQ model against current

uncompressed model by considering the following test scenarios:

i. Larger index files,

ii. Index files of different structure; for example, add more meta-data such as

keyword position, near keyword, and other information to the index

structure.

iii. Index files that support incremental updates whereby documents can be

added without re-indexing.

iv. Mixed language index files.

(4) Perform further investigation using different list of Website and keywords.

93

References

[Ami 05] Amit Singhal, Challenges in running a commercial search engine,

Proceedings of the 28th annual international ACM SIGIR conference on

Research and development in information retrieval, p.432-432, Salvador,

Brazil, August 15-19, 2005.

[Anh 04] V. N. Anh and A. Moffat. Index Compression Using Fixed Binary

Codewords. Proceedings of the 15th Australasian Database Conference

(Editors K. D. Schewe and H. Williams), Dunedin, New Zealand, January

2004.

[Bad 02] C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed

Query Processing Using Partitioned Inverted Files. Proceedings of the 9th

String Processing and Information Retrieval (SPIRE) Symposium, September

2002.

[Bae 99] Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto,Modern Information

Retrieval, Addison Wesley. 1999.

[Bah 07a] H. Al-Bahadili and A. Rababa’a. “An Adaptive Bit-Level Text Compression

Scheme Based on the HCDC Algorithm”. Proceedings of Mosharaka

International Conference on Communications, Networking and Information

Technology (MIC-CNIT 2007), Amman-Jordan, 6-8 December 2007.

(Accepted on 15th May 2009, and to appear in the International Journal of

Computers and Applications (IJCA), manuscript number 202-2914), May

2009

[Bah 07b] Hussein Al-Bahadili. A Novel Data Compression Scheme Based on the Error

Correcting Hamming Code. Proceedings of the 1st International Conference

on Digital Communications and Computer Applications (DCCA2007), pp

478-485, Amman-Jordan, 19-22 March 2007.

94

URL: http://www.cis.just.edu.jo/dcca2007.

[Bah 08a] H. Al-Bahadili. “A Novel Lossless Data Compression Scheme Based on the

Error Correcting Hamming Codes”. Journal of Computers and Mathematics

with Applications, Vol. 56, Issue 1, pp. 143-150, 2008. It also published at the

Proceedings of the 1st International Conference on Digital Communications

and Computer Applications (DCCA’07), pp. 478-485, 2007.

[Bah 08b] H. Al-Bahadili and S. M. Hussain. “An Adaptive Character Wordlength

Algorithm for Data Compression”. Journal of Computers & Mathematics

with Applications, Vol. 55, Issue 6, pp. 1250-1256, 2008.

[Bah 10a] Hussein Al-Bahadili and Shakir M. Hussain. A Bit-Level Text Compression

Scheme Based on ACW Algorithm. The International Journal of Automation

and Computing (IJAC), Vol. 7, No. 1, pp. 128-136, February 2010.

URL: http://www.ijac.net/zxly.asp

[Bah 10b] Hussein Al-Bahadili and Ahmad Rababa’a. A Bit-Level Text Compression

Scheme Based on the HCDC Algorithm. International Journal of Computers

and Applications (IJCA), Vol. 32, Issue 3, 2010.

URL: http://actapress.com/Content_of_Journal.aspx?JournalID=116

[Bah 10c] Hussein Al-Bahadili and Saif Al-Saab. A Compressed Index-Query Web

Search Engine Model. International Journal of Computer Information

Systems (IJCIS), Vol. 1, No. 4, pp. 73-79, November, November 2010.

URL: http://www.svpublishers.co.uk/cgi-bin/download.cgi

[Bah 10d] Hussein Al-Bahadili, Saif Saab, Reyadh Naoum, and Shakir M. Hussain. A

Web Search Engine Model Based on Index-Query Bit-Level Compression.

Proceedings of the International Conference on Intelligence and Semantic

Web: Services and Application (ISWSA 2010), Isra Private University,

Amman-Jordan, 14-16 June 2010.

95

http://www.svpublishers.co.uk/cgi-bin/download.cgi

http://actapress.com/Content_of_Journal.aspx?JournalID=116

http://www.ijac.net/zxly.asp

http://www.cis.just.edu.jo/dcca2007

URL: http://iswsa2010.ipu.edu.jo/

[Bar 06] Kenneth C. Barr , Krste Asanović, Energy-aware lossless data compression,

ACM Transactions on Computer Systems (TOCS), v.24 n.3, p.250-291,

August 2006

[Bel 89] T. C. Bell, I. H. Witten, and J. G. Cleary. Modeling for Text. Compression.

ACM Computing Surveys, Vol. 21, No.4, pp. 557-591, December 1989.

[Bel 90] T. C. Bell, I. H. Witten, and J. G. Cleary. Text Compression. Prentice- Hall,

1990.

[Bri 98] S. Brin and L. Page. “The Anatomy of a Large-Scale Hypertextual Web

Search Engine”. Computer Networks and ISDN Systems, Vol. 30, No. 1-7,

pp. 107-117, 1998.

[But 06] Stefan Buttcher and Charles L. A. Clarke. Unaligned Binary Codes for Index

Compression in Schema-Independent Text Retrieval Systems. University of

Waterloo technical report CS-2006-40. Waterloo, Canada, October 2006.

[Cai 04] G. Caire, S. Shamai and S. Verdu. "Noiseless Data Compression with Low-

Density Parity-Check Codes". In Advances in Network Information Theory

(Editors: P. Gupta, G. Kramer and A. J. van Wijngaarden), DIMACS Series in

Discrete Mathematics and Theoretical Computer Science, vol. 66, pp.

263-284, American Mathematical Society, 2004.

[Che 01] Z. Chen, J. Gehrke, and F. Korn. “Query Optimization in Compressed

Database Systems”. Proceedings of the Association for Computing

Machinery (ACM) – Special Interest on Management of Data (SIGMOD)

2001 Conference (Editor: Walid G. Aref), pp. 271-282, Santa Barbara,

California, USA, 21-24 May 2001.

[Far 05] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. “Structuring

Labeled Trees for Optimal Succinctness, and Beyond”. Proceedings of IEEE

Symposium on Foundations of Computer Science (FOCS 2005), pp. 184-193,

96

http://iswsa2010.ipu.edu.jo/

2005.

[Fer 01] P. Ferragina and G. Manzini. “An Experimental Study of an Opportunistic

Index”. Proceedings of the 12th ACM-SIAM Symposium on Discrete

Algorithms, pp. 269-278, Washington D.C., USA, 7-9 January, 2001.

[Fer 07] P. Ferragina, G. Manzini, V. Makinen, and G. Navarro. “Compressed

Representation of Sequences and Full-Text Indexes”. ACM Transactions on

Algorithms, Vol. 3, No. 2, 2007.

[Fer 09] P. Ferragina, R. Gonzalez, G. Navarro, and R. Venturini. “Compressed Text

Indexes: From Theory to Practice”. Journal of Experimental Algorithmics

(JEA), Vol. 13, Section 1, Article No. 12, 2009.

[Fuh 02] Norbert Fuhr and Norbert Gövert, Index compression vs. retrieval time of

inverted files for XML documents, Proceedings of the eleventh international

conference on Information and knowledge management, McLean, Virginia,

USA, November 04-09, 2002.

[Gon 07a] R. Gonzalez and G. Navarro, “Compressed Text Indexes with Fast Locate”.

Proceedings of the 18th Annual Symposium on Combinatorial Pattern

Matching (CPM 2007) (Editors: B, Ma and K. Zhang), pp. 216-227, London,

Canada, 9-11 July 2007.

[Gon 07b] R. Gonzalez and G. Navarro. “A Compressed Text Index on Secondary

Memory”. Proceedings of the 18th International Workshop on Combinatorial

Algorithms (IWOCA 2007), pp. 80-91, College Publications, UK, 2007.

[Gro 03] R. Grossi, A. Gupta, and J. Vitter. “High-Order Entropy-Compressed Text

Indexes”. Proceedings of the ACM SIAM Symposium on Discrete

Algorithms (SODA 2003), pp. 841-850, 2003.

[Guh 03] R. Guha , Rob McCool , Eric Miller, Semantic search, Proceedings of the

12th international conference on World Wide Web, Budapest, Hungary, May

20-24, 2003.

97

[Hay 08] W. Y. Al_Hayek. “Development and Performance Evaluation of a Bit-Level

Text Compression Scheme Based on the Adaptive Character Wordlength

Algorithm”. M.Sc. Thesis, Amman Arab University for Graduate Studies,

Amman-Jordan, January 2008.

[Hen 03] M. Henzinger, "Algorithmic challenges in search engines," Internet Math., vol.

1, no. 1, pp. 115-126, 2003.

[How 94] P. G. Howard and J. S. Vitter. “Arithmetic Coding for Data Compression”.

Proceedings of the IEEE 82 (6) 857-865 (1994).

[Hui 09] Huijia Yu , Yiqun Liu , Min Zhang , Liyun Ru , Shaoping Ma, Web Spam

Identification with User Browsing Graph, Proceedings of the 5th Asia

Information Retrieval Symposium on Information Retrieval Technology, ,

Sapporo, Japan, October 21-23, 2009.

[Huf 52] D. A. Huffman, "A Method for the Construction of Minimum-Redundancy

Codes", Proceedings of IRE, Vol. 40, No. 9, pp. 1098-1101, 1952.

[Jar 01] A. Jaradat and M. Irshid, “A Simple Binary Run-Length Compression

Technique For Nonbinary Sources Based On Source Mapping”, Active and

Passive Elec. Comp., Vol. 24, pp. 211-221, 2001.

[Jar 06] A.. Jaradat, M. Irshid and T. Nassar,” A File Splitting Technique for Reducing

the entropy of text files”, Int. Journal of Information Technology ,Vol. 3, No.

2, 2006.

[Jun 00] Junghoo Cho , Hector Garcia-Molina, The Evolution of the Web and

Implications for an Incremental Crawler, Proceedings of the 26th

International Conference on Very Large Data Bases, p.200-209, September

10-14, 2000

[Kim 05] N. Kimura and S.Latifi, "A Survey on Data Compression in Wireless Sensor

Networks", Proceedings of the IEEE International Conference on Information

Technology: Coding and Computing (ITCC’05), pp. 8-13, 2005.

98

[Kir 08] Kirchhoff, L., Stanoevska-Slabeva, K., Nicolai, T., & Fleck, M. Using social

network analysis to enhance information retrieval systems. Paper presented at

the Applications of Social Network Analysis (ASNA), Zurich 2008.

[Khu 05] Udayan Khurana, and Anirudh Koul, Text Compression and Superfast

Searching - ArXiv,org May 2005.

[Knu 85] D. E. Knuth. “Dynamic Huffman Coding”. Journal of Algorithms, Vol. 6, No.

2, pp. 163-180, 1985.

[Lan 82] G. G. Langdon and J. J. Rissanen. A Simple General Binary Source Code.

IEEE Transaction on Information Theory. IT-28, Pp. 800-803, 19820-803.

[Lei 06] Lei, Y. and Uren, V.S. and Motta, E. (2006) SemSearch: a search engine for

the semantic web, Managing Knowledge in a World of Networks, pages pp.

238-245, POdebrady, Czech Republic, In Proceedings EKAW 2006.

[Liu 05] Y. K. Liu and B. Zalik, "An Efficient Chain Code with Huffman Coding",

Pattern Recognition, Vol. 38, pp. 553-557, 2005.

[Lel 87] D. A. Lelewer, and D. S. Hirschberg. “Data Compression”, ACM

Computing Surveys, Vol. 19, No.3, pp. 261-296, September 1987.

[Lem 04] Ronny Lempel , Shlomo Moran, Competitive caching of query results in

search engines, Theoretical Computer Science, v.324 n.2-3, p.253-271, 20

September 2004

[Lon 03] X. Long and T. Suel. "Optimized Query Execution in Large Search Engines

with Global Page Ordering”. Proceedings of the 29th International Conference

on Very Large Databases (VLDB 2003) (Editors: J. C. Freytag, P. C.

Lockemann, S. Abiteboul, M. J. Carey, P. G. Selinger, A. Heuer), Berlin,

Germany, 9-12 September 2003.

[Mah 00] M. V. Mahoney, "Fast Text Compression with Neural Networks", Proceedings

of the 13th International Florida Artificial Intelligence Research Society

99

Conference, pp. 230-234, 2000.

[Mcb 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. First

International Conference on the World Wide Web. CERN, Geneva

(Switzerland), May 25-26-27 1994.

[Mel 00] S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. “Building a

Distributed Full-Text Index for the Web. Proceedings of the 10th International

World Wide Web Conference, Hong-Kong, 2-5 May 2000.

[Mof 07] A. Moffat and J. S. Culpepper, “Hybrid Bitvector Index Compression”.

Proceedings of the 12th Australasian Document Computing Symposium,

Melbourne, Australia, pp. 25-31, December 2007.

[Mou 97] E. S. de Moura, G. Navarro, and N. Ziviani. “Indexing Compressed Text”.

Proceedings of the 4th South American Workshop on String Processing

(Editor: R. Baeza-Yates), Vol. 8, pp. 95-111, Carleton University Press

International Informatics, Ottawa, Canada, 1997.

[Mou 00] Edleno Silva de Moura , Gonzalo Navarro , Nivio Ziviani , Ricardo Baeza-

Yates, Fast and flexible word searching on compressed text, ACM

Transactions on Information Systems (TOIS), Vol. 18 No. 2, pp.113-139,

April 2000.

[Mun 97] I. Munro and V. Raman. Succinct representation of balanced parentheses,

static trees and planar graphs. In Proc. 38th FOCS, pp. 118–126, 1997.

[Mun 03] J. Munro, R. Raman, V. Raman, and S. Rao. “Succinct Representations of

Permutations”. Proceedings of 2003 International Colloquium on Automata,

Languages and Programming (ICALP 2003), Springer-Verlag, pp. 345-356,

2003.

[Nag 02] Ajith Nagarajarao, Jyothir Ganesh R., and Abhishek Saxena. An inverted

100

index implementation supporting efficient querying and incremental

indexing. http://www.cs.wisc.edu/ajith/docs/inverted-index.pdf, May 2002.

[Nav 04] G. Navarro. “Indexing Text Using the Ziv-Lempel Trie”. Journal of Discrete

Algorithms, Vol. 2, No. 1, pp. 87-114, 2004.

[Nav 07] Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM

Computing Surveys (CSUR), v.39 n.1, p.2-es, 2007

[Nel 89] M. Nelson, “LZW Data Compression”, Dr Dobb’s Journal, Vol. 14, No. 10,

pp. 62-75, 1989.

[Nof 07] S. Nofal. Bit-Level Text Compression. Proceedings of the 1st International

Conference on Digital Communications and Computer Applications

(DCCA’07), pp. 486-488, 2007.

[Ogg 06] Simon Ogg and Bashir Al-Hashimi. Improved Data Compression for Serial

Interconnected Network on Chip through Unused Significant Bit Removal.

Proceedings of the 19th International Conference on VLSI Design,

Hyderabad, India, 3–7 January, 2006.

[Ois 10] Oisın Boydell , Barry Smyth, Social summarization in collaborative web

search, Information Processing and Management: an International Journal, v.

46 n.6, p.782-798, November, 2010

[Pan 00] M. K. Pandya. “Data Compression: Efficiency of Varied Compression

Techniques”. Report, University of Brunel, UK (2000).

[Pla 06] H. Plantinga, "An Asymmetric, Semi-Adaptive Text Compression

Algorithm", IEEE Data Compression, 2006.

[Rab 08] Ahmad Rababa’a. An Adaptive Bit-Level Text Compression Scheme Based

on the HCDC Algorithm. M.Sc Thesis. Amman Arab University for Graduate

Studies, Amman-Jordan, July 2008.

101

http://www.cs.wisc.edu/ajith/docs/inverted-index.pdf

[Ram 02] R. Raman, V. Raman, and S. Srinivasa Rao. “Succinct Indexable Dictionaries

with Applications to Encoding k-Ary Trees and Multisets. Proceedings of the

ACM SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 233-242,

2002.

[Ris 89] J. J. Rissanen and K. M. Mohiuddin. A Multiplication-Free Multialphabet

Arithmetic Code. IEEE Transaction on Communication, Vol. 37, pp. 93-98,

1989.

[Rue 06] L. Rueda and J. Oommen. “A Fast and Efficient Nearly-Optimal Adaptive

Fano Coding Scheme”. Journal of Information Science, Vol. 176, Issue 12,

pp. 1656-1683, 2006.

[Sar 01] P. Saraiva, E. Moura, N. Ziviani, W. Meira, R. Fonseca, B. Ribeiro-Neto,

Rank-preserving two-level caching for scalable search engines, in: Proc.

24rd Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, pp. 51–58, New Orleans, Louisiana,

USA, , 2001.

[Say 00] K. Sayood. Introduction to Data Compression. Morgan Kaufmann Publishers

2000.

[Sha 04] A. A. Sharieh. “An Enhancement of Huffman Coding for the Compression of

Multimedia Files”. Transactions of Engineering Computing and Technology,

Vol. 3, No. 1, pp. 303-305, 2004.

[Smi 97] S. W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing,

California Technical Publishing, 1997.

[Tan 03] Amdrew Tanenbaum, “Computer Networks”, Prentice Hall, 2003.

[Var 97] S. Varadarajan and T. C. Chiueh, “SASE: Implementation of a Compressed

Text Search Engine”. USENIX Symposium on Internet Technologies and

Systems 1997.

102

[Vit 89] J. S. Vitter, “Dynamic Huffman Coding”, Journal of ACM, Vol. 15, No. 2, pp.

158-167, 1989.

[Web 1] Twitter Does 19 Billion Searches Per Month, Beating Yahoo & Bing.

http://searchengineland.com/twitter-does-19-billion-searches-per-

month-39988

[Web 2] We knew the web was big. http://googleblog.blogspot.com/2008/07/we-

knew-web-was-big.html

[Web 3] Facebook and Bing’s Plan to Make Search Social

http://mashable.com/2010/10/13/facebook-bing-social-search/

[Web 4] Inverted index: http://en.wikipedia.org/wiki/Inverted_index

[Web 5] Index Search Engine: http://en.wikipedia.org/wiki/Index_(search_engine)

[Web 6] About Nutch: http://nutch.apache.org/about.html

[Web 7] What is Apache Lucene?

http://lucene.apache.org/#What+Is+Apache+Lucene%3F

[Web 8] What is Solr? http://lucene.apache.org/solr/#intro

[Web 9] MapReduce http://en.wikipedia.org/wiki/MapReduce

[Web 10] What Is Hadoop? http://hadoop.apache.org/#What+Is+Hadoop%3F

[Web 11] windows network, NTLM (NT LAN Manager)

http://en.wikipedia.org/wiki/NTLM

[Web 12] English Stopwords http://www.ranks.nl/resources/stopwords.html

[Web 13] MySQL Full-Text Stopwords

http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

103

http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

http://www.ranks.nl/resources/stopwords.html

http://en.wikipedia.org/wiki/NTLM

http://hadoop.apache.org/#What+Is+Hadoop%3F

http://en.wikipedia.org/wiki/MapReduce

http://lucene.apache.org/solr/#intro

http://lucene.apache.org/#What+Is+Apache+Lucene%3F

http://nutch.apache.org/about.html

http://en.wikipedia.org/wiki/Index_(search_engine)

http://en.wikipedia.org/wiki/Inverted_index

http://mashable.com/2010/10/13/facebook-bing-social-search/

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988

http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988

[Web 14] Search engine optimization

http://en.wikipedia.org/wiki/Search_engine_optimization

[Wit 87] I. H. Witten, R. M. Neal, and J.G. Cleary, “Arithmetic Coding for Data

Compression”, Communications of the ACM, Computing Practices, Vol. 30,

No. 6, pp. 520-540, 1987.

[Yan 09] Hao Yan, Shuai Ding and Torsten Suel, Inverted index compression and query

processing with optimized document ordering, Proceedings of the 18th

international conference on World Wide Web, Madrid-Spain, April 20-24,

2009.

[Zha 98] F. Zhao, F. Jiang and L. Shen. Multiplication-Free Bit-Level Arithmetic

Coding and Its Applications. Journal of China Institute of Communication,

Vol. 19, pp. 82-87, 1998.

[Zha 08] Jiangong Zhang , Xiaohui Long , Torsten Suel, Performance of compressed

inverted list caching in search engines, Proceeding of the 17th international

conference on World Wide Web, Beijing, China, April 21-25, 2008,

[Zha 10] Jianjun Zhang and Xingfang Ni. An Improved Bit-level Arithmetic Coding

Algorithm. Journal of Information and Computing Science, Vol. 5, No. 3, pp.

193-198, 2010.

[Ziv 77] J. Ziv and A. Lempel. “A Universal Algorithm for Sequential Data

Compression”. IEEE Transaction on Information Theory, Vol. 23, No. 3, pp.

337-343, 1977.

[Ziv 78] J. Ziv and A. Lempel. Compression of Information Sequences via Variable-

Rate Coding. IEEE Transaction on Information Theory, Vol. 24, No. 5, pp.

530-536, 1978.

104

http://en.wikipedia.org/wiki/Search_engine_optimization

Appendix I

Special Characters

Browser display Numeric Symbolic Description‘ ‘ left single quote’ ’ right single quote‚ &sbquo; single low-9 quote“ “ left double quote” ” right double quote„ &bdquo; double low-9 quote† &dagger; dagger‡ &Dagger; double dagger‰ &permil; per mill sign‹ &lsaquo; single left-pointing angle quote› &rsaquo; single right-pointing angle quote♠ &spades; black spade suit♣ &clubs; black club suit♥ &hearts; black heart suit♦ &diams; black diamond suit★ ★ black star☆ ☆ white star‾ &oline; overline, = spacing overscore← ← leftward arrow↑ ↑ upward arrow→ → rightward arrow↓ ↓ downward arrow© © © copyright sign™ ™ trademark sign! ! exclamation mark" " " double quotation mark# # number sign$ $ dollar sign% % percent sign& & & ampersand' ' apostrophe( ( left parenthesis) ) right parenthesis* * asterisk+ + plus sign, , comma- - hyphen. . period· · · middle dot• bullet/ &frasl; / slash: : colon; ; semicolon

105

< < < less-than sign= = equals sign> > > greater-than sign? ? question mark@ @ at sign[ [ left square bracket\ \ backslash] ] right square bracket^ ^ caret_ _ horizontal bar (underscore)` ` grave accent{ { left curly brace| | vertical bar/ &frasl; / slash° ° ° degree sign± ± ± plus or minus² ² ² superscript two³ ³ ³ superscript three´ ´ ´ acute accentµ µ µ micro sign¶ ¶ ¶ paragraph sign¸ ¸ ¸ cedilla¹ ¹ ¹ superscript oneº º º masculine ordinal» » » right angle quote¼ ¼ ¼ one-fourth½ ½ ½ one-half¾ ¾ ¾ three-fourths¿ ¿ ¿ inverted question markÀ À À uppercase A, grave accentÁ Á Á uppercase A, acute accentÂ Â Â uppercase A, circumflex accentÃ Ã Ã uppercase A, tildeÄ Ä Ä uppercase A, umlautÅ Å Å uppercase A, ringÆ Æ Æ uppercase AEÇ Ç Ç uppercase C, cedillaÈ È È uppercase E, grave accentÉ É É uppercase E, acute accentÊ Ê Ê uppercase E, circumflex accentË Ë Ë uppercase E, umlautÌ Ì Ì uppercase I, grave accentÍ Í Í uppercase I, acute accentÎ Î Î uppercase I, circumflex accentÏ Ï Ï uppercase I, umlautÐ Ð Ð uppercase Eth, IcelandicÑ Ñ Ñ uppercase N, tildeÒ Ò Ò uppercase O, grave accentÓ Ó Ó uppercase O, acute accentÔ Ô Ô uppercase O, circumflex accent

106

Õ Õ Õ uppercase O, tildeÖ Ö Ö uppercase O, umlaut× × × multiplication signØ Ø Ø uppercase O, slashÙ Ù Ù uppercase U, grave accentÚ Ú Ú uppercase U, acute accentÛ Û Û uppercase U, circumflex accentÜ Ü Ü uppercase U, umlautÝ Ý Ý uppercase Y, acute accentÞ Þ Þ uppercase THORN, Icelandicß ß ß lowercase sharps, Germanà à à lowercase a, grave accentá á á lowercase a, acute accentâ â â lowercase a, circumflex accentã ã ã lowercase a, tildeä ä ä lowercase a, umlautå å å lowercase a, ringæ æ æ lowercase aeç ç ç lowercase c, cedillaè è è lowercase e, grave accenté é é lowercase e, acute accentê ê ê lowercase e, circumflex accentë ë ë lowercase e, umlautì ì ì lowercase i, grave accentí í í lowercase i, acute accentî î î lowercase i, circumflex accentï ï ï lowercase i, umlautð ð ð lowercase eth, Icelandicñ ñ ñ lowercase n, tildeò ò ò lowercase o, grave accentó ó ó lowercase o, acute accentô ô ô lowercase o, circumflex accentõ õ õ lowercase o, tildeö ö ö lowercase o, umlaut÷ ÷ ÷ division signø ø ø lowercase o, slashù ù ù lowercase u, grave accentú ú ú lowercase u, acute accentû û û lowercase u, circumflex accentü ü ü lowercase u, umlautý ý ý lowercase y, acute accentþ þ þ lowercase thorn, Icelandicÿ ÿ ÿ lowercase y, umlaut

107

Appendix IIINVINX: Building the inverted index and start indexing step source code

# This code developed by Saif Mahmood Saab in partial fulfillment for the requirement for the #degree of Doctor of Philosophy (PhD) in Computer Information Systems.#!/usr/bin/python# -*- coding: UTF-8 -*-# Inverted Index Creatorimport osimport sysimport reimport os.pathimport codecsclass IndexCreator(object):

wordDist = {}charactermap = {}stopWords = []speicalChars = []def __init__(self):

self.wordDist = {}self.charactermap = {}self.NumOfLoops = int ()#self.speicalChars = ['Â¼', '\]', '\[', 'Â¹', '\^', '`', 'Å“', '\Ã˜', 'Âº', 'Å“',

'Å’', 'Â³', 'Â¯', 'Â®', '\Â±', 'Â£', 'Â¥', 'Â§', 'Â©', 'Å¡', 'Ãƒ', 'Âµ', 'Ã¯', '\*', 'Â¢', 'Â£', 'Â¤', '\Â¥', 'â‚¬', 'Â¦', 'Â§', 'Â¨', 'Â©', '\Âª', 'Âª', 'Â«', '\Â¬', '\Â®', 'Â¯', 'Â°', 'Â±', 'Â²', 'Â³', 'Â´', 'Âµ', 'Â¶', 'Â·', 'Â¸', 'Â¹', 'Âº', 'Â»', 'â€“', 'â€”', 'â€˜', 'â€™', 'â€š', 'â€œ', 'â€�', '\'', 'â€¹', 'â€º', '"', '&', '<', '>', '@', 'â€ ', 'â€¡', 'â€¢', 'â€¦', 'â€°', 'â€²', 'â€³', 'â€¾', 'â�„', 'â„˜', 'â„‘', 'â„œ', 'â„¢', 'â„µ', 'â†�', 'â†‘', 'â†’', 'â†“', 'â†”', 'â†µ', 'â‡�', 'â‡‘', 'â‡’', 'â‡“', 'â‡”', 'âˆ€', 'âˆ‚', 'âˆƒ', 'âˆ…', 'âˆ‡', 'âˆˆ', 'âˆ‰', 'âˆ‹', 'âˆ�', 'âˆ‘', 'âˆ’', '\âˆ—', 'âˆš', 'âˆ�', 'âˆž', 'âˆ ', 'âˆ§', 'âˆ¨', 'âˆ©', 'âˆª', 'âˆ«', 'âˆ´', 'âˆ¼', 'â‰…', 'â‰ˆ', 'â‰ ', 'â‰¡', 'â‰¤', 'â‰¥', 'âŠ‚', 'âŠƒ', 'âŠ„', 'âŠ†', 'âŠ‡', 'âŠ•', 'âŠ—', 'âŠ¥', 'â‹…', 'âŒˆ', 'âŒ‰', 'âŒŠ', 'âŒ‹', 'âŒ©', 'âŒª', 'â—Š', 'â™ ', 'â™£', 'â™¥', 'â™¦', ',', '\.', '\!', '\?', '\s+', '\n', ';', ':', '\|', '\{', '\}', '$', '$', '\+', '\%', '\$', '\#', '\t', '\=', '/', '\-', '\_', '\~', 'Ã˜', 'Ã™', 'Ãš', 'Ã›', 'Ãœ', 'Ã�', 'Ãž', 'ÃŸ', 'Ã ', 'Ã¡', 'Ã¢', 'Ã£', 'Ã¤', 'Ã¥', 'Ã¦', 'Ã§', 'Ã¨', 'Ã©', 'Ãª', 'Ã«', 'Ã¬', 'Ã', 'Ã®', 'Ã¯', 'Ã°', 'Ã±', 'Ã²', 'Ã³', 'Ã´', 'Ãµ', 'Ã¶', 'Ã¸', 'Ã¹', 'Ãº', 'Ã»', 'Ã¼', 'Ã½', 'Ã¾', 'Ã¿', 'Ã–', '\Â¿', '\\\\', 'Â¬', 'Â½', 'Ã‚', 'Ã¯', 'Ã™']

self.speicalChars = ['\'', '\xc2\xbc', '\]', '\[', '\xc2\xb9', '\^', '`', '\xc5\x93', '\\\xc3\x98', '\xc2\xba', '\xc5\x93', '\xc5\x92', '\xc2\xb3', '\xc2\xaf', '\xc2\xae', '\\\xc2\xb1', '\xc2\xa3', '\xc2\xa5', '\xc2\xa7', '\xc2\xa9', '\xc5\xa1', '\xc3\x83', '\xc2\xb5', '\xc3\xaf', '\\*', '\xc2\xa2', '\xc2\xa3', '\xc2\xa4', '\\\xc2\xa5', '\xe2\x82\xac', '\xc2\xa6', '\xc2\xa7', '\xc2\xa8', '\xc2\xa9', '\\\xc2\xaa', '\xc2\xaa', '\xc2\xab', '\\\xc2\xac', '\\\xc2\xae', '\xc2\xaf', '\xc2\xb0', '\xc2\xb1', '\xc2\xb2', '\xc2\xb3', '\xc2\xb4', '\xc2\xb5', '\xc2\xb6', '\xc2\xb7', '\xc2\xb8', '\xc2\xb9', '\xc2\xba', '\xc2\xbb', '\xe2\x80\x93', '\xe2\x80\x94', '\xe2\x80\x98', '\xe2\x80\x99', '\xe2\x80\x9a', '\xe2\x80\x9c', '\xe2\x80\x9d', '\xe2\x80\xb9', '\xe2\x80\xba', '"', '&', '<', '>', '@', '\xe2\x80\xa0', '\xe2\x80\xa1', '\xe2\x80\xa2', '\xe2\x80\xa6', '\xe2\x80\xb0', '\xe2\x80\xb2', '\xe2\x80\xb3', '\xe2\x80\xbe', '\xe2\x81\x84', '\xe2\x84\x98', '\xe2\x84\x91', '\xe2\x84\x9c', '\xe2\x84\xa2', '\xe2\x84\xb5', '\xe2\x86\x90', '\xe2\x86\x91', '\xe2\x86\x92', '\xe2\x86\x93', '\xe2\x86\x94', '\xe2\x86\xb5', '\xe2\x87\x90', '\xe2\x87\x91', '\xe2\x87\x92', '\xe2\x87\x93', '\xe2\x87\x94', '\xe2\x88\x80', '\xe2\x88\x82', '\xe2\x88\x83', '\xe2\x88\x85', '\xe2\x88\x87', '\xe2\x88\x88', '\xe2\x88\x89', '\xe2\x88\x8b', '\xe2\x88\x8f', '\xe2\x88\x91', '\xe2\x88\x92', '\\\xe2\x88\x97', '\xe2\x88\x9a', '\xe2\x88\x9d',

108

'\xe2\x88\x9e', '\xe2\x88\xa0', '\xe2\x88\xa7', '\xe2\x88\xa8', '\xe2\x88\xa9', '\xe2\x88\xaa', '\xe2\x88\xab', '\xe2\x88\xb4', '\xe2\x88\xbc', '\xe2\x89\x85', '\xe2\x89\x88', '\xe2\x89\xa0', '\xe2\x89\xa1', '\xe2\x89\xa4', '\xe2\x89\xa5', '\xe2\x8a\x82', '\xe2\x8a\x83', '\xe2\x8a\x84', '\xe2\x8a\x86', '\xe2\x8a\x87', '\xe2\x8a\x95', '\xe2\x8a\x97', '\xe2\x8a\xa5', '\xe2\x8b\x85', '\xe2\x8c\x88', '\xe2\x8c\x89', '\xe2\x8c\x8a', '\xe2\x8c\x8b', '\xe2\x8c\xa9', '\xe2\x8c\xaa', '\xe2\x97\x8a', '\xe2\x99\xa0', '\xe2\x99\xa3', '\xe2\x99\xa5', '\xe2\x99\xa6', ',', '\.', '\!', '\?', '\s+', '\n', ';', ':', '\|', '\{', '\}', '$', '$', '\+', '\%', '\$', '\#', '\t', '\=', '/', '\-', '\_', '\~', '\xc3\x98', '\xc3\x99', '\xc3\x9a', '\xc3\x9b', '\xc3\x9c', '\xc3\x9d', '\xc3\x9e', '\xc3\x9f', '\xc3\xa0', '\xc3\xa1', '\xc3\xa2', '\xc3\xa3', '\xc3\xa4', '\xc3\xa5', '\xc3\xa6', '\xc3\xa7', '\xc3\xa8', '\xc3\xa9', '\xc3\xaa', '\xc3\xab', '\xc3\xac', '\xc3\xad', '\xc3\xae', '\xc3\xaf', '\xc3\xb0', '\xc3\xb1', '\xc3\xb2', '\xc3\xb3', '\xc3\xb4', '\xc3\xb5', '\xc3\xb6', '\xc3\xb8', '\xc3\xb9', '\xc3\xba', '\xc3\xbb', '\xc3\xbc', '\xc3\xbd', '\xc3\xbe', '\xc3\xbf', '\xc3\x96', '\\\xc2\xbf', '\\\\', '\xc2\xac', '\xc2\xbd', '\xc3\x82', '\xc3\xaf', '\xc3\x99']

#read stopwords list from a fileself.stopWords = codecs.open("stopwords.lst", "r", "utf8").read().split("\n")

#get list of files in the directories to do the processing on themdef getListOfFiles(self, directoryPath):

docsFilesList = [x for x in os.listdir(directoryPath)]print "We have a total of " + str(len(docsFilesList)) + " files to work"return docsFilesList

#clean the stopwords and the special charactersdef cleaner(self, text):

for i in self.speicalChars:text = re.sub(i, " ", text)

for j in self.stopWords:reg = r"(?i) " + j.strip() + " "text = re.sub(reg, " ", text)

return text#just read the text file and return the contained textdef readFile(self, sourceFilesPath, filename):

self.NumOfLoops=self.NumOfLoops+1print "Now reading file " + filenameprint self.NumOfLoopsfile = codecs.open(sourceFilesPath+filename, "r", "utf8")text = file.read()file.close()return text

#not useddef checkText(self, text):

text = re.sub("â€¢", "", text)text = re.sub("â€º", "", text)return text

def indexMapUpdater(self, currentFile, text):splitedText = self.cleaner(text).split(" ")for word in splitedText:

listOfLocations = []listOfLocations = self.wordDist.get(word)if listOfLocations == None or listOfLocations == []:

self.wordDist[str(word)] = [currentFile]#print "first time for {" + str(word) + "}"

else:if not currentFile in listOfLocations:

listOfLocations.append(currentFile)

109

self.wordDist[str(word)] = listOfLocations#print "already exist {" + str(word) + "} in " + str(listOfLo-

cations)"""#split the text after cleaning it into words, these words will be connected to where it

appears in the docsdef indexMapUpdater(self, currentFile, text):

splitedText = self.cleaner(text).split()for word in splitedText:

if re.search("[A-z]|[0-9]", word.strip()):listOfLocations = []listOfLocations = self.wordDist.get(word.strip())if listOfLocations == None or listOfLocations == []:

try:self.wordDist[str(word.strip())] = [currentFile]

except:pass#print word

#print "first time for {" + str(word) + "}"else:

if not currentFile in listOfLocations:listOfLocations.append(currentFile)try:

self.wordDist[str(word.strip())] = listOfLocationsexcept:

pass#print word

#print "already exist {" + str(word) + "} in " + str(li-stOfLocations)

#here we will write the index profile, each word with it's occurences in the filesdef indexFileUpdater(self, mapOfWords, sourceFilesPath, dirName):

os.popen("rm " + sourceFilesPath + dirName + "index.index")indexFile = codecs.open(sourceFilesPath + dirName + "index.index", "a", "utf8")i = 0for word in mapOfWords:

listOfLocations = mapOfWords.get(word)if word == None or listOfLocations == None or listOfLocations == [] or

str(word) == "":print "continue"continue

#out = "â‡’" + str(i) + "â€º" + str(word) + "â€¢"#out = "â€º" + str(word) + "â€¢"out = "^" + str(word) + "|""""for loc in listOfLocations:

out += loc + "â€¢"out.replace('â€¢$', '')"""for loc in listOfLocations:

out += locindexFile.write(str(out).decode("utf8"))indexFile.flush()i = i + 1

indexFile.close()#get back a frequency of words in the filesdef wordsFrequency(self, mapOfWords, sourceFilesPath, dirName):

os.popen("rm " + sourceFilesPath + dirName + "wordsfrequency.freq")

110

wordFreqFile = open(sourceFilesPath + dirName + "wordsfrequency.freq", "a")for word in mapOfWords:

listOfLocations = mapOfWords.get(word)try:

wordFreqFile.write(str(word)+ " â‡’ " +str(len(listOfLocations)))wordFreqFile.write('\n')wordFreqFile.flush()

except:pass#print word

wordFreqFile.close()#get back a frequency of characters in the filesdef characterFrequency(self, text):

listOfChars = list(text)for char in listOfChars:

if self.charactermap.get(char):self.charactermap[char] = int(self.charactermap.get(char)) + 1

else:self.charactermap[char] = 1

#get the map of character frequncy and write it down into the filedef writeCharacterFrequency(self, charmap, sourceFilesPath, dirName):

os.popen("rm " + sourceFilesPath + dirName + "charsfrequency.freq")charFreqFile = open(sourceFilesPath + dirName + "charsfrequency.freq", "a")for i in charmap:

try:charFreqFile.write(str(i) + " â‡’ " + str(charmap.get(i)) + "\n")

except:pass#print i

charFreqFile.flush()charFreqFile.close()

if __name__ == '__main__':base = "/home/saif/Documents/py/readydocsa9/"sourcesFilesPathList = [base+"now25k/"]#sourcesFilesPathList = [base+"1k/", base+"10k/", base+"25k/", base+"50k/", base

+"75k/", base+"104k/"]#dirCollections = ["mixedcharsdocs", "smallcharsdocs"]dirCollections = ["a9"]indexCreator = IndexCreator()i = 0for sourceFilesPath in sourcesFilesPathList:

for dir in dirCollections:for file in indexCreator.getListOfFiles(sourceFilesPath + dir + "/"):

#if i > 1:# breaktext = indexCreator.readFile(sourceFilesPath + dir + "/", file)indexCreator.indexMapUpdater(file, text)indexCreator.characterFrequency(text)i = i + 1

indexCreator.writeCharacterFrequency(indexCreator.charactermap, sourceFile-sPath, dir)

indexCreator.indexFileUpdater(indexCreator.wordDist, sourceFilesPath, dir)indexCreator.wordsFrequency(indexCreator.wordDist, sourceFilesPath, dir)indexCreator.wordDist = {}indexCreator.charactermap = {}

111

Appendix IIICOMINX: Compressing the inverted index step source code

# This code developed by Saif Mahmood Saab in partial fulfillment for the requirement for the #degree of Doctor of Philosophy (PhD) in Computer Information Systems.#!/usr/bin/python# -*- coding: UTF-8 -*-# Inverted Index Compressor import osimport sysimport reimport os.pathfrom datetime import datetimeimport SpecialBinaryConvertor

class SpecialIndexToBinary(object):

#read file and return textdef readFile(self, sourceFilesPath, filename):

print "Now reading file " + filenamefile = open(sourceFilesPath+filename, "r")text = file.read()file.close()return text

#write down the frequencies of the charactersdef writeBinaryIndex(self, bText, filename, path):

os.popen("rm " + path+filename+".spe.bin")out = open(path+filename+".spe.bin", "w")out.write(str(bText))#out.write(str(i).decode('unicode_escape').encode("utf8") + " â‡’ " +

str(charmap.get(i)).decode('unicode_escape').encode("utf8") + "\n")out.flush()out.close()

if __name__ == '__main__':bConv = SpecialBinaryConvertor.SpecialBinaryConvertor()iBinary = SpecialIndexToBinary()base = "/home/saif/Documents/IndexFileOperations/"sourcesFilesPathList = [base+"Indexes/"]#sourcesFilesPathList = [base+"1k/", base+"10k/", base+"25k/", base+"50k/", base

+"75k/", base+"104k/"]#dirCollections = ["mixedcharsdocsindex.index", "smallcharsdocsindex.index"]dirCollections = ["num09index50.index"]for sourceFilesPath in sourcesFilesPathList:

for dir in dirCollections:text = iBinary.readFile(sourceFilesPath, dir)binText = bConv.getBin(text)if binText != None and binText != "":

iBinary.writeBinaryIndex(binText, dir, sourceFilesPath)

++++++++++++++++

112

# This code developed by Saif Mahmood Saab in partial fulfillment for the requirement for the #degree of Doctor of Philosophy (PhD) in Computer Information Systems.#!/usr/bin/python# -*- coding: UTF-8 -*-# Inverted Index Compressor Mapping (New Encoding Class)import sys

class SpecialBinaryConvertor(object):

def getBin(self, text):charmap = eval(open("charmapping.map", "r").read())#modify the file mapping

nametextList = list(text)binText = ""for i in textList:

if not i == None and not i == "":try:

binText += charmap.get(i)except:

#print i#ai = ord(i)#print ''.join('01'[(ai >> x) & 1] for x in xrange(7, -1,

-1))pass

return binText

def getAscii(self, binary):occurencesTmpList = list(binary)position = ""char = ""r = eval(open("charmappingASCII.map","r").read())#modify the file mapping

name

while len(occurencesTmpList) != 0:char = ""if occurencesTmpList[0] == "0":

for i in range(0,9):char += occurencesTmpList[i]

position += r.get(char)del occurencesTmpList[0:9]

elif occurencesTmpList[0] == "1":for i in range(0,5):

char += occurencesTmpList[i]position += r.get(char)del occurencesTmpList[0:5]

return position

if __name__ == '__main__':text = "1"bConv = SpecialBinaryConvertor()bin =

open("/Extra/FreeLancingWork/Saif/NEW/readydocs/1k/smallcharsdocsindex.index.bin", "r").read()

t = open("out", "w")

113

t.write(bConv.getAscii(str(bin).strip()))t.close()

++++++++++++++++++++

# The new Encoding form (Mapping){"0":"10000","3":"10001","6":"10010","7":"10011","2":"10100","1":"10101","5":"10110","4":"10111","9":"11000","8":"11001","^":"11010","|":"11011","e":"11100","a":"11101","i":"11110","s":"11111","n":"001101110","r":"001110010","t":"001110100","o":"001101111","l":"001101100","d":"001100100","c":"001100011","u":"001110101","m":"001101101","g":"001100111","p":"001110000","h":"001101000","b":"001100010","y":"001111001","f":"001100110","v":"001110110","k":"001101011","w":"001110111","x":"001111000","j":"001101010","z":"001111010","q":"001110001"}

114

Appendix IVSRHINX: Searching the index file (inverted or inverted/compressed index)

step source codeAnd

COMRES: Comparing the outcomes of different search processes performed by SRHINX procedure step source code

# This code developed by Saif Mahmood Saab in partial fulfillment for the requirement for the #degree of Doctor of Philosophy (PhD) in Computer Information Systems.#!/usr/bin/python# -*- coding: UTF-8 -*-# Inverted Index File Search

import osimport sysimport reimport os.pathfrom datetime import datetimeimport BinaryConvertor

class IndexFileBinarySearch(object):

indexMap = {}def __init__(self):

print "loading"

def loadBinaryIndexFile(self, indexFile):indexFileObject = open(indexFile, "r")text = indexFileObject.read()indexFileObject.close()return text

# | in binary 01111100# ^ in binary 01011110def searchMap(self, query):

#print queryoccurencesList = []occurencesStr = ""match = re.findall("01011110"+query+"01111100", text)print "01011110"+query+"01111100"print matchtry:

newStr = ""if match != None and len(match) != 0:

#print matchoccurencesStr = re.search("01011110"+query+"01111100(.+)",

text).group(1)#cut into list of charsoccurencesTmpList = list(occurencesStr)#occurencesTmpList = occurencesTmpList[0:3000]numberOfCharsToCompleteAPosition = 0position = ""currentposition = 0

115

while len(occurencesTmpList) != 0:char = ""if numberOfCharsToCompleteAPosition == 6:

#print "we have a complete position"occurencesList.append(position)position = ""numberOfCharsToCompleteAPosition = 0

for i in range(currentposition,currentposition+8):char += occurencesTmpList[i]

if char == '01011110':#print "we are done here"break

#newStr += char#print "before " + str(currentposition)currentposition = currentposition + 8#print "after " + str(currentposition)if numberOfCharsToCompleteAPosition == 6:


position += charnumberOfCharsToCompleteAPosition = numberOfCharsToCom-

pleteAPosition + 1except Exception, err:

print str(err)return occurencesList

if __name__ == '__main__':bConv = BinaryConvertor.BinaryConvertor()indexFileBinarySearch = IndexFileBinarySearch()startTimer = datetime.now()text = indexFileBinarySearch.loadBinaryIndexFile("/home/saab/Documents/IndexFileOp-

erations/Indexes/num09index50.index.bin") #the bin file namestopTimer = datetime.now()total = (stopTimer - startTimer)print "index file loaded in " + str(total) + " seconds"running = "true"while(running):

query = str(raw_input("please input your query: "))if query == None or query == '' or query == 'exit':

print "Terminating..."sys.exit(0)

print "Searching results for query {" + query +"}"startTimer = datetime.now()occurences = indexFileBinarySearch.searchMap(bConv.getBin(query.strip()))stopTimer = datetime.now()total = (stopTimer - startTimer)if occurences == None or occurences == []:

print "No results"else:

for o in occurences:print bConv.getAscii(o)

print "Found " + str(len(occurences)) + " results in " + str(total) + " seconds"

+++++++++++++++++++

116

# This code developed by Saif Mahmood Saab in partial fulfillment for the requirement for the #degree of Doctor of Philosophy (PhD) in Computer Information Systems.#!/usr/bin/python# -*- coding: UTF-8 -*-# Compressed Inverted Index File SearchNew2

import osimport sysimport reimport os.pathfrom datetime import datetimeimport SpecialBinaryConvertor

class SpecialIndexFileBinarySearch(object):

indexMap = {}def __init__(self):

print "loading"

def loadBinaryIndexFile(self, indexFile):indexFileObject = open(indexFile, "r")text = indexFileObject.read()indexFileObject.close()return text

# | in binary 11011# ^ in binary 11010def searchMap(self, query):

#print queryoccurencesList = []occurencesStr = ""print len(text)match = re.findall("11010"+query+"11011", text)print "11010"+query+"11011"try:

newStr = ""if match != None and len(match) != 0:

#print matchoccurencesStr = re.search("11010"+query+"11011(.+)",

text).group(1)#cut into list of charsoccurencesTmpList = list(occurencesStr)#occurencesTmpList = occurencesTmpList[0:3000]numberOfCharsToCompleteAPosition = 0position = ""currentposition = 0while len(occurencesTmpList) != 0:

char = ""if numberOfCharsToCompleteAPosition == 6:


if occurencesTmpList[currentposition] == "0":for i in range(currentposition,currentposition+9):

char += occurencesTmpList[i]

117

#newStr += char#print "before " + str(currentposition)currentposition = currentposition + 9#print "after " + str(currentposition)

elif occurencesTmpList[currentposition] == "1":for i in range(currentposition,currentposition+5):

char += occurencesTmpList[i]if char == '11010':

#print "we are done here"break

else:#newStr += char#print "before " + str(currentposition)currentposition = currentposition + 5#print "after " + str(currentposition)

if numberOfCharsToCompleteAPosition == 6:#print "we have a complete position"occurencesList.append(position)position = ""numberOfCharsToCompleteAPosition = 0

position += charnumberOfCharsToCompleteAPosition = numberOfCharsToCom-

pleteAPosition + 1except Exception, err:

print str(err)return occurencesList

if __name__ == '__main__':bConv = SpecialBinaryConvertor.SpecialBinaryConvertor()indexFileBinarySearch = SpecialIndexFileBinarySearch()startTimer = datetime.now()text = indexFileBinarySearch.loadBinaryIndexFile("/home/saab/Documents/IndexFileOp-

erations/Indexes/num09index50.index.spe.bin")stopTimer = datetime.now()total = (stopTimer - startTimer)print "index file loaded in " + str(total) + " seconds"running = "true"while(running):

query = str(raw_input("please input your query: "))if query == None or query == '' or query == 'exit':

print "Terminating..."sys.exit(0)

print "Searching results for query {" + query +"}"startTimer = datetime.now()occurences = indexFileBinarySearch.searchMap(bConv.getBin(query.strip()))stopTimer = datetime.now()total = (stopTimer - startTimer)if occurences == None or occurences == []:

print "No results"else:

for o in occurences:print bConv.getAscii(o)

print "Found " + str(len(occurences)) + " results in " + str(total) + " seconds"#containe file name with 6-digets

118