21
Apache Hadoop India Summit 2011 Searching Information Inside Hadoop Platform Abinasha Karana Director-Technology Bizosys Technologies Pvt Ltd. [email protected] www.bizosys.com

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Embed Size (px)

Citation preview

Page 1: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

Searching Information

Inside Hadoop Platform

Abinasha KaranaDirector-Technology

Bizosys Technologies Pvt Ltd.

[email protected]

www.bizosys.com

Page 2: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr

Page 3: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

Map-reduce

Result not in a mouse clickWhat didn’t work for us

Page 4: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

It required vertical scaling with manual sharding and subsequent resharding as data grew

Lucene/Solr

What didn’t work for us

Page 5: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

We built a new search for

Hadoop Platform HDFS and HBase

What we did

Page 6: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

my learning from designing, developing

and benchmarking a distributed,

real-time search engine whose Index is

stored and served out of HBase

In the next few slides you will hear about

Page 7: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

Key Learning1. Using SSD is a design decision.

2. Methods to reduce HBase table storage size

3. Serving a request without accessing ALL region servers

4. Methods to move processing near the data

5. Byte block caching to lower network and I/O trips to HBase

6. Configuration to balance network vs. CPU vs. I/O vs memory

Page 8: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

1 Using SSD is a design decision

SSD improved HSearch response time by 66% over SATA. However, SSD is costlier.

In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.

Page 9: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

.. Our SSD Friendly Schema Design

Keyword + Document in 1 Table

Keyword: Reads all for a query.

Document: Reads 10 docs / query.

Keyword + Document in 2 Tables

SSD deployment is All or none SSD deployment is only for Keyword Table

Page 10: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

2 Methods to reduceHBase table storage size

Storing a 4 byte cell requires >27bytes in HBase.

Key len

gth

Valu

e leng

th

Ro

w len

gth

Ro

w B

ytes

Fam

ily Len

gth

Fam

ily Bytes

Qu

alifier Bytes

Tim

estamp

Key Typ

e

Valu

e B

ytes

4 BYTES1 BYTE4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTESBYTES

Page 11: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

.. to 1/3rd

• Stored large cell values by merging cells• Reduced the Family name to 1 Character• Reduced the Qualifier name to 1 Character

Page 12: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

3Serving a request without

accessing ALL region servers

Consider a 100 node cluster of HBase and a single

search request need to access all of them.

Bad Design.. Clogged Network.. No scaling

Page 13: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

3 And our solution…

0-1 M

1-2 M

2-3 M

3-4 M

4-5 M

Ro

w R

ang

es

Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

Fam

ily A

Fam

ily B

Scan “Family A” - 5 Machines HitScan Table A - 3 Machines Hit

Machine 1

Machine 2

Machine 3 Machine 3

Machine 4

Machine 5

Table A Table B

Index Table was divided on Column-Family as separate tables

Page 14: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

4Methods to move

processing near the data

public class TermFilter implements Filter {public ReturnCode filterKeyValue(KeyValue kv) {

boolean isMatched = isFound(kv);if (isMatched ) return ReturnCode.INCLUDE;return ReturnCode.NEXT_ROW;

}…

• Sent filtered Rows over network.

public class DocFilter implements Filter {public void filterRow(List<KeyValue> kvL) {

byte[] val = extractNeededPiece(kvL);kvL.clear();kvL.add(new KeyValue(row,fam,,val));

}….

• Sent relevant section of a Field over network.

E.g. Matched rows for a keyword

E.g. Computing a best match section from within a document for a given query

• Sent relevant Fields of a Row over network.

Page 15: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

5Byte block caching to lower

network and I/O trips to HBase

• Object caching – With growing number of objects

we encountered ‘Out of Memory’ exception

• HBase commit - Frequent flushing to HBase

introduced network and I/O latencies.

Converting Objects to intermediate Byte Blocks

increased record processing by 20x in 1 batch.

Page 16: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

6Configuration to balance Network vs. CPU vs. I/O vs. Memory

DiskI/O

Network

Memory CPU

Block Caching

IPC Caching

Compression

Compression

Aggressive GC

In a Single Machine

Page 17: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

… and it’s settings

• Network– Increased IPC Cache Limits (hbase.client.scanner.caching)

• CPU– JVM agressive heap

("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)

• I/O– LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)

• Memory– HBase block caching (hfile.block.cache.size) and overall memory

allocation for data-node and region-server.

Page 18: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

.. and parallelized to multi-machines

•HTable.batch (Get, Put, Deletes)•ParallelHTable (Scans)•FUTURE-coprocessors (hbase 0.92 release).

Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count

Page 19: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

HSearch Benchmarks on AWS

• Amazon Large instance 7.5 GB Memory * 11 Machines

with a single 7.5K SATA drive

• 100 Million Wikipedia pages of total 270GB and

completely indexed (Included common stopwords)  

• 10 Million pages repeated 10 times. (Total indexing time

is 5 Hours)

• Search Query Response speed using a – regular word is 1.5 sec

– common word such as “hill” found 1.6 million matches and

sorted in 7 seconds

Page 20: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

www.sourceforge.net/bizosyshsearch

• Apache-licensed (2 versions released)

• Distributed, real-time search

• Supports XML documents with rich search syntax

and various filtration criteria such as document

type, field type.

Page 21: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Apache Hadoop India Summit 2011

References• Initial performance reports (Bizosys HSearch, a Nosql

search engine, featured in Intel Cloud Builders Success Stories (July, 2010)

• HSearch is currently in use at http://www.10screens.com

• More on hsearch– http://www.bizosys.com/blog/– http://bizosyshsearch.sourceforge.net/

• More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf