View
349
Download
1
Category
Tags:
Preview:
DESCRIPTION
http://www.meetup.com/PhillyDB/events/104465902 This presentation will provide an overview of Hbase as well as MapR's M7 NoSQL database. We will begin with a discussion of the basic Hbase architecture and the problems it solves. We will then discuss how MapR's M7, like M5, adds innovative features that provide tangible advantages to Hbase users while maintaining API compatibility."
Citation preview
1 ©MapR Technologies
Hbase and M7 Technical Overview Keys Botzum Senior Principal Technologist MapR Technologies
March 2013
2 ©MapR Technologies
HBase MapR M7 Containers
Agenda
3 ©MapR Technologies
HBase A sparse, distributed, persistent, indexed, and
sorted map OR
A NoSQL database OR
A Columnar data store
4 ©MapR Technologies
Key-‐Value Store
§ Row key – Binary sortable value
§ Row content key (analogous to a column) – Column family (string) – Column qualifier (binary) – Version/Omestamp (number)
§ A row key, column family, column qualifier, and version uniquely idenOfies a parOcular cell – A cell contains a single binary value
5 ©MapR Technologies
A Row
Value1 Row Key Value2 Value3 Value4 ValueN …
C0 C1 C2 C3 C4 CN
Column Family Row Key Column
Qualifier Version Value2
Column Family Row Key Column
Qualifier Version Value1
Column Family Row Key Column
Qualifier Version ValueN
…
6 ©MapR Technologies
§ Weakly typed and schema-‐less (unstructured or perhaps semi-‐structured) – Almost everything is binary
§ No constraints – You can put any binary value in any cell – You can even put incompaOble types in two different instances of the same column family:column qualifier
§ Column (qualifiers) are created implicitly
§ Different rows can have different columns § No transacOons/no ACID
– Only unit of atomic operaOon is a single row
Not A TradiAonal RDBMS
7 ©MapR Technologies
§ APIs for querying (get), scanning, and updaOng (put) – Operate on row key, column family, qualifier, version, and values – Can parOally specify and will retrieve union of results
• if just specify row key, will get all values for it (with column family, qualifier) – By default only largest version (most recent if Omestamp) is returned
• Specify row key and column family to get will retrieve all values for that row and column family
– Scanning is just get over a range of row keys
§ Version – While defaults to a Omestamp, any integer is acceptable
API
8 ©MapR Technologies
§ Rather than storing table rows linearly on disk and each row on disk as a single byte range with fixed size fields, store columns of row separately – Very efficient storage for sparse data sets (NULL is free) – Compression works befer on similar data – Fetches of only subsets of row very efficient (less disk IO) – No fixed size on column values – No requirement to even define columns
§ Columns are grouped together into column families – Basically a file on disk – A unit of opOmizaOon – In Hbase, adding column is implicit, adding column family is explicit
Columnar
9 ©MapR Technologies
HBase Table Architecture § Tables are divided into key ranges (regions) § Regions are served by nodes (RegionServers) § Columns are divided into access groups (columns families)
CF1 CF2 CF3 CF4 CF5
R1
R2
R3
R4
10 ©MapR Technologies
§ Data is stored in sorted order – A table contains rows – A sequence of rows are grouped together into a region
• A region consists of various files related to those rows and is loaded into a region server
• Regions are stored in HDFS for high availability – A single region server manages mulOple regions
• Region assignment can change – load balancing, failures, etc.
§ Clients connect to tables – HBase runOme transparently determines the region (based on key ranges) and contacts the appropriate region server
§ At any given Ome exactly one region server provides access to a region – Master region servers (with Zookeeper) manage that
Storage Model Highlights
11 ©MapR Technologies
§ Very scalable § Easy to add region servers § Easy to move regions around § Scans are efficient
– Unlike hashing based models
§ Access via row key is very efficient – Note: there are no secondary indexes
§ No schema, can store whatever you want when you want § Strong consistency
§ Integrated with Hadoop – Map-‐reduce on HBase is straighlorward – HDFS/MapR-‐FS provides data replicaOon
What’s Great About This?
12 ©MapR Technologies
§ Data from a region column family is stored in an HFile – An HFile contains row key:column qualifier:version:value entries
– Index at the end into the data – 64KB “blocks” by default § Update
– New value is wrifen persistently to Write Ahead Log (WAL) – Cached in memory – When memory fills, write out new HFile
§ Read – Checks in memory, then all of the Hfiles – Read data cached in memory
§ Delete – Create a tombstone record (purged at major compacOon)
Data Storage Architecture
13 ©MapR Technologies
Apache HBase HFile Structure
64Kbyte blocks are compressed
An index into the compressed blocks is created as a btree
Key-‐value pairs are laid out in increasing order
Each cell is an individual key + value -‐ a row repeats the key for each column
14 ©MapR Technologies
HBase Region OperaAon
§ Typical region size is a few GB, someOmes even 10G or 20G § RS holds data in memory unOl full, then writes a new HFile
– Logical view of database constructed by layering these files, with the latest on top
Key range represented by this region
newest
oldest
15 ©MapR Technologies
HBase Read AmplificaAon § When a get/scan comes in, all the files have to be examined
– schema-‐less, so where is the column? – Done in-‐memory and does not change what's on disk
• Bloom-‐filters do not help in scans
newest
oldest
With 7 files, a 1K-‐record get() potenOally takes about 30 seeks, 7 block fetches and decompressions, from HDFS. Even with the index in memory 7 seeks and 7 block fetches are required.
16 ©MapR Technologies
HBase Write AmplificaAon
§ To reduce the read-‐amplificaOon, HBase merges the HFiles periodically – process called compacOon – runs automaOcally when too many files – usually turned off due to I/O storms which interfere with client access
– and kicked-‐off manually on weekends
Major compacOon reads all files and merges into a single HFile
17 ©MapR Technologies
Client Hbase Master
Hbase Region Server
Zookeeper
HDFS Server
Linux Filesystem
HFiles
WAL
HBase Server Architecture
Coordinates Lookup
Data
18 ©MapR Technologies
§ A persistent record of every update/insert in sequence order – Shared by all regions on one region server – WAL files periodically rolled to limit size but older WALs sOll needed – WAL file no longer needed once every region with updates in WAL file has flushed those from memory to an HFile • Remember that more HFiles slow read path!
§ Must be replayed as part of recovery process since in memory updates are “lost” – This is very expensive and delays bringing a region back online
WAL File
19 ©MapR Technologies
What’s Not So Good
Reliability • Complex coordinaOon between ZK, HDFS, HBase Master, and Region Server during region movement
• CompacOons disrupt operaOons • Very slow crash recovery because of • CoordinaOon complexity • WAL log reading (one log/server)
Business conAnuity • Many administraOve acOons require downOme • Not well integrated into MapR-‐FS mirroring and snapshot funcOonality
20 ©MapR Technologies
What’s Not So Good
Performance • Very long read/write path • Significant read and write amplificaOon • MulOple JVMs in read/write path – GC delays!
Manageability • CompacOons, splits and merges must be done manually (in reality)
• Lots of “well known” problems maintaining reliable cluster – spliwng, compacOons, region assignment, etc.
• PracOcal limits on number of regions/region server and size of regions – can make it hard to fully uOlize hardware
21 ©MapR Technologies
Region Assignment in Apache HBase
22 ©MapR Technologies
Apache HBase on MapR
Limited data management, data protecOon and disaster recovery for tables.
23 ©MapR Technologies
HBase MapR M7 Containers
Agenda
24 ©MapR Technologies
MapR A provider of enterprise grade Hadoop with
uniquely differenOated features
25 ©MapR Technologies
MapR: The Enterprise Grade DistribuAon
26 ©MapR Technologies
One PlaVorm for Big Data
…
Batch
99.999% HA
Data ProtecOon
Disaster Recovery
Scalability &
Performance Enterprise IntegraOon
MulO-‐tenancy
Map Reduce
File-‐Based ApplicaOons SQL Database Search Stream
Processing
InteracOve Real-‐Ome
… Broad range of
applicaOons
RecommendaOon Engines Fraud DetecOon Billing LogisOcs Risk Modeling Market SegmentaOon Inventory ForecasOng
27 ©MapR Technologies
Dependable: Lights Out Data Center Ready
§ Automated stateful failover
§ Automated re-‐replicaOon
§ Self-‐healing from HW and SW failures
§ Load balancing
§ No lost jobs or data
§ 99999’s of upOme
Reliable Compute Dependable Storage
§ Business conOnuity with snapshots and mirrors
§ Recover to a point in Ome § End-‐to-‐end check summing § Strong consistency § Data safe § Mirror across sites to meet Recovery Time ObjecOves
28 ©MapR Technologies
Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase
Terasort (1x replicaOon, compression disabled)
Total 13m 35s 26m 6s 2X
Map 7m 58s 21m 8s 3X
Reduce 13m 32s 23m 37s 1.8X
DFSIO throughput/node
Read 1003 MB/s 656 MB/s 1.5X
Write 924 MB/s 654 MB/s 1.4X
YCSB (50% read, 50% update)
Throughput 36,584.4 op/s 12,500.5 op/s 2.9X
RunOme 3.80 hr 11.11 hr 2.9X
YCSB (95% read, 5% update)
Throughput 24,704.3 op/s 10,776.4 op/s 2.3X
RunOme 0.56 hr 1.29 hr 2.3X
Benchmark hardware configuraOon: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE
MinuteSort Record 1.5 TB in 60 seconds
2103 nodes
Fast: World Record Performance
29 ©MapR Technologies
The Cloud Leaders Pick MapR
Google chose MapR to provide Hadoop on Google
Compute Engine
Amazon EMR is the largest Hadoop provider in revenue
and # of clusters
30 ©MapR Technologies
MapR Supports Broad Set of Customers
§ Log analysis § HBase
§ Customer targeOng § Social media analysis
§ Customer Revenue AnalyOcs
§ ETL Offload
§ AdverOsing exchange analysis and opOmizaOon
§ Clickstream Analysis § Quality profiling/field
failure analysis
§ Enterprise Grade Plalorm
§ COOP features
§ Monitoring and measuring online behavior
§ Fraud DetecOon § Channel analyOcs
§ RecommendaOon Engine § Fraud detecOon and PrevenOon
§ Customer Behavior Analysis § Brand Monitoring
§ Customer targeOng § Viewer Behavioral analyOcs
§ RecommendaOon Engine § Family tree connecOons
§ Intrusion detecOon & prevenOon § Forensic analysis
§ Global threat analyOcs
§ Virus analysis
§ PaOent care monitoring
Leading Retailer Global Credit Card Issuer
31 ©MapR Technologies
MapR EdiAons
§ Control System § NFS Access § Performance § High Availability § Snapshots & Mirroring § 24 X 7 Support § Annual SubscripOon
§ Control System § NFS Access § Performance § Unlimited Nodes § Free
Compute Engine
Also Available through:
§ All the Features of M5 § Simplified
AdministraOon for HBase
§ Increased Performance § Consistent Low Latency § Unified Snapshots,
Mirroring
32 ©MapR Technologies
Hbase MapR M7 Containers
Agenda
33 ©MapR Technologies
M7 An integrated system for
unstructured and structured data
34 ©MapR Technologies
Introducing MapR M7
§ An integrated system – Unified namespace for files and tables – Built-‐in data management & protecOon – No extra administraOon
§ Architected for reliability and performance – Fewer layers – Single hop to data – No compacOons, low i/o amplificaOon – Seamless splits, automaOc merges – Instant recovery
35 ©MapR Technologies
Binary CompaAble with HBase APIs
§ HBase applicaOons work "as is" with M7 – No need to recompile (binary compaOble)
§ Can run M7 and HBase side-‐by-‐side on the same cluster – e.g., during a migraOon – can access both M7 table and HBase table in same program
§ Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-‐versa % hbase org.apache.hadoop.hbase.mapreduce.CopyTable -‐-‐new.name=/user/srivas/mytable oldtable
36 ©MapR Technologies
M7: Remove Layers, Simplify
MapR M7
Take note! No JVM!
37 ©MapR Technologies
M7: No Master and No RegionServers
No extra daemons to manage
One hop to data Unified cache
No JVM problems
38 ©MapR Technologies
Region Assignment in Apache HBase None of this complexity is present in MapR M7
39 ©MapR Technologies
Unified Namespace for Files and Tables
$ pwd /mapr/default/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -‐ls /user/dave Found 5 items -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 16 2012-‐09-‐28 08:34 /user/dave/file1 -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 22 2012-‐09-‐28 08:34 /user/dave/file2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:32 /user/dave/table1 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:33 /user/dave/table2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:38 /user/dave/table3
40 ©MapR Technologies
Tables for End Users
§ Users can create and manage their own tables – Unlimited # of tables
§ Tables can be created in any directory
– Tables count towards volume and user quotas
§ No admin intervenOon needed – I can create a file or a directory without opening a Ocket with admin team, why not a table?
– Do stuff on the fly, no stop/restart servers
§ AutomaOc data protecOon and disaster recovery – Users can recover from snapshots/mirrors on their own
41 ©MapR Technologies
M7 – An Integrated System
42 ©MapR Technologies
M7 ComparaOve Analysis with
Apache HBase, Level-‐DB and a BTree
43 ©MapR Technologies
HBase Write AmplificaAon Analysis
§ Assume 10G per region, write 10% per day, grow 10% per week – 1G of writes – a�er 7 days, 7 files of 1G and 1file of 10G (only 1G is growth)
§ IO Cost – Wrote 7G to WAL + 7G to HFiles – CompacOon adds sOll more
• read: 17G (= 7 x 1G + 1 x 10G) • write: 11G write to new Hfile
– WAF – wrote 7G “for real” but actual disk IO a�er compacOon is read 17G + write 25G and that’s assuming no applicaOon reads!
§ IO Cost of 1000 regions similar to above – read 17T, write 25T è major impact on node
§ Best pracOce, limit # of regions/node à can’t fully uOlize storage
44 ©MapR Technologies
AlternaAve: Level-‐DB
§ Tiered, logarithmic increase – L1: 2 x 1M files – L2: 10 x 1M – L3: 100 x 1M – L4: 1,000 x 1M, etc
§ CompacOon overhead – avoids IO storms (i/o done in smaller increments of ~10M) – but significantly extra bandwidth compared to HBase
§ Read overhead is sOll high – 10-‐15 seeks, perhaps more if the lowest level is very large – 40K -‐ 60K read from disk to retrieve a 1K record
45 ©MapR Technologies
BTree analysis § Read finds data directly, proven to be fastest
– interior nodes only hold keys – very large branching factor – values only at leaves – thus index caches work – R = logN seeks, if no caching – 1K record read will transfer about logN blocks from disk
§ Writes are slow on inserts – inserted into correct place right away – otherwise read will not find it – requires btree to be conOnuously rebalanced – causes extreme random i/o in insert path – W = 2.5x + logN seeks if no caching
46 ©MapR Technologies
Log-‐Structured Merge Trees § LSM Trees reduce insert cost by deferring and batching index changes
– If don't compact o�en, read perf is impacted – If compact too o�en, write perf is impacted
§ B-‐Trees are great for reads – but expensive to update in real-‐Ome
Index Log
Index
Memory Disk
Write
Read
Can we combine both ideas? Writes cannot be done befer than W = 2.5x
write to log + write data to somewhere + update meta-‐data
47 ©MapR Technologies
M7 from MapR § TwisOng BTree's
– leaves are variable size (8K -‐ 8M or larger) – can stay unbalanced for long periods of Ome
• more inserts will balance it eventually • automaOcally throfles updates to interior btree nodes
– M7 inserts "close to" where the data is supposed to go
§ Reads – Uses BTree structure to get "close" very fast
• very high branching with key-‐prefix-‐compression – UOlizes a separate lower-‐level index to find it exactly
• updated "in-‐place” bloom-‐filters for gets, range-‐maps for scans
§ Overhead – 1K record read will transfer about 32K from disk in logN seeks
48 ©MapR Technologies
M7 provides Instant Recovery § Instead of having one WAL/region server or even one/region, we have many micro-‐WALs/region
§ 0-‐40 microWALs per region – idle WALs “compacted”, so most are empty – region is up before all microWALs are recovered – recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline – 1000-‐10000x faster recovery
§ Never perform equivalent of HBase major or minor compacOon
§ Why doesn't HBase do this? M7 uses MapR-‐FS, not HDFS – No limit to # of files on disk – No limit to # open files – I/O path translates random writes to sequenOal writes on disk
49 ©MapR Technologies
1K record -‐read amplificaAon
CompacAon Recovery
HBase with 7 hfiles 30 seeks 130K xfer
IO Storms good bandwidth
Huge WAL to recover
HBase with 3 hfiles 15 seeks, 70K xfer
IO Storms high bandwidth
Huge WAL to recover
LevelDB with 5 levels 13 seeks 48K xfer
No i/o storms Very high b/w
WAL is Ony
BTree logN seeks logN xfer
No i/o storms but 100% random
WAL is proporOonal to concurrency + cache
MapR M7 logN seeks 32K xfer
No i/o storms low bandwidth
microWALs allow recovery < 100ms
Summary
50 ©MapR Technologies
M7: Fileservers Serve Regions
§ Region lives enOrely inside a container – Does not coordinate through ZooKeeper
§ Containers support distributed transacOons – with replicaOon built-‐in
§ Only coordinaOon in the system is for splits – Between region-‐map and data-‐container – already solved this problem for files and its chunks
51 ©MapR Technologies
Hbase MapR M7 Containers
Agenda
52 ©MapR Technologies
What's a MapR container?
53 ©MapR Technologies
l Each container contains l Directories & files l Data blocks l BTrees
l 100% random writes
MapR's Containers Files/directories are sharded into blocks, and placed in containers on disks
Containers are ~32 GB segments of disk, placed on nodes
Patent Pending
54 ©MapR Technologies
M7 Containers
§ Container holds many files – regular, dir, symlink, btree, chunk-‐map, region-‐map, … – all random-‐write capable
§ Container is replicated to servers – unit of resynchronizaOon
§ Region lives enOrely inside 1 container – all files + WALs + btree's + bloom-‐filters + range-‐maps
55 ©MapR Technologies
Read-‐write ReplicaAon
§ Write are synchronous – All copies have same data
§ Data is replicated in a "chain" fashion – befer bandwidth, uOlizes full-‐duplex network links well
§ Meta-‐data is replicated in a "star" manner – response Ome befer, bandwidth not of concern
– data can also be done this way
55
client1 client2
clientN
56 ©MapR Technologies
Random WriAng in MapR S1
S2
S3 S5 S4
S1, S2, S4 S1, S3 S1, S4, S5 S2, S4, S5 S3
Client wriAng data
CLDB Ask for 64M block
Create cont.
Picks master and 2 replica slaves
Write next chunk to S2
S2, S3, S5
afach
57 ©MapR Technologies
l As data size increases, writes spread more, like dropping a pebble in a pond
l Larger pebbles spread the ripples farther
l Space balanced by moving idle containers
Container Balancing • Servers keep a bunch of containers "ready to go". • Writes get distributed around the cluster.
58 ©MapR Technologies
l HB loss + upstream enOty reports failure => server dead
l Incr epoch at CLDB l Rearrange repl path l Exact same code for files
and M7 tables l No ZK
Failure Handling
Containers managed at CLDB (HB, container-‐reports).
Container LocaOon DataBase (CLDB)
59 ©MapR Technologies
Architectural Params
§ Unit of I/O – 4K/8K (8K in MapR)
§ Unit of Chunking (a map-‐reduce split) – 10-‐100's of megabytes
§ Unit of Resync (a replica) – 10-‐100's of gigabytes – container in MapR
i/o 10^3 map-‐red
10^6 resync
10^9 admin
HDFS 'block'
§ Unit of AdministraOon (snap, repl, mirror, quota, backup) – 1 gigabyte -‐ 1000's of terabytes – volume in MapR – what data is affected by my
missing blocks?
60 ©MapR Technologies
Other M7 Features
§ Smaller disk footprint – M7 never repeats the key or column name
§ Columnar layout
– M7 supports 64 column families – in-‐memory column-‐families
§ Online admin – M7 schema changes on the fly – delete/rename/redistribute tables
61 ©MapR Technologies
Thank you!
QuesAons?
62 ©MapR Technologies
Examples: Reliability Issues
§ CompacAons disrupt HBase operaAons: I/O bursts overwhelm nodes (hfp://hbase.apache.org/book.html#compacOon)
§ Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-‐1111)
§ Unreliable splibng: Region spliwng may cause data to be inconsistent and unavailable. (hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐dynamic.html)
§ No client throcling: HBase client can easily overwhelm RegionServers and cause downOme. (HBASE-‐5161, HBASE-‐5162)
63 ©MapR Technologies
Examples: Business ConAnuity Issues
§ No Snapshots: MapR provides all-‐or-‐nothing snapshots for HBase. The WALs are shared among tables so single-‐table and selecOve mulO-‐table snapshots are not possible. (HDFS-‐2802, HDFS-‐3370, HBASE-‐50, HBASE-‐6055)
§ Complex Backup Process: complex, unreliable and inefficient. (hfp://bruteforcedata.blogspot.com/2012/08/hbase-‐disaster-‐recovery-‐and-‐whisky.html)
§ AdministraAon Requires DownAme: The enOre cluster must be taken down in order to merge regions. Tables must be disabled to change schema, replicaOon and other properOes. (HBASE-‐420, HBASE-‐1621, HBASE-‐5504, HBASE-‐5335, HBASE-‐3909)
64 ©MapR Technologies
Examples: Performance Issues
§ Limited support for mulAple column families: HBase has issues handling mulOple column family due to compacOons. The standard HBase documentaOon recommends no more than 2-‐3 column families. (HBASE-‐3149)
§ Limited data locality: HBase does not take into account block locaOons when assigning regions. A�er a reboot, RegionServers are o�en reading data over the network rather than the local drives. (HBASE-‐4755, HBASE-‐4491)
§ Cannot uAlize disk space: HBase RegionServers struggle with more than 50-‐150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasOng disk space. (hfp://hbase.apache.org/book/important_configuraOons.html, hfp://www.cloudera.com/blog/2011/04/hbase-‐dos-‐and-‐donts/)
§ Limited # of tables: A single cluster can only handle several tens of tables effecOvely. (hfp://hbase.apache.org/book/important_configuraOons.html)
65 ©MapR Technologies
Examples: Manageability Issues
§ Manual major compacAons: HBase major compacOons are disrupOve so producOon clusters keep them disabled and rely on the administrator to manually trigger compacOons. (hfp://hbase.apache.org/book.html#compacOon)
§ Manual splibng: HBase auto-‐spliwng does not work properly in a busy cluster so users must pre-‐split a table based on their esOmate of data size/growth. (hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐dynamic.html)
§ Manual merging: HBase does not automaOcally merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.
§ Basic administraAon is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-‐643)
Recommended