Zero-downtime Hadoop/HBase Cross-datacenter Migration

SPN, TrendMicroScott Miao & Dumbo team membersSep. 19, 2015

Who am I• Scott Miao• RD, SPN, Trend Micro• Worked on Hadoop ecosystem since

2011• Expertise in HDFS/MR/HBase• Contributor for HBase/HDFS• Speaker in HBaseCon2014• @takeshi.miao

Our blog ‘Dumbo in TW’: http://dumbointaiwan.blogspot.tw/HBasecon2014 sharing:http://www.slideshare.net/HBaseCon/case-studies-session-6https://vimeo.com/99679688

Agenda• What problem we suffered• IDC migration• Zero downtime migration• Wrap up

What problem we suffered ?

#1Network bandwidth

insufficient

Old IDC Layout

● ● ●

POD Core Switch

TOR Switch

41U rack 41U rackPOD

● ● ●

Up stream devices

HD NNcpu: 8coresmem: 72GBDisk: 4TB

HD DNcpu: 12coresmem: 128GBdisk: 6TB

Other services

12 Gb usageHadoop + services

network traffic

No physical space

Core Switch

Since 2008

x 2 x n

Devices view Servers view

#2Data storage capacity

insufficient

Est. Data Growth

• ~2x data growth

GAME OVER

http://www.space.com/19786-cosmic-rays-origins-star-explosion.html

10http://www.305startup.net/creative-new-business-ideas-2015/

What options• Enhance old IDC

– Replace 1Gb to 10 Gb network topology– Adjust servers location– Any chances for more physical space ?

• Migrate to new IDC– 10 Gb network topology– Servers location well defined– More physical space

What options• Migrate to public cloud

– Provision on-demand• Instance type (NIC/CPU/Mem/Disk) and amount

– Pay as you go– Need to optimize our existing services

Migrate to new IDC !

http://gdimitriou.eu/?m=200912

IDC Migration

Recap…

Network bandwidthData storage capacity

insufficient

New IDC Layout

● ● ●

POD Core

TOR Switch

41rack 41U rackSPN Hadoop POD

160 Gb

Up stream devices

HD NNcpu: 16coresmem: 128GBdisk: 10TB

HD DNcpu: 24coresmem: 196GBdisk: 72TB

Other services

Core Switch

Network traffic becomes far more less

Total 2~3X data storage capacity in terms of our data growth

Grow up to 14 racksx 2 x n

Servers viewDevices view

Now what ?Don’t forget our beloved

elephant~

YARNhttps://gigaom.com/2013/10/25/cloudera-ceo-were-taking-the-high-profit-road-in-hadoop/http://www.pragsis.com/blog/how_install_hadoop_3_commands

YARN abstracts the computing frameworks from Hadoop

http://hortonworks.com/hadoop/yarn/

So not only doing migration

also doing upgrade as well

TMH6 V.S. TMH7

Project TMH6 TMH7 Highlights

Hadoop 2.0.0 (MRv1) 2.6.0 • YARN + MRv2• YARN + ???

HBase 0.94.2 0.98.5 • MTTR impr.• Stripe Comp.

Zookeeper 3.4.5 3.4.6

Pig 0.10.0 0.14.0 • Pig on Tez

Sqoop1 1.4.2 1.4.5

Oozie 4.0.1 4.0.1

JVM Java6 Java7 • G1GC support

How we test our TMH7 ?How our services port and test with TMH7 ?

Apache Bigtop PMC Evans Ye

Comes to rescuein next Session

Something about HW• CPU

– Mores cores• Memory

– More memory• Disk

– Storage capacity

• Network– 10Gb– Topology

• # of nodes per rack– Do PoC

http://www.desktopwallpapers4.me/computers/hardware-28528/

Migration + Upgrade• Span two IDCs -> upgrade -> phase out old

Old IDC

New IDC

Migration + Upgrade• Build new one -> migrate -> phase out old

Old IDC

New IDC

1. Build new one2. migrate

3. phase out old one

Are we done ?We even not in the

game !

SLA for PROD Services

Various data access patterns

Zero downtime migration

Zero downtime ?

http://www.whatdegreewhichuniversity.com/Student-Housing/Moving-out-of-home-in-2013.aspx

Data Access Pattern Analysis

Hadoop/HDFS/MR

Hadoop cluster

Log collector

Message queues

Data sourcing services

File compactor

Internet

Data inData proc

Applicationservices

Data outService1. New files put (mins)

to HDFS2. Proc files with Pig/MR(hourly/daily) to HDFS3. Get result files from HDFS, do further proc4. Serve user requests

Data access patterns for Hadoop/HDFS/MR

• Data in– New file put in couple mins

• Computation– Process data hourly or daily

• Data out– Get result files by services for further process

Categorize Data• Hot data

– Ingest files in mins• New data file put into Hadoop

continuously– Digest by Pig/MR for services

hourly or daily• Needed history data files

– Usually within couple months

– Sync data by• Replicate Data streaming ingestion

(Message queues + File compactors)

• distcp – every mins

• Cold data– All data except hot

• Time spans couple years data

• For monthly/quarterly/yearly report purposes

• Adhoc query– Copy data by

• disctp, run & leave it alone

Kerberos federation among our clusters• Please wait for our next session

– Multi-Cluster Live Synchronization with Kerberos Federated Hadoop by Mammi Chang, Dumbo team

Old IDCTMH6 stg

TMH6 prod

Old IDCTMH7 stg

TMH7 prod

New IDCOld IDC

Hadoop(tmh7)

Old Service 1’

New Service 1

Log collectors

20g Link

Hadoop(tmh6)

Old Service 2

Old Service 1

Log collectors

Sync hot data

Message Queue

Zero downtime migration for Hadoop/HDFS/MR

File Compactors

Copy cold data

File Compactors

Message Queues

Need services’ cooperation• It seems services have no downtime• Latency for hot data sync

– May cause about latency in mins– Due to distcp cron job runs every couple mins

• Need services to– Adjust their jobs to delay couple mins to run

Seems pretty !So are we done?

Don’t forget our HBase XD

Data Access Pattern AnalysisHBase

Hadoop cluster

Log collector

Message queues

Data sourcing services

File compactor

Internet

Data inData proc

Applicationservices

Data outService1. New files put (mins)

to HDFS2. Proc files with Pig/MR(hourly/daily) to HBase3. Random read from HBase4. Serve user requests5. Random writes to HBase

Data access patterns for HBase• Data in

– Random write to HBase– Process/write data hourly or daily

• Data out– Random read from HBase

Considerations for HBase data sync• What we want ?

• All HBase data synced between old and new

• Arrange useless regions (Region merge)• Rowkey: ‘<key>-<timestamp>’• hbase.hregion.max.filesize

– 1GB to 4GB

Considerations for HBase data sync• Incompatible changes between old & new

HBases– API binary incomapatible– HDFS level folder structure changed– HDFS level meta data file format changed

• Not include HFileV2

Tools for HBase data syncTool Impl. tech. API compatible Service impact Data chunk

Boundary

CopyTable API client call

Cluster Replication API client call

Completebulkload HFileNeed to pending writes and flush

Based on when to pending writes

Export/Import SequenceFile + KeyValue + MR

Set start/end timestamp Based on previous

http://hbase.apache.org/book.html#tools

Support tools for HBase sync• Pre-splits generator

– Run on TMH6– Deal with region merge issue– To generate pre-splits rowkey file– Create new HTable on TMH7 with this filegen-htable-presplits.sh /user/SPN-hbase/<table-name>/ <region-size-bytes>

<threshold> > /tmp/<table-name>-splits.txt

hbase shellcreate '<table-name>', '<column-family-1>' , SPLITS_FILE => '/tmp/<table-name>-splits.txt'

Support tools for HBase sync• RowCount with timerange

– Support on both TMH6 & TMH7– Imported data check– Not officially support– Enhance old one to make our own

rowCounter.sh <table-name> --time-range=<start-timestamp>,end-timestamp># ... com.trendmicro.spn.hbase.mapreduce.RowCounter$RowCounterMapper$Counters ROWS=10892133 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0

Support tools for HBase sync• Snapshot

– On TMH7– For every time pass of imported data check– Rollback to previous snapshot if data check fails

hbase shellsnapshot '<table-name>', '<table-name>-<start-timestamp>-<end-timestamp>'

Support tools for HBase sync• DateTime <-> Timestamp

# get current java timestamp (long) date +%s%N | cut -b1-13# get current hour java timestamp (long) date --date="$(date +'%Y%m%d %H:00:00')" +%s%N | cut -b1-13# get current hour -1 java timestamp (long) date --date="$(date --date='1 hour ago' +'%Y%m%d %H:00:00')" +%s%N | cut -b1-13# timestamp to date date -d '@1436336202' # must be 10 digits, from left to right

Zero downtime migration for HBase

Old IDC

Staging

New IDC

Staging

hbase-tmh7

Hadoop-tmh7

hbase-tmh6

Hadoop-tmh6

ServiceA

ServiceB

1. Confirm KV timestamp with ServiceB2. Export data to HDFS with timestamp3. Gen splits file4. distcp data to TMH75. Create HTable with splits6. Import data to HTable7. Verify data by rowcount W/ timestamp8. Create snapshot9,11. Sync data thru #2~8 (skip 3, 5)10. ServiceB stag test start12. Grant ‘RW’ to HTable for ServiceB13. Install ServiceB in new IDC14. Start ServiceB in new IDC15. Done

ServiceB13. 14.

Need services’ cooperation• There still will be a small data gap

– It may be mins• Is it sensitive to services ?

– If it is not• Wait for our final data sync

– If it is• Services need to direct their writes to both clusters

Data sync to HTable -> service start up and run -> final data sync to HtableData gap

Wrap up

Wrap up• Analyze access patterns

– Batch ? Real time ? Streaming ?– Cold data ? Hot data ?

• Keep it simple!– Use native utils as far as you can

• Rehearsal ! Rehearsal ! Rehearsal !• Communicate with your users closely

某一天… 你們migrate的如何？我migrate完了！

我migrate，完了

有聽有保庇！

Thank You

Backups

What items need to take care of• CPU

– Use more cores• One MR task process uses 1 CPU core• Single core clock rate does not increase much

– Do math to compare CPU cores for old and new

(codes-per-old-machine * amount-of-machines * increases-percent) / cores-per-new-machine = amount-of-new-machines

1. Hortonworks, Corp., Apache Hadoop Cluster Configuration Guide, 2013 Apr., p. 15.

e.g. # of 8 cores machine s to # of 24 cores machine, with 1.5X capacity higher(8 * 10 * 150%) / 24 = 120 / 24 =~ 5

P.S. could consider to enable hyper-threading1, then the # of cores is double, but 1/3 of doubled cores need to keep for OS

What items need to take care of

• Memory– Total memories much higher than our old cluster– Consider next gen. computing framework

((per-slot-gigbytes * total-slots + hbase-heap-gigabytes) * 120%-os-mem) * increase-percent / mem-per-new-machine = amount-of-new-machines

e.g. 8 slots with 2GB for each per old machine(((2GB * 80 + 8GB) * 120%) * 300%) / 192GB = (168GB * 120%) * 300% / 192GB =~ 4

• Disk– 2~3X storage capacity to fulfill our BIG data size– Hot swapping support– One disk/partition versus 2~3 process (MR tasks)

• Network– Network topology changed (as previous)– 10Gb NIC for Hadoop nodes

total-cores / (disks-per-new-machine * amount-of-new-machines) = amount-of-process-per-diske.g. with total cores is 120; 120 / (12 * 5) =~ 2

• Rack– Power consumption & cooling– One rack can support our Hadoop nodes is 15, instead of

20– Ask your HW vendor for PoC !!

• Transactional workload (heavy IO load)• Computation workload (100% CPU workload)• Memory Intensive workload (full memory usage)

• New Hadoop TMH7– Build new one first -> migrate -> phase out old one

Need services’ cooperation

• Services need to port their codes for TMH7• We released a Dev Env. (all-in-one Hadoop) for

services to test in advanced– VMWare image (OVF)– VagrantBox– Docker image

• A Jira project for users to submit issues if any

Zero-downtime Hadoop/HBase Cross-datacenter Migration

Technology

HBase operations

Spark + HBase

HBase Schema Design - HBase-Con 2012

ZERO-DOWNTIME DATACENTER FAILOVERS - luka.io fileWHO? Luka Kladaric formerly: web developer for 10+ years now: architecture, infrastructure & security consultant also a startup founder

Flume HBase

HBase train Stark - community.qingcloud.com · HBase 介绍及特点 HBase 系统架构 HBase 集群搭建 HBase 存储结构 HBase 关键流程 HBase 使用及开发 HBase 大纲

Oracle Databases on VMware High Availability€¦ · Oracle database to minimize downtime within the same datacenter. The architectures described are based on VMware vSphere High

HBase Basic

Hbase Tutorial

HBase internals

Hbase in action - Chapter 09: Deploying HBase

BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

OPENSTACK AT 99.999% AVAILABILITY WITH CEPH · Which availability? Server Network Datacenter Cloud Application/Service End-to-End availability most interesting 6 availability downtime/year

Oracle Database Solutions on VMware High Availability · 2015-07-07 · Oracle database to minimize downtime within the same datacenter. ... Select Failover replays queries that were

MyLife with HBase or HBase three flavors

HBase + Hue - LA HBase User Group

HBase Presentation

HTrace: Tracing in HBase and HDFS (HBase Meetup)

hbase - arif.works

Hbase Operations