HADOOP, FROM LAB TO 24/7 PRODUCTION
http://criteolabs.com/jobs
criteolabs.com/jobs
Jean-Baptiste NOTE
Ana DIN
From the Criteo HPC Team(+ Loïc / Serge / Maxime / Samuel / Yann / Stuart)
ABOUT US
criteolabs.com/jobs
CRITEO ?
6 DATA CENTERS, 4 CONTINENTS.120 BILLION REQUESTS/DAY*.
* EVERY DAY CRITEO IS CALLED MORE THAN 100 BILLION TIMES BY ADVERTISERS AND PUBLISHERS
54 OPEN POSITIONS IN PARIS’ R&Dhttp://criteolabs.com/jobs
criteolabs.com/jobs
« Anything that can go wrong - will go wrong »-- Murphy’s Law
TALES OF A TECHNOLOGY ADOPTION
criteolabs.com/jobs
Usage of Hadoop is growing exponentially
• Learning curve is real• Analysts discover interesting things with raw data
– Which causes them to ask more questions• Increased insight leads to a better product
– Which leads to more data• Data gains in value and more is kept (and studied!)
• YOU (the admin) are the bottleneck !
USAGE GROWTH
criteolabs.com/jobs
• Administration automation• Hadoop configuration tuning• Network• Multitenancy
TOPICS
criteolabs.com/jobs
ADMINISTRATION AUTOMATION
criteolabs.com/jobs
Rack and load!• Machine is racked, cabled and provisionned for a role• Chef is our one stop-shop for automation• Diskless system install
AUTOMATING DEPLOYMENTS
INSTA- CLUSTER!
criteolabs.com/jobs
• Learn from the past• Previous cluster 1.5 years operation• 78% failure rate on /dev/sda at restart
• Disk usage symmetry
• Garanteed statelessness
OS DISKLESS : WHY
criteolabs.com/jobs
• PXE Boot on custom CentOs image• Automated Chef bootstrap• Everything done by Chef
– Inventory– Firmware updates– OS / Service deployment
OS DISKLESS : HOW
criteolabs.com/jobs
• Evolutive maintenance (version bump)• Not much to do on normal ops• Most freq. issue is flacking / slow performing host
• Use Preprod / Prod for infra changes• Progressive VS black out
MAINTENANCE
criteolabs.com/jobs
• User facing interfaces• Jobtracker• Fsimage checkpointing• HDFS usage and local disk usage
MONITORING
criteolabs.com/jobs
HADOOP CONFIG TUNING
criteolabs.com/jobs
• Hadoop is a DDOS to your infrastructure– Increase ARP retention (L2-specific)– Use NSCD
• Increase Read ahead• Disable THP compaction• MTU jumbo frames
SYSTEM CONFIGS
criteolabs.com/jobs
CLUSTER CONFIGS
criteolabs.com/jobs
CLUSTER CONFIGS
• Adjust log settings (default is INFO,console)• Increase handler counts (JT,NN,DN)• Use namenode.service.handler.count• Watch out for checkpointing loops
criteolabs.com/jobs
NETWORK
criteolabs.com/jobs
• One datacenter topology will not fit all• Web traffic VS Hadoop traffic• Historical Fat-tree hierarchy with layer 2 routing• Switched to meshed design (soon layer3)
NETWORK TOPOLOGY
criteolabs.com/jobs
• Rack awareness (of course !)– Performance– Reliability– Maintenance (eg. relocation)
HADOOP TOPOLOGY
criteolabs.com/jobs
• HDFS Quotas• Scheduling (user-facing)• Map / Reduce ratio
• Use Yarn !
MULTITENANCY
criteolabs.com/jobs
SECURITY
criteolabs.com/jobs
• Dedicated kdc / realm• Dedicated services principals• Cross-realm trusts• Delegate user management to your IT
KERBEROS SETUP
criteolabs.com/jobs
• Use multiple proxies• Easy way to interconnect to the outside world• Data injection / read with a simple curl• High bandwidth transfers
HTTPFS PROXIES
criteolabs.com/jobs
• Multiple use cases (ML, BI analytics)• Baseline Json (+gzip) is ok• Don’t optimize too early• We still use it(*) at Peta scale
(*) some teams also use Parquet and contributed to Hive integration
FILE FORMATS
criteolabs.com/jobs
QUESTIONS ?
criteolabs.com/jobs
Did I say we’re hiring!
We’re hiring lots of engineers in 2014. Come join us!
http://criteolabs.com/jobs
MY FELLOW CRITEOS WOULD KILL ME…
Recommended