Getting 100B Metrics to Disk

G E T T I N G 1 0 0 B M E T R I C S T O D I S KJonathan Thurman -Site Reliability Engineer @jthurman42

1 9 4 B

http://www.flickr.com/photos/meteopassione/9157134653/

N E W R E L I C

• Performance Monitoring

• Web Apps

• Mobile Apps

• Servers

• Databases, Caches & More…

• Software Analytics

O K AY, Y O U C O L L E C T D ATA

• 194 Billion Metrics

• 100,000 req/sec

• 2 Gbps Inbound

• 216 Terabytes

• All backed my MySQL

http://www.flickr.com/photos/bobsfever/6658919861/

H O W W E G O T H E R E

http://www.flickr.com/photos/auvet/853157494/

B U I L D I N G B L O C K S

• Hosted Environment

• Xen Virtual Machines

• Data storage

• ATA over Ethernet

• SATA drives

• MySQL 5.0

• Single Ruby on Rails Application

http://www.flickr.com/photos/riekhavoc/4648423297/

S H A R D I N G F R O M I N C E P T I O N

• Account Information

• Read heavy

• Single HA Instance

• Agent Data

• Write heavy

• 8 shards based on AccountId

http://www.flickr.com/photos/erikb/48221952/

TA L E O F T W O M O D E L S

• Ruby on Rails

• class ShardData < ActiveRecord::Base

• Look up shard for Account

• Override ConnectionHandler

http://www.flickr.com/photos/jungle_boy/140279885/

T R I B B L E S TA B L E S

• Metric table name contains

• AccountID

• Year and Julian Day

• Resolution

• ts_72_13221_1h

• Currently ~200k tables per DB

http://www.flickr.com/photos/15942690@N00/4571141076/

B I N G E A N D P U R G E

• Purging data

• DELETE FROM …

• DROP TABLE …

• innodb_file_per_table

• innodb_lazy_drop_table (pre 5.5.30-30.2)

http://www.flickr.com/photos/exalthim/2261294871/

http://www.flickr.com/photos/davidmonro/8331755849/

http://www.flickr.com/photos/heliocentric/1571127347/

http://www.flickr.com/photos/aigle_dore/6225535459/

G R O W I N G PA I N S

M U LT I P L E P O I N T S O F FA I L U R E

• Single shard slows down

• App servers wait for response

• DB connection pool becomes full

• Site goes down

http://www.flickr.com/photos/boston_public_library/8204384670/

S H A R D G U A R D

• Monitor all databases

• Identify shard status:

• Bad? Mark as “wedged”

• Good? Clear “wedged” flag

• ShardData checks status!

http://www.flickr.com/photos/mac_filko/5486980804/

S TA B I L I T Y A N D P E R F O R M A N C E

• Degraded performance

• New Accounts => Shard 9!

• Old accounts remain as-is

http://www.flickr.com/photos/ejpphoto/7823027272/

D ATA C O L L E C T I O N

• Rails isn’t great for data collection

• Ruby isn’t great either…

• Rewritten in Java using Jetty

http://www.flickr.com/photos/autograt/224540606/

C A C H E I S K I N G

• Buffered, not queued

• RAM is cheaper than I/O

• Get creative with batch processing

http://www.flickr.com/photos/epsos/8474532085/

I N S E R T I N T O ( S E L E C T …

• Select rows and re-process

• Cache last hour in Java’s Heap

• Write a journal and post-process it

http://www.flickr.com/photos/esoteric_13/4741001804/

R E A D / W R I T E P R O B L E M

• Sequential Inserts

• Batched in 5k chunks

• Optimize for Throughput

• Must complete < 1 minute

R E A D / W R I T E P R O B L E M

• Scattered Reads

• Optimized for Latency

• Unique Covering Indexes

M O V E T O H A R D W A R E

• Instant performance!

• Just add…

• Datacenter - Chicago, US

• Servers - Dell

• Storage - Direct Attached

• Time - About 6 months

http://www.flickr.com/photos/zebble/9621007/

S P I N N I N G R U S T

• Dell MD1200 shelves

• 8 Disks per shelf

• RAID 5 virtual disk

• Dedicated Hot-spare

http://www.flickr.com/photos/walkn/5472536812/

T H E G R E AT E X PA N S E

• MD1200s support 12 disks

• Add four more!

• Online RAID expansion

# FA I L

• “On-line” expansion, not so much

• Added second 4 disk RAID 5

• LVM Concatenation for space

http://www.flickr.com/photos/fireflythegreat/2845637227/

N E E D M O R E C A PA C I T Y

• Tight on disk space

• Performance not an issue

• New Accounts => Shard 10!

• Old Accounts as-is

http://www.flickr.com/photos/seandreilinger/6289721616/

S H A R D P I T FA L L S

http://www.flickr.com/photos/21206761@N00/469110140/

M I G R AT I O N P R O B L E M

• Accounts cannot move

• Not all tables have the shard key

• Rails defaults to auto-increment IDs

• Massive primary key collisions

• Punt and move the metrics

http://www.flickr.com/photos/tzafrir/125380911/

B R E A K I N G U P I S H A R D T O D O

• Agent Databases

• Metadata / Notes / Errors

• Timeslice Databases

• Time-series metric data

• 1 Minute and 1 Hour resolution

http://www.flickr.com/photos/rsepulveda/4275236049/

R E S O U R C E P O O L S

• Distributed by Shard Key

• Distribution can CHANGE

• Lookup table, not hash

• Data can be MOVED

http://www.flickr.com/photos/dclark3996/4971906528/

B A C K U P S

• Custom mysqldump wrapper

• Based on business need

• Backup per table

• Ignore tables to be purged

http://www.flickr.com/photos/usdagov/6896218334/

E V O L U T I O N

http://www.flickr.com/photos/pfsullivan_1056/3485953405/

S S D R E V O L U T I O N

• 600GB Intel 320 SSDs

• Dell MD1220 Direct Attached shelf

• Disks are no longer the bottle-neck

• Inserts in Read-optimized order are “fast enough”

Y O U C A N U S E S S D W I T H D ATA B A S E S

• 6 of 420 drives RMA’d

• March 2012 to Aug 2013

• Average 180TB lifetime writes

• 91% wear remaining

http://www.flickr.com/photos/joeshlabotnik/3584172834/

R E D U N D A N T A R R AY O F E X P E N S I V E D I S K S

• Rebuilds under load > 4 hours

• Migrated to RAID 60

• 2 x 12 disk span

• Ditch the Hot-spares

http://www.flickr.com/photos/mbk/27640225/

X F S T U N I N G

• mkfs.xfs -s size=4096

• options

• noatime

• nobarrier

• inode64

• logbsize=256k

http://www.flickr.com/photos/rocketlass/5169004165/

S H A R D G U A R D PA R T D E U X

• Protect all the things!

• Kill UI queries over 75 seconds

• Kill background queries over 1 hour

• Yes, all of them

• No really, kill them, now

http://www.flickr.com/photos/chiky/7194089194/

I F Y O U D O N ’ T B E L I E V E M E …

• Delayed Job

• Long running background query

• InnoDB History List Traversal

T O I N F I N I T Y A N D B E Y O N D

http://www.flickr.com/photos/temma2/1149223191/

H A R D W A R E V 2

• Dell R620

• 2 x Intel E5-2690 @ 2.90GHz

• 96GB RAM

• MD1220 Storage Shelf

• 800GB Intel SSD S3500

http://www.flickr.com/photos/tnarik/2590037637/

C O N T I N U O U S I M P R O V E M E N T

• EXT4 / ZFS / XFS

• RAID Card vs HBA

• Percona Server 5.6

• Multiple MySQL Instances

• Databases per Service

http://www.flickr.com/photos/shawnclover/8555834230/

JOIN THE TEAM NewRelic.com/jobs

Getting 100B Metrics to Disk

Technology

Economics 100B

Beginning Algebra Math 100B - Employee Web Siteemp.byui.edu/BairdD/100B/Math 100B Textbook.pdf · Beginning Algebra Math 100B Math Study Center ... Graphing by Pick ‘n Stick,

Sinov-100b Ipphone Manual

A LOOK AT THE YAQIN MC-100B - G4CNHPage 1 of 18 A LOOK AT THE YAQIN MC-100B By Les Carpenter G4CNH – September 2019 UNDER CONSTRUCTION This is the Authors 100B fitted with Full Music

maruhlro r 100B rl 00B 12Bi*Z

ST-100B (UNIVERSAL) - Analizador de electrolitosanalizador-de-electrolitos.com/ST-100B manual de usuario spanish.pdf · No instale el analizador de electrolitos ST-100B cerca de

LaLa garden 30 10Ê24B -11B 11B [a] oee 3F 10Ê24BO-11Ê11 ... · LaLa garden 30 10Ê24B -11B 11B [a] oee 3F 10Ê24BO-11Ê11 shopp 100B 100B POINT X*OK!! 100B 13 10 claireS GORGE

CHAPTER 14: MASS-STORAGE STRUCTURE Disk Structure Disk Structure Disk Scheduling Disk Scheduling Disk Management Disk Management Swap-Space Management

Software Metrics - IDATDDC88/theory/12Metrics.pdf · Software metrics • Usage-based metrics • Verification & Validation metrics • Volume metrics • Structural metrics • Effort

Solaris Performance Metrics Disk Utilisation by Process · Solaris Performance Metrics – Disk Utilisation by Process 10th December 2005 Brendan Gregg [Sydney, Australia] Abstract

Econ 100B: Microeconomic Theory Winter 2010econ.ucsb.edu/~grossman/teaching/Econ100B_Winter2010/intro-ho.pdf · Econ 100B: Microeconomic Theory Winter 2010. Course Information What

Supply Chain Metrics That Matter: Semiconductors and Hard Disk Drives - 18 FEB 2014

PLP 100B Monitoring Compliance and Contraventions 100B... · of the program and daily routines. 2. Documentation and Records . PLP 100B Monitoring Compliance and Contraventions 2018

($100B+ market cap) - Navidar

Haldane 100b Question

Electrical Control Cabinet (TAG22T3-100B) / 03 Apr 2019 ......Electrical Control Cabinet (TAG22T3-100B) / 03 Apr 2019 / Sierra Blassingame Factory Acceptance Test Checklist Conducted

12c - Shabbos - 66a-100b

AIX Disk IO Tuning 093011 - · PDF file5 2011IBM Corporation Disk metrics MB/s IOPS With a reasonable service time Application metrics Response time Batch job run time System metrics

June 2004Page 1 Disk Subsystem Capacity Management, Based on Business Drivers, I/O Performance Metrics and MASF Igor Trubin, Ph.D. and Linwood Merritt

201 7B1 NASïÜ APiTA 0100B *F-355— · 2019. 6. 10. · SUN MARCHÉ LOOP-LINE BUS CARO F r manacaJñN LINE@ 050B @sunmarche @apita207 100B 50B 1.14 100B 100B fiŽ 50B 8:55 9:35