Upload
insightdatascience
View
38
Download
0
Tags:
Embed Size (px)
Citation preview
Where Is Your Data?: An Introduction to Problems and
Bottlenecks in Data Systems!
John Joo, Program Director David Drummond, Program Director
!Insight Data Engineering
Goals• Understand the different components of the
tech stack at a high level.
• Understand the hardware bottlenecks that dictate the tech stack.
• Understand the tech stacks that are generally used for different types of companies, and why.
Various ports (I/O)
up to ~ 10GB/s
CPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
RAM (memory)
~ 8GB
Various ports (I/O)
up to ~ 10GB/s
RAM (memory)
~ 8GB
CPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
Various ports (I/O)
up to ~ 10GB/s
RAM (memory)
~ 8GB
CPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
Network Processing Storage
Data @ Point of Sale• 1 Transaction → 2 kb
• What did Customer buy?
• How much did Customer spend?
• When did Customer make this transaction?
Daily Data @ Individual Store• ~50,000 transactions / store /
day → 100 MB
• Servers at back of store
• What items were sold today?
• What was our revenue for today?
• How much was refunded today?
• What do we need to do to restock for tomorrow?
Yearly Data @ Individual Store• 20 million transactions → 40 GB /
year
• What are some seasonal trends in purchased items?
• How should we target our coupons or advertisements to local customers?
• Who were the most efficient employees?
• Should the store’s hours change depending on the time of year?
Various ports (I/O)
up to ~ 10GB/s
RAM (memory) ~8GB
CPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
Yearly Data @ All Stores• 7 billion transactions → 10 TB / year
• Requires in data centers
• What national sales campaigns should we run? Ads, coupons, commercials, web.
• What should the CEO's compensation be?
• Where should we open Supercenters, Discount Stores, Neighborhood Stores, Walmart Expresses?
• What music should we play in the stores?
Complete Historic Data @ All Stores
• 16 years (1992 - 2008)
• 1 trillion transactions → 2.5 PB
• Data centers
• “Area 71” in Caverna, Missouri.
• 125,000-square-foot
• 460 TB
• Colorado Springs
• 210,000-square-foot
• $100 million
Area 71
Bottlenecks in Data SystemsProper data system design should consider these limiting bottlenecks:
• Loading data into the CPU and memory
• Finding data on the disk
• Moving data across the network
Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Distributed computing with ample memory
Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)
Actuator arm with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)
• Solution: SSD and structuring data in the order it is accessed
Actuator arm with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Moving Data
• Solution: Keeping data close to the processors
• Moving data from machine to machine over a network
Bottlenecks: Example• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network 100 :1 200 :1 50 :1
Tech Stacks for CompaniesDepending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility
Small Firms with Small Data• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Large Firms with Stable Growth• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month
Start-Ups with Exponential Growth• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day • $20-50 / TB / Mo
Start-Ups with Exponential Growth• Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure
Data Pipeline
Ingestion
Realtime Processing
File System Batch Processing
Database
Gathering data in a
reliable wayStoring the
unstructured data redundantly
Processing the data in large
batches at the data center
Processing live streaming data reliably
Organizing data for quick
access