34
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems John Joo, Program Director David Drummond, Program Director Insight Data Engineering

Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Embed Size (px)

Citation preview

Where Is Your Data?: An Introduction to Problems and

Bottlenecks in Data Systems!

John Joo, Program Director David Drummond, Program Director

!Insight Data Engineering

Program mentors are data engineers from top technology companies including:

Goals• Understand the different components of the

tech stack at a high level.

• Understand the hardware bottlenecks that dictate the tech stack.

• Understand the tech stacks that are generally used for different types of companies, and why.

Computing basics

Various ports (I/O)

up to ~ 10GB/s

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

RAM (memory)

~ 8GB

Various ports (I/O)

up to ~ 10GB/s

RAM (memory)

~ 8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Various ports (I/O)

up to ~ 10GB/s

RAM (memory)

~ 8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Network Processing Storage

What does this look like for a business?

Data @ Point of Sale• 1 Transaction → 2 kb

• What did Customer buy?

• How much did Customer spend?

• When did Customer make this transaction?

Daily Data @ Individual Store• ~50,000 transactions / store /

day → 100 MB

• Servers at back of store

• What items were sold today?

• What was our revenue for today?

• How much was refunded today?

• What do we need to do to restock for tomorrow?

Yearly Data @ Individual Store• 20 million transactions → 40 GB /

year

• What are some seasonal trends in purchased items?

• How should we target our coupons or advertisements to local customers?

• Who were the most efficient employees?

• Should the store’s hours change depending on the time of year?

Various ports (I/O)

up to ~ 10GB/s

RAM (memory) ~8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Yearly Data @ All Stores• 7 billion transactions → 10 TB / year

• Requires in data centers

• What national sales campaigns should we run? Ads, coupons, commercials, web.

• What should the CEO's compensation be?

• Where should we open Supercenters, Discount Stores, Neighborhood Stores, Walmart Expresses?

• What music should we play in the stores?

Complete Historic Data @ All Stores

• 16 years (1992 - 2008)

• 1 trillion transactions → 2.5 PB

• Data centers

• “Area 71” in Caverna, Missouri.

• 125,000-square-foot

• 460 TB

• Colorado Springs

• 210,000-square-foot

• $100 million

Area 71

Various ports (I/O)

RAM (memory)

CPU (processor)

Hard Drive (storage)

Network Processing Storage

Bottlenecks in Data SystemsProper data system design should consider these limiting bottlenecks:

• Loading data into the CPU and memory

• Finding data on the disk

• Moving data across the network

Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

• Solution: Distributed computing with ample memory

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

• Solution: SSD and structuring data in the order it is accessed

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Bottlenecks: Moving Data• Moving data from machine to machine over a network

Bottlenecks: Moving Data

• Solution: Keeping data close to the processors

• Moving data from machine to machine over a network

Bottlenecks: Example• Processing a 2 kB transaction in memory, sequentially and

randomly on disk, or across the network 100 :1 200 :1 50 :1

Tech Stacks for CompaniesDepending on your growth plans:

• Single system with small data

• Distributed data center with large data

• Renting computers for flexibility

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Small Firms with Small Data

Large Firms with Stable Growth• Example: Facebook with steadily growing data centers

• Pros: Economies of scale, redundancy, innovative design

• Cons: Upfront capital, dedicated maintenance

• >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month

Start-Ups with Exponential Growth• Example: AirBnB - rent processing and storage from AWS

• Pros: Scales easily, no maintenance, no upfront capital

• Cons: Expensive in the long run, depend on data provider

• 50 GB / Day • $20-50 / TB / Mo

Start-Ups with Exponential Growth• Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure

Data Pipeline

Ingestion

Realtime Processing

File System Batch Processing

Database

Gathering data in a

reliable wayStoring the

unstructured data redundantly

Processing the data in large

batches at the data center

Processing live streaming data reliably

Organizing data for quick

access

Conclusion• Understand the different components of the

tech stack at a high level

• Understand the hardware bottlenecks that dictate the tech stack

• Understand the tech stacks that are generally used for different types of companies, and why