Upload
ted-dunning
View
524
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Talk about what scalability really means in terms of interacting processes and statistics of growth
Citation preview
1©MapR Technologies - Confidential
Scalability in Hadoop and Similar Systems
2©MapR Technologies - Confidential
Big is the next big thing
Big data and Hadoop are exploding
Companies are being funded
Books are being written
Applications sprouting up everywhere
2
3©MapR Technologies - Confidential
Slow Motion Explosion
3
4©MapR Technologies - Confidential
Hadoop Explosion
4
5©MapR Technologies - Confidential
Why Now?
But Moore’s law has applied for a long time
Why is Hadoop exploding now?
Why not 10 years ago?
Why not 20?
59/18/12
6©MapR Technologies - Confidential
Size Matters, but …
If it were just availability of data then existing big companies would adopt big data technology first
6
7©MapR Technologies - Confidential
Size Matters, but …
If it were just availability of data then existing big companies would adopt big data technology first
They didn’t
7
8©MapR Technologies - Confidential
Or Maybe Cost
If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
8
9©MapR Technologies - Confidential
Or Maybe Cost
If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
They didn’t
9
10©MapR Technologies - Confidential
Backwards adoption
Under almost any threshold argument startups would not adopt big data technology first
10
11©MapR Technologies - Confidential
Backwards adoption
Under almost any threshold argument startups would not adopt big data technology first
They did
11
12©MapR Technologies - Confidential
Everywhere at Once?
Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small
12
13©MapR Technologies - Confidential
Everywhere at Once?
Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small
Why?
13
14©MapR Technologies - Confidential
More data is being produced more quicklyData sizes are bigger than even a very large computer can holdCost to create and store continues to decrease
The Conventional Answer
BUSTED!
15©MapR Technologies - Confidential
Analytics Scaling Laws
Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns
The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant
Cost/performance has changed radically– IF you can use many commodity boxes
16©MapR Technologies - Confidential
We knew that
We should have known that
We didn’t know that!
You’re kidding, people do that?
17©MapR Technologies - Confidential
Anybody with eyes
Intern with a spreadsheet
In-house analytics
Industry-wide data consortium
NSA, non-proliferation
18©MapR Technologies - Confidential
Net value optimum has a sharp peak well before maximum effort
19©MapR Technologies - Confidential
But scaling laws are changing both slope and shape
20©MapR Technologies - Confidential
More than just a little
21©MapR Technologies - Confidential
They are changing a LOT!
22©MapR Technologies - Confidential
23©MapR Technologies - Confidential
24©MapR Technologies - Confidential
25©MapR Technologies - Confidential
26©MapR Technologies - Confidential
Initially, linear cost scaling actually makes things worse
A tipping point is reached and things change radically …
27©MapR Technologies - Confidential
Pre-requisites for Tipping
To reach the tipping point, Algorithms must scale out horizontally– On commodity hardware– That can and will fail
Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare
28©MapR Technologies - Confidential
Yeah… but wait
29©MapR Technologies - Confidential
The Standard Sort of Model
People talk about the law of large numbers as if it were …
Well, as if it were a law
It’s not …
It is a context and assumption dependent theorem
30©MapR Technologies - Confidential
What if …
These assumptions are:
Changes have a – stationary, – independent, – finite variance distribution
What happens if these assumptions are wrong?
And which of them is really wrong?
31©MapR Technologies - Confidential
For Example
32©MapR Technologies - Confidential
End point has nice tractable distribution
33©MapR Technologies - Confidential
What if the Assumptions are Wrong?
Take the finite variance as a simple example
This leads to Levy stable distributions
Like the Cauchy distribution
34©MapR Technologies - Confidential
Is it Really Different?
35©MapR Technologies - Confidential
36©MapR Technologies - Confidential
What About Real Life?
37©MapR Technologies - Confidential
38©MapR Technologies - Confidential
But is it Really Infinite Variance?
Or are there other kinds of phenomena that show this?
What about the independence assumption?
What if the supposedly independent components of the system communicate?
Like we do. Everyday. All the time.
39©MapR Technologies - Confidential
Why the Difference?
Law of large numbers
Infinitevariance
Interactingagents
Apologies and credit to Simon DaDeo, SFI
The space of all things that change
The space of interacting things
40©MapR Technologies - Confidential
What Happens with Interactions
Social phenomena defeat the law of large numbers Distributions are well modeled by “rich get richer” processes– Pittman-Yar process, Indian Buffet
Limiting dstributions are heavy tailed, power law We see these distributions everywhere– price of cotton in the 19th century– word frequencies– popularity of Github projects– equity pricing and volumes– sizes of cities– popularity of web-sites
41©MapR Technologies - Confidential
What are the Implications?
42©MapR Technologies - Confidential
43©MapR Technologies - Confidential
In a Nutshell
Scalability is much more important than we thought
Mashups are more important than we thought
Network effects are more important than we thought
Exploration is more important than we thought
Hadoop style linear scaling must be mixed with ad hoc analysis
44©MapR Technologies - Confidential
Thank You
45©MapR Technologies - Confidential
whoami?
Ted Dunning– @ted_dunning– [email protected] (MapR distribution for Hadoop)– [email protected] (Mahout, Hadoop, Lucene, Zookeeper, Drill)– [email protected] (me)
More info:
http://www.mapr.com/company/events/hadoop-in-finance-2012