The Challenges of Managing Big Data
Amr El AbbadiComputer Science, UC Santa Barbara
Coined by Jeannette Wing (CMU, NSF, Microsoft). Essential for the productive citizenry of the
21 century. Computer Science is the critical component
in Science, Commerce, Finance, Engineering, Social Science, and even Humanity.
Our role as educators is challenging but very rewarding. Need to understand and teach the foundations and essence of computation.
AlAkhawayn 2015 2
Computational Thinking
AlAkhawayn 2015 3
Accurate and effective public health emergency response demands deep understanding of the context and the details of chaotic situations.
How Big Data Analysis Guides Hurricane Sandy Response
http://www.directrelief.org/2012/11/how-big-data-analysis-guides-our-hurricane-sandy-response/
AlAkhawayn 2015 4
Evaluate more candidates, amass more data and peer more deeply into applicants' personal lives and interests.
Allows employers to predict specific outcomes, such as whether a prospective hire will quit too soon, file disability claims, or steal.
Meet the New Boss: Big Data
• For example after a half-year trial that cut attrition by a fifth, Xerox now leaves all hiring for its 48,700 call-center jobs to software.
• The model for the ideal call-center worker is a person who lives near the job, has reliable transportation and uses one or more social networks, but not more than four.
• Note that practices that even unintentionally filter out older or minority applicants can be illegal.
• Wall Street Journal, Sept 20, 2012.
AlAkhawayn 2015 5
The Big Data Eco-System in the Cloud
inter-data center
Inter-data center
Analysis & Quality
Infrastructure
AlAkhawayn 2015 6
The 3 V’s of Big Data
7
Big Data in Numbers
Facebook:◦ 1.5 Billion users◦ 140.3 Billion friendships
Twitter in a day:◦ 500 million tweets sent
Youtube in a day:◦ 3 billion videos viewed
Stats from facebook.com, twitter.com and youtube.comAlAkhawayn 2015
8
104+ Hours of video uploaded on youtube 42,408+ App Downloads 153,804+ New photos uploaded on Facebook $263,947+ money spent on web shopping 298,013+ New Tweets 1,881,737+ youtube video views 2,521,244+ search queries on Google 2,692,323+ New Facebook Likes 20,234,009+ flickr photos views 204,709,030+ emails sent over the Internet
Source: http://whathappensontheinternetin60seconds.com/
AlAkhawayn 2015
AlAkhawayn 2015 9
AlAkhawayn 2015 10
Reality Check: Inside a Data Center
AlAkhawayn 2015
App Server
App Server
App Server
11
Scaling in the Cloud
Load Balancer (Proxy)
App Server
DATABASE
Client Site
App Server
Client Site Client Site
Database becomes the Scalability Bottleneck
Cannot leverage elasticity
AlAkhawayn 2015
App Server
App Server
App Server
12
Scaling in the Cloud
Load Balancer (Proxy)
App Server
DATABASE
Client Site
App Server
Client Site Client Site
AlAkhawayn 2015
Key Value Stores
App Server
App Server
App Server
13
Scaling for Big Data
Load Balancer (Proxy)
App Server
Client Site
App Server
Client Site Client Site
Scalable and Elastic,but limited consistency
and operational flexibility
AlAkhawayn 2015 14
AlAkhawayn 2015 15
Every read or write of a single row is atomic.
Objective: make all operations single-sited.
Key Value Stores
Scale-up◦ Classical enterprise setting
(RDBMS)◦ Flexible ACID transactions◦ Transactions in a single node
Scale-out◦ Cloud friendly (Key value
stores)◦ Execution at a single server
Limited functionality & guarantees
◦ No multi-row or multi-step transactions
AlAkhawayn 2015 16
Two approaches to scalability
AlAkhawayn 2015 17
Why Consistency Matters On-line Social Media needs to be consistent!
◦ New unanticipated applications
The host’s dilemma◦ Remove “unpopular” friend X as friend◦ Post Party Next Friday at YYY”
AlAkhawayn 2015 18
What about the Application Programmer?
Key-value StoresTransactions and SQL
AlAkhawayn 2015 19
SQL and Scale-outS
cale
Ou
t
SQL Transactions
Key Value Stores
RDBMSs
AlAkhawayn 2015 20
Cloud Elasticity
Challenge: Elasticity in Database tier
AlAkhawayn 2015 21
Database tier
Load Balancer
Application/Web/Caching tier
22
What the user wants…
What the service provider wants…
The need for Live Migration: If the database platform was a phone booth…
AlAkhawayn 2015
AlAkhawayn 2015 23
Catastrophic Failures: Geo-Replication
“As a result they had no access to email,
calendars, or - most importantly - their
documents and Office
Online applications”
“most of digital communication - email, Lync, Sharepoint -
was out”
“Most of the other high-profile companies, including the
likes of Amazon, have had substantial outages … cloud services are still in their infancy, and glitches
like this are going to happen”
Geo Replication Promises: Low latency reads Tolerate Failures & data centers outages
AlAkhawayn 2015 25
Geo Replication
AlAkhawayn 2015 26
Geo Replication
AlAkhawayn 2015 27
Geo Replication
AlAkhawayn 2015 28
Communication Overhead
21101 99
169
341
173
260
AlAkhawayn 2015 29
Latency lower-bound [SIGMOD 2015]
Replica1
Replica 2
Wide-area link
Datacenter A Datacenter B
Transaction T1
Round Trip Latency Delay
AlAkhawayn 2015 30
Commit latency of T1 + Commit latency of T2 must be greater than or equal the Round-Trip Time between
them
Transaction T2
Then we have the challenge of VARIETY! Diverse access methods OLTP OLAP Graph
AlAkhawayn 2015 31
Fall of ‘One Size Fits all’
Replication Driven Solution Leverage Replication by storing replicas in
different representations
Execution Engine
Column Row
Graph
OLTP Client
OLAP Client
Graph Client
AlAkhawayn 2015 32
BIG Data Analytics Needs
Large scale data processing is difficult!◦ Managing hundreds or thousands of processors◦ Managing parallelization and distribution◦ I/O Scheduling◦ Status and monitoring◦ Fault/crash tolerance
AlAkhawayn 2015 33
AlAkhawayn 2015 34
MapReduce to the Rescue?
Overview:◦ Data-parallel programming model ◦ An associated parallel and distributed
implementation for commodity clusters Pioneered by Google
◦ Processes 20 PB of data per day Popularized by open-source Hadoop project
◦ Used by Yahoo!, Facebook, Amazon, and the list is growing …
and now there is SPARK…….
AlAkhawayn 2015 35
MapReduce (Hadoop) Model
Raw Input: <key, value>
MAP
<K2,V2><K1, V1> <K3,V3>
REDUCE
Execution Model: Data splits Map phase Intermediate data sort, partition and shuffling Reduce phase
AlAkhawayn 2015 36
How Target Figured Out A Teen Girl Was Pregnant Before Her Dad Did
Stores everything customers bought + demographic information.
Crawl through the data, assign each shopper a “pregnancy prediction” score + estimate due date
Send coupons timed to specific stages of pregnancy.
Jenny is 23, lives in Atlanta, in March bought cocoa-butter lotion, a large purse, zinc and magnesium supplements and a bright blue rug
87% chance that Jenny’s pregnant and her delivery date is sometime in late August.
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-
before-her-father-did/
AlAkhawayn 2015 37
Privacy in the Cloud Data confidentiality
◦ Attacks Unauthorized accesses,
side channel attacks◦ Solutions
Encryption, querying encrypted data
Trusted computing
User
Cloud Servers
DataQuery
Answer
• Access privacy– Attacks
• Inferences on access patterns or query results
– Solutions• Private information
retrieval• Query obfuscation
AlAkhawayn 2015 38
The Internet and how we use it
“Nearly two-thirds of Internet users worldwide use some type of social media” (McCafferty, CACM 2012)
• “The internet's largest impact comes in connecting people to other people for advice or sharing valuable experiences. For about one-third (34%) of those who used the … social networks was part of the decision-making dynamic. ” (Horrigan et al. 2006)
39
Diffusion of Information
Diffusion (a.k.a. cascade, spread) in social networks: Does it happen?
• Mass Convergence and Emergency Events (Hughes et al. 2009, Sakaki et al. 2010)
• Education through Social Networks (Cheong et al. 2010)
• Collective Action
AlAkhawayn 2015
AlAkhawayn 2015 40
Time Critical Social MobilizationThe 10 red balloons DARPA Challenge. MIT winning team found the 10 balloons
in 8 hours and 52 minutes using incentivized diffusion in a social network (Twitter). Science Vol 334, 28 Oct 2011.
AlAkhawayn 2015 41
Are we missing something?Morocco is the Best!
Morocco is the Best!Morocco is
the Best!
Morocco is the Best!
Morocco is the Best!
Morocco is the Best!
Traditional trend detection fails to capture the difference between the two scenarios
Dispersed interest in the
topic
Interest from structural subgroup
AlAkhawayn 2015 42
Coordinated vs Uncoordinated Trends
Consider two traditionally similar hashtags #pawpawty and #mafiawars (using Prefuse)
#pawpawty: Traditional rank: 289Significant as a coordinated trend: rank 24
#mafiawars: Traditional rank: 212Insignificant as a coordinated trend (rank 25812th )
Disconnectednodes
#pawpawty is a hashtag used by animal rights defenders while #mafiawars is used by
gamers. One might entail more of a community formation…
Next question: Are topics of certain categorical nature more (or less) important as structural
trends? Yes, political hashtags tend to be more significant as structural trends while hashtags
relating to gaming, music etc. are more significant as uncoordinated trend
AlAkhawayn 2015 43
GeoScope Reporting diffusion: Geo-correlated trend
detection:◦ Provides high level information about topics and
locations:◦ Detects important location-topic pairs in a sliding
window What is the popularity of #ff
in Morocco?Which topics are of
interest particularly to Ifran
today?
AlAkhawayn 2015 44
Cities
Users of certain cities like Jakarta have diverse interestsWhile other cities like Cairo are interested in more local topics
AlAkhawayn 2015 45
Computational Thinking Data Cross-disciplinary Globalization
Privacy and Societal issues. Energy Efficiency
Conclusions