D4M-1
Jeremy Kepner
MIT Lincoln Laboratory
3 October 2012
Transforming Big Data with D4M
This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.
D4M-2
• Nicholas Arcolano
• Michelle Beard
• Bob Bond
• Josh Haines
• Matthew Schmidt
• Ben Miller
• Benjamin O’Gwynn
• Tamara Yu
• Bill Arcand
• Bill Bergeron
Acknowledgements
• David Bestor• Chansup Byun• Matt Hubbell• Pete Michaleas• Julie Mullen• Andy Prout• Albert Reuther• Tony Rosa• Charles Yee• Dylan Hutchinson
D4M-3
• Introduction
• Theory
• Results
• Summary
Outline
D4M-4
• Cross-Mission Challenge: Detection of subtle patterns in massivemulti-source noisy datasets
Example Applications of Graph Analytics
Cyber
• Graphs represent communication patterns of
computers on a network
• 1,000,000s – 1,000,000,000s network events
• GOAL: Detect cyber attacks or malicious software
Social
• Graphs represent relationships between
individuals or documents
• 10,000s – 10,000,000s individual and interactions
• GOAL: Identify hidden social networks
• Graphs represent entities and
relationships detected through multi-
INT sources• 1,000s – 1,000,000s tracks
and locations• GOAL: Identify anomalous
patterns of life
ISR
D4M-5
Four Ecosystems Dominate Cloud Computing
Enterprise
Big Data DBMS
Big Compute
• Each ecosystem is at the center of a multi-$B market• Pros/cons of each are numerous; diverging hardware/software• Some missions can exist wholly in one ecosystem; some can’t
- Interactive- On-demand- Elastic
- High performance- Parallel Languages- Scientific computing
- Java- Map/Reduce- Easy admin
- Indexing- Search- Security
D4M-6
• LLGrid MapReduce provides map/reduce interface in a big compute environment
• D4M provides an interactive parallel scientific computing environment to databases
LLGridEnterprise
Big Data DBMS
- Interactive- On-demand- Elastic
- High performance- Parallel Languages- Scientific computing
- Java- Map/Reduce- Easy admin
- Indexing- Search- Security
Big Compute
MapReduce
Four Ecosystems Dominate Cloud Computing
D4M-7
Big Data + Big Compute ChallengeDatabase Worldview
“It’s the data!”Delivering data is the end
Supercomputing Worldview“It’s the computer!”
Delivering data is the start
Shared Compute
Separate DataSeparate Compute
Shared Data
• Database and supercomputing views are fundamentally different• Have never coexisted; do not know how to coexist• Big Data “Analytics” are forcing them together• Current standard practice duplicates hardware and data
D4M-8
Big Data + Big Compute Stack
High Level Composable API: D4M (“Databases for Matlab”)
Weak Signatures,Noisy Data,Dynamics
Novel Analytics for:Text, Cyber, Bio
Interactive Super-computing
High Performance Computing: LLGrid + Hadoop
Distributed Database/ Distributed File System
Distributed Database: Accumulo (triple store)
A
C
E
B
Array Algebra
• Combining Big Compute and Big Data enables entirely new domains
D4M-9
High Level Language: D4Mhttp://www.mit.edu/~kepner/D4M
Distributed Database
Query:AliceBobCathyDavidEarl
Associative ArraysNumerical Computing Environment
D4MDynamic Distributed Dimensional Data Model A
C
DE
B
A D4M query returns a sparse matrix or a graph…
…for statistical signal processing or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
D4M-10
• Introduction
• Theory– Associate Arrays– Incidence Matrix
• Results
• Summary
Outline
D4M-11
What are Spreadsheets and Big Tables?
Spreadsheets
Big Tables
• Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?)
• Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?)
• Simultaneous diverse data: strings, dates, integers, reals, …• Simultaneous diverse uses: matrices, functions, hash tables, databases, …• No formal mathematical basis; Zero papers in AMA or SIAM
D4M-12
D4M Key Concept:Associative Arrays Unify Four Abstractions
• Extends associative arrays to 2D and mixed data typesA('alice ','bob ') = 'cited '
or A('alice ','bob ') = 47.0
• Key innovation: 2D is 1-to-1 with triple store('alice ','bob ','cited ')
or ('alice ','bob ',47.0)
x ATxAT
alice
bob
alice
carl
bob
carlcited
cited
D4M-13
• Key innovation: mathematical closure– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexingA('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library (~2000 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra
Composable Associative Arrays
• Complex queries with ~50x less effort than Java/SQL• Naturally leads to high performance parallel implementation
D4M-14
Associative Array Definitions
• Keys and values are from the infinite strict totally ordered set S
• Associative array A(k) : Sd S, k=(k1,…,kd), is a partial function from d keys (typically 2) to 1 value, where
A(ki) = vi and otherwise
• Binary operations on associative arrays A3 = A1 A2, where = f() or f(), have the properties
– If A1(ki) = v1 and A2(ki) = v2, then A3(ki) is
v1 f() v2 = f(v1,v2) or v1 f() v2 = f(v1,v2)
– If A1(ki) = v or and A2(ki) = or v, then A3(ki) is
v f() = v or v f() = • High level usage dictated by these definitions• Deeper algebraic properties set by the collision function f()• Frequent switching between “algebras” (how spreadsheets are used)
D4M-15
• Associative arrays can be constructed from a few definitions
• Similar to linear algebra, but applicable to a wider range of data
• Key questions– Which linear algebra properties do apply to associative arrays (intuitive)– Which linear algebra properties do not apply to associative arrays
(watch out)– Which associative array properties do not apply to linear algebra (new)
Theory Questions
LinearAlgebra
watch out
AssociativeArrays
intuitivenew
D4M-16
• Book: “Graph Algorithms in the Language of Linear Algebra”• Editors: Kepner (MIT-LL) and Gilbert (UCSB)• Contributors:
– Bader (Ga Tech)– Bliss (MIT-LL)– Bond (MIT-LL)– Dunlavy (Sandia)– Faloutsos (CMU)– Fineman (CMU)– Gilbert (USCB)– Heitsch (Ga Tech)– Hendrickson (Sandia)– Kegelmeyer (Sandia)– Kepner (MIT-LL)– Kolda (Sandia)– Leskovec (CMU)– Madduri (Ga Tech)– Mohindra (MIT-LL)– Nguyen (MIT)– Radar (MIT-LL)– Reinhardt (Microsoft)– Robinson (MIT-LL)– Shah (USCB)
References
D4M-17
• Introduction
• Theory– Associate Arrays– Incidence Matrix
• Results
• Summary
Outline
D4M-18
Digraphs are Black & White
D4M-19
The World is Color
Artist: Ann Pibal; Painting: “XCRS”
D4M-20
5 Edge Colors
Artist: Ann Pibal; Painting: “XCRS”
BlueSilverGreenOrangePink
D4M-21
20 Vertices
Artist: Ann Pibal; Painting: “XCRS”
V2
V1
V3
V4
V5
V6
V7
V8
V9
V10
V11
V13
V14V12
V15
V16
V17
V18
V19
V20
D4M-22
1 Isolated Standard Edge
Artist: Ann Pibal; Painting: “XCRS”
P4
D4M-23
12 Multi Edges
Artist: Ann Pibal; Painting: “XCRS”
B1,S
1,G
1,O
1,O
2,P1
B2,S
2,G
2,O
3,O
4,P2
D4M-24
18 Hyper Edges
Artist: Ann Pibal; Painting: “XCRS”
B1,S
1,G
1,O
1,O
2,P1
B2,S
2,G
2,O
3,O
4,P2
O5
P3B1
,S1,
G1,
O1,
O2,
P1
B2,S
2,G
2,O
3,O
4,P2
P5
P6
P7
P8
D4M-25
27 Edge Orderings
Artist: Ann Pibal; Painting: “XCRS”
O5
P3B1
,S1,
G1,
O1,
O2,
P1
B2,S
2,G
2,O
3,O
4,P2
P5
P6
P7
P8
O5 < P3,P6,P7,P8O5 < B1,S1,G1,O1,O2,P1O5 < B2,S2,G2,O3,O4,P2 < P7,P8
D4M-26
52 Standard Multi Edges
Artist: Ann Pibal; Painting: “XCRS”
O5x5
P3x3(B
1,S1
,G1,
O1,
O2,
P1)x
2
(B2,
S2,G
2,O
3,O
4,P2
)x4P5x2
P6x2
P7x2
P8x2
D4M-27
Summary Observations
Artist: Ann Pibal; Painting: “XCRS”
• Standard edge representation fragments hyper edges– Information is lost
• Digraph representation compresses multi-edges– Information is lost
• Matrix representation drops edge labels– Information is lost
• Standard graph representation drops edge order– Information is lost
• Need edge representation that preserves information
D4M-28
Solution: Incidence Matrix
Artist: Ann Pibal; Painting: “XCRS”
Edge Color Order V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
B1 Blue 2 1 1 1
S1 Silver 2 1 1 1
G1 Green 2 1 1 1
O1 Orange 2 1 1 1
O2 Orange 2 1 1 1
P1 Pink 2 1 1 1
B2 Blue 2 1 1 1 1 1
S2 Silver 2 1 1 1 1 1
G2 Green 2 1 1 1 1 1
O3 Orange 2 1 1 1 1 1
O4 Orange 2 1 1 1 1 1
P2 Pink 2 1 1 1 1 1
O5 Orange 1 1 1 1 1 1 1
P3 Pink 2 1 1 1 1
P4 Pink 2 1 1
P5 Pink 2 1 1 1
P6 Pink 2 1 1 1
P7 Pink 3 1 1 1
P8 Pink 3 1 1 1
D4M-29
• Introduction
• Theory
• Results– Network monitoring example– Bioinformatics example
• Summary
Outline
D4M-30
Graph Construction Using D4M:Explode Schema
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
log_id src_ip server_ip001 128.0.0.1 208.29.69.138
002 192.168.1.2 157.166.255.18
003 128.0.0.1 74.125.224.72
Dense Table
src_ip|128.0.0.1 src_ip|192.168.1.2 server_ip|157.166.255.18 server_ip|208.29.69.138 server_ip|74.125.224.72
log_id|001 1 0 0 1 0
log_id|002 0 1 1 0 0
log_id|003 1 0 0 0 1
Exploded Table
Use as row indices
Create columns for each unique
type/value pair
D4M-31
Graph Construction Using D4M:Storing Exploded Data as Triples
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
src_ip|128.0.0.1 src_ip|192.168.1.2 server_ip|157.166.255.18 server_ip|208.29.69.138 server_ip|74.125.224.72
log_id|001 1 0 0 1 0
log_id|002 0 1 1 0 0
log_id|003 1 0 0 0 1
Exploded Table
D4M stores the triple data representing both the exploded table and its transpose
Table Triples Table Transpose Triples
Row Column Valuelog_id|001 src_ip|128.0.0.1 1log_id|001 server_ip|208.29.69.138 1log_id|002 src_ip|192.168.1.2 1log_id|002 server_ip|157.166.255.18 1log_id|003 src_ip|128.0.0.1 1log_id|003 server_ip|74.125.224.72 1
Row Column Valueserver_ip|157.166.255.18 log_id|002 1server_ip|208.29.69.138 log_id|001 1server_ip|74.125.224.72 log_id|003 1src_ip|128.0.0.1 log_id|001 1src_ip|128.0.0.1 log_id|003 1src_ip|192.168.1.2 log_id|002 1
D4M-32
Graph Construction Using D4M:Construct Associative Arrays
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
D4M Query #1keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);
(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1) ...
D4M-33
Graph Construction Using D4M:Construct Associative Arrays
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
D4M Query #1keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2data = T(Row(keys), :);
(‘log_id|001’,‘server_ip|208.29.69.138’,1)(‘log_id|001’,‘src_ip|128.0.0.1’,1)(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) ...(‘log_id|002’,‘server_ip|157.166.255.18’,1)(‘log_id|002’,‘src_ip|192.168.1.2’,1)(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) ...(‘log_id|003’,‘server_ip|74.125.224.72’,1)(‘log_id|003’,‘src_ip|128.0.0.1’,1)(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1) ...
D4M-34
Graph Construction Using D4M:Construct Associative Arrays
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
D4M Query #1keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2data = T(Row(keys), :);
Associative Array AlgebraG = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
(‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1)(‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1)(‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1) ...
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
D4M-35
• Graphs can be constructed with minimal effort using D4M queries and associative array algebra
Graph Construction Using D4M:Construct Associative Arrays
Distributed Database
Raw Data
CSV Files
Assoc.Arrays
D4M Query #1keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2data = T(Row(keys), :);
Associative Array AlgebraG = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
Adj(G);
D4M-36
Accumulo Ingestion Scalability StudyLLGrid MapReduce With A Python Application
Data #1: 5 GB of 200 files Data #2:
30 GB of 1000 files
4 Mil e/s
Accumulo Database: 1 Master + 7 Tablet servers
D4M-37
• Introduction
• Theory
• Results– Network monitoring example– Bioinformatics example
• Summary
Outline
D4M-38
Relative Cost per DNA Sequence
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012
High Volume Sequencer
Big Data
Energy Efficient
PortableSequencer
D4M-39
Example Disease OutbreakMay-July 2011 - Virulent E. Coli Outbreak Germany
Sequencing and crowd source analysis showed promising potential -> Still too slow
Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen
DNASequencereleased
Outbreakidentified
SpanishCucumbersimplicated
SproutsIdentified
Deaths
www.rki.de EHEC final report
diarrheakidney
D4M-40
RNA Reference Set Collected Sample
sequence word (10mer)
refe
renc
ese
quen
ce ID
unkn
own
sequ
ence
IDA1 A2
A1 A2'
refe
renc
eba
cter
ia
unkn
own
bact
eria
sequence word (10mer)
refe
renc
ese
quen
ce ID
unknown sequence ID
Sequence Matching Graph Sparse Matrix Multiply in D4M
• Associative arrays provide a natural framework for sequence matching
D4M-41
Database Automatically ComputesReference 10mer Distribution
• Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers
50% 5% 0.5%
D4M-42
Leveraging “Big Data” Technologies for High Speed Sequence Matching
• High performance triple store database trades computations for lookups• Used Apache Accumulo database to accelerate comparison by 100x• Used Lincoln D4M software to reduce code size by 100x
100 10000 100000010
100
1000
10000
code volume (lines)
run
time
(sec
onds
)D4M
D4M +Triple Store
BLAST
100x
fas
ter
100x smaller
D4M-43
• Big data is found across a wide range of areas– Document analysis– Computer network analysis– DNA Sequencing
• Currently there is a gap in big data analysis tools for algorithm developers
• D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation
Summary