Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Multi-Dimensional Hashed Indexed Metadata/Middleware
(MDHIM) Project
James Nunez Ultrascale Systems Research Center
High Performance Computing Systems Integration May 10, 2012
LA-UR 12-10467 and LA-UR-11-11964
Agenda
• Project Personnel • Motivation - PLFS Problem • MDHIM - Description • MDHIM - How It Works • Communication • Data Stores • Future Work
LA-UR 12-10467 and LA-UR-11-11964
MDHIM Project Personnel • Team Members
– Gary Grider - Project PI – James Nunez - Developer – Hugh Greenberg - Developer
• Interested Parties – CMU: Garth Gibson, Chuck Cranor, and students(?)
LA-UR 12-10467 and LA-UR-11-11964
MOTIVATION – PLFS: REVIEW AND PROBLEMS
LA-UR 12-10467 and LA-UR-11-11964
DOE Parallel Apps “should” do Parallel IO
• Periodic checkpoint writes because the thousands of compute notes are not reliable
• Visualization writes • Writes are synchronized • Hundreds of thousands to millions of
synchronized writes can be difficult for the file system
• Two most common write patterns – N-1 where N procs write to 1 shared file – N-N where N procs write to N non-shared files
LA-UR 12-10467 and LA-UR-11-11964
N-N File IO N-1 File IO
LA-UR 12-10467 and LA-UR-11-11964
N to 1 is Really Better for the Long Run If Issues Can Be Addressed
• N to N at a billion way will just be a non starter. • Further, even N to M seems silly for the user to
have to manage, we have to solve the same problems for N to M as we do for N to 1
• N to 1 really looks like the best long run answer but we have to fix – Concurrence Management – Metadata Management – Etc.
• Lets take the best of both N to N and N to 1
PLFS LA-UR 12-10467 and LA-UR-11-11964
Decouples Logical from Physical
PLFS Virtual Layer
/foo
host1 host2 host3
/foo/
host1/ host2/ host3/
131 132 279 281 132 148
data.131
indx
data.132 data.279 data.281
indx data.132 data.148
indx Physical Underlying Parallel File System LA-UR 12-10467 and LA-UR-11-11964
LBNL PatternIO Benchmark
With PLFS
Without PLFS
Stripe aligned
block aligned
Unaligned
PLFS makes alignment and blocksize irrelevant! LA-UR 12-10467 and LA-UR-11-11964
23X
7X
150X
2X
5X
28X
83X
PLFS Checkpoint BW Summary
These were small problems, the bigger the problem the bigger the win
LA-UR 12-10467 and LA-UR-11-11964
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9
Recall: PLFS Independent Writing Each proc has a block of data
{0,47},{47,10},{57,100},{157,11},{168,60},{228,50},{278,90},{368,11},{379,30},{409,60}
PLFS writes each block to a unique physical file (one log structured file per proc) and remembers where data is with an index.
Logical Offset
Physical Offset
Length Physical Block ID
0 0 47 0
47 0 10 1
57 0 100 2
157 0 11 3
168 0 60 4
228 0 50 5
278 0 90 6
368 0 11 7
379 0 30 8
409 0 60 9
Notice that if every write were the same size (i.e. it was structured), then PLFS could save just one index entry identifying the pattern instead of one entry per write. LA-UR 12-10467 and LA-UR-11-11964
Naïve Approach: Every Process Reads All Indexes to have a Complete Index in Each
Proc’s Memory to Honor any Unknown Read Request (N Squared!)
LA-UR 12-10467 and LA-UR-11-11964
Effects of Everyone Reading All PLFS Indexes
LA-UR 12-10467 and LA-UR-11-11964
MDHIM – WHAT IS IT?
LA-UR 12-10467 and LA-UR-11-11964
The MDHIM Project
• The Multi-Dimensional Hierarchical Indexing Metadata/Middleware (MDHIM) project aims to develop a research prototype infrastructure capable of managing massive amounts of index information representing even larger amounts of scientific data to enable data exploration at enormous scale
• Another way to think about the utility of what we are proposing is to think of a highly scalable/parallel multi-dimensional Virtual Storage Access Method (VSAM) or key/value store (Bigtable/Hbase).
LA-UR 12-10467 and LA-UR-11-11964
Key/Value Stores Already Exist, Why Another? • What about the new indexing technology like Bigtable, Hbase …
• Written in languages not suited for extreme scientific computing environments (Java, Erlang, etc.)
• Built on top of immature infrastructures like HDFS etc. • Even Hypertable which runs on standard file systems does not leverage the
real power of parallel file system technology • Don’t really take advantage of supercomputer interconnects
• We can make data operations scale on commercial parallel file systems today by utilizing shared nothing/append only concepts just like Google has done with lessoned semantics
• Our premise is that GFS/HDFS recreated many of the wheels that parallel file systems have had for years and just went unnoticed.
• Shouldn’t there be a way to make needed metadata/indexing operations scale on top of existing commercial file systems?
• Can we leverage supercomputer style interconnects to go beyond shared nothing style indexing techniques LA-UR 12-10467 and LA-UR-11-11964
How? Leverage – Giga+ and its algorithms for achieving very high concurrency to
very large numbers of entries within a single file system directory. – Commercial Parallel File System files, directories, tree indexing – Bigtable/Hypertable have reasonable key/value api’s – Public ISAM and other index methods exists but most are not
parallel (PBL ISAM, Tokutek, Fastbits, etc.) – These leveraged capabilities are the required ingredients for
building a scalable Parallel ISAM: • append only methods for data and metadata • shared nothing to everything • tree indexing, ranging, etc. • light weight embeddable single machine ISAM
– Supercomputing class interconnects, enabling shared some LA-UR 12-10467 and LA-UR-11-11964
MDHIM – HOW IT WORKS
LA-UR 12-10467 and LA-UR-11-11964
What Does MDHIM Look Like?
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0 MDHIM
Application code
Rank-1 MDHIM
Application code
Rank-N MDHIM
Application code
Rank-N-1 MDHIM
Application code
Rank-2 MDHIM
MDHIM Range Server
MDHIM Range Server
…
Independent data store Range 0
Independent data store Range 1
Independent data store Range N
MDHIM Range Server
… Logical Database
Distributed Application
MDHIM API • MDHIM init
– Initializes MDHIM structures and creates range server threads.
• MDHIM open – Creates directories for data and key files.
• MDHIM insert – A list function that inserts new records with key and record data.
• MDHIM flush – Makes key distribution information available to the clients.
• MDHIM find – find a record using primary key (match, best higher or lower) and set
the absolute record number.
• MDHIM finalize – Shut down range server threads.
LA-UR 12-10467 and LA-UR-11-11964
MDHIM API • MDHIM close
– close a MDHIM file
• MDHIM get – get previous/next/first/last record
• MDHIM read – a list function that read records (key and data), using absolute record
numbers
• MDHIM readkey – a list function that read the keys, using absolute record numbers
• MDHIM delete – Delete the current (last read) record
• MDHIM update – a list function that update data (not key) in records, using absolute
record numbers LA-UR 12-10467 and LA-UR-11-11964
MDHIM Operation: Initialization
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0
Application code
Rank-1
Application code
Rank-4
Application code
Rank-3
Application code
Rank-2
Initialization is a collective call, it fires up index range services on specified nodes, set up range computation and ensure everyone knows how to compute range
Initialization
MDHIM MDHIM Range Server
MDHIM MDHIM Range Server
MDHIM MDHIM MDHIM Range Server
MDHIM
MDHIM Operation: Open
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0
Application code
Rank-1
Application code
Rank-4
Application code
Rank-3
Application code
Rank-2
Open/create is a collective call, that ensures specified files can be written to and creates subdirectories
Open /tmp/keyfile Range Srv List 1,2,4 Key Ranges Every 50
MDHIM MDHIM Range Server
MDHIM MDHIM Range Server
MDHIM MDHIM MDHIM Range Server
MDHIM
MDHIM Operation: Insert
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0 MDHIM
Application code
Rank-1 MDHIM
Application code
Rank-4 MDHIM
Application code
Rank-3 MDHIM
Application code
Rank-2 MDHIM
MDHIM Range Server
MDHIM Range Server
MDHIM Range Server
Processes use range computation to send records to be indexed to appropriate range server, notice ranges are created on the fly by the range servers
Nodes 0 and 2 insert records
Record key 57 Record key 1
Independent data store
Range 50-99
Nodes 0, 2, 3, 4 insert records
Record key 351 Record key 43 Record key 72 Record key 127
Independent data store
Range 100-149 Independent
data store Range 350-399
Independent data store
Range 0-49
Range List 0 - 1
Range List 50 – 1
Range List 100 - 1
Range List 0 - 2
Range List 50 – 2 350 - 1
Range List 100 - 2
Indexes Types – Primary Global – Global index with key order same as record order and Primary Global
keys point at data records • There must be exactly one of these per PISAM file • If there is a Secondary Global index, this the Primary Global Index must be unique • If there is no Secondary Global index, the Primary Global Index doesn’t have to be unique • If you don’t want the Primary Global index functionality, then setting the Primary Global keys
to random and possible random but unique is a way to accomplish this fuctionality (which implies you only want to use Secondary Local indexes)
– Secondary Local • There can be multiples of these and they do not have to be unique • There will be one instance of each Secondary Local index for each Global index (Primary and
Secondary's) • There can be more than one type of Secondary Local index, to enable multiple statistical
representations of the Secondary Local key values
– Secondary Global – Global index where key order is not same as record order. A Secondary Global key does not point at a data record, it points at the Primary Global key for the record which does point at a data record
• There can be multiples of these and they do not have to be unique • If there is a Secondary Global index, the Primary Global keys must be unique • The Global Primary index is used to gain access to the data records
LA-UR 12-10467 and LA-UR-11-11964
MDHIM Operation: Flush
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0 MDHIM
Application code
Rank-1 MDHIM
Application code
Rank-4 MDHIM
Application code
Rank-3 MDHIM
Application code
Rank-2 MDHIM
MDHIM Range Server
MDHIM Range Server
MDHIM Range Server
Flush is collective, all range servers report all ranges and how many records per range and broadcast
Flush
Independent data store
Range 50-99
Independent data store
Range 100-149 Independent
data store Range 350-399
Independent data store
Range 0-49
Range List 0 - 2
Range List 50 – 2 350 - 1
Range List 100 - 2
Range List 0 – 2
50 – 2 100 – 2 350 - 1
Range List 0 – 2
50 – 2 100 – 2 350 - 1
Range List 0 – 2
50 – 2 100 – 2 350 - 1
Range List 0 – 2
50 – 2 100 – 2 350 - 1
Range List 0 – 2
50 – 2 100 – 2 350 - 1
Key Distribution Methods – For all indexes, one of the functions of MDHIM on flush is to share key distribution. For every
range, for every index (Primary Global, Secondary Global, and Secondary Local) key distribution information can be kept as records are inserted and then this key distribution information can be provided to the query client for use in query optimization.
– The amount of key distribution information that can be kept by the range servers for each key can be specified. The minimum information would be number of records per range for the Primary Global Key which would result in knowing minimal key distribution information and global order for the records across ranges.
– Key distribution information can be ignored and this would imply basically a map/reduce or Hive like functionality. This minimizes the inter machine communications required and behaves as if we were not really on a supercomputer with a high performance interconnect
– Key distribution can be extremely rich and could provide statistics on each index for each range, for example quartiles or distribution type, mean, standard deviation, etc. The exact number of these statistical key distribution representations is not known at this time and will need to be flexible to allow the user to specify key distribution type information for each index for each range.
– This key distribution capability is one of the things that shows how adding the high performance interconnect on a parallel machine can benefit, more extensive statistical models per range or subrange per index versus costs to store the records.
– This is part of the power of the MDHIM approach, to provide the user the flexibility to have as many dimensions of indices as desired with as much or little key distribution information as desired. We are trying to provide a tool to allow people to represent enormous volumes of record oriented data with very small amounts of indices/statistical distribution information.
LA-UR 12-10467 and LA-UR-11-11964
COMMUNICATION
LA-UR 12-10467 and LA-UR-11-11964
MDHIM Communication
LA-UR 12-10467 and LA-UR-11-11964
Application code
Rank-0 MDHIM
Application code
Rank-1 MDHIM
Application code
Rank-N MDHIM
Application code
Rank-N-1 MDHIM
Application code
Rank-2 MDHIM
MDHIM Range Server
MDHIM Range Server
… MDHIM Range Server
Distributed Application
Communications Facility (MPI, GasNET, etc.)
MDHIM Communication
• Objective – Make use of the Supercomputer’s high speed
interconnect for communication with range services and key statistics distribution
• Currently using MPI • Future
– Abstract layer for communication – MPI, shmem, UPC (GASnet), …
LA-UR 12-10467 and LA-UR-11-11964
DATA STORES
LA-UR 12-10467 and LA-UR-11-11964
Data Store • Requirements
– Embeddable – Language – No client-server model – Support range queries
• Currently using Program Base Library Functions Indexed Sequential Access Method (PBL ISAM)
• Future – Abstract layer for data stores – Berkeley DB, Level DB, Toukutek, B-Trees, …
LA-UR 12-10467 and LA-UR-11-11964
Candidate Data Stores
LA-UR 12-10467 and LA-UR-11-11964
Data Store Investigation
LA-UR 12-10467 and LA-UR-11-11964
• Performance • DB File size • Features
TESTING
LA-UR 12-10467 and LA-UR-11-11964
Testing • Testing
– Data store routines vs. data store routines with wrapper code vs. MDHIM routines
– Insert scaling study – Against Map Reduce style and other key-value data stores
• Test Harness – Simulates application inserting records in parallel
• Benchmarks – TPC-C, …? – Sort benchmarks
LA-UR 12-10467 and LA-UR-11-11964
RESEARCH QUESTIONS AND FUTURE WORK
LA-UR 12-10467 and LA-UR-11-11964
What are the Research Questions? • Can massively parallel indexing be done on
existing commercial file systems and not require new data intensive style file systems?
• Can we build a multi-dimensional indexing system that effectively represents petabytes with only mega-gigabytes of representation data?
• Can the application of “shared some” as apposed to shared nothing be done using supercomputing class interconnect technology and how would this “shared some” environment help over the current “shared nothing” fad (is there something in between map-reduce and transactional SQL)? LA-UR 12-10467 and LA-UR-11-11964
Current and Future Work • Current
– Full functionality/tested – Work through scaling issues – Enhanced flush statistics – Test against competing distributed parallel data
stores
• Future – Rebalance of range servers – Restart capability – User defined key comparison – User defined statistical function
LA-UR 12-10467 and LA-UR-11-11964
Summary • Multiple dimensions of global indices • Multiple local indices • Efficient queries • “Stab” or range queries into global indices • Range queries into local indices • Range queries over combinations of global and local indices • Can be treated as global index system • Can be treated as only local index system (like Hive etc.) • Utilizes interconnect present in HPC systems (not really present in
classical data intensive systems) to provide: – Rich key distribution information to make queries optimized and
parallel – Representing petabytes of record information with megabytes of
key distribution information LA-UR 12-10467 and LA-UR-11-11964
Acknowledgements
This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory.