SDM-Center talk
Optimizing shared access Optimizing shared access to tertiary storageto tertiary storage
Arie ShoshaniArie Shoshani
Alex SimAlex Sim
July 10, 2001July 10, 2001
Scientific Data Management GroupScientific Data Management GroupComputing Science DirectorateComputing Science Directorate
Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory
SDM-Center talk
Outline
• Short High Energy Physics overview (of data handling problem)
• Description of the Storage Coordination System• File tracking• File sharing between file access requests
— coordination of “file bundles”• The Hierarchical Storage Manager (HRM)
— tertiary storage queuing and tape coordination— transfer time for query estimation
SDM-Center talk
Optimizing Storage Management forOptimizing Storage Management for High Energy Physics Applications High Energy Physics Applications
Collaboration# members/institutions
Date offirst data # events/year
total datavolume/year-TB
STAR 350/35 2000 107-10
8 300
PHENIX 350/35 2000 109 600
BABAR 300/30 1999 109 80
CLAS 200/40 1997 1010 300
ATLAS 1200/140 2004 109 2000
Data Volumes for planned HENP experiments
STAR: Solenoidal Tracker At RHICRHIC: Relativistic Heavy Ion Collider
SDM-Center talk
Particle Detection Systems
STAR detector at RHIC Phenix at RHIC
SDM-Center talk
Result of Particle Collision (event)
SDM-Center talk
Typical Scientific Exploration Process
• Generate large amounts of raw data— large simulations— collect from experiments
• Post-processing of data— analyze data (find particles produced, tracks)— generate summary data
• e.g. momentum, no. of pions, transverse energy
• Number of properties is large (50-100)
• Analyze data— use summary data as guide— extract subsets from the large dataset
• Need to access events based on partialproperties specification (range queries)
• e.g. ((0.1 < AVpT < 0.2) ^ (10 < Np < 20)) v (N > 6000)
— apply analysis code
SDM-Center talk
• STAR experiment
— 108 events over 3 years
— 1-10 MB per event: reconstructed data
— events organized into 0.1 - 1 GB files
— 1015 total size
— 106 files, ~30,000 tapes (30 GB tapes)
• Access patterns
— Subsets of events are selected by region in high-dimensional property space for analysis
— 10,000 - 50,000 out of total of 108
— Data is randomly scattered all over the tapes
• Goal: Optimize access from tape systems
Size of Data and Access Patterns
SDM-Center talk
EXAMPLE OF EVENT PROPERTY VALUES
I event 1I N(1) 9965I N(2) 1192I N(3) 1704I Npip(1) 2443I Npip(2) 551I Npip(3) 426I Npim(1) 2480I Npim(2) 541I Npim(3) 382I Nkp(1) 229I Nkp(2) 30I Nkp(3) 50I Nkm(1) 209I Nkm(2) 23I Nkm(3) 32I Np(1) 255I Np(2) 34
I Np(3) 24I Npbar(1) 94I Npbar(2) 12I Npbar(3) 24I NSEC(1) 15607I NSEC(2) 1342I NSECpip(1) 638I NSECpip(2) 191I NSECpim(1) 728I NSECpim(2) 206I NSECkp(1) 3I NSECkp(2) 0I NSECkm(1) 0I NSECkm(2) 0I NSECp(1) 524I NSECp(2) 244I NSECpbar(1) 41I NSECpbar(2) 8
R AVpT(1) 0.325951R AVpT(2) 0.402098R AVpTpip(1) 0.300771R AVpTpip(2) 0.379093R AVpTpim(1) 0.298997R AVpTpim(2) 0.375859R AVpTkp(1) 0.421875R AVpTkp(2) 0.564385R AVpTkm(1) 0.435554R AVpTkm(2) 0.663398R AVpTp(1) 0.651253R AVpTp(2) 0.777526R AVpTpbar(1) 0.399824R AVpTpbar(2) 0.690237I NHIGHpT(1) 205I NHIGHpT(2) 7I NHIGHpT(3) 1I NHIGHpT(4) 0I NHIGHpT(5) 0
54 Properties, as many as 108 events
SDM-Center talk
• Prevent / eliminate unwanted queries=> query estimation (fast estimation index)
• Read only events qualified for a query from a file (avoid reading irrelevant events)=> exact index over all properties
• Share files brought into cache by multiple queries=> look ahead for files needed and cache management
• Read files from same tape when possible=> coordinating file access from tape
Opportunities for optimization
SDM-Center talk
The Storage Access Coordination System (STACS)
QueryEstimator
(QE)
HirarchicalResourceManager(HRM)
QueryMonitor
(QM)
Query estimation /execution requests
file cachingrequest
CachingPolicy
Module
FileCatalog
(FC)
Bit-Slicedindex
file purging
Disk Cache
file caching
User’sApplication
open,read,close
SDM-Center talk
A typical SQL-like Query
SELECT *FROM star_datasetWHERE 500<total_tracks<1000 & energy<3
-- The index will generate a set of files {F6 : E4,E17,E44, F13 : E6,E8,E32, …, F1036 : E503,E3112} that the query needs
-- The files can be returned to the application in any order
SDM-Center talk
File Tracking (1)
SDM-Center talk
File Tracking (2)
SDM-Center talk
File Tracking
query1start
query2start
query3start
All 3queries
SDM-Center talk
Sharing of Files by Multiple Queries
SDM-Center talk
File Bundles:Multiple Event Components
e1
e2
e3
e4
e5
e6
e7
e8
e9
Files ofComponent A
Files ofComponent B
Component Aof event e1
Component Bof event e1
File 4File 3
File 1 File 2
File Bundles: (F1,F2: e1,e2,e3,e5), (F3,F2: e4,e7), (F3,F4: e6,e8,e9)
SDM-Center talk
A typical SQL-like Queryfor Multiple Components
SELECT Vertices, RawFROM star_datasetWHERE 500<total_tracks<1000 & energy<3
-- The index will generate a set of bundles {[F7, F16: E4,E17,E44], [F13, F16: E6,E8,E32], …}
that the query needs
-- The bundles can be returned to the application in any order
-- Bundle: the set of files that need to be in cache at the same time
SDM-Center talk
File Weight Policy for ManagingFile Bundles
• File weight (bundle) = 1 if it appears in a bundle, = 0 otherwise
• Initial file weight = SUM (all bundles for each query) over all queries
• Example:— query 1: file FK appears is 5 bundles
— query 2: file FK appears is 3 bundles
Then, IFW (FK) = 8
queriesinbundlesall
jbundleinisFifotherwiseij
queriesall
i j
kkk FWFIFW 1
0{)()(
SDM-Center talk
File Weight Policy for ManagingFile Bundles (cont’d)
•Dynamic file weight: the file weight for a file in a bundle that was processed is decremented by 1
• Dynamic Bundle Weight
queriesin
bundlesprocessed
jbundleinisFifotherwiseij
queriesall
i j
kkkk FWFIFWFDFW 1
0{)()()(
ibundlein
filesall
kkFDFWBiDBW )()(
SDM-Center talk
How file weights are usedfor caching and purging
• Bundle caching policy— For each query, in turn, cache the bundle
with the most files in cache— In case of a tie, select the bundle with the
highest weight— Ensures that a bundle that include files
needed by other bundles/queries have priority• File purging policy
— No file purging occurs till space is needed— Purge file not in use with smallest weight— Ensures that files needed in other bundles
stay in cache
SDM-Center talk
Other policies
• Pre-fetching policy— queries can request pre-fetching of bundles
subject to a limit— Currently, limit set to two bundles— multiple pre-fetching useful for parallel
processing• Query service policy
— queries serviced in Round Robin fashion— queries that have all their bundles cached
and are still processing are skipped
SDM-Center talk
Managing the queues
.
.
....
.
.
.
.
.
.
.
.
.
QueryQueue
BundleSet
FileSet
FilesBeing
Processed
Filesin Cache
SDM-Center talk
File Tracking of Bundles
Query 1starts here
Query 2starts here
Bundle wasfound in cache
Bundle sharedby two queries
Bundle (3 files)formed, then
passed to query
SDM-Center talk
Summary
• The key to managing bundle caching and purging policies is weight assignment— caching - based on bundle weight— purging - based on file weight
• Other file weight policies are possible— e.g. based on bundle size— e.g. based on tape sharing
• Proving which policy is best - a hard problem— can test in real system - expensive, need stand alone— simulation - too many parameters in query profile can
vary: processing time, inter-arrival time, number of drive, size of cache, etc.
— model with a system of queues - hard to model policies— we are working on last two methods
SDM-Center talk
Read files from same tape when possible
SDM-Center talk
Queuing File Transfers
• Number of PFTPs to HPSS are limited— limit set by a parameter - NoPFTP— parameter can be changed dynamically
• HRM is multi-threaded— issues and monitors multiple PFTPs in parallel
• All requests beyond PFTP limit are queued• File Catalog used to provide for each file
— HPSS path/file_name— Disk cache path/file_name— File size— tape ID
SDM-Center talk
File Queue Management
• Goal— minimize tape mounts— still respect the order of requests— do not postpone unpopular tapes forever
• File clustering parameter - FCP
— If the file at top of queue is in Tapei
and FCP > 1 (e.g. 4) then up to 4 files from Tapei will be selected to be transferred next
— then, go back to file at top of queue• Parameter can be set dynamically
F1(Ti)
F3(Ti)
F2(Ti)
F4(Ti)
1
2
3
4
5
Orderof fileservice
SDM-Center talk
File Caching Order fordifferent File Clustering Parameters
File Clustering Parameter = 1 File Clustering Parameter = 10
SDM-Center talk
Transfer Rate (Tr) Estimates
• Need Tr to estimate total time of a query• Tr is average over recent file transfers from the
time PFTP request is made to the time transfer completes. This includes:— mount time, seek time, read to HPSS Raid,
transfer to local cache over network• For dynamic network speed estimate
— check total bytes for all file being transferredover small intervals (e.g. 15 sec)
— calculate moving average over n intervals(e.g. 10 intervals)
• Using this, actual time in HPSS can be estimated
SDM-Center talk
Dynamic Display of Various Measurements
SDM-Center talk
Query Estimate
• Given: transfer rate Tr.• Given a query for which:
— X files are in cache— Y files are in the queue— Z files are not scheduled yet
• Let s(file_set) be the total byte size of all files in file_set
• If Z = 0, then— QuEst = s(Y’)/Tr
• If Z = 0, then— QuEst = (s(T)+ q.s(Z))/Tr
where q is the number of active queries
F1(Y)
F3(Y)
F2(Y)
F4(Y)
T
SDM-Center talk
Reason for q.s(Z)
20 Queries of length ~20 minuteslaunched 20 minutes apart
20 Queries of length ~20 minuteslaunched 5 minutes apart
Estimate pretty close Estimate bad - requestaccumulate in queue
SDM-Center talk
Error Handling
• 5 generic errors— file not found
• return error to caller
— limit PFTP reached• can’t login• re-queue request, try later (1-2 min)
— HPSS error (I/O, device busy)• remove part of file from cache, re-queue• try n times (e.g. 3), then return error
“transfer_failed”
— HPSS down• re-queue request, try repeatedly till successful• respond to File_status request with “HPSS_down”
SDM-Center talk
Summary
• Heirarchical Resource Manager— insulates applications from transient
HPSS and network errors— limits concurrent PFTPs to HPSS— manages queue to minimize tape mounts— provides file/query time estimates— handles errors in a generic way
• Same API can be used for any MSS, suchas Unitree, Enstore, Castor, etc.
SDM-Center talk
Future Plans
• Package the HRM (CORBA, APIs)• Dynamic file catalog Capability
— Interface HRM to HSI eliminating the need for a file catalog
• Expand use of HRM to include “write capability”• Build upon results of Probe project
— Large block size, tape striping• Provide hints to MPI-IO module• Apply to HENP experiments – initially STAR• Work with PPDG collaboration pilot (SciDAC)• Perform research on “object granularity caching”