Storage Management Technical Evolution Group Wahid Bhimji Daniele Bonacorsi

Embed Size (px)

DESCRIPTION

WLCG TEG Objectives To reassess the implementation of the grid infrastructures that we use in the light of the experience with LHC data, and technology evolution, but never forgetting the important successes and lessons, and ensuring that any evolution does not disrupt our successful operation.

Citation preview

Storage Management Technical Evolution Group Wahid Bhimji Daniele Bonacorsi This meeting Welcome and objectives for the group Procedure and logistics Starting to setting scope and priorities WLCG TEG Objectives To reassess the implementation of the grid infrastructures that we use in the light of the experience with LHC data, and technology evolution, but never forgetting the important successes and lessons, and ensuring that any evolution does not disrupt our successful operation. Some snippets from the rationale Analysis of experience in Storage Management from experiment and site perspectives Focus on operational needs Actual requirements of our community Identify key topics and trigger discussions Commonalities between experiments and grids Ensure evolution does not disrupt operations Deliverables Assessment of current situation with middleware, operations and support structure Where we are (D1) Strategy document setting out a plan and needs for the next 2--5 years Where we will be (D2) and how we are going to get there (D3) Us now Us in the future Current operating environment Future operating environment Analysis of data Trend projections The plan Digression: a typical strategy Where we are Where we will be How we are going to get there The world Current operating environment Us now Analysis of data Possible futures The plan Preferred future Future to be avoided Avoidance/ mitigation plan Adaptation plan Less preferred future A more resilient strategy D1=D1+ Wider world D3=D3+ Risk mitigation D2=D2+ Possible alternatives Logistics: Timeline Mid Jan (~23 rd ): Possible f2f for everyone Co-locate with Data TEG. Possibly in Amsterdam Nov Dec Jan Feb Final Document (D1+D2+D3) 14th-Dec GDB Draft Document (D1+D2) Early Dec (~7th): Possible f2f at CERN (for those who can make it rest on EVO) Overlap with Data TEG. 11th-Jan GDB 7th-Feb TEG Reports pre-GDB Draft D1 Logistics EVO meeting every other week At this time? Next one: Thu 24 Nov (but its thanksgiving) Finalize topics and start towards D1. Alternating chair in the meeting between Wahid/Daniele other taking minutes. F2f: 2 one in early Dec (7 th ) and one in January (co-located with Data) or just one? Sub groups? lets see what topics we have Other meetings going on Federated Storage Meeting in Lyon Hepix Storage Working Group - regular Root I/O workshop in Jan or Feb Experiment Software Weeks (We should plug into at least these any more?) Membership See: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGTEGStor age#People May match people to topics once we have topics Have also asked many of you for area of expertise which we will make available in short form. B4 discussing topics: quick round table to introduce; and express any concerns with mandate or logistics. Topics Homework Question 1 - In your view, what are the 3 main current issues in Storage Management (SM)? : Question 2 -What is the greatest future challenge which would greatly impact the SM sector? Question 3 -What is your site/experiment/middleware currently working on in SM? Question 4 -What are the big developments that you would like to see from your site/experiment/storage system in the next 5 years? Question 5 -In your experience and area of competence, what are the (up to) 3 main successes in SM so far? Question 6 - In your experience and area of competence, what are the (up to) 3 main failures or things you would like to see changed in SM so far? Feedback Recieved: Q3/4 not clear. I think we are getting decent answers. The aim was to see how joined up are what the sites/experiments and developers are doing, and plan. Weaknesses Strengths Opportunities /Threats Topics.. Current situation Plans Responses THANKS to those whove done this. Lots of really nice content that I havent done justice too in my snippets Please could EVERYONE fill in We need this to establish where we are and capture all perspectives. A lot of stuff coming in today so.. Will discuss in detail next time with more I summarize some responses at the end (inc. some from me) limited attempt to cluster / remove duplications and missing some of the detail in the twiki. Strongly encourage you to read originals and start discussion on list between meetings. For now focus on topics Topics (Starting point as given to us with emerging priority stars and new items in blue) Experiment I/O and other usage patterns (and so performance requirements * for storage). IO / scalability limits ** Our requirements on future storage systems and how storage will evolve independent of us * Separation of archives and disk pools/caches Storage system interfaces to Grid (SRM future?) ** Interoperation ** Filesystems/protocols (standards?) ** Security/access controls End user experience Site-run services: storage management interfaces, performance measurements, monitoring, manageability ** Roadmaps/communication Summary of twiki responses so far very speedily done so just a start Slightly cut in a different way from the questions: Current situation (Experiments/Sites / Middleware); Strengths Weaknesses Future desires; Opportunities ; Threats Strengths It works ; people are using it; Life sciences and climate science complain that they can't cope with the data volumes, yet HEP has "solved Should aim to quantify this: how much we are storing/accessing and the performance SRM could be considered as a success, but is far from perfection Standardization: With the same client (gfal,lcg_util) you can deal with files that are hosted on storage systems of different types. Tape backend reliability in Castor, dCache and StoRM Under stable conditions (stable access patterns etc.) the SEs work quite stable and perform well. EMI middleware. It may not be perfect but it is sufficiently modular to allow mixing and matching components; served us fairly well over the past years, and is being used by other projects. Building a community of people who deploy and support the infrastructure with experience and skill. Fancy things emerging: redirectors; reading over WAN Weaknesses SRM Specifically SRM: because it was implemented for most SM as an add-on to an existing system without considering changing the way this system works. SRM / FTS latencies Mismatch between what SRM provides and what experiments need Inconstancies between srm and underlying data Protocol performance Lack of scalability of data access by protocol, in particular for tape disk caches Protocol scalability Best performance often comes from copy to WN Different usage Different computing and data management models of the experiments require SEs with different setup (layout) of servers, data pools etc. halt of storage-management activities within WLCG; hiatus on cross-experiment, cross- technology storage-management development.. lead experiments to move into new areas on their own initiative, investigating and creating ad hoc solutions Which actor is responsible for certain functionality: micromanaging storage Weaknesses Maintability/Complexity Complexity: this can lead to longer than desired downtimes while problems are being investigated. Complexity (Storage model and systems) the current systems look over-complicated for what they do also the clients look over-complicated, with ugly APIs Wide range of local access protocols excessive complexity of the architectures, sparse systems that expose too much of their inner complexity Handling upgrades: draining of older diskservers onto newer generations is quite a lengthy process. Similarly for migration between tape media. Site storage major cause of job failures; Stability of Ses lack of robustness deletion procedure at tier2 level the abundance of workarounds that were implemented in the applications makes evolution difficult (one has to support also the workarounds) Future Plans / desires a solution for the typical incident when a disk server or a tape are temporarily (or permanently ) off-line standardization of the data access protocols a performant and stable system that allows the creation of loosely coupled federations, beyond LFC More robust SRM service with easy possibilities of fail over and load balancing. Flexibility of moving space tokens around Establishing a group that takes a long-term interest in storage management. Switch to use WebDAV for basic namespace operations be able to deploy a much more fail tolerant fully posix file system at least at LAN level, but it could be of big interest to have it also among federeted site at regional level reduce the complexity of the system ; make them more operable within the limited manpower budget A more flexible and dinamic data placement and deleting, driven by the "request of performance Opportunities Adopt standards to reduce work-load, so delivering what people need faster. more sophisticated networking models within storage implementations provide better interaction between storage and networking so that a storage system can deliver a file with minimal number of hops, allocated and configure network paths (c.f. lambda project) New storage standards in industry (nfs 4.1 ; webDav etc.) Big data / Hadoop etc. Cheaper SSDs being able to transparently use storage centers acquired from third parties, including Cloud storages Threats / Challanges Spinning magnetic disks will become extinct in some years (5-10), SSDs or something similar will take over. Will probably solve some performance problems but how is the price/TB evolving? Reliability? Many / multi-core IO challenges - Long term archival and data migration of the steadily increasing amount of data from old to new hardware and newer media is challenging. Will technology evolve fast enough to keep the necessary efforts constant? The variety of issues caused by increasing disk sizes, Scalability Sites supporting multiple VOs have to support different technologies (SRM + non-SRM). Will it be possible to support idiosyncratic solutions also in the future? The current development to create federated data stores is bypassing SRM. If that works well, is there a need for SRM at all? SM will have to adapt to the evolving storage technologies, i.e. SSD/HDD/tape. current issues surrounding disk vs tape will likely resurface as SSD vs magnetic harddisk ad hoc solutions: delivering solutions that are applicable to multiple experiments. being tied into particular software Sustainability of development and operations Changing future workflows and requirements and increased data access demands. Current situation: Experiments and Sites Lhcb: using Castor, dCache and StoRM. DPM is in the coming future Currently rely on SRM KIT is running dCache and xrootd (for Alice only) RAL Tier-1 is currently using CASTOR , and will soon upgrade to INFN-Bari is using Lustre + Storm. Supporting also xrootd for remote and interactive access. . Should aim to quantify quantity we are storing/accessing and the performance Current situation: developers DPM/LFC maintainance and support NFS4.1 - HTTP/WebDAV for HEP data access performance improvements, starting from the current status of the lcgdm (=DPM+LFC) components Scalable SRM: providing multiple SRM front-ends that clients can be load-balance over and removing a single-point-of-failure. Adopting standards and driving their adoption elsewhere: NFS v4.1/pNFS and WebDAV. Investigating cloud storage management APIs: Amazon S3 vs CDMI vs other proprietary standards. What benefits do they bring?