Data storage in Cloud computing

Embed Size (px)

Text of Data storage in Cloud computing

  1. 1. Overview of Cloud Computing and Workflow Research in NGSP Group Dr. Dong YUAN Research Fellow Swinburne University of Technology Melbourne, Australia
  2. 2. Outline > SUCCESS Centre and NGSP Group > Background: Big Data, Cloud Computing and Workflow > Research Topics Data Management in Cloud Computing Performance Management in Scientific Workflows Security and Privacy Protection in the Cloud SwinDeW-C Cloud Workflow System
  3. 3. The Centre of SUCCESS > SUCCESS: Swinburne University Centre for Computing and Engineering Software Systems SUCCESS is the NO.1 Software Engineering Centre in Australia SUCCESS is one of the 7 Tire 1 Centres at Swinburne University of Technology (Times World Ranking: 351- 400, Academic Ranking of World Universities: 301- 400) > The ambition of the Centre is to become the top centre for software research in the Southern Hemisphere within the next five years. 3
  4. 4. SUCCESS > Research Focus Areas Knowledge and Data Intensive Systems Nature of Software Next Generation Software Platforms SE Education and IBL/RBL Software Analysis and Testing Software R&D Group > http://www.swinburne.edu.au/ict/success/research- expertise/ 4
  5. 5. NGSP (Small) Group Overview > We conduct research into cloud computing and workflow technologies for complex software systems and services. > Members: Leader: Prof Yun Yang (PC Member for ICSE 07/08, FSE09 ICSE 10/11/12) Researchers: Dr Xiao Liu (Postdoc, China) Dr Dong Yuan (Postdoc) Gaofeng Zhang Wenhao Li Dahai Cao Jofry Hadi SUTANTO Antonio Giardina Others: Prof John Grundy Prof Chengfei Liu 5 Visitors: Prof Lee Osterweil Prof Lori Clarke Prof Ivan Stojmenovic Prof Paola Inverardi Prof Amit Sheth Prof Wil van der Aalst Prof Hai Jin Prof Hai Zhuge
  6. 6. > Primary projects: (Cloud) workflow technology: Scheduling and temporal analysis in cloud workflows ARC LP0990393 (Y Yang, R Kotagiri, J Chen, C Liu) Cloud computing: Intermediate data management in cloud computing ARC DP110101340 (Y Yang, J Chen, J Grundy) > Secondary project: Management control systems for effective information sharing and security in government organisations ARC LP110100228 (S Cugenasen, Y Yang) R&D Projects Grants 6
  7. 7. > SwinDeW workflow family including SwinDeW-C Architectures / Models (D Cao) Scheduling / Data and service management (D Yuan, X Liu) Verification / Exception handling (X Liu) > Cloud computing: Data management (D Yuan, X Liu, W Li) Privacy and Security (G Zhang, X Zhang, C Liu) R&D Projects Overview 7
  8. 8. > J. Chen and Y. Yang, Temporal Dependency based Checkpoint Selection for Dynamic Verification of Temporal Constraints in Scientific Workflow Systems. ACM Transactions on Software Engineering and Methodology, 20(3), 2011 > X. Liu, Y. Yang, Y. Jiang and J. Chen, Preventing Temporal Violations in Scientific Workflows: Where and How. IEEE Transactions on Software Engineering, 37(6):805- 825, Nov./Dec. 2011. > D. Yuan, Y. Yang, X. Liu and J. Chen, On demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems. Journal of Parallel and Distributed Computing, 71:(316-332), 2011 > J. Chen and Y. Yang, Localising Temporal Constraints in Scientific Workflows. Journal of Computer and System Sciences, Elsevier, 76(6):464-474, Sept. 2010 > G. Zhang, Y. Yang and J. Chen, A Historical Probability based Noise Generation Strategy for Privacy Protection in Cloud Computing. Journal of Computer and System Sciences, Elsevier, published online, Dec. 2011. > Another 8 A* papers are currently under review Some Recent ERA A* Ranked Publications 8
  9. 9. Part 1: Outline > SUCCESS Centre and NGSP Group > Background: Big Data, Cloud Computing and Workflow > Research Topics Data Management in Cloud Computing Performance Management in Scientific Workflows Security and Privacy Protection in the Cloud SwinDeW-C Cloud Workflow System
  10. 10. Big Data > Data explosion TB (1012 ), PB(1015 ), exabyte (EB, 1018 ), zettabyte (ZB, 1021 ), yottabyte (YB,1024 ) The total amount of global data in 2010: Google processes data everyday in 2009: Every day, Facebook 10T, Twitter 7T, Youtube 4.5T > Moore's law vs. data explosion speed Application data double every year over the next decade and further - [Szalay et al. Nature, 2006] > Buzzwords: data storage, data processing, parallel, distributed, virtualisation, commodity machines, energy consumption, data centres, utility computing, software (everything) as a service 10 1.2 ZB 24 PB
  11. 11. 11 Example: Pulsar Searching > Astrophysics: pulsar searching > Pulsars: the collapsed cores of stars that were once more massive than 6-10 times the mass of the Sun > http://astronomy.swin.edu.au/cosmos/P/Pulsar > Parkes Radio Telescope (http://www.parkes.atnf.csiro.au/) > Swinburne Astrophysics group (http://astronomy.swinburne.edu.au/) has been conducting pulsar searching surveys (http://astronomy.swin.edu.au/pulsar/) based on the observation data from Parkes Radio Telescope. > Typical scientific workflow which involves a large number of data and computation intensive activities. For a single searching process, the average data volume (not including the raw stream data from the telescope) is over 4 terabytes and the average execution time is about 23 hours on Swinburne high performance supercomputing facility (http://astronomy.swinburne.edu.au/supercomputing/). left: Image of the Crab Nebula taken with the Palomar telescope right: A close up of the Crab Pulsar from the Hubble Space Telescope Credit: Jeff Hester and Paul Scowen (Arizona State University) and NASA
  12. 12. Pulsar Searching Workflow 12 Dr. Willem van Straten
  13. 13. Benefits of Clouds > No upfront infrastructure investment No procuring hardware, setup, hosting, power, etc.. > On demand access Lease what you need and when you need.. > Efficient Resource Allocation Globally shared infrastructure > Nice Pricing Based on Usage, QoS, Supply and Demand, Loyalty, > Application Acceleration Parallelism for large-scale data analysis > Highly Availability, Scalable, and Energy Efficient > Supports Creation of 3rd Party Services & Seamless offering Builds on infrastructure and follows similar Business model as Cloud 13
  14. 14. SwinDeW Workflow Series SwinDeW Swinburne Decentralised Workflow - foundation prototype based on p2p SwinDeW past SwinDeW-S (for Services) past SwinDeW-B (for BPEL4WS) past SwinDeW-G (for Grid) past SwinDeW-A (for Agents) past SwinDeW-V (for Verification) current SwinDeW-C (for Cloud) current
  15. 15. Part 1: Outline > SUCCESS Centre and NGSP Group > Background: Big Data, Cloud Computing and Workflow > Research Topics Data Management in Cloud Computing Performance Management in Scientific Workflows Security and Privacy Protection in the Cloud SwinDeW-C Cloud Workflow System
  16. 16. 16 Dr. Dong Yuan http://www.ict.swin.edu.au/personal/dyuan/ Data Management in Cloud Computing Research Topics
  17. 17. Data Management in Cloud Computing > Scientific applications in cloud computing Computation and data intensive applications Excessive computation and storage resources Pay-as-you-go model > Three aspects of data management in the cloud Data storage Data placement Data replication
  18. 18. Data Storage > Developing smart data storage strategies for reducing the cost of storing big data in the cloud Data regeneration (computation and storage trade-off) Data de-duplication Data compression > Researcher: Dong Yuan
  19. 19. Publications > D. Yuan, Y. Yang, X. Liu, J. Chen, On demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems, Journal of Parallel and Distributed Computing, Elsevier, vol. 71(2), pp. 316-332, 2011. > D. Yuan, Y. Yang, X. Liu, G. Zhang, J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems, Concurrency and Computation: Practice and Experience, Wiley, 24(9), pp. 956-976, Jun. 2012. > D. Yuan, Y. Yang, X. Liu, J. Chen, A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems, Proc. of 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS10), Atlanta, USA, Apr. 2010. > D. Yuan, Y. Yang, X. Liu and J. Chen, A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud, Proc. of 4th IEEE International Conference on Cloud Computing (Cloud2011), Washington DC, USA, July 4-9, 2011.
  20. 20. Data Placement > Smart data placement strategies to reduce application cost Data correlation based strategy to reduce bandwidth cost Data usage based strategy to reduce storage cost > Researchers: Dong Yuan, Jofry Hadi SUTANTO, Antonio Giardina
  21. 21. Publications > D. Yuan, Y. Yang, X. Liu, J. Chen, A Data Placement Strategy in Scientific Cloud Workflows, Future Generation Computer Systems, Elsevier, vol. 26(8), pp. 1200-1214, 2010.
  22. 22. Data Replication > To cost-effectively assure data reliability in the cloud Dynamic replication strategy Proactively checking based replication strategy > Researchers: Wenhao Li, Dong Yuan
  23. 23. Publications > W. Li, Y. Yang and D. Yuan, A Novel Cost-effective Dynamic Data Replication Strategy for Reliability in Cloud Data Centres. Proc. of International Conference on Cloud and Green Computing (CGC2011), pages 496-502, Sydney, Australia, Dec. 2011. > W. Li, Y. Yang, J. Chen and D. Yuan, A Cost-Effective Mechanism for Cloud Data Reliability Management based on Proactive Replica Checking. Proc. of 12th IEEE/ACM International Symposium on Cluster, Cloud and Gr