90
RDSI CREATING AUSTRALIA’S RESEARCH DATA CLOUD

RDSI Project History

Embed Size (px)

Citation preview

Page 1: RDSI Project History

RDSI CREATING AUSTRALIA’S RESEARCH DATA CLOUD

Page 2: RDSI Project History

2|

Table of Contents

03 12 20 35 39Chapter I Chapter II Chapter III Chapter IV Chapter V

44 48 52 54 61Chapter VI Chapter VII Chapter VIII Chapter XI Chapter X

In the Beginning Consulting the community

The data collections

An integrated infrastructure Moving data

Accessing data Protecting data Working with data

Selecting solutions

Case studies

68 84Chapter XI Chapter XI

Looking back and looking ahead

The team

Page 3: RDSI Project History

In the beginning…

Chapter I

Page 4: RDSI Project History

4|

…there was nothing

At the very beginning of this project there was nothing in the sector for collaborative storage.

Really nothing. So it started from scratch, to persuade people that it was a good idea to store data, except not just in your university

or in your research group.

– Dr Nick Tate

RDSI Project Director

Page 5: RDSI Project History

5|

No discoveries without data

RDSI is in my mind, entirely about the delivery of data and the amassing of the data. It is my belief that without bringing collections together, without making them accessible, you can’t make new discoveries.

‒ Prof Nathan BindoffDirector TPAC and Professor of Physical Oceanography

Page 6: RDSI Project History

6|

How it started

There is growing recognition that new ways to conduct research have emerged and are being validated across most research disciplines. Adding to traditional forms of research that rely on experiment, theory and testing hypotheses using data, it is now evident that researchers also:

» collect increasingly larger sets of data as a primary form of research; and

» use modelling tools to assist them in deriving patterns, perceptions and trends that can form the basis for establishing and confirming hypotheses.

Information and communications technology (ICT) is the cornerstone to such new approaches, providing the means not only for increasingly powerful computer-enabled simulation and modelling, but also the very avenue to manage and integrate the increasing volume and complexity of datasets and collections. Hence, ICT is not only a resource to administer and manage research but also to drive and innovate the ways in which research is conducted.

– Strategic Roadmap for Australian Research Infrastructure,2008, p19

In 2008, the Government’s Strategic Roadmap for Australian Research Infrastructure highlighted the need to manage and integrate the increasing volume and complexity of research datasets and collections.

Page 7: RDSI Project History

7|

Research storage before RDSI

Part of my role here at QCIF is to help research groups with their storage needs. In the days before RDSI it caught me by a huge surprise that even at a major university it was quite hard for a research group to get the right kind of storage, to have some better way of collaborating than sending a spreadsheet across the globe.

I remember the first meeting I had with one professor at UQ who had come from Oxford. He said, ‘Look, I’m a centre director, I’m a professor, and I have spent six months just trying to get a little bit of storage. I’m worried I should be doing something else.’

‒ Graham CheneResearch Manager, QCIF

Page 8: RDSI Project History

8|

Research storage before RDSI

A retiring academic in Marine Science came to us at the University of Sydney library. He had spent his whole life collecting data from the beaches of Australia. He asked us, ‘What do I do with all of this? It’s in various formats, it’s valuable, but I don’t know what to do with it.’ We approached the university, but at that stage the view was that when research results were published, the research was finished. Why would they want to store the data and why would they want to share it?

‒ John Shipp RDSI Project Board

Page 9: RDSI Project History

9|

Research storage before RDSI

We asked another researcher what she was doing to curate her data. She said, ‘Every week I download it onto CD. I make three copies: one for home, one for the office, and one I send to my mother in Perth.’

Even today I have a lingering fantasy that somewhere in Perth there is a little old lady with these CDs stacked up around her and one day they will cave in on top of her and she will be crushed by her daughter’s research. This is just imagination, but it’s an indication of what people were doing because they didn’t have facilities to curate their data properly.

‒ John Shipp RDSI Project Board

Page 10: RDSI Project History

10|

2010: RDSI beginsDates: 2010-1014

Funding: $50m

Program: Education Investment Fund (EIF) under the Super Science Initiative

Lead agent: The University of Queensland

Chair: Prof Max LuDeputy Vice-Chancellor (Research)

Project Director: Dr Nick Tate

The aim of the RDSI Project is that researchers will be able to use and manipulate significant collections of data that were previously either unavailable or difficult to access, and that there will be a consistent means of accessing this data.

The Project will be realised through the creation and development of data storage infrastructure accessed through a common infrastructure layer and provided by agencies within the sector, or commercial providers, or both.

Page 11: RDSI Project History

11|

Where should we put the data?

The question was, where should we put the data? This was stopping progress across research. People couldn’t keep storing it on their desktops, and there were no onshore cloud services at the time. That left us with two options: either one big data centre or a small number of sites around Australia. The problem with one site is then you have to put everything else around it, and you lose innovation. Several is better.

‒ Dr Rhys FrancisRDSI Board

Page 12: RDSI Project History

Consulting the community

Chapter II

Page 13: RDSI Project History

13|

Opening a Dialogue

Jun

e

July

Au

gu

st

Sep

tem

b

er

Octo

ber

Novem

be

r

Decem

be

r

Jan

uary

Feb

ruary

Marc

h

Ap

ril

May

2011 2012

ReDS Consultation Series9 Feb – 2 Mar

Vendor Briefing1 Aug

ReDS Tinman Consultation

28 Sep

DaSh TinmanConsultation

11 Nov

ReDS Strawman Consultation Series

3 June – 1 July

DaSh Strawman Consultation Series 

17 June

Page 14: RDSI Project History

14|

Consulting the community

Because the Education Investment Fund has a restriction that you can’t use the funding for operation, we needed operational partners who could provide the operational working funds. We didn’t know if anybody would be interested in putting their hand up to do that. We were therefore consulting to see who would be interested, under what circumstances they might be interested, and what would be possible for them to achieve that would meet the project objectives and the needs of researchers. By doing that we were able to put together a plan.

‒ Dr Nick TateRDSI Project Director

Page 15: RDSI Project History

15|

From Straw Man to Tin Man

It was a large community engagement exercise. For each program, we would find out who might be interested, get them together for a workshop, bounce around ideas, and then consolidate the thinking into a document. We had a lot of good feedback—in some cases probably 100 individual contributions to the Straw Man documents. We would then develop a Tin Man document for each of the programs and work through those, to ultimately lead into the business plan for the project. It matured the thinking of the project quite quickly, and it didn’t feel like an invention being planted on the community from outside. This was something the community built itself.

‒ Dr Markus BuchhornResearch Data Manager, RDSI Project

Page 16: RDSI Project History

16|

Node Development

The NoDe Development programme was designed to identify, strengthen and develop research data centres so that they were able to hold and process high data volumes. These data centres became Nodes of the RDSI project, and their operators were provided with establishment funding from this programme.

The Node Development programme funded the development of eight high capacity Nodes: six Primary Nodes located in Brisbane, Sydney, Canberra, Melbourne, Adelaide and Perth, with two additional Nodes in Townsville and Hobart.

Patricia McMillan
This is a new slide
Page 17: RDSI Project History

17|

Identifying the Nodes

RDSI went through a process of calling for proposals. We were all looking at what our individual contribution to the national infrastructure would be, and I remember in Queensland we focused on life sciences and eco sciences as being the main areas where we thought that there was a really significant body of research expertise.

‒ Rob CookCEO, QCIF

Page 18: RDSI Project History

18|

Responding to the call for Nodes

We put a proposal to our members that Intersect propose to become a Node of RDSI. Intersect has twelve university members, quite a lot, and the thing that is interesting is that all twelve quickly said, ‘Yes, this is a really good idea.’ So we responded to the call for proposals and we were one of the Nodes selected from the country.

‒ Dr Ian GibsonCEO, Intersect Australia

Page 19: RDSI Project History

19|

RDSI funded Nodes

Jan

uary

Feb

ruary

Marc

h

Ap

ril

May

Jun

e

July

Au

gu

st

Sep

tem

b

er

Octo

ber

Novem

be

r

Decem

be

r

22/03/2012

5/04/2012

21/05/2012

25/06/2012

5/07/2012

10/07/2012

4/09/2012

2/11/2012

Patricia McMillan
This is a new slide
Asher Vennell
dates they came on board perhaps?
Page 20: RDSI Project History

The data collections

Chapter III

Patricia McMillan
Deleted the whole previous section on setting up a Node because we didn't have much material.
Page 21: RDSI Project History

21|

Research Data Services

The ReDS programme was designed to identify research data holdings of lasting value and importance and contribute funding to their development at the most appropriate Node. ReDS delivers storage services in support of significant data collections, research data sets and associated access tools which are aggregated into related holdings that add value to each other through co-location.

Patricia McMillan
This is a new slide
Page 22: RDSI Project History

22|

Selecting collections

The value of data is only realised when it’s used, so having data that is going to be reused was a critical part of whether it could be stored through the ReDS program. Every collection that has been given an allocation through the ReDS program meets certain criteria that indicate or demonstrate its national significance.

‒ Peter HicksReDS Program Manager, RDSI Project

Page 23: RDSI Project History

23|

Merit criteria

Determining criteria by which research data might be assessed for merit was difficult across disciplines. You could assess a collection based on how many people it would be shared with. But what might be an appropriate audience size for sharing climate data would not be the same for sharing medical data or humanities data. It was very challenging to find merit criteria that would carry the same weight across all disciplines.

‒ Dr Frankie StevensResearch Data Manager, RDSI Project

Page 24: RDSI Project History

24|

Data storage is a commitment

Many of the Nodes had experience with merit allocation processes used for high performance computing. But with HPC, the resources are used, and at the end of the cycle the whole process is repeated. When you try to think about that for data, for long-term storage, you’re not thinking about a resource that goes away in six months. You’re making a long-term commitment. And so the assessment of merit was a much bigger challenge.

‒ Dr Markus BuchhornResearch Data Manager, RDSI Project

Page 25: RDSI Project History

25|

Identifying collections

The Intersect model is we have staff located on campus with our member universities, a team of roughly 15 people already talking to research groups, telling them about options and services available to them. So that team were now looking for collections that might benefit from being on the RDSI infrastructure.

‒ Dr Ian Gibson CEO, Intersect Australia

Page 26: RDSI Project History

26|

Exposing science agency collections

At NCI we’ve worked very closely with the science agencies we are associated with—CSIRO, Geoscience Australia, and the Bureau of Meteorology—to expose the national and international collections that have been locked up inside those agencies. And the win-win is that the exposure to the national community is of value because these collections otherwise would not have been available, and the confluence of them enables transdisciplinary research. The advantage to the agencies is that the availability of the copy here puts that data in a rich computational environment, much richer than they have internally. That’s the nature of the win-win that’s been possible.

‒ Prof Lindsay Botten Director, National Computational Infrastructure

Page 27: RDSI Project History

27|

The long tail of research data

Coordinated research groups like astronomers are able to collect their data and curate it, and eventually they will find the storage. But the humanities, medicine, the environmental sciences, they were less united in their purpose, and I always saw RDSI as a vehicle for bringing those people together and providing them the nascent infrastructure to store their data and make it available.

‒ John Shipp RDSI Board

Page 28: RDSI Project History

28|

Data at RDSI Nodes

Last updated 14/12/2014Click graph for current figures

Page 29: RDSI Project History

29|

55 Petabytes of data available in over 70 Petabytes of storageThat these are huge numbers is beyond question; perhaps more astonishing is that this is an order of magnitude increase in just 4 years.

Even more encouraging is that this data is spread across every one of the 22 research disciplines. From Humanities to Radio Astronomy, no Field of Research has been left untouched.

Page 30: RDSI Project History

30|

Collections by Field of Research

Last updated 14/12/2014Click graph for current figures

Page 31: RDSI Project History

31|

About large allocations

The RDSI project allocated up to $9.4M of funds to support large collections at Nodes.  Nodes were given the opportunity to submit funding proposals for collections that were too large to be funded under the initial ReDS agreement. Large collection proposals were evaluated by the RDSI Project Board who decided a total of 5 proposals will be funded.  

Page 32: RDSI Project History

32|

Large allocations

I’m really pleased about the large allocations. Having that strategic view at the research community level of how storage is used to aggregate data from particular domains in a way that enables advanced research, is a critical outcome of the ReDS programme.

‒ Peter HicksReDS Programme Manager, RDSI Project

Page 33: RDSI Project History

33|

A national medical research data storage facilityRecognising that there was a potential need to support health and medical research data collections, we put in a proposal to establish a national medical data storage facility using funds from the ReDS special allocation process. We put that proposal to the medical research community and we were overwhelmed with responses. Forty-seven institutions around Australia nominated major collections they would like to store. So what we have now is an opportunity to build a national medical research data storage facility. This is something that is quite uncommon globally. It will allow researchers to get the serendipitous second use outcomes and impacts from that data.

‒ Dr Ian GibsonCEO, Intersect Australia

Page 34: RDSI Project History

34|

The large allocations

Murchison Widefield Array Data Archive: The Murchison Widefield Array (MWA) project is funded for operations over a two-year period, which commenced in July 2013. These data sets, of international importance, will assist a global Astronomy and Astrophysics community to do research into the main science goals of the MWA, which include: exploration of the early Universe and the search for signals from the first stars and galaxies; exploration of the transient and dynamic Universe; studies of the Earth's ionosphere; and the study of astrophysics related to objects in our galaxy and in the distant galaxies.

Participating RDSI Node is iVEC.

National Environmental Research Data Centre: The National Environmental Research Data Collection (NERDC) comprises international and national reference collections spanning five fields. This multidisciplinary confluence of collections: (a) spans the lithosphere, crust, biosphere, hydrosphere, troposphere, and the stratosphere; (b) encapsulates the complex interactions within, and amongst, these layers, and (c) will enable new, transdisciplinary approaches to research.

Participating RDSI Node is NCI.

National Genomics Data Storage Facility: Genomics is a critical and complex science for the understanding of living forms and the way they are impacted by the environment, for improving medicine, and understanding food amongst many other uses. The facility will store and make available large volumes of genome data generated at the leading national centres, as well as essential national and international genome libraries.

Participating RDSI Nodes are Intersect, VicNode and QCIF.

Australian Coordinated Characterisation Data Space: The ACCDS underpins national-scale research programs, in particular two recently established characterisation-intensive ARC Centres of Excellence: the ARC Centre of Excellence in Advanced Molecular Imaging (CAMI), which will develop innovative imaging technologies to explore the immune system; and the ARC Centre of Excellence for Integrative Brain Function (CIBF), which is tackling the challenging problems involved in understanding how the human brain works.

Participating RDSI Nodes are VicNode, Intersect and QCIF.

Australian National Medical Research Data Storage Facility: The foundation data sets for this collection represent major national assets supporting research into Australia’s most significant diseases including heart disease, mental illness, the major cancers, as well as the increasing problems of lifestyle diseases such as diabetes and obesity. Importantly, children’s health and the health of our aging population are both well supported in the foundation data sets of the ANMRDSF.

Participating RDSI Nodes are Intersect, VicNode and QCIF.

Page 35: RDSI Project History

An integrated infrastructure

Chapter IV

Page 36: RDSI Project History

36|

Data Sharing Programme

The DaSh programme was designed to develop the DaSh Collaboration Network (DaShNet) and the DaSh Technical Architecture. The integration of these two major parts of the DaSh programme became the DaSh Technical Framework, which describes the network, data movement capabilities, security and identity matters, data access, cloud gateway access, test platform for the programme’s components and workflow automation capabilities for the RDSI-funded Nodes.

Patricia McMillan
This is a new slide
Page 37: RDSI Project History

37|

Why is it hard?

Why is it hard, relative to other programs? It’s hard because data has a narrative that’s different for every data stream. Every data-oriented organisation thinks their data is special and different from everybody else’s. Really it should be about revealing data, gathering it together, and developing tools to express and analyse it.

‒ Prof Nathan BindoffTPAC Director and Professor of Physical Oceanography

Page 38: RDSI Project History

38|

Building technical skills across the sector

One of the things the project has been able to do is to fund technical and data specialists at the Nodes. Many of the techniques and skills for storing, moving, and accessing large sets of data were new to the project and to the Nodes. Being able to invest in building up a community of people in the sector with these skills and capabilities has been one of the important contributions of RDSI.

‒ Viviani Paz RDSI Project Manager

Page 39: RDSI Project History

Moving data

Chapter V

Page 40: RDSI Project History

40|

The need for a fast network

One of the significant technical challenges to consider was moving data. We experienced this early in the project as the ARCS Data Fabric drew to a close and people were trying to move the Integrated Marine Observing System data from Perth to Hobart and Brisbane. The data was going at the speed of congealed porridge running down a hill. It was clear that the network capabilities—not just fast networks, but the techniques, the tools, the software for using them—weren’t in place. So the Project has solved a very serious challenge. When AARNet put in the data transfer network, they are now certifying that they’re getting 95 percent use of the capacity of a 10-gigabit link. That is enormous compared with what was there. It’s not just the bandwidth that’s changed.

‒ Dr Nick Tate RDSI Project Director

Page 41: RDSI Project History

41|

The challenge in moving data

Without a doubt, the biggest challenge you have when moving data is that a lot of these datasets are built upon many files that are inherently small in nature, and they’re quite difficult to move over large distances. Even though you might have a lot of bandwidth available to you, you can find that 10 terabytes of data can take weeks if not months to transfer.

‒ Brendan DaveyDeputy Director, TPAC

Page 42: RDSI Project History

42|

Network enhancements, AARNet and NRN

Our goal was to ensure we could get multiple 10 gigabits of capacity into the RDSI sites. We were looking to get every Node connected to the AARNet backbone with capacity significantly over and above what a big university would have, so they could provide services to a community. And to do that we needed to ensure there were redundant fibre paths, and the appropriate network infrastructure on those fibres to deliver that capacity.

‒ Peter ElfordNetwork Program Manager (2013), RDSI Project

Patricia McMillan
This slide could be consolidated with the network diagram
Page 43: RDSI Project History

43|

Connecting Nodes to Nodes, and Nodes to Researchers

The Data Sharing Network (DaShNet) is a reliable high-speed network service

built over the new AARNet4 backbone network. It connects RDSI-funded Nodes

to each other and researchers around Australia. It can ultimately support up to

100 gigabits per second, significantly increasing data transfer rates across the

country.

Page 44: RDSI Project History

Accessing data

Chapter VI

Page 45: RDSI Project History

45|

Making it easy to find and access data

Mediaflux will enable the research community to have useful data management tools that are consistent across Australia, so that people will be interacting with research data in the same way, irrespective of where they’re located.

‒ Dr Frankie Stevens Research Data Manager, RDSI Project

Page 46: RDSI Project History

46|

Mediaflux and the Nodes

Page 47: RDSI Project History

47|

Identity and access management

In Australia we already had the Australian Access Federation, a trust fabric for identity management. But one of the things that became very plain early on is that although it handles web access for everyone, it can’t yet do access via other methodologies. This is where the project got into territory where we were cutting new ground.

‒ Richard NorthamNode Development Manager, RDSI Project

Page 48: RDSI Project History

Protecting data

Chapter VII

Page 49: RDSI Project History

49|

Securing the data

One of the toughest parts of looking at the security models for this project was to understand how the Nodes would collaborate and share responsibility for the data and data transfers. How they would manage the relationships with researchers to give them a level of comfort that the integrity of their data is being maintained. When your research data has been under your direct control and then it goes outside your own perimeter, there’s a concern that you don’t know what’s happening with it. So for me, it’s been a management job of perception more than anything else.

‒ Mark McPhersonRDSI Security Policy Manager

Page 50: RDSI Project History

50|

Will my data be safe?

One of our biggest challenges in getting people on board is to assure them that their data is going to be safe, it’s going to be secure, that they’ll have access to it, and their partners who use the data will have access to it.

‒ Brendan Davey Deputy Director, TPAC

Page 51: RDSI Project History

51|

Removing the obstacles

We were initially concerned about whether researchers would adopt a central data storage facility at all. We drew up a list of 10 obstacles, and you could be pretty sure that if you started talking to a researcher about putting their data on RDSI, they’d start going through these 10 obstacles one by one without you prompting them. ‘I can’t touch it anymore, it’s insecure, you’re only here until the end of 2014, I might have to pay for it,’ and so on. One by one we’ve been able to make these obstacles insignificant.

‒ Rob Cook CEO, QCIF

Page 52: RDSI Project History

Working with data

Chapter VIII

Page 53: RDSI Project History

53|

Connecting with compute facilities

The Raijin supercomputer at NCI

“Data is a vital enabler of research. Big data can only be handled in a rich computational environment. It’s not the data alone. It’s not the compute alone. It’s the confluence of the two. People advance their research by being able to have well-managed, integrated collections of data where they can explore new ideas by having a confluence of different datasets available to them.”

‒ Prof Lindsay BottenDirector, National Computational Infrastructure

Patricia McMillan
This slide was consolidated from 2 separate slides
Page 54: RDSI Project History

Selecting solutions

Chapter IX

Page 55: RDSI Project History

55|

Vendor Panel

The Vendor Panel programme, implemented in partnership with the Council of Australian University Directors of Information Technology (CAUDIT), was created to facilitate the procurement of storage related infrastructure, software and services. The purposes for the programme were twofold. Firstly to allow Nodes, universities and other authorised users to avoid lengthy tendering processes by using an appropriately constructed panel, and secondly to support volume pricing across Nodes and the wider Higher Education and Research Sector.

Page 56: RDSI Project History

56|

An open mind towards solutionsIn many of the research infrastructure projects we’ve seen before, there has been a focus on adopting only open source solutions or solutions developed by other researchers. We took the view that we should go with a completely open mind to the process. You can use commercial or open source or other not-for-profit infrastructure software. It doesn’t matter. The important thing is to pick the best that’s available at the time and to make sure it’s affordable, and that’s why we’ve been negotiating collectively. For example, the software we’re using in our data transfer network is a commercial product and by negotiating effectively, we’ve been able to acquire it at a price that allows the sector to do things which have not been possible in the past. With the Vendor Panel transferring to CAUDIT over the coming months, this is a legacy for the sector as a whole.

‒ Dr Nick TateRDSI Project Director

Page 57: RDSI Project History

57|

Vendors on the PanelTwenty-two and counting

Last updated 14/12/2014Click image for current members

Page 58: RDSI Project History

58|

Saving Money for the Sector

CIOs have challenges in navigating procurement for IT services, testing the market, keeping up with suppliers. The Vendor Panel simplified storage options for the whole sector and was a catalyst for CIOs to open a dialogue with the research community. It took data storage down from being too complicated, too hard, to being a commodity product you can buy and use as you need. Probably we won’t realise the benefits for a few years, but it saved the sector an enormous amount of money.

‒ Peter Nikoletatos RDSI Project Board

Page 59: RDSI Project History

59|

Towards public cloud

This is potentially the last generation of serious storage the Nodes will own. The project has looked extensively at how public cloud could complement in-house storage and compute, and we’ve established agreements with Amazon Web Services to eliminate costs in moving research data in and out of the public cloud, to enable the Nodes to make some informed decisions about that in the future.

‒ Paul CampbellVendor Engagement Specialist, RDSI Project

Page 60: RDSI Project History

60|

A data storage ecosystem

When it comes to large-scale infrastructure, organisations for the foreseeable future will use a hybrid solution. They will have some capability internally, some from private cloud providers such as Intersect, and some from public cloud infrastructure. We are part of an ecosystem that allows the data to flow around these different parties which collectively provide an ongoing solution to data storage.

‒ Dr Ian GibsonCEO, Intersect Australia

Page 61: RDSI Project History

Chapter X

From July 2013, the RDSI Project began collecting use cases on how research groups across Australia are using collections stored at RDSI nodes, and why RDSI-funded storage is important to their research. From high energy physics to the humanities, from climate to cancer research, researchers are discovering common needs around research data. They all need to preserve and store their data, access and share it with collaborators, bring disparate collections together to be analysed by common tools, and in many cases, reuse data that was collected by someone else or for a different purpose.

Case Studies

Page 62: RDSI Project History

62|

A major new human genome collection

How RDSI is helping:

The sequencing will generate 4.5 petabytes of data over the next 3 years. Storage through RDSI helps reduce costs to researchers and allows the data to be moved easily among Nodes for analysis.

The outcome:

Australian researchers are positioned to take a leading role in emerging genomics research through access to cost-effective genome sequencing and genomic collections of international importance.

The challenge:

The Garvan Institute of Medical Research is sequencing over 4000 healthy human genomes to create a Medical Genomics Reference Bank for researchers around the world.

Image courtesy of P. Morris, Garvan Institute

“We see this as providing Australia with a seat at the table and an opportunity to be amongst the world leaders in an area that’s emerging so rapidly.”

– A/Prof Marcel Dinger

Head of Clinical Genomics and Genome Informatics, Garvan Institute

Page 63: RDSI Project History

63|

Access to what was once inaccessible

How RDSI is helping:

Through RDSI storage, Richard is now able for the first time to make this collection of over 1 petabyte of data accessible and searchable by researchers everywhere.

The outcome:

The footage is being used by the Queensland Government to track turtle hatchling success rates at Raine Island and by JCU to study the ecology and biology of venomous box jellyfish.

The challenge:

Award-winning natural history cinematographer and marine scientist Richard Fitzpatrick has 20 years worth of film footage of the complex behaviours of ocean and terrestrial creatures.

“The fact that it’s now searchable is just huge. There are 5000 hours of footage, and now you can go in and chase stuff yourself. That in itself is monumental.”

– A/Prof Jamie SeymourDirector, Tropical Australian Venom Research Unit

Page 64: RDSI Project History

64|

Opening the door to use large datasets in a HPC environment

How RDSI is helping:

RDSI storage makes available data that was previously locked within agencies. RDSI storage within the NCI computational environment opens the door for using HPC to work with these large datasets.

The outcome:

The National Flood Risk Information Project (NFRIP) is using the Data Cube to create a portal showing areas of land where surface water has been observed from satellites in the past, to raise community awareness of flood risks.

The challenge:

Geoscience Australia is bringing together 30 years of earth observation satellite images into a Data Cube that creates a geographic time machine, allowing scientists to apply the data to big problems such as managing flood and fire risk.

“We now have hundreds of terabytes of satellite data covering all of Australia going back 30 years, and we wanted to begin applying it to big problems.”

‒ Dr Adam LewisNational Earth & Marine Observations Group, Geoscience Australia

Page 65: RDSI Project History

65|

Preserving collections

How RDSI is helping:

A growing number of these collections have been brought together by the human communication science community, stored through RDSI, and are now accessible through the Alveo virtual laboratory, funded by NeCTAR.

The outcome:

These collections are being preserved and used in new ways by linguists, psychologists, musicologists, and computational scientists.

The challenge:

Collections containing real examples of the use of speech, language, and music were stored in locations disparate from one another. Accessibility was difficult, and some collections were at risk of being lost.

“A lot of the excitement is about

being able to bring these collections

out of the dusty closets and make

them useful.” ‒ A/Prof Steve CassidyComputer Scientist, Macquarie University

Page 66: RDSI Project History

66|

End-to-end research data management

The challenge:

The Australian Synchrotron needed to protect, store, provide access to, and allow researchers to share, reuse, and validate data from Synchrotron beamline experiments.

How RDSI is helping:

Technical staff from the Monash eResearch Centre and VicNode worked with the Synchrotron to develop a solution, which uses RDSI storage to store and provide access to the data. It also uses the NeCTAR Research Cloud and DOIs from ANDS.

The outcome:

Store.Synchrotron is the first persistent, open data store in the world for a synchrotron. Thousands of datasets have been stored in the permanent, accessible archive.

“Ours is going to be the only system in the world where all of the primary data from the beamlines, every frame, will go into the store. And it will be there. This is an absolute world first.”

‒ Dr Tom Caradoc-DaviesPrincipal Scientist for Macromolecular Crystallography, Australian Synchrotron

Page 67: RDSI Project History

67|

Data reuse and new collaborations

How RDSI is helping:

RDSI storage integrated with the NeCTAR Research Cloud allows these models to be available and easily run by other researchers. As a result, use of the models by other groups is growing rapidly.

The outcome:

A group of National Resource Management Regions (NRMs) has found them so beneficial they are funding Dr VanDerWal’s group to include fresh water species information.

The challenge:

Dr Jeremy VanDerWal at JCU creates models to study how climate change will affect bird and animal populations. The models were previously behind university firewalls. Providing access to others was difficult.

“Previously my thinking was limited by the small amount of storage and computing that was available to me. I always had to summarise down and minimise the data. I don’t have to do that now. I don’t have to worry about the live disk limitation or the compute resources. Now I can keep doing the research as I’d like to see it done.”

‒ Dr Jeremy VanDerWal Centre for Tropical Biodiversity and Climate Change, James Cook University

Page 68: RDSI Project History

Looking back and looking ahead

Chapter XI

Page 69: RDSI Project History

69|

Project Success

The RDSI Project set out to transform the way in which research data in Australia is stored and made available to its potential users. By any measure, the project has been successful in achieving this.

When the contract for the project was signed between the University of Queensland and the Commonwealth Government on Christmas Eve 2010, it has been estimated that there was a total of about 5 Petabytes of research data stored throughout the sector and that much of this was inaccessible to most researchers. By the end of the project in 2015, it is expected that over 55 Petabytes of data will be available in over 70 Petabytes of storage. Even more importantly, this will be stored in facilities that are able to make it collaboratively available to researchers.

‒ Dr Nick Tate RDSI Project Director

Asher Vennell
NEW SLIDE
Page 70: RDSI Project History

70|

Are researchers finding it valuable?

Researchers are voting with their data. They’re bringing it to the Nodes, they’re putting it on. It’s a great leap forward.

‒ Peter Elford Director, Government Relations, AARNet

Asher Vennell
Not really sure what this means by itself. should we change title to a rephrase of the quote?
Page 71: RDSI Project History

71|

Cultural change

‒ In addition to putting the tin on the ground to store data, a key success of the RDSI Project has been to facilitate cultural change around collaboration and sharing.

‒ Brian Anker Chair, RDSI Project Board

Asher Vennell
Ch-ch-changes
Page 72: RDSI Project History

72|

Supporting research activities

Coming from an IT background, I learned so much from working on this project about supporting research activities in the next decade. It’s not just about compute and store, it’s about collaboration. It’s about connecting, about access, about identity. It’s about protecting the work. It’s about curation of data. It’s about making it available for groups, not just for now but in the future. It’s just so big, it takes a while to get your head around it.

‒ Peter NikoletatosRDSI Project Board

Page 73: RDSI Project History

73|

Fingers into the future

With a lot of projects there is a start, a finish, and you move onto the next thing. This thing really has fingers into the future in being able to act as a building block for other initiatives to build on.

‒ Brian AnkerChair, RDSI Project Board

Page 74: RDSI Project History

74|

Data access

Our views of data have changed as RDSI has evolved. When the project started, everyone was talking about data storage. But as you start storing the data, you realise the real problem is access. In the early days, the access mechanisms available were extremely primitive. Now we have the data stored, we have the Mediaflux tool to make data easy to find and access, we have Aspera for moving large quantities of data. These weren’t even really imagined when we started the RDSI project. As we go forward, people will begin to realise that the real value that’s been delivered by the system is the organisation of the data into a way that people can find it, access it, and manage it. That’s a big change that RDSI has made.

‒ Rob CookCEO, QCIF

Patricia McMillan
Made a couple of text edits to this quote to cut it down slightly. Also updated LiveArc to Mediaflux.
Page 75: RDSI Project History

75|

Thinking outside the box

A lot of people say to me, ‘This is great. I can now go to one location, I can have access to this dataset and to this other dataset right alongside, whereas before I’d have to go to multiple locations to get all the data I needed.’ And what I’m hearing in the wider research community is that people are starting to think outside the box. They can now suddenly combine two datasets from different disciplines together and potentially do new science. So it’s quite exciting.

‒ Brendan Davey Deputy Director, TPAC

Page 76: RDSI Project History

76|

Focus on research

Once you’ve solved the problem of knowing how to move data around and knowing where to put it, you can start to focus on other things. And that’s really where RDSI will continue to change research in Victoria. You won’t need to focus on where to put the data and whether or not it’s a good idea to put it there. You can move on. As an operator, the best thing is when you do get people to use these services and they never, ever call you again. Because it means the service is humming away.

‒ Dr Steven ManosManager Research Services in ITS, The University of Melbourne

Page 77: RDSI Project History

77|

The value of data

Researchers are beginning to realise the value of their data. A professor in pathology from The University of Melbourne was one of our first major consumers of data storage. We had gone in to speak to her. Their archiving solution was a set of hard drives on a shelf, and she wanted advice on a better solution. She said, ‘Well, I pay $70,000 a year in liquid nitrogen to preserve my physical tissue samples. Why wouldn’t I pay the equivalent to look after my digital assets?’ For her, the value proposition was obvious.

‒ Dr Steven ManosManager Research Services in ITS, The University of Melbourne

Page 78: RDSI Project History

78|

A million-fold increase in scale

In 1996 I became the Chairman of the World Ocean Circulation Experiment Data Products Committee, which was all about assembling the data from this billion dollar international project. I had 12 organisations working for me and in those 12, probably 35 effective full-time staff delivering up data on a regular basis. To give you a sense of the scale, the final product in 2002 from this billion dollar experiment fit onto a single DVD. It was just 4.7 gigabytes, but we won accolades because that dataset, huge at the time, was delivered across the Internet. Now in 2014, we have over 30 petabytes of data approved for ingestion into the RDSI Nodes across Australia. So that’s nearly than a million-fold increase in data, with fewer people involved, and with the additional challenge that the datasets come from a much more diverse research community.

‒ Prof Nathan Bindoff Director of TPAC and Professor of Physical Oceanography

Page 79: RDSI Project History

79|

Putting in a chair lift

Sharing research data for everyone’s benefit involves taking a risk. You have to climb a hill. With the RDSI investment, the government has put in a chair lift to help the research sector get up the hill to see the benefit on the other side.

‒ Prof Liz SonenbergRDSI Project Board

Asher Vennell
A helping hand for researchers?
Page 80: RDSI Project History

80|

Moving out of the data ice age

An interesting lesson that wasn’t clear to me when we started is that we’re really at a very early stage of maturity in dealing with data. We didn’t realise that we were in the ice age; it was all frozen. You can get a real sense of excitement from recognising what people will be able to do with data when it becomes so easy to use. And it will, you know.

‒ Rob CookCEO, QCIF

Page 81: RDSI Project History

81|

Evolution

I think the RDSI Project has taken us through a significant learning curve. Data is a tricky thing. It’s so multi-dimensional, and it has so many owners. Working with the multiplicity of interests is quite challenging. And so the place we’ve ended up is by evolution. You would not have been able to write it down on a piece of paper on Day One.

‒ Prof Lindsay Botten Director, National Computational Infrastructure

Page 82: RDSI Project History

82|

The pathway is real

It’s important to understand that it’s not just about where we have ended up. It’s about the pathway. The pathway is real.

‒ Dr Rhys Francis RDSI Project Board

Page 83: RDSI Project History

83|

Passing the Baton

It has been quite a journey over the past 4 years as together we have created this extraordinary infrastructure for the research sector. The project has tackled a rich tapestry issues and challenges, but with the help of all our stakeholders we now have a result to be proud of.

Researchers have made significant gains in the way they interact with their data by being able to concentrate on their research rather than worrying about the volume of data they are producing or the mechanisms to store that data.

We now pass the baton for continued development and support of this national infrastructure to the RDSI Node Operators, who will lead the next step in its evolution. We wish them luck and long lasting sustainability.

‒ Dr Nick Tate RDSI Project Director

Page 84: RDSI Project History

The team

Chapter XII

Page 85: RDSI Project History

85|

Project Office Communications

Project Director

Dr Nick Tate

[email protected]

+61 7 3365 2019 | +61 412 674 010

Communications Officer

Asher Vennell 

[email protected]

+61 408 517 376

Project Manager

Viviani Paz 

[email protected]

+61 7 3365 2033 | +61 402 280 257

Storyteller

Patricia McMillan 

[email protected]

+61 434 602 050

Office Manager

Toni Walkinshaw [email protected]

+61 7 3365 2030 | +61 419 477 490

Solutions Specialist

Loretta Davis 

[email protected]

+61 407 370 474

Page 86: RDSI Project History

86|

Data Sharing (DaSh)

Vendor Engagement Specialist

Paul Campbell

[email protected]

+61 7 3878 2666 | +61 402 002 266

Security Policy Manager

Mark McPherson 

[email protected]

+61 418 425 872

Node Development (NoDe)

NoDe Manager

Richard Northam 

[email protected]

+61 417 044 625

Research Data Services (ReDS)

ReDS Programme Manager

Peter Hicks 

[email protected]

+61 401 103 640

Research Data Manager

Dr Markus Buchhorn 

[email protected]

+61 417 281 429

Research Data Manager

Dr Frankie Stevens 

 [email protected]

+61 435 657 730

Page 87: RDSI Project History

87|

Interacting with the community

I’ve really enjoyed the breadth of interaction we’ve been able to have as a project team across the stakeholders– the research communities, the universities and science agencies, the Nodes. It’s been wonderful to be able to talk to all of those stakeholder groups. One of the things that’s been most interesting for me has been to see the different approaches those groups bring to data—how research data management might be viewed at the institutional level, versus the state level, versus a person working in a laboratory.

‒ Dr Frankie StevensResearch Data Manager, RDSI Project

Page 88: RDSI Project History

88|

Nodes

Asher Vennell
Was the intent to have the node heads here? if so, should we list them under 'the team'?
Page 89: RDSI Project History

Board MembersBrian Anker

Independent Chair

Dr Rhys Francis Director - eResearch Futures P/L

Professor Doug McEachern

Former Pro Vice Chancellor

Research and Innovation - The

University of Western Australia Professor Anton

Middelberg Deputy Vice

Chancellor (Research) - The

University of Queensland

Former Board Members

Peter Nikoletatos

Executive Director and Chief Information Officer - La Trobe University

John Shipp

Vice-President - Australian Library and Information Association

Professor Liz Sonenberg

Pro Vice-Chancellor (Research Collaboration) - The University of Melbourne

Professor Max Lu

Provost and Senior Vice-President - The University of Queensland

Professor Jill Trewhella

Deputy Vice-Chancellor (Research) - The University of Sydney

Page 90: RDSI Project History

The end