TruXy: Trusted Storage Cloud for Scientific Workflowsdownload.xuebalib.com/q860f7QX1Vg.pdf · such as BioBlend [13], CloudMap [14] and Galaxy cloud [15]. However, none of these platforms

TruXy: Trusted Storage Cloudfor Scientific Workflows

Surya Nepal, Richard O. Sinnott, Carsten Friedrich, Catherine Wise,

Shiping Chen, Senior Member, IEEE, Sehrish Kanwal, Jinhui Yao, and Andrew Lonie

Abstract—A wide array of clouds have been built and adopted by business and research communities. Similarly, many research

communities use workflow environments and tools such as Taverna and Galaxy to model the execution of tasks to support and

expedite the reenactment of complex processes, and ultimately support the repeatability of science. As a result, a number of systems

that integrate clouds with these workflow systems have emerged including CloudMap, CloudMan, Galaxy cloud, BioBlend, etc. Though

these systems have proven to be successful in service and data integration, they have not dealt with the data security issue inherent in

cloud-based systems with the outsourced models of infrastructure provisioning. For many domains, e.g., health, this poses serious

challenges regarding the adoption of cloud infrastructures. Yet such domains also have much to gain from clouds especially given the

explosion of genomics data and opportunities for personalized medicine in the big data era. This paper addresses this problem by

presenting a trusted storage cloud for scientific workflows, called TruXy. The paper describes the TruXy architecture, the

corresponding protocols and illustrates the adoption of TruXy to support collaborative bioinformatics research in endocrine genomics. A

range of experiments have been performed to measure the performance of TruXy for processing of exome data sets on individuals with

a rare genetic disorders: disorders of sex development (DSD). Our results show that the performance of TruXy is comparable to that of

using a standalone workflow tool and that it can handle the big data security challenges required.

Index Terms—Distributed/Internet based software engineering tools and technique, data encryption, security, integrity and

protection, health care

Ç

1 INTRODUCTION

1.1 Cloud-Based Workflows

ALTHOUGH cloud computing was originally conceivedpredominantly with business computing needs inmind,

it has increasingly been adopted by scientific communities[1]. Whilst there has been a concern about the performance ofthe cloud to meet the needs of scientific computation, recentresults have shown that the performance of public clouds arecomparable to high-performance-computing (HPC) systemsfor particular applications [2], although it is noted that this isnot true for all types of application. A number of scienceclouds have been built with the focus on providing comput-ing infrastructure for science applications [3], [4].

An increasing number of scientific applications andresearch communities have adopted workflows [5]. Giventhis, there is a need to integrate workflows with clouds. Sys-tems such as SciCumulus [6] and SwinDeW-C [7] have shown

the feasibility of integrating workflows and cloud. In additionto these systems, many specific Cloud-based workflow sys-tems have been developed for science applications such as lifesciences [8], astronomy [9] and urban research [10]. One of thekey issues identified in these works is themanagement of datain terms of itsmovement, sharing, and security.

In the bioinformatics domain, two of the most popularworkflow systems are Taverna [11] and Galaxy [8], [12]. Inrecent times, Galaxy has become increasingly popular andwidely used. A number of extensions/plug-ins have beendeveloped to integrate Galaxy with cloud environmentssuch as BioBlend [13], CloudMap [14] and Galaxy cloud[15]. However, none of these platforms dealt with the issuesrelated to data movement, data sharing and above all datasecurity. In this paper, we address this problem througha system called, TruXy—a trusted virtual private storagecloud for Cloud-based scientific workflows.

1.2 Motivation

The motivation for this work comes from the post-genomiclife sciences and specifically to support the endocrinegenomics virtual research laboratory (endoVL1) [16], [17].This project was funded through the Australian NationaleResearch Collaborative Tools And Resources (NeCTAR2).We derive the following needs and requirements for theCloud-based scientific workflows from the project.

1. Data Privacy: The data provisioned, stored and ana-lysed through scientific workflows includes highly

� S. Nepal, C. Friedrich, C. Wise, and S. Chen are with the CSIRO Compu-tational Informatics group, Canberra, ACT 2601. E-mail: {surya.nepal,Carsten.Friedrich, Catherine.Wise, shiping.chen}@csiro.au.

� R.O. Sinnott, S. Kanwal, and A. Lonie are with the Department of Com-puting and Information Systems at the University of Melbourne, Mel-bourne, Vic. 3010, Australia. E-mail: {rsinnott, alonie}@unimelb.edu.au,[email protected].

� J. Yao is with the School of Electrical and Information Engineering, Uni-versity of Sydney, Sydney, New South Wales, Australia.E-mail: [email protected].

Manuscript received 28 Jan. 2015; revised 16 Aug. 2015; accepted 21 Sept.2015. Date of publication 12 Oct. 2015; date of current version 6 Sept. 2017.Recommended for acceptance by K.-K.R. Choo, O. Rana, and M. Rajarajan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TCC.2015.2489638

1. www.endovl.org.au2. www.nectar.org.au

428 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO. 3, JULY-SEPTEMBER 2017

2168-7161� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

sensitive data (e.g., exome/whole genome data onpatients with extremely rare conditions). Thedetailed analysis of these data sets requires complexworkflows to be defined and developed. Theseworkflows must address the many privacy and secu-rity concerns regarding the data itself—which by itsvery nature is personally identifying. The data needsto be protected at all times during its life cycle.

2. Cloud Computations: The scientific workflows containcomputationally expensive tasks typically lastingdays-weeks that require large scale computationalinfrastructure. Given this, it is unrealistic to assumethat health providers (e.g., hospitals) have their owndedicated private dedicated cloud. Science cloudssuch as the NeCTAR Research cloud offer an idealenvironment to run these workflows since they offersignificant compute power required for these com-pute extensive tasks.

3. Secure Data Storage: Direct provisioning of (unen-crypted) genomic data onto a public Research cloudviolates the terms of use for acceptable behaviour onuse of the data according to the ethics and informa-tion governance demands of the clinical and bio-medical stakeholders. While the actual real-timeprocessing of the data is ephemeral enough to poseonly minor data leakage concerns, the issues of(semi-) permanent unencrypted data storage is a crit-ical issue.

4. Secure Data Sharing: The data as well as the resultsfrom the execution of workflows need to be sharedamong researchers. Indeed within endoVL, multipleresearchers work on the same set of data hence sup-port for collaboration is a must.

5. Data Movement: Scientific workflows often deal withlarge data sets. For example, within the endoVL proj-ect individual patient data sets of 20-35 gigabytes insize in compressed form were used. These are com-prised of a number of files from 1.5-3.5 gigabytes. Intotal over 20 terabytes of data storage was provisionedfor endoVL. Furthermore, the execution of the work-flows generates output data of varying sizes that them-selves are often larger than the original data. Thesedata volumes increasingly challenge private cloudresources. Ideally the data should be stored close tothe computational infrastructure that supports execu-tion of the workflows. Hence the data often needs tobe stored in larger scale public clouds likeNeCTAR.

6. Ease of Use: One of the key requirements for scientificapplications is that the system should be easy to useby the scientists who are not experts in working incloud environments, but are often familiar with theworkflow systems. Ideally the security-enhance sys-tem should be similar to that of standalone workflowsystem to the end user researcher.

1.3 Contributions and Organisation

To address the needs for security-enhanced scientific work-flows for the endoVL project, we have developed a novelvirtual private storage cloud for scientific workflows:TruXy. This offers a secure data service that extends thescientific workflow system by integrating it with a secure

cloud storage service. In the research communities, thefocus so far has been either integrating workflows withclouds or providing secure storage in the cloud.

This paper presents a novel way of providing securecloud storage for the workflows in the cloud (i.e., workflowsthat are running on multiple virtual machines in the cloud).This is achieved by (a) developing a generic architecture forproviding secure storage services to the scientific work-flows, (b) enabling collaboration amongst collaborators—anessential feature of any collaborative application dealingwith sensitive data—through a secure data sharing protocol–, (c) controlling the data movement so that only therequired data for the workflows is moved between cloudstorage and workflows running on virtual machines in thescience cloud by instantiating the secure data storage ser-vice in the same cloud environment, and (d) enabling theusers to use the workflow system in a transparent manner,e.g., as if they were using a standalone workflow system.

In this paper, we present the architecture, protocols ofthe system and also demonstrate how we have instantiated,developed, deployed the architecture using Galaxy andTrustStore [18] in the context of a endoVL project dealingwith sensitive (genomic) data sets. In addition to integra-tion, we have extended TrustStore to support data integrity,availability and secure data sharing services in TruXy andallow Galaxy to import/export data from/to TrustStore.We undertake a range of experiments to measure the perfor-mance of the system. We observed that overall TruXy isscalable with the data sets and has minimum effect on theperformance due to the data sizes. Most importantly, thework meets the many requirements regarding privacy andinformation governance demanded by the data stakehold-ers in the endoVL project.

The rest of the article is organized as follows. Section 2provides an overview of related work with focus on clouddata security, and integration of cloud and scientific work-flow systems. Section 3 describes the details of the architec-ture of TruXy and corresponding services and protocols.Section 4 provides a brief description of endoVL in the con-text of TruXy and the case studies that have driven andshaped this work. This is followed by the implementationTruXy using TrustStore and Galaxy in Section 5. Section 6provides the experimental results. The benefits and short-comings of TruXy are presented in Section 7 for the benefitsof practitioners and developers. The final section providesthe conclusions and the plans for future work.

2 RELATED WORK

2.1 Workflows in the Cloud

Juve et al. have experimented with the suitability of com-mercially available cloud infrastructures for scientific work-flows in [2]. One of the important observations from theirexperiments was that the cost of running workflows oncommercial clouds could be reduced by storing data in thecloud, rather than transferring it to the cloud from externalsources at runtime. This reinforces the need to integrateavailable cloud storage with cloud computing resources(through scientific workflows).

In the past decade, the integration of cloud computingand workflows has gained great momentum [19]. However,

NEPAL ET AL.: TRUXY: TRUSTED STORAGE CLOUD FOR SCIENTIFIC WORKFLOWS 429

the integration of the two is not straightforward as most tra-ditional workflow systems are designed to be used within atrusted, confined environment, whereas cloud computingencourages outsourcing. In this regard, a number of solu-tions have been proposed in the literature such as SciCumu-lus [6], SwinDeW-C [7], BioBlend [13], CloudMap [14] andGalaxy cloud [15]. However, none of these solutions ade-quately deals with data security and hence many domainshave been precluded from the benefits of cloud infrastruc-tures. Security is one of the major concerns in public cloudworkflow deployments [20], [21].

In this paper, we provide a unique data security solutionto cloud workflow systems and a use case for a well-knownscientific workflow environment, Galaxy, with the applica-tion focus on amajor genomics project (as opposed to a hypo-thetical or representative test case). Whilst we have chosenGalaxy for workflow and TrustStore as a storage solution wenote that the TruXy architecture itself is extensible to otherworkflow systems. We next describe what makes a Trust-Store based solution suitable for scientific workflows in com-parison to other secure cloud data storage solutions.

2.2 Cloud Data Security

2.2.1 Confidentiality

The most common approach to support data confidentialityis to encrypt data stored in the cloud [22]. Another approachto cloud security is the use of hybrid-clouds, where the sen-sitive data is stored/processed in the private cloud andnon-sensitive data is stored/processed in the public cloud.Example of this include Twin clouds [23] and SecCSIE [24].However, the focus of the Twin cloud and SecCSIE is at theenterprise level, and as such they are not suitable for proj-ects like endoVL where users belong to many enterprises.

The confidentiality issue has also been dealt in the dataprivacy research using techniques such as k-anonymiza-tion [25] and differential privacy [26]. The aim of thesetechniques is to share the data with other researchers insuch a way so that the confidentiality of the data is main-tained. However, these techniques are not applicable tocollaborative scientific applications like endoVL, wherethe sensitive data need to be shared among multipleresearchers to support the repeatability of science, and allresearchers participating in the virtual laboratory needaccess to the original data.

TrustStore works at the individual user level and doesnot assume an individual has his/her own private cloud.Hence, it is best suited for projects such as endoVL. Further-more, TrustStore provides easy management of keys, one ofthe challenging issues in existing data security solutions.

2.2.2 Integrity

Integrity is a well-researched property within data security.Two popular approaches include: public auditing [27], [28],[29] and verification [30], [31]. SecCloud [32] provides a solu-tion based on auditing where the auditor ensures that theCloud providers behave as expected.Wang et al. also proposea flexible distributed storage integrity auditing mechanism,utilizing homomorphic tokens and distributed erasure-codeddata [27], [33], [34]. Panda [29] proposes an efficient techniqueof public auditing when the data is shared among a group ofusers. However, these solutions are inefficient for endoVL for

two reasons: homomorphic encryption based solutions arenot scalable for big data and the solution is tightly coupledwith the cryptographic protocol.

A number of algorithms and protocols have been pro-posed to address the problem of secure verification, i.e., anun-trusted outsourced storage service is faithfully storingthe client data [35], [36], [37], [38]. They are generally stud-ied under two categories: Provable Data Possession (PDP)and Proof of Retrievability (PoR). Ateniese et al. [35] definethe PDP model that enables un-trusted servers to verify thepossession of the original data without retrieving it. Juelsand Kaliski [37] describe a PoR model that enables a user todetermine that a server possesses a file. However, there aretwo key shortcomings with these approaches: efficiency andreliance on storage service. In our project, we have used aservice-oriented solution for integrity management in Trust-Store [39] to overcome these shortcomings.

2.2.3 Availability

Availability of data can be increased by using multiple cloudservice providers (CSPs) and replicating data amongst thoseproviders. Bessani et al. developed such a system, called Dep-Sky, that uses the cloud-of-clouds to improve the availabilityof data in addition to data confidentiality and consistency[40]. HAIL is another cloud storage system that supports highavailability independent of individual storage services [41]. Inour project, we have implemented a solution for availabilityin TrustStore similar to the approach proposed in HAIL andDepSky, where a replication service deals with data availabil-ity. One of the key features of the replication service is that itworks independent of the underlying cloud service providers.Hence, it is possible to add any number of storage services orclouds provided by different vendors to meet the needs of avariety users and applications.

2.2.4 Data Sharing

One of the specific data security issues relevant to scientificworkflows is secure sharing of data stored in the cloud withother processes and/or researchers. Lin and Tseng have pro-posed a threshold proxy re-encryption scheme and inte-grated it with a decentralized erasure code to establish asecure distributed cloud storage system [42]. Liu et al. pro-poses a scheme for sharing files in hierarchical role structures[43]. In this scheme, a sender can specify several users at thelower level as the recipients for a file by taking the numberand public keys of the recipients as inputs of a hierarchicalidentity-based encryption algorithm, which enables only theuser at the upper level, as well as the intended recipients, todecrypt the file using their own private keys. Kamara et al.proposed a searchable cryptographic cloud storage system,called CS2 [44]. The CS2 system uses highly efficient andprovably secure cryptographic primitives and protocolsbased on a searchable symmetric encryption scheme.

Key problems with these approaches include: the data filemay need to be re-encrypted in order to revoke access forusers or they place full reliance on the PKI infrastructure, butoffer no protection and management for the private keys,or they have full reliance on the access control mechanism.In our solution, we provide a unique mechanism for datasharing where a user can use the PKI infrastructure for datasharing, but the private keys are protected in the user profile


through passwords and the public key itself is used to pro-vide role-based access control. These functionalities aresupported by a key management service (KeyS) in additionto the data confidentiality capabilities described earlier inthis section.

2.2.5 Security and Privacy Issues in MapReduce

Emerging parallel and distributed data programmingframeworks (e.g., Hadoop [45]) enable development ofdistributed applications. The MapReduce framework ofHadoop has adopted the Kerberos authentication mecha-nism [46]. However, access control fails to preserve privacyas the data becomes accessible in unencrypted forms. Royet al. [47] investigated the data privacy problem caused bythe MapReduce framework and presented a system, calledAiravat, which incorporates mandatory access control withdifferential privacy. In this model, mandatory access controlis triggered when the privacy leakage exceeds a threshold,so that both privacy preservation and high data utility areensured. However, the results produced in this system aremixed with degrees of noise resulting in lower data utilitythat is not applicable to scientific workflows that aim to sup-port highly accurate data analytics and associated repeat-ability of science.

Several systems have been proposed to handle privacyconcerns of computation and storage on the cloud from datasecurity perspectives. Ko et al. [48] proposed the HybrExMapReducemodel to provide away that sensitive andprivatedata are processed within a private cloud while others can besafely extended to public cloud. Zhang et al. [49] proposedthe Sedic system which partitions MapReduce computingjobs in terms of the security labels of data they work on andthen assigns the computation without sensitive data to a pub-lic clouds. Sharif et al. [50] proposed a similar approach forworkflows, called MPHC. One of the problems in theseapproaches is that they rely on having private cloud to dealwith private data and public cloud for non-sensitive data.These approaches are not only complex, but they do not scalewell when all data are sensitive and private, i.e., they cannotbe simply separated (aswas the case for endoVL).

The paper thus provides a unique secure storage cloudsolution for scientific workflows. The solution is unique asnone of the cloud based workflow systems provide a securedata solution that ensures confidentiality, integrity, avail-ability and data sharing on public clouds.

3 TRUXY

This section presents the overall architecture of the TruXysystem.

3.1 System Architecture

Fig. 1 illustrates the architecture of TruXy; it uses TrustStoreas an underlying platform. It consists of four main layers:Cloud Service Provider layer, TruXy Middleware Layer(TML), TruXy Client-end (TCe) and TruXy ApplicationLayer (AL). The functionalities of each of the main architec-tural layers are summarized as follows:

The cloud Service Provider layer hides the complexity andheterogeneous nature of the underlying storage infrastruc-tures from the higher layers by providing a location-

independent Cloud storage infrastructure. The CSP abstractsthe underlying storage infrastructure layer allowing clients,e.g., TruXy apps to put/get data to/from the storage infra-structure without needing to be concerned about specificstorage technologies, topologies and their location, etc. Forexample, if TruXy creates an instance of secure storage usingAmazon S3 as a CSP, the simple put/get operations of theinterface are translated into the equivalent protocol messagesused by Amazon S3, namely XML messages wrapped inHTTP carried by the BitTorrent P2P protocol.

The TruXy Middleware Layer acts as a middleware layerbetween the applications (in the AL) and the Cloud storageprovided by CSPs. It provides services to the system totransparently enhance the trustworthiness of the cloud stor-age. It has three main components:

� Key Management Service that manages the encryptionkeys used in the system. The KeyS generates, distrib-utes and stores the keys for the user applications inthe AL, applies access control and facilitates keysharing among users. In addition, KeyS also main-tains the profile of the virtual store.

� Data Integrity Management Service (IntS). Beforedata is sent to the cloud, its hash value is stored inthe integrity management service, which can sub-sequently use it to verify its integrity when retriev-ing it. IntS can support both passive and activeverification of the integrity of the data stored inthe cloud.

� Data Replication Service (RepS) assists the user in cre-ating multiple replicas of the data and storing themin different CSPs. It enhances the availability of theuser data at the cost of additional cloud usage.

The three fundamental aspects of security are providedby three independent services: Key Service (data confidenti-ality), Replication Service (data availability) and Data Integ-rity Service (data integrity). These services can be providedby trusted third parties. Given that each entity is subject topossible fraud and deception, the chance that multiple enti-ties collude with each other and tamper a given user’s datawithout being detected is significant smaller. The TruXyMiddleware Layer provides capabilities that enable scien-tific workflows to meet the storage requirements by creatingstorage instances in the same infrastructure where work-flows are running in order to reduce the data movement.

Fig. 1. Architectural design of TruXy.


The TruXy Client-end serves two purposes: it provides aninterface for user applications in the AL to interact with theTML, and it works as a client-end for TruXy applications toprovide security functionalities which need to be deployedat the application side, e.g., local data partitioning andencryption/decryption.

The Application Layer allows workflow and user applica-tions to access secure virtual storage provided by the TML.The workflow application is used by the workflow systems,noting that, the user application can also be used byresearchers to access data outside of the workflow environ-ment, i.e., users can upload new data sets as well as viewthe results of analysis outside the workflow systems.

3.2 Data Confidentiality Protocol

The following describes the approach taken to utilize cryp-tographic techniques to protect the confidentiality of datastored in the public cloud. We have adopted the algorithmproposed in TrustStore [18], where raw data is partitionedinto different parts that are distributed to different serviceentities: CSP and KeyS. Only when all parts are collectedby a single entity, can the restoration of the raw data besuccessfully conducted. The trustworthiness offered byTruXy is built on the premise of this separation of trustand mutual constraints.

TruXy provides security based on the following assump-tions. We assume that a client computer is entirely trustedand secured for sensitive data operation and computation.Both cloud Service Providers and the TruXy MiddlewareLayer are semi-trusted and there is no collusion betweenthem. They manage the services they claim to provide.However, they are assumed to be potentially subject to bothinternal and external attacks. With this being said, theencryption and decryption processes need to be done at theuser end so sensitive content is protected right after beinggenerated by the user applications. In TruXy, the KeySissues and archives high strength encryption keys for theuser while the TruXy applications deployed at the clientside deals with all the encryption and decryption operationsassociated with the issued keys. The entire TruXy process isshown in the Fig. 2.

We assume that sensitive data on the application side isstored in the form of files. Let a file (f) be modelled as:f ¼ ðM;XÞ;where M represents meta-data including the fil-ename, its size, type and modified time, and X represents afile with content. To encrypt a file, we partition the file into

a set of fragments as shown as in Fig. 2 as a file fragmentmap. This process is defined as:

fr ¼ fid;m; xf g and

m ¼ ðkid; filename; order; flocationgÞ:

Where fid is a fragment identity of a file f, which is a ran-domly generated id used as a filename for that fragment; mrepresents the fragment meta-data including the key fromthe fragment key map used to encrypt this fragment, the fil-ename of the original file, the fragmentation order and thelocations where the fragment is stored; x is the content ofthe fragment. We can redefine (f) as a set of fragments as fol-lows:

f ¼[p

i¼1

fr ¼[p

i¼1

fid;[p

i¼1

mi;[p

i¼1

xi

( )

f ¼ v;[p

i¼1

xi

( );

where p is the number of fragments of a file and v repre-sents all meta information pertaining to these fragmentsincluding fragment key maps and fragment location maps.The meta-data and directory structure of a file may revealprivate and confidential information about the data itself.Therefore, they need to be treated at the same level to thatof the contents of a file.

As such, we need a model to represent such structures sothat file related information can be stripped off. To achievethis, we define a File Object (F):

F ¼ ID;v;[n

i¼1

F

( ):

Each file has an associated File Object with a unique IDthat is randomly assigned to the file, and meta-informationof the file’s fragments in the case of a file. In the case of adirectory, the File Object contains the File Objects of all itsunderlying child files. With this model, the meta-informa-tion and file structure of a whole collection of files can beencapsulated into one single File Object, which we call theRoot File Object (R). Therefore, the encryption of a collectionof files (a file system FS) can be expressed as below:

Enc FSð Þ ¼ Enck1 x1ð Þ; . . . ::; Enckn xq

� �; Enckqþ1

Rð Þn o

;

where Enc represents the encryption operation; q þ 1 keysare involved in which one is used for the Root File Objectand the rest are used for fragments generated from n files inthe collection. The file meta-data and structure are trans-formed into the cipher text similar to the content. Theencryption operation is done at the client end via the TCewhile the keys used are issued by KeyS. The key map for thewhole file system is represented as:

KeyMap ¼[q

i¼1

ki; kidi

� �; Rid; Rid

k

� �( )

:

The Fragment Key Map part of the data includes a collec-tion of q þ 1 encryption keys with their key IDs. It also

Fig. 2. TruXy’s overall processes for confidentiality, availability andintegrity.


includes the pair of Root File Object ID (RID) and its associ-ated key ID. Since the fragment IDs and the associated frag-ment key IDs are all contained in the Root File Object, this isthe first file that needs to be fetched and decrypted so thatother fragments can be processed.

The file uploading process when using TruXy applicationworks as follows. Once the target file is chosen by the user,the TCe in TruXy applications first partitions the data file intofragments; then encryption keys are obtained from the KeySwhich, are used to encrypt both the contents and the meta-data of the files; and finally, the cipher is uploaded to the cho-sen CSPs. After these procedures, the keys are stored at theKeyS, but the original content is transformed into cipher textand uploaded to the CSP.Hence, the separation of encryptionkey from the cipher text can be achieved, i.e., neitherKeyS norCSP are able to view the sensitive stored data.

3.3 Data Integrity Protocol

Another important aspect of security is preserving dataintegrity. Although the cloud storage service cannot suc-cessfully decrypt the cipher text that is stored, it maytamper with the data while at rest. Similarly, a third partycould tamper with the data whilst in transit. Furthermore, asalient corruption may occur to the data due to hardwarefailures. Thus, the public cloud storage service is inherentlynot trusted. In order to address this problem, TruXyadopted the concept of DIaaS [39] from TrustStore thatdefines a dedicated service—the Integrity Management Ser-vice—to deal with the problem of integrity violations ofdata stored on the cloud.

We assume the cloud storage service supports function-alities for data integrity checks such as the return of a hash/checksum value of the stored object in the cloud. This isneeded for the user and the IntS to verify the integrity of thedata without actually downloading them. The Data Integ-rity Service is semi-trusted. As such, the information disclo-sure to IntS should be minimized while allowing it to verifythe integrity of the data. According to the modelling in theprevious section, data files are partitioned into fragmentsand stored in the cloud as cipher texts while the keys arestored in KeyS. During the process, the TCe can utilize colli-sion-resistant hash function [51] to calculate the hash of allencrypted fragments, and put the hashes into their meta-data. This can be defined as:

m ¼ ðkid; filename; order; locationf g; hashðEnck xð ÞÞÞ:

In this way, the new File Object of a file will contain thehash values of the all the hashes of its fragments (integritymap in Fig. 2). In the uploading process, the integrity mapwill be sent to the IntS to be stored. This will be fetched inthe downloading process to verify the integrity of the corre-sponding data file.

A detailed data uploading protocol is shown in Table 1.After the data partitioning and encryption, the TCe appliesa hash function, creates a unique hash value of the object(step 1) and sends a) the cipher text to the CSP; b) the keysto KeyS (steps 2 and 3); and c) the hash value (encapsulatedin Integrity Map) to IntS. Upon receiving the hash value, IntSwill request the hash value of the object from CSP (step 3and 4) (after the object has been uploaded and stored atCSP) it compares two hash values of the same data object(step 5). The steps 3, 4 and 5 are optional as not all CSP canprovide hash value of the stored object. The IntS signs Integ-rity Map with a timestamp by using its private key andsends the certificate back to the TCe (steps 6 and 7). The TCeverifies the authenticity and correctness of this certificate; ifvalid, it signs it with the private key of the user which isgenerated and stored at the KeyS, and sends it back to IntS(steps 8, 9 and 10). Finally, IntS stores the signatures gener-ated along the process as well as the hash value.

Fig. 3 shows the interactions between the componentswhere the integrity of the downloaded data is validatedaccording to the hash value stored at IntS. After theencrypted file objects and keys are obtained by TCe, thehash values of the objects are fetched from IntS. Only ifthe signatures supplied by IntS are valid and the hash val-ues match, will the cipher be decrypted to restore the origi-nal data files. Note that in phase 1b, the keys needed toverify the signatures will also be fetched, such as the publickey of the IntS and the private/public key of the user. Theuse of a dedicated service to maintain the integrity of thedata also represents the N-degrees of separation.

3.4 Availability Protocol

TruXy adopts the principles of Redundant Array of Inde-pendent Disks (RAID) [52] and Reliable Array of Indepen-dent Nodes (RAIN) [53] to address the availability of thedata. Using the fragment location map shown in Fig. 2, theprocess works as follows: for the store operation for a file fof size d, the maximum distance separable code (MDS) (n, k)can be used to encode the file into n symbols, each of size d

k.We store one symbol per CSP. One symbol may have multi-ple fragments of a file and the CSP details are registered foreach fragment in fragment location map. The fragment loca-tion map is then stored in RepS. For the retrieve operation,we get the data from for a file from k CSPs and decode them

TABLE 1Uploading Protocol for Preserving Data Integrity

1. TCe : hash(Enck(x)) ¼ h2a. TCe! CSP : Encrypted Fragments2b. TCe! KeyS : Fragment Key Map2c. TCe! IntS : Integrity Map3. IntS! CSP : Request hash of Enck(x)4. CSP! IntS : hash value ¼ h’5. IntS : verify(h’, h)6. IntS : SignIntS (Integrity Map, time) ¼ SIntS

7. IntS! TCe : SIntS

8. TCe : verify(SIntS)9 TCe : SignUserðSIntSÞ ¼ SUser

10. TCe! IntS : SUser

11. IntS : Store(SUser ;SIntS, Integrity Map)

Fig. 3. Protocol for checking data integrity for downloading.


with the help of fragment locations map to obtain the origi-nal file. For a small number of CSPs, we use the RAID-1with full redundancy.

This availability protocol has a number of advantages:(a) the original data can be recovered with up to (n-k) CSPfailures, (b) depending on the cost and utility, we can shutdown CSPs without copying data from them and also addnew CSPs, and (c) the data can be made available by acces-sioning only (n-k) CSPs.

With regard to the availability of TruXy middlewareservices (i.e., KeyS, IntS and RepS), we use the cost and util-ity model proposed in [54] to replicate the services.

3.5 Data Sharing Protocol

One of the requirements of scientific workflows is the sup-port for collaborations where the data is widely accessed andshared. However, confidential data sharing needs strongaccess control to ensure the sharing is authorized. As a result,in addition to data confidentiality, integrity and availability,TruXy also provides data access control mechanisms.

The access control in TruXy is provided at the store level.Some cloud services support access control at the individualstorage (bucket) level, while others only support it at theaccount level. The TruXy application first creates a virtualstore by sending a request to KeyS containing the metadatafor the store. KeyS records this information, and returns thesame store with the addition of a unique identifier. Themetadata contains the following items:

� Store Name: The store name is supplied by the user.The store name also helps to search and browse thestores when the user has a large number of stores.

� Access Control List (ACL): This is a list of usernamesand their associated access levels. When the store iscreated for the first time, none of these are required tobe set, except that the creating usermust be theOwner.

� Codename: A unique identifier of the file representingthe base of the directory tree in the cloud storage.

� Initialization Vector: The initialisation vector usedwhen encrypting the base of the directory tree.

� Profile File Name: The name of the file on KeyS thatholds the IntS and CSP credentials.

The next step is to set up the store’s profile file, and thedirectory tree in the cloud. For the profile, the client mustgenerate a username, password, and a public-private keypair to use with IntS. The profile also contains the creden-tials for all cloud providers to be used, and the bucket (alsoknown as a workspace) name. Different clients will supportdifferent cloud providers, depending on the availability ofsuitable libraries.

The access control list is defined as a triplet as follows:

� User: refers to a user of TruXy, who can upload,share and consume the data stored in TruXy;

� Profile: is a record of the credentials required toaccess the data stored in a particular store;

� Role: refers to a set of types of TruXy user, who havedifferent permissions to access and use the data.

Four roles are defined in TruXy: Administrator, Owner,Author and Reader. A Reader can only read the data contentswithin a given store. An Author can read all data contents,and also add/update/delete data. An Administrator can

read and write data, and is also able to grant and removeroles from other users (including Administrator). In addi-tion, an Administrator is able to delete the entire store.There is only one Owner per store, and they have the sameprivileges as an Administrator. The only difference is thattheir role is permanent; no user (including themselves) mayrevoke their access rights. This is because it is assumed thatthey own the cloud credentials, and are therefore responsi-ble for any costs for using the cloud service.

Once a user has created a store, they may share that storewith other users. To do this, the user first views the store inthe store list interface, and chooses ‘edit’. The list shows thestores that the user has permission to access. The user canonly edit those stores they have Administrator or Owneraccess to. Then the user is able to add new users with accesslevels described earlier.

When the user has modified the access of others to astore, the TruXy application will request the public keys ofall users who have been granted access from KeyS. The cli-ent must then encrypt the profile using each key, usingSMIME encryption, and store the file on KeyS for otherusers to retrieve. The client can then request KeyS updatethe store metadata.

When a user chooses to open a store from the store list, theclient requests the store credentials from KeyS, these willhave been encrypted with the user’s public key. The clientthen decrypts the profile using the user’s private key. Oncein possession of the decrypted profile, the client can accessthe file system representation on the cloud storage providers.

Since the process of decrypting and assembling the frag-ments using the encryption key is encapsulated in theTruXy applications and executed in memory at runtime, itis unlikely to capture and store the encryption key locallyfor unauthorized usage later on. However, dishonest userscan still store by copy/paste or filming the screen and reusethe collaborative data after access being revoked, which isbeyond the scope of this paper and can only be addressedusing traditional ways, such as contracts and/or laws.

4 ENDOCHRINE GENOMICS VIRTUAL LABORATORY

TruXy has been applied in the context of a national biomedi-cal virtual laboratory supporting research into a variety ofendocrine disorders. The endocrine genomics virtual labo-ratory was funded through the Australian National eRe-search Collaboration Tools and Resources (NeCTAR)initiative. The focus of endoVL has been to develop a rangeof deep phenotypic disease registries covering a variety ofendocrine disorders that, together with advanced bioinfor-matics data processing capabilities focused on/aroundwhole genome/exome genomic data sets, allows insightsand breakthroughs into disease manifestation, shed lightonto the underlying biological processes that take placewith given disorders, and ultimately aid in patient manage-ment and support personalised medicine.

EndoVL developed a range of targeted registries includ-ing: type-1 diabetes (supporting the work and efforts of theAustralian Paediatric Endocrine Society (APEG)); atypicalfemur fractures (supporting the Endocrine Society of Aus-tralia (ESA) and the Australian and New Zealand Bone andMineral Society (ANZBMS)); neuroendorcrine and adrenal


tumours (supporting the work of Clinical Oncology Societyof Australia (COSA)); Niemann-Pick Disease types A, B and C(supporting the International Niemann-Pick Disease Alli-ance (INPDA)); polycystic ovarian syndrome (PCOS) anddisorders of sex development (DSD) (supporting the Austral-asian Disorders of Sex Development network (DSDnet-work)). These disorders were primarily chosen due to theextensive body of work that had already been undertakenin standardization of data models by clinical communities,and their associated implementation through a range ofcompleted and ongoing UK and European projects, e.g.,ENSAT-CANCER (www.ensat-cancer.eu); International-DSD project (www.i-dsd.org); EuroWABB (www.euro-wabb.org) and the Australian Diabetes Data Network(ADDN) (www.addn.org.au).

It is important to note that each of these disorders havetheir own specific phenotypic data that is of interest and hasbeen agreed as the data to be collected by that community.The detailed (deep) phenotypic data is primarily focused onclinical information, e.g., demographic information, treat-ment information, surgical information. However manyresearchers that are supported through endoVL requireaccess to physical biospecimens (blood, urine, plasma,DNA, . . .) that can be used for better understanding of thegenetics and underlying biological aetiology of the particu-lar diseases.

The evolution of the much-vaunted e-Health and rhetoricaround personalized medicine greatly depends upon theability to process and interpret the ever-increasing volumesof genomic data that can now be generated. A major activityof endoVL was the definition and delivery of repeatableGalaxy data processing pipelines on the NeCTAR Researchcloud. The various stakeholders involved in the project(included ethical boards and information governance per-sonnel) demanded that all necessary steps were taken toensure that the data was protected at all times.

The case study that shaped the application of TruXy wasbased around a cohort of patients with Disorders of SexDevelopment. This disorder is extremely rare and has arange of phenotypic presentations. These can includeambiguous genitalia, short stature and a range of otherphysical (and societal) complications. As such, the sensitivi-ties around these data sets are extreme. The original focusof the work was based around a family of 15 from Indone-sian of whom seven had different manifestations of DSD.Initially the three distributed bioinformatics groupsinvolved across Australia were tasked with identifying thedifferent mutations (variants) that gave rise to these condi-tions within the family. Each group used their own in-housebioinformatics data processing pipelines using local datastorage with mixed and ultimately inconsistent results.

The second phase of the work focused entirely on the useof standardized Galaxy-based data processing pipelinesthrough the Genomics Virtual Laboratory (GVL3) that uti-lized TruXy. The focus of this activity was on a cohort ofsix DSD patients (from Belgium and the Netherlands).The whole exome sequencing of the samples from thesepatients was undertaken by the Australian Genome

Research Facility (AGRF4) on the Illumina HiSeq2000 plat-form to generate 100 bp paired-end reads. Files of FASTQsequence data were provisioned through TruXy on theNeCTAR Research cloud for bioinformatics data analytics.The total resources provisioned through the NeCTARResearch cloud for endoVL for this activity comprised 20 Tbof data storage and 200 VMs offering 440 Gb of memory.

It is important to note that whilst all groups were advisedto use Galaxy (and TruXy) on the NeCTAR Research cloud,it was made clear that the actual Galaxy data processingpipelines themselves should be developed independentlyand without collusion. A key aspect of this work was toassess the accuracy of the resultant analysis and how readilyit could be translated into a clinical context.

An example of a Galaxy workflow that was developed toprocess the exome data of the endoVL DSD patients isshown in Fig. 4. The figure illustrates the overall workflowitself and the sequence of steps and data processing toolsinvolved in the bioinformatics analysis. Fig. 5 shows theimport of data into TruXy. This pipeline involves numeroussteps covering amongst other things, sequence alignment,read coverage and biological annotation. Each of these stepscan be computationally expensive and involve the genera-tion of significant intermediate data sets. The final result ofthis Galaxy data-processing pipeline is a variant call formatfile (.vcf) that contains all of the gene sequence variantsfound for that particular individual, i.e., the differencesfound for this individual from the reference genome. Thisdata and its association with the phenotypic informationassociated with the individual in the DSD registry are keyto endocrine genomics: do all individuals have the samevariants/genetic mutations? Is this what is causing this par-ticular presentation of the disease?

It should also be noted that the linkage between the clini-cal information in the registry and the processed bioinfor-matics data utilizes unique registry-specific information,i.e., all patients are assigned a unique identifier in the regis-try and this identifier is used as the basis for the samplesand experiments associated with the data from that patient.This information is included in the .vcf file (and visualized).

The EndoVL systems collectively include data on over15,000 patients across the different disease areas. Over 500

Fig. 4. A workflow for DSD patients.

3. www.genome.edu.au 4. www.agrf.org.au


researchers from around the world including over 100 sepa-rate hospitals/clinical research groups from around Europe,North and South America and Australasia currently use theendoVL systems. Theworkflows thatwere developed to ana-lyse the DSD patient data comprise over 20 computationallyexpensive and separate data processing steps (as indicatedin Fig. 4)—noting that this was just one of the workflows thatwas created by the different bioinformatics groups. Each ofthese data processing steps is dependent upon the results ofthe previous data steps. The original exome data (FASTQdata) from the DSD patients comprised over 300 Gb of rawsequence data. The subsequent data processing pipeline(workflow) resulted in over 1 Tb of data, however the actualend product (the .vcf file) is typically of the order of a fewhundred kilobytes. This data is stored in a targeted variantdatabase and associated with the patient information hostedin the clinical/phenotypic databases.

5 TRUXY IMPLEMENTATION

EndoVL Galaxy workflows are expected to processextremely sensitive patient genomic data. Ensuring thesecurity and confidentiality of this data is paramount. TheCSIRO TrustStore platform was selected to protect EndoVLdata while stored in the NeCTAR cloud for the reasons

described in Section 2. In order to secure endoVL data withTrustStore, we have developed a TruXy system, asdescribed in Section 3. We have implemented an instance ofthe system as shown in Fig. 6. The implementation of theinstance requires the following extensions to TrustStore:

� Adding support for the NeCTAR Research cloudAPIs to the TruXy platform.

� Developing the TruXy Galaxy workflow applicationto support data import and export between Galaxyand NeCTAR storage systems.

� Deploying and operating the TruXy middlewareincluding Key Management, Integrity Management,and Replication services.

� Developing and deploying the TruXy user applica-tion for uploading the data and for user registration.

� Developing and deploying the secure data sharingprotocol.

5.1 TruXy NeCTAR Support

The NeCTAR Research cloud is based on OpenStack5 andthus supports OpenStack APIs as well as a subset of the AWSAPIs. These APIs are sufficient to support TruXy require-ments andwere thus chosen as the implementation target.

Unfortunately, the supported subset of the Amazon APIson the NeCTAR Research Cloud realization of OpenStackdoes not include the Amazon Identity and Access Manage-ment API (IAM). Furthermore, at present OpenStack doesnot offer a similar native service. As a consequence, some ofthe security features could not be fully supported. It is thustheoretically possible for another user B who has been givenaccess to a store owned by user A to reverse engineer userA’s access credentials to the NeCTAR storage for theselected NeCTAR project. Whilst user B is still not able toaccess the plaintext of any TruXy secured data of user A, inprinciple they would be able to delete or modify arbitraryfiles, access data that user A stored not using TruXy, orexploit access to the account in other ways.

As a workaround for this problem a dedicated NeCTARuser (or project) for use with TruXy was adopted. Even in thiscase, the problem is only being mitigated, not completelyeliminated as user B would still be able to exploit this accountor delete and / or modify arbitrary encrypted data storedunder it. While TruXy would detect such deletions and/ormodifications, without further redundancy provisions themanipulated datawould still be lost or rendered useless.

Fig. 6. An implementation instance of TruXy.

Fig. 5. Application of TruXy in endoVL for bioinformatics data processinganalysis.

5. www.openstack.org


It is therefore essential that a sufficiently strong trust rela-tionship exists between collaborators using a commonTruXy store in the NeCTAR context. Should future versionsof OpenStack add support for the IAM API or provide anative API with similar functionality, this risk could beeliminated completely.

5.2 TruXy Applications

For EndoVL, two TruXy applications are supported: TruXyuser application that enables TruXy users to create newstores on NeCTAR and upload data to it. This application isdeveloped using the TrustStore Java client (TCe). Hence, weuse the TrustStore Java client and TruXy user application(in Fig. 6) interchangeably hereafter.

In order to make data that has been uploaded to TruXyon the NeCTAR cloud available to Galaxy workflows, a Gal-axy data import module as well as a Galaxy data exportmodule for TruXy were developed. For development effi-ciency and reusability reasons, a generic command lineworkflow application for TruXy was developed in Python.Subsequently a Galaxy module for data import and exportwas implemented as a simple wrapper around the com-mand line application. This is called TruXy workflow appli-cation (in Fig. 6).

Python was selected as programming language forimplementing the command line client as well as the Galaxymodules as it is natively supported by Galaxy. This alsosimplified the integration, and especially installation via theGenomics Virtual Laboratory Galaxy Toolshed (where thebioinformatics applications used in the workflows and othertools are provisioned for workflow definition).

The data import and export modules are generic innature and work with any Galaxy workflow. The user sim-ply adds the modules to their workflow as they would withany other data import or export. Once added to the work-flow the user can use the module configuration page withinGalaxy to specify the data source. In using this, users simplyneed to enter their TruXy credentials, the name of the storecontaining the data, as well as the name and location of thedata file within the store. When executing the workflow, theTruXy module will be invoked with the supplied parame-ters and subsequently downloads, decrypts, and verifies thespecified data. This data is then made available to subse-quent data processing (workflow) steps through Galaxy.

5.3 TruXy Middleware

The TruXy middleware services for endoVL were deployedto CSIRO servers and are currently hosted and operated byCSIRO as shown in Fig. 1. These services are envisaged tobe eventually migrated to endoVL ownership and associ-ated infrastructure.

6 PERFORMANCE ANALYSIS

Since genomics data, as used by EndoVL, tends to be in theorder of hundreds of gigabytes as explained earlier, it isessential that TruXy, with its security features in operation,does not impose significant performance overheads onEndoVL workflows. We therefore conducted extensivebenchmarks to measure and quantify any potential perfor-mance issues.

6.1 Experimental Setup

6.1.1 Dataset

The test dataset contains a selection of data types and sizes.Data types included ASCII text files, Base64-encoded datafiles, uncompressed random binary files, and compressedbinary files. The variety of file types was deliberately cho-sen to measure any performance impact that various net-work transport layers may cause by optimizing for specialcases, such as through automatic stream compression.Genomic data can range in size from a few megabytes tohundreds of gigabytes per file. The typical steps of in a bio-informatics workflow involve splitting these data sets intomore manageable chunks of one to five gigabytes, beforeprocessing by the workflow tools.

6.1.2 Environment

The experiments were conducted using different environ-ments. The performance of TruXy was measured using twotypes of applications: TruXy workflow application andTruXy user application. The workflow application uses amulti-process model, and the user application is multi-threaded. The machine used for the clients in experimenthad a 16-core Xeon E5540, 2.53 GHz CPU with 72489 MiBmemory, a 1 GBit ethernet connection to the CSIRO net-work, through to a fiber connection to AARNET. NeCTARResearch cloud’s Object Store was used for the cloud stor-age service. The experiments on the performance of TruXywere conducted using GVL/Galaxy on a NeCTAR Researchcloud “m1.xlarge” virtual machine with 32 GB RAM, 8VCPUs and 10 GB of local disk space.

6.2 Experimental Results

The first experiment performed was to test data uploadexperiments to the NeCTAR Research cloud storage. Specif-ically the experiment focused on uploading data from aTruXy workflow application to the cloud. In particular, theexperiment measured the impact of TruXy security layer(TrustStore) on the performance of the application:

� Uploading of data from an application outsideNeCTAR to NeCTAR storage directly.

� Uploading of data from an application outsideNeCTAR to NeCTAR storage using TruXy securitylayer.

The results of the data upload with different file sizeswith/without TruXy security layer are shown in Fig. 7a.The results clearly show that the overall penalty of securitylayer is minimal and constant, i.e., the performance doesnot degrade with the increase in the size of the data.

The second experiment was to test the performance ofTruXy security layer on downloading data from NeCTARResearch cloud Storage to a workflow application. Theresults of the data download with different file sizeswith/without security layer are shown in Fig. 7b. Theresult is similar to uploading. This means the TruXy secu-rity layer has minimal impact on the performance whileuploading data to the NeCTAR cloud for workflow proc-essing or for downloading results to the client machine.

The next two experiments focused on the impact of thefragment size on uploading and downloading using thesame data set. The impact of fragments on uploading and


downloading time is shown in Figs. 7c and 7d. The resultsclearly show that the fragment sizes do not have any impacton the uploading and downloading time. It is important tonote that the variation in the performance in the figures isthe result of other factors such as network availability andCPU availability.

Our next objectives were to measure the performance ofdifferent implementation of TruXy applications and howthey performed for big data sizes. The results for uploadingand downloading different size of data two applications areshown in Figs. 8a and 8b. The TruXy user application per-formed slightly worse during uploads in our experimentscompared to the TruXy workflow application, in particularfor smaller file sizes, but overall results are similar. Bothimplementations pose only a small overhead over transfer-ring the data without encryption. It should be noted that wedo not have measurements for unencrypted uploadsbeyond 1 GB file sizes as the next size in our sample datasetis 10 GB which is no longer supported by NeCTAR storage.TruXy scales linearly beyond 1 GB sizes and we measuredresults for files up to 100 GB.

We next ran a series of experiments to measure the actualperformance of TruXy on the NeCTAR cloud. We ran threedifferent workflows using data imported using TruXy andcompared it to a standard Galaxy data import tool,Genomespace, which also uses NeCTAR storage. In particu-lar, the experiment looked at the overheads in running the

workflows using TruXy. Each experiment used real data,and so had different file sizes and hence different computa-tional demands in running the workflows. The results ofexecuting these workflows are shown in Fig. 9.

In our experiments, multiple instances (processes) of eachimport tool were used to import one file each. The files wereapproximately 6 GB in size. Three test caseswere considered,importing a different number of files concurrently: 2, 4 or 8files. The results clearly show that the performance of thestandard Galaxy import tool is better than TruXy, most nota-blywhen there is a higher number of input files. It is not clearwhat has caused the performance penalty, as the earlierexperiments excluded potential sources, such as failure toeffectively use parallel resources (multi-cores and process-ors), Key Server and cloud Service performance, and multi-thread versus a multi-process design. It is also not clear howthe standard galaxy import tools are able to import multiplefiles in constant time. There is a need of further investigationwhy the TruXy’s performance degraded heavily while deal-ing withmultiple concurrent input files in this experiment. Itis possible that the inter-process contention for resourceswas not well managed, or that either the Key Service or stor-age was limiting connections. However, the performance ofTruXy importing files is insignificant in comparison with thetotal execution time of the workflows. For example, in theendoVL experiments, workflows with two and eight inputfiles took times for full execution ranging from between 19 to

Fig. 7. TruXy security layer performance results: (a) Upload times for different file sizes; (b) Download times for different file sizes; (c) Upload timesfor different fragment sizes; (d) Download time for different fragment sizes.


28 hours, respectively. The time TruXy component of this toimport eight input files for the workflow is about 20 seconds,hence the overall performance due to TruXy are minimal,but the security benefits substantial.

7 DISCUSSION

TruXy provided a generic secure cloud data storage solu-tion for scientific workflows. The paper illustrated the pro-posed solution through an implementation instance usingGalaxy and TrustStore. This requires extensions to both Gal-axy and TrustStore. The Galaxy workflow environment wasextended with a data import and export module that wasused to interact with TrustStore. TrustStore was extendedby not only adding a secure data sharing capability to thekey management service, but also by extending the originalTrustStore system with data integrity and data replicationservices. In the following, we summarize the underlyingassumptions and shortcomings of TruXy for the benefits ofdevelopers and practitioners who would consider adoptingit for other application domains.

� System portability: We demonstrated an implementa-tion instance of TrustStore for Galaxy in this paper.If one wants to use TruXy for different workflow sys-tems such as Taverna, he/she needs to develop adata import and export module for that workflow,i.e., a new TruXy workflow application needs to be

developed for each workflow system. The rest of theTruXy services are portable to any workflow systemsincluding Taverna without further modifications.

� Data security: TruXy provides security based on thefollowing assumptions. Firstly, we assume that thevirtual machine used by workflow system is secured.In our case, the virtual machine are not secure, butthe actual real-time processing of the data is ephem-eral enough to pose only little data leakage concerns.We also assume that a client computer is entirelytrusted and secured for sensitive data operation andcomputation to run TruXy user applications. That is,both the TruXy workflow and the user applicationsare both assumed to be run on trusted and securedenvironments. Both the cloud Service Provider andthe TruXy Middleware Layer are semi-trusted, andthere is no collusion between them. They managethe services they claim to provide. However, theyare assumed to be potentially subject to both internaland external attacks.

� Data availability: Our implementation of data avail-ability is based on RAID-1. Data is fully replicated inmultiple clouds. In practice, this probably works formany scientific workflows due to the cost of havingmultiple public cloud storage systems. Furthermore,many science clouds run on quotas and there is a lim-ited storage available for each collaborative project.

� Data movement: TruXy can instantiate secure cloudstorage in the cloud where the virtual machines run-ning scientific workflows are located. This signifi-cantly decreases the movement of the data.

� Data sharing: TruXy in its current implementationsupports access control at the store level. This worksfine for the endoVL project. However, access controlat the file level is needed in many applications. Insuch cases, a new secure data sharing protocol needsto be developed.

� Usability: TruXy authentication is not fully integratedwith the workflow system. That is, a user needs toauthenticate both workflow system and TruXy. Thishas an advantage of TruXy being portable to differ-ent workflow systems at the cost of usability. How-ever we note that we have recently integrated TruXy

Fig. 8. Performance results for upload and download using different TruXy clients.

Fig. 9. Performance of TruXy and standard galaxy while running differenttypes of workflow.


with the Australian Access Federated (AFF) to sup-port single sign-on directly and overcome this issue(since endoVL used the AAF for authenticationalready). The detailed description of this work is outof the scope of this paper.

8 CONCLUSIONS AND FUTURE WORK

This paper presented a trusted storage cloud for scientificworkflows called TruXy. We demonstrated its use throughintegrating the CSIRO TrustStore security solution within aGalaxy-based bioinformatics data processing workflow inthe context of a national biomedical virtual laboratory sup-porting research into a variety of endocrine disorders(endoVL). We conducted a number of experiments on theperformance of the system. The experimental results showthat TruXy has little impact on the overall performance ofthe NeCTAR Research cloud Storage, but provides muchneeded security to address the issues of data privacy andsecurity on the cloud.

The performance of TruXy, when TrustStore is integratedwith the Galaxy workflow system, degrades more thanexpected while dealing with multiple input files, howeverthe overall impact of this degradation for large-scale work-flows demanded by endoVL is minimal. Nevertheless weplan to address this performance issue in our future work.TruXy provides data security services for archiving, storingand sharing of information. It is thus natural to extend thiswork for other aspects of data security including creation,use and secure data deletion (erasure) from the cloud inorder to provide a more robust and encompassing securitysolution for the NeCTAR Research cloud.

REFERENCES

[1] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berri-man, et al., “On the use of cloud computing for scientific work-flows,” in Proc. 4th IEEE Int. Conf. eSci., 2008, pp. 640–645.

[2] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman,et al., “Scientific workflow applications on Amazon EC2,” in Proc.5th IEEE Int. Conf. E-Sci. Workshops, 2009, pp. 59–66.

[3] K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa,“Science clouds: Early experiences in cloud computing for scien-tific applications,” in Proc of Cloud Comput. and Appl. Workshop,Chicago, USA, pp. 825–830, 2008.

[4] L. Shang, S. Petiton, N. Emad, and X. Yang, “YML-PC: A referencearchitecture based on workflow for building scientific privateclouds,” in Cloud Computing, New York, NY, USA: Springer, 2010,pp. 145–162.

[5] I. J. Taylor, E. Deelman, D. Gannon, and M. Shields, Workflowse-Science, Berlin, Germany: Springer-Verlag, 2007.

[6] D. De Oliveira, E. Ogasawara, F. Bai~ao, and M. Mattoso,“SciCumulus: A lightweight cloud middleware to explore manytask computing paradigm in scientific workflows,” in Proc. IEEE3rd Int. Conf. Cloud Comput., 2010, pp. 378–385.

[7] X. Liu, D. Yuan, G. Zhang, J. Chen, and Y. Yang, “SwinDeW-C: Apeer-to-peer based cloud workflow system,” in Handbook CloudComputing, New York, NY, USA: Springer, 2010, pp. 309–332.

[8] J. Goecks, A. Nekrutenko, and J. Taylor, “Galaxy: A comprehen-sive approach for supporting accessible, reproducible, and trans-parent computational research in the life sciences,” Genome. Biol.,vol. 11, no. 8, p. R86, 2010.

[9] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, etal., “Pegasus: Mapping scientific workflows onto the grid,” inProc. 2nd Eur. AcrossGrids Conf. Grid Comput., 2004, pp. 11–20.

[10] R. O. Sinnott, C. Bayliss, A. Bromage, G. Galang, G. Grazioli,P. Greenwood, et al., “TheAustralia urban research gateway,”Con-currency Comput.: Pract. Experience, vol. 27, no. 8, pp. 358–375, 2014.

[11] T. Oinn,M. Addis, J. Ferris, D.Marvin,M. Senger,M. Greenwood, etal., “Taverna: A tool for the composition and enactment of bioinfor-maticsworkflows,” Bioinformatics, vol. 20, pp. 3045–3054, 2004.

[12] E. Afgan, J. Goecks, D. Baker, N. Coraor, A. Nekrutenko, andJ. Taylor, “Galaxy: A gateway to tools in e-Science,” in Guide toe-Science, New York, NY, USA: Springer, 2011, pp. 145–177.

[13] C. Sloggett, N. Goonasekera, and E. Afgan, “BioBlend: Automat-ing pipeline analyses within Galaxy and CloudMan,” Bioinformat-ics, vol. 29, pp. 1685–1686, 2013.

[14] G. Minevich, D. S. Park, D. Blankenberg, R. J. Poole, and O.Hobert, “CloudMap: A cloud-based pipeline for analysis ofmutant genome sequences,” Genetics, vol. 192, pp. 1249–1269,2012.

[15] E. Afgan, D. Baker, N. Coraor, H. Goto, I. M. Paul, K. D. Makova,et al., “Harnessing cloud computing with galaxy cloud,” NatureBiotechnol., vol. 29, pp. 972–974, 2011.

[16] R. O. Sinnott, L. Bruns, C. Duran, W. Hu, G. Jayaputera, and A.Stell, “Development of an endocrine genomics virtual researchenvironment for Australia: Building on success,” in Proc. 13th Int.Conf. Comput. Sci. Appl., 2013, pp. 364–379.

[17] A. Stell and R. Sinnott, “e-Enabling international cancer research:Lessons being learnt in the ENSAT-CANCER Project,” in Proc.IEEE 9th Int. Conf. eSci., 2013, pp. 132–139.

[18] J. Yao, S. Chen, S. Nepal, D. Levy, and J. Zic, “TrustStore: MakingAmazon S3 trustworthy with services composition,” in Proc. 10thIEEE/ACM Int. Conf. Cluster, Cloud Grid Comput., 2010,pp. 600–605.

[19] X. Liu, D. Yuan, G. Zhang, W. Li, D. Cao, Q. He, et al., DesignCloud Workflow Systems. New York, NY, USA: Springer, 2012.

[20] Y. Zhao, X. Fei, I. Raicu, and S. Lu, “Opportunities and challengesin running scientific workflows on the cloud,” in Proc. Int. Conf.Cyber-Enabled Distrib. Comput. Knowl. Discovery , 2011, pp. 455–462.

[21] J. C. Mace, A. Van Moorsel, and P. Watson, “The case for dynamicsecurity solutions in public cloud workflow deployments,” inProc. IEEE/IFIP 41st Int. Conf. Dependable Syst. Netw. Workshops,2011, pp. 111–116.

[22] E. Saleh and C. Meinel, “HPISecure: Towards data confidentialityin cloud applications,” in Proc. 13th IEEE/ACM Int. Symp. Cluster,Cloud Grid Comput., 2013, 2013, pp. 605–609.

[23] S. Bugiel, S. N€urnberger, A.-R. Sadeghi, and T. Schneider, “Twinclouds: Secure cloud computing with low latency,” in Proc.12thInt. Conf. Commun. Multimedia Security, 2011, pp. 32–44.

[24] R. Seiger, S. Groß, and A. Schill, “SecCSIE: A secure cloud storageintegrator for enterprises,” in Proc. IEEE 13th Conf. CommerceEnterprise Comput., 2011, pp. 252–255.

[25] R. J. Bayardo and R. Agrawal, “Data privacy through optimalk-anonymization,” in Proc. 21st Int. Conf. Data Eng., 2005,pp. 217–228.

[26] C. Dwork, “Differential privacy: A survey of results,” in Proc. 5thInt. Conf. Theory Appl. Models Comput., 2008, pp. 1–19.

[27] C. Wang, S. S. Chow, Q. Wang, K. Ren, and W. Lou, “Privacy-pre-serving public auditing for secure cloud storage,” IEEE Trans.Comput., vol. 62, no. 2, pp. 362–375, Feb. 2013.

[28] C. Liu, J. Chen, L. T. Yang, X. Zhang, C. Yang, R. Ranjan, et al.,“Authorized public auditing of dynamic big data storage on cloudwith efficient verifiable fine-grained updates,” IEEE Trans. ParallelDistrib. Syst., vol. 25, no. 9, pp. 2234–2244, Sep. 2014.

[29] B. Wang, B. Li, and H. Li, “Panda: Public auditing for shared datawith efficient user revocation in the cloud,” IEEE Trans. ServicesComput., vol. 8, no. 1, pp. 92–106, Jan./Feb. 2015.

[30] Z. Hao, S. Zhong, and N. Yu, “A privacy-preserving remote dataintegrity checking protocol with data dynamics and publicverifiability,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 9,pp. 1432–1437, Sep. 2011.

[31] F. Seb�e, J. Domingo-Ferrer, A. Martinez-Balleste, Y. Deswarte, andJ.-J. Quisquater, “Efficient remote data possession checking in crit-ical information infrastructures,” IEEE Trans. Knowl. Data Eng.,vol. 20, no. 8, pp. 1034–1038, Aug. 2008.

[32] L. Wei, H. Zhu, Z. Cao, W. Jia, and A. V. Vasilakos, “SecCloud:Bridging secure storage and computation in cloud,” in Proc. IEEE30th Int. Conf. Distrib. Comput. Syst. Workshops, 2010, pp. 52–61.

[33] C. Wang, Q. Wang, K. Ren, N. Cao, and W. Lou, “Toward secureand dependable storage services in cloud computing,” IEEETrans. Services Comput., vol. 5, no. 2, pp. 220–232, Apr.–Jun. 2012.

[34] C. Wang, K. Ren, W. Lou, and J. Li, “Toward publicly auditablesecure cloud data storage services,” IEEE Netw., vol. 24, no. 4,pp. 19–24, Jul.–Aug. 2010.

[35] G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peter-son, et al., “Provable data possession at untrusted stores,” in Proc.14th ACM Conf. Comput. Commun. Security, 2007, pp. 598–609.


[36] G. Ateniese, R. D. Pietro, L. V. Mancini, and G. Tsudik, “Scalableand efficient provable data possession,” in Proc. 4th Int. Conf. Secu-rity Privacy Commun. Netow., 2008, pp. 1–10.

[37] A. Juels and B. S. Kaliski Jr, “Pors: Proofs of retrievability for largefiles,” in Proc. 14th ACM Conf. Comput. Commun. Security, 2007,pp. 584–597.

[38] C. Erway, A. K€upc€u, C. Papamanthou, and R. Tamassia,“Dynamic provable data possession,” in Proc. 16th ACM Conf.Comput. Commun. Security, 2009, pp. 213–222.

[39] S. Nepal, S. Chen, J. Yao, and D. Thilakanathan, “DIaaS: Dataintegrity as a service in the cloud,” in Proc. IEEE Int. Conf. CloudComput., 2011, pp. 308–315.

[40] A. Bessani, M. Correia, B. Quaresma, F. Andr�e, and P. Sousa,“DepSky: Dependable and secure storage in a cloud-of-clouds,”ACM Trans. Storage, vol. 9, p. 12, 2013.

[41] K. D. Bowers, A. Juels, and A. Oprea, “HAIL: A high-availabilityand integrity layer for cloud storage,” in Proc. 16th ACM Conf.Comput. Commun. Security, 2009, pp. 187–198.

[42] H.-Y. Lin and W.-G. Tzeng, “A secure erasure code-based cloudstorage system with secure data forwarding,” IEEE Trans. ParallelDistrib. Syst., vol. 23, no. 6, pp. 995–1003, Jun. 2012.

[43] Q. Liu, G. Wang, and J. Wu, “Efficient sharing of secure cloudstorage services,” in Proc. IEEE 10th Int. Conf. Comput. Inf. Technol.,2010, pp. 922–929.

[44] S. Kamara, C. Papamanthou, and T. Roeder, “Cs2: A searchablecryptographic cloud storage system,” Microsoft Res., Redmond,WA, USA, Tech. Rep. MSR-TR-2011-58, 2011.

[45] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoopdistributed file system,” in Proc. IEEE 26th Symp. Mass StorageSyst. Technol., 2010, pp. 1–10.

[46] B. C. Neuman and T. Ts’ O, “Kerberos: An authentication servicefor computer networks,” IEEE Commun. Mag., vol. 32, no. 9,pp. 33–38, Sep. 1994.

[47] I. Roy, S. T. Setty, A. Kilzer, V. Shmatikov, and E. Witchel,“Airavat: Security and privacy for MapReduce,” in Proc. 7th USE-NIX Conf. Netw. Syst. Des. Implementation, 2010, pp. 297–312.

[48] S. Y. Ko, K. Jeon, and R. Morales, “The HybrEx model for confi-dentiality and privacy in cloud computing,” in Proc. 3rd USENIXConf. Hot Topics Cloud Comput., 2011, p. 8.

[49] K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, “Sedic: Pri-vacy-aware data intensive computing on hybrid clouds,” in Proc.18th ACM Conf. Comput. Commun. Security, 2011, pp. 515–526.

[50] S. Sharif, J. Taheri, A. Y. Zomaya, and S. Nepal, “MPHC: Preserv-ing privacy for workflow execution in hybrid clouds,” in Proc. Int.Conf. Parallel Distrib. Comput., Appl. Technol., 2013, pp. 272–280.

[51] K. Fu, M. F. Kaashoek, and D. Mazieres, “Fast and secure distrib-uted read-only file system,” in Proc. 4th Conf. Symp. Operat. Syst.Des. Implementation, 2000, vol. 4, p. 13.

[52] D. D. Long, B. R. Montague, and L.-F. Cabrera, “Swift/RAID: Adistributed RAID system,” Comput. Syst., vol. 7, pp. 333–359, 1994.

[53] V. Bohossian, C. C. Fan, P. S. LeMahieu, M. D. Riedel, L. Xu, and J.Bruck,“Computing in the RAIN: A reliable array of independentnodes,” EEE Trans. Parallel Distrib. Syst., vol. 12, pp. 99–114, Feb.2001.

[54] S. Pandey and S. Nepal, “Modeling availability in clouds formobile computing,” in Proc. IEEE 1st Int. Conf. Mobile Services,2012, pp. 80–87.

Surya Nepal received the BE degree from theNational Institute of Technology (NIT) Surat,India, the ME degree from the Asian Institute ofTechnology (AIT), Thailand, and the PhD degreefrom the RMIT University, Australia. He is a prin-cipal research scientist at the CSIRO Digital Pro-ductivity Flagship. His main research interest is inthe development and implementation of technolo-gies in the area of distributed systems and socialnetworks, with a specific focus on security, pri-vacy, and trust. At CSIRO, he undertook

research in the area of multimedia databases, web services and service-oriented architectures, social networks, security, privacy and trust in col-laborative environments, cloud systems, and big data. He has more than150 peer-reviewed publications to his credit.

Richard O. Sinnott received the bachelor’s ofscience degree in theoretical physics (Hons), themaster’s of science degree in software engineer-ing, and the PhD degree in distributed systems.He is the director of eResearch at the Universityof Melbourne and holds a professorial role inapplied computer systems. He was formerly tech-nical director of the National e-Science Centre,United Kingdom; the director of e-Science at theUniversity of Glasgow. He has published morethan 200 peer-reviewed papers in conferences/

journals across a wide range of computing science areas with specificfocus over the past 10 years in supporting communities demandingfiner-grained access control (security).

Carsten Friedrich received the undergraduatedegree in computer science from the Universityof Passau, in Germany, and the PhD degree incomputer science from the University of Sydney.His PhD thesis received the “Most outstandingPhD thesis in the field of computer science, inAustralasia” award in 2003. He is a researchteam leader in the CSIRO Digital ProductivityFlagship. In recent years, he has focused hiswork on building and commercializing researchbased software for the Capital Markets CRC, var-

ious start-up companies, and CSIRO. His current work focus is on cloudcomputing and cyber security.

Catherine Wise received the bachelor’s of math-ematics (Hons) degree from the University ofWollongong. She is a CSIRO software engineerworking on projects researching cloud computing,geographic data aggregation, and web userexperience optimisation.

Shiping Chen is a principal research scientist atthe CSIRO Australia. He also holds an adjunctassociate professor title with the University ofSydney through teaching and supervising PhD/Master students. He has been working on distrib-uted systems for over 20 years with focus on per-formance and security. He has published over100 research papers in these research areas. Heis actively involved in computing research com-munity through publications, journal editorships,and conference PC services, including WWW,

EDOC, ICSOC, and IEEE ICWS/SCC/CLOUD. His current researchinterests include secure data storage and sharing and secure multipartycollaboration. He is a senior member of the IEEE.

Sehrish Kanwal received the bachelor’s of sci-ence (Honors) in bioinformatics with distinction atthe COMSATS Institute of Information Technol-ogy, Islamabad, Pakistan. She is currently work-ing toward the PhD degree from the Departmentof Computing and Information Systems, Univer-sity of Melbourne. She has worked as a researchassociate in the Bioinformatics Research Group-Pakistan, where she was actively involved in sev-eral research projects and received a “ResearchProductivity Award”.


Jinhui Yao received both the BE degree andPhD degree from the University of Sydney, Aus-tralia. He is a research scientist in the Palo AltoResearch Center (PARC) of Xerox, US. Sincejoining PARC, he has been engaged in a range ofresearch projects aiming to facilitate agile busi-ness processes, dynamic service compositions,and leveraging analytic workflows for big data.His research interests include cloud computing,service-oriented architecture (SOA), businessprocess management (BPM), and big data.

Andrew Lonie is a faculty member in theDepartment of Computing and Information Sys-tems at the University of Melbourne and founda-tion head of the Life Sciences ComputationCentre. He has research and research serviceachievements in a variety of topics includingbiological modelling and bioinformatics, datamining, internet technologies and distributed sys-tems, and significant expertise in designing, andimplementing high performance computationalinformatics approaches to data intensive life sci-ences research problems.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


Documents

TruXy: Trusted Storage Cloud for Scientific Workflowsdownload.xuebalib.com/q860f7QX1Vg.pdf · such as BioBlend [13], CloudMap [14] and Galaxy cloud [15]. However, none of these platforms