13
Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues in integrating Legacy Experiment Environment to Scientific Workflows Zhiming Zhao, Dmitry A. Vasunin, Adianto Wibisono, Adam Belloum, Cees de Laat, Pieter Adriaans, Bob Hertzberger

Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Embed Size (px)

Citation preview

Page 1: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Privacy issues in integrating R environment in scientific workflows

Dr. Zhiming Zhao

University of AmsterdamVirtual Laboratory for e-Science

Privacy issues in integrating Legacy Experiment Environment to Scientific WorkflowsZhiming Zhao, Dmitry A. Vasunin, Adianto Wibisono, Adam Belloum, Cees de Laat, Pieter Adriaans, Bob Hertzberger

Page 2: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Outline

• Scientific experiments and R• Problem description• Optional solutions• Experimental results• Summarizing discussion• Future work

Page 3: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Scientific experiments and support systems

Experiment: on full data scale.

Define goal

Data analysis

Prototype the algorithm

Computing(Test with small data)

Vis./Int.(Validation)

Finding &Dissemination

Apply to full size data

RefineRefine

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 s t Q t r 2 n d Q t r 3 r d Q t r 4 t h Q t r

E a s t

W e s t

N o r t h

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 s t Q t r 2 n d Q t r 3 r d Q t r 4 t h Q t r

E a s t

W e s t

N o r t h

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 s t Q t r 2 n d Q t r 3 r d Q t r 4 t h Q t r

E a s t

W e s t

N o r t h

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 s t Q t r 2 n d Q t r 3 r d Q t r 4 th Q t r

E a s t

W e s t

N o r th

Prototype: on small data scale. In such scenarios:• Existing experiment

environments, such as R, are widely used by domain scientists

• Human in the loop computing is important for testing and validating prototypes

• scientific workflows are used to manage different processes and the experiment lifecycle

Page 4: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

R and workflow support in VL-e

• R realises rich functionality of data statistics and visualisation, and has been used as an important experimental environment in bio-sciences.– R needs scientific workflow support

• Accessing different e-Science resources• Being coordinated with the other components in a large

scale experiment– E-Science workflows in certain domains also need R

• Reuse the advanced results from legacy systems• Support experiments developed on legacy systems

• Workflow support in VL-e– Four systems are recommended

• Taverna, Kepler and VLAM have support to R– A generic solution is under construction

Page 5: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

R in scientific workflows: current solutions Three types of solutions

• Local: local installation of R, through the command line interface of R– Simple configuration– Performance bottleneck

• Web Service: SOAP to pass R script and objects– Standard interface,

distributed computing– High latency

• TCP Socket: socket interface (RServe)– Distributed computing– Maintain states– Poor security

Wf system

User Desktop

Local REnv.

Remote node

Remote REnv.W

SSocke

t

L

S

W

Page 6: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Typical scenario of RServe and requirements on privacy

Different levels of privacy issues

• Data level– Intermediate results not

to be seen by the other users

• Communication level: graphical display– Remote X display and

interaction between multi users

WF1 WF2 R Display

Page 7: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Problem description and desired solution

• Problem description– Most of the legacy experiment environment do not have

strong security management– Workflow systems provide integration without

considering security issues– The deployment of remote environment is required to

be secure

• Desire– Using existing technologies– Provide solutions to privacy issues at workflow level,

preferably in a transparent way

Page 8: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Experiments

• Review optional solutions• Investigate the overhead of security

enhancement on the workflow execution

Page 9: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Different configurations and their level of security

Data management Display management

Static (R engine)Shared engine

Dynamic (R engine) different user account

Static (X server) Dynamic (X server) {Job+VNC}Local X Remote X +

VNC

No. Yes Yes No Yes

Easy to setup The endpoint is unknown at workflow design stage

Individual X server, bounded to user’s desktop

X is not protected

Management overhead of VNC

Page 10: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

An experiment: Taverna, RServe and security tunnel

Data transfer between workflow and R

1

10

100

1000

10000

100000

1000 10000 100000 1000000

Size of data between workflow and R

Tim

e (

mill

ise

co

nd

)

Non-Secure

SecureExperiment• Adding security

enhancement in Taverna

• Protect the data channels between Taverna and RServe

• Overhead– Setting up security

tunnels– Runtime data

transfer

Page 11: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Summarizing discussion

• Integrating existing experiment environment with workflow system is important for rapid prototyping

• Privacy issues are demanded by both users and e-Science infrastructure, and can be viewed a generic issue when integrating a user interaction enabled legacy component in workflow

• Privacy protection can be achieved at certain level by customizing the workflow execution

• Enhancing workflow execution not necessarily gives high penalty on execution

Page 12: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Future work

• In the VL-e project, we are developing a bus style generic solution for different workflow systems

• Taking the data privacy into account when realizing the interoperability between different workflow systems

Page 13: Privacy issues in integrating R environment in scientific workflows Dr. Zhiming Zhao University of Amsterdam Virtual Laboratory for e-Science Privacy issues

Activities• Int’l workshop on “Workflow systems in e-Science”, organized by

Zhiming Zhao and Adam Belloum, in the context of ICCS, 2006 Reading University, 2007 Beijing, China.– Proceedings is in LNCS, Springer Verlag.– A special issue will be published in Scientific Programming Journal. – http://staff.science.uva.nl/~zhiming/iccs-wses

• Workshop on “Scientific workflows and industrial workflow standards in e-Science ”, organized by Adam Belloum and Zhiming Zhao, in the context of IEEE e-Science and Grid computing conference in Amsterdam December 2006.– Pegasus, Dr. Ewa Deelman (Department of Computer Science University of

South California) – BPEL, Dr. Dieter König (IBM Research Germany Development Laboratory) – Kepler, Dr. Bertram Ludäscher (Department of Computer Science

University of California, Davis) – Taverna, Prof. Peter Rice (European Bioinformatics Institute) – WS and Semantic issues, Dr. Steve Ross-Talbot (CEO, and a co-founder,

of Pi4 Technologies) – Triana, Dr. Ian J. Taylor (Department of Computer Science Cardiff

University) – http://staff.science.uva.nl/~adam/workshop/VL-e-workshop.htm