SPAS e-SciBioenergy: Program and Presentation Abstracts

7/31/2019 SPAS e-SciBioenergy: Program and Presentation Abstracts

1/20


2/20


3/20


4/20

4


5/20

5

Program

Time

October 22

Monday

October 23

Tuesday

October 24

Wednesday

October 25

Thursday

October 26

Friday

8:30 Registration

9:00 Opening

9:30Presentation

FAPESPTalk C. Ambroise Talk T. Dunning Talk J. E. Ferreira Talk Y. Xu

10:30 Break Break Break Break Break

11:00

Talk M. Mattoso

Talk C. Ambroise Talk T. Dunning Talk S. Sansone Talk Y. Xu

12:00 Talk M. Mattoso Talk T. Dunning Talk S. Sansone

13:00 Lunch Lunch Lunch Lunch Lunch

14:30 Talk C. B. Medeiros Talk M. MattosoTalk B.S.

Manjunath

Talk C. B.

Medeiros

15:30 Talk C. AmbroiseTalk B.S.

ManjunathPosters Students

Talk Graduate

Progs

16:30 Break Break Break Break

17:00 Talk C. Ambroise Talk B.S.Manjunath

Posters Students - -


6/20

6

B. S. Manjunath, Centre for Bio-image Informatics, University of

California (UCSB), USA

Introduction to Bio-Image Informatics. Introduction to the topic;

fundamental issues in image and video segmentation and tracking, examples

drawn from recent research. (Lecture time: 2 hours)

Introduction to Bisque Cyber Infrastructure for Bio-image Informatics. A

high level introduction to the open source Bisque image database platform for

managing, processing, indexing and searching bio-images. (Lecture time: 1

hour)


7/20

7

Christophe Ambroise, Laboratoire Statistique et Gnome, Centre

National de la Researche Scientifique (CNRS), France

Statistical Models for Biological Network Inference. Gaussian Graphical

Models provide a convenient framework for representing dependencies

between variables. In this framework, a set of variables is represented by an

undirected graph, where vertices correspond to variables, and an edge

connects two vertices if the corresponding pair of variables are dependent,

conditional on the remaining ones. Recently, this tool has received a high

interest for the discovery of biological networks by l1-penalization of the model

likelihood. In this lecture, we introduce various ways of inferring sparse co-

expression networks from either steady-state or time-course transcriptomic

data. We will focus on inference from samples collected in different

experimental conditions and therefore not identically distributed. (Lecture

time: 2 x 2 hours)


8/20

8

Cludia Bauzer Medeiros, Institute of Computing, University of

Campinas (UNICAMP), SP, Brazil

The Era of eScience: building the ark during the data deluge. Scientists

from all domains (from the mathematical to the social sciences) are collecting

enormous amounts of data. These data are captured from a variety of devices

(from those aboard satellites to microsensors in embedded systems), but also

provided by experiments, or even social networks. This has originated the so-

called "data deluge", sometimes referred to as "data tsunami", in recognition

that a large amount of these data will never be seen or directly managed by

humans. eScience has emerged as a branch of science characterized by joint

research between computer scientists and scientists from other domains to

leverage and accelerate research in those domains, helping scientists to

analyze, filter, manipulate, visualize and interpret their data, while at the same

time supporting cooperative work. This talk is geared towards discussing a few

major trends in eScience research, from a data-centric perspective, with

examples from several scientific domains. (Lecture time: 1 hour)

Coping with Digital Preservation: preserving the present to help the

future. We daily generate an enormous amount of data - for instance, during

bank transactions, phone calls, credit card operations and others. Moreover,

there are countless kinds of data linked to us -X-ray images, security videos in

stores and banks, radar-triggered photos in streets, and so on. All this

information is stored, frequently during several years, and maintained by third

parties, given its economic and/or social value. What are we doing, however,


9/20

9

with other very valuable kinds of data sets - the data generated by our

research? Our work involves complex models and computational simulations

whose intermediate and final results need to be stored. We may archive the

most relevant files, but there are many more data sets that are lost, sometimesfor lack of adequate procedures, or time, or even appropriate hardware to

record the data. This phenomenon is repeated in any context that involves

experimental activities, e.g., in biology, chemistry, physics, sociology,

anthropology, and so on. Even when all data and models involved in an

experiment are recorded, there are other challenges to meet. For instance, how

to ensure that we will be able to retrieve the desired information, in the future?

And how to share and disseminate the results of our work? This and otherissues are at the origin of digital preservation concerns. They are geared

towards investigating new methods, models, algorithms and mechanisms to

support data organization, archival and retrieval, for long term accessibility,

while at the same time considering the issues of quality, reliability and

durability. Preservation research can also be applied to corporate or business

data, but the problems involved (and their solution) are not the same. This talk

will discuss some of the challenges faced by the research in the preservation ofexperimental research data. (Lecture time: 1 hour)


10/20

10

Joo Eduardo Ferreira, Computer Science Department, Institute of

Mathematics and Statistics (IME), University of So Paulo, BrazilTransaction Processing for e-Science Applications. The management of molecular

and clinical data in e-Science applications has introduced new requirements for

database storage and transaction processing systems. There are two famous phrases

that resume the e-Science scenario. The first phrase is Science is becoming data-

intensive and collaborative, and the second is Researchers from numerous disciplines

need to work together to attack complex problems; openly sharing data will pave the

way for researchers to communicate and collaborate more effectively. These phraseswere written by Ed Seidel, acting assistant director for NSF Mathematical and Physical

Sciences directorate. This e-Science scenario shows that we are in data deluge age

where transaction processing systems under collaborative research perspective is an

important computer science challenge. More concretely, in typical e-Science laboratory

routines, transaction processing is used in many tests that are performed concurrently

and supervised by researchers. New tests are defined frequently, so researchers have to

be guided to execute the right task at appropriate time. Incompatibilities among

previous processes and new data requirements make the integration and analysis of

available knowledge very difficult. This problem is compounded by the process of

scientific knowledge discovery, which requires frequent process updates, collaborative

interactions among researchers, and refinement of scientific hypotheses. This e-Science

scenario requires an appropriate transaction processing in order to avoid data manual

approaches that quickly become very expensive or commonly infeasible. In this talk,

we provide a historical perspective, main recent challenges and solutions of

transactional processing for e-Science applications. (Lecture time: 1 hour)


11/20

11

Marta L. Queirs Mattoso (jointly with Jonas Dias and Kary Ocana),

Alberto Luiz Coimbra Institute for Graduate Studies and ResearchEngineering (COPPE), Federal University of Rio de Janeiro (UFRJ), Brazil

Exploring Provenance Data in High Performance Scientific Computing. Large-scale

scientific computations are often organized as a composition of many computational

tasks linked through data flow. After the completion of a computational scientific

experiment, a scientist has to analyze its outcome, for instance, by checking inputs and

outputs along computational tasks that are part of the experiment. This analysis can be

automated using provenance management systems that describe, for instance, the

production and consumption relationships between data artifacts, such as files, and

the computational tasks that compose the scientific application. Due to its exploratory

nature, large-scale experiments often present iterations that evaluate a large space of

parameter combinations. In this case, scientists need to analyze partial results during

execution and dynamically interfere on the next steps of the simulation. Features, such

as user steering on workflows to track, evaluate and adapt the execution need to be

designed to support iterative methods. In this course we define basic concepts of

scientific workflows and provenance data. We will show examples of scientific

workflows in the bioinformatics domain. We briefly describe how provenance of many-

task scientific computations are specified and coordinated by current workflow

systems on large clusters and clouds. We discuss challenges in gathering, storing and

querying provenance in high performance computing environments. We also show

how provenance can enable runtime and useful queries to correlate computational

resource usage, scientific parameters, and data set derivation. (Lecture time: 2 x 2

hours)


12/20

12

Susanna-Assunta Sansone, PhD. Principal Investigator, Team Leader

University of Oxford, Oxford e-Research Center, Oxford, UKThe Buzz Around Reproducible Bioscience Data: the policies, the communities

and the standards. Increased availability of the bioscience data generated is

fuelling increased consumption, and a cascade of derived datasets that

accelerate the cycle of discovery. But the successful integration of

heterogeneous data from multiple providers and scientific domains is already a

major challenge within academia and industry. Even when datasets are

publicly available, published results are often not reusable due to incomplete

description of the experimental details. In the last decade, several data

preservation, management, sharing policies, and plans have emerged in

response to increased funding for high-throughput approaches in genomics

and functional genomics bioscience [1]. A growing number of community-

based initiatives have developed minimum reporting guidelines, terminologies

and formats (referred to generally as community standards) [2] to structure and

curate datasets, enabling data annotation to varying degrees; other efforts

work to maximize the interoperability among these standards [e.g. 3, 4].

Researchers and bioinformaticians in both academic and commercial

bioscience, along with funding agencies and publishers, embrace the concept

that standards are pivotal to enriching the annotation of the entities of interest

(e.g., genes, metabolites) and the experimental steps (e.g., provenance of study

materials, technology and measurement types), to ensure that shared

investigations are comprehensible and (in principle) reproducible. But despite


13/20

13

all these efforts, in practice data sharing is challenging [5]. Vast swathes of

bioscience data still remain locked in esoteric formats, are described using ad

hocor proprietary terminology [e.g. 6], or lack sufficient contextual information;

many tools do not implement standards even where these exists; a currentwealth of domain-specific reporting standards, or their incompleteness and

absence in other areas are other major challenges. My presentation will provide

a snapshot of the current situation. I will highlight a number of stories, the

social engineering side and also key challenges, enriched by my experience

over the last decade by working with a variety of stakeholders, including

bioscience researchers, bioinformaticians, developers in public and private

sectors, standards developing communities, as well as funders and publishers.(Lecture time: 1 hour)

References

1. Field D*, Sansone SA*, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K,

Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE,

Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J:

Megascience. 'Omics data sharing. Science 326(5950):234-236 (2009)

2. List of standards at BioSharing: www.biosharing.org3. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ,

Eilbeck K, Ireland A, Mungall CJ; OBI Consortium, Leontis N, Rocca-Serra P,

Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S: The

OBO Foundry: coordinated evolution of ontologies to support biomedical data

integration. Nat Biotechnol 25(11):1251-1255 (2007)

4. Taylor CF,* Field D*, Sansone SA*, Aerts J, Apweiler R, Ashburner M, Ball CA,

Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch

EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy

NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper

M, Le Novre N, et al.: Promoting coherent minimum reporting guidelines for

biological and biomedical investigations: the MIBBI project. Nat Biotechnol

26(8):889-896 (2008)

5. Sansone SA and Rocca-Serra P: On the evolving portfolio of community-

standards and data sharing policies: turning challenges into new opportunities.

GigaScience 1:10 (2012)


14/20

14

6. Harland L, Larminie C, Sansone SA, Popa S, Marshall MS, Braxenthaler M,

Cantor M, Filsell W, Forster MJ, Huang E, Matern A, Musen M, Saric J, Slater T,

Wilson J, Lynch N, Wise J, Dix I: Empowering industrial research with shared

biomedical vocabularies. Drug Discov Today 16(21-22):940-947 (2011)

The Reality From the Buzz: how to deliver reproducible bioscience data. In

this unsettled status quo - presented in my first talk - how can we enable

bioscience researchers to make use of existing community standards and

maximize data sharing and the subsequent reuse of richly annotated

experimental information?

A successful example is provided by the Investigation/Study/Assay (ISA) [1]

open source, metadata-tracking framework developed and supported by the

growing ISA Commons community [2]. The ISA framework includes both a

general-purpose file format and a software suite to tackle the harmonization of

the structure of bioscience experimental metadata (e.g., provenance of study

materials, technology and measurement types, sample-to-data relationships)

by enabling compliance with the community standards. This exampleillustrates how the synergy between research and service groups in academia,

(e.g. in Harvard [3] and at The European Bioinfomatics Institute [4]) and in

industry (e.g. at The Novartis Institutes for BioMedical Research and at Janssen

Pharmaceuticals, a company of Johnson & Johnson) across a variety of life

science domains, is pivotal to build an network of data collection, curation, and

sharing solutions that progressively enable the invisible use of standards. I will

present the rationale behind the collaborative development and the evolution

of this exemplar ecosystem of data curation and sharing solutions - built on the

common ISA framework. I will also provide high-level examples on how this is

used to collect, curate and manage heterogeneous experimental metadata in

an increasingly diverse set of domains including environmental health,

environmental genomics, metabolomics, (meta)genomics, proteomics, stem

cell discovery, systems biology, transcriptomics, toxicogenomics, etc. I will also

discuss the experiences learned by my team, our collaborators and the growing

user community with usability of the community standards and provide an


15/20

15

update on the next steps to develop user-friendly visualization functionalities

and use semantic web approaches to make existing knowledge available for

linking, querying, and reasoning. (Lecture time: 1 hour)

References

1. Rocca-Serra P, Brandizi M, Maguire E, Sklyar N, Taylor C, Begley K, Field D,

Harris S, Hide W, Hofmann O, Neumann S, Sterk P, Tong W, Sansone SA: ISA

software suite: supporting standards-compliant experimental annotation and

enabling curation at the community level. Bioinformatics. 15; 26(18):2354-6(2010); isa-tools.org

2. Sansone SA*, Rocca-Serra P*, Field D, Maguire E, Taylor C, Hofmann O, Fang

H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L,

Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de

Matos P, Dix I, Edmunds S, Evelo CT, Forster MJ, Gaudet P, Gilbert J, Goble C,

Griffin JL, Jacob D et al.: Toward interoperable bioscience data. Nat Genet 27;

44(2):121-126 (2012); isacommons.org

3. Ho Sui SJ, Begley K, Reilly D, Chapman B, McGovern R, Rocca-Sera P, MaguireE, Altschuler GM, Hansen TA, Sompallae R, Krivtsov A, Shivdasani RA, Armstrong

SA, Culhane AC, Correll M, Sansone SA, Hofmann O, Hide W: The Stem Cell

DiscoveryEngine: an integrated repository and analysis system for cancer stem cell

comparisons. Nucleic Acids Res 40 (Database issue):D984-91 (2012). (2012);

discovery.hsci.harvard.edu

4. Haug K; Salek R; Conesa P, Hasting J, de Matos P, Rijnbeek M, Mahendraker T,

Williams M, Neumann S, Rocca-Serra P, Maguire E, Gonzalez Beltran A, Sansone

SA, Griffin J, Steinbeck C: MetaboLights An open-access general-purpose

repository for Metabolomics studies and associated meta-data. Nucleic Acids Res

(in review);www.ebi.ac.uk/metabolights


16/20

16

Thom H. Dunning, Jr., National Center for SupercomputingApplications, Institute for Advanced Computing Applications and

Technologies, and Department of Chemistry, University of Illinois at

Urbana-Champaign

Scientific Computing in Science and Engineering. Computational modeling and

simulation is among the most significant developments in the practice of scientific

inquiry in the 20th Century. Modeling and simulation are now contributors to

essentially all scientific and engineering research programs and are finding increasing

use in a broad range of industrial applications. The use of computing technology is

now spreading to the observational sciences, which are being revolutionized by the

advent of powerful new sensors that can detect and measure a wide range of physical,

chemical and biological phenomena. Massive digital detectors in a new generation of

telescopes have turned astronomy into a digital science. Sensor arrays for

characterizing ecologies and new sequencing instruments for genomics research are

revolutionizing the biological sciences. This lecture will discuss the elements ofcomputational modeling and simulation as well as the emerging area of data-driven

science and discuss the impact of these new approaches in a few fields, while also

drawing on the lecturers experiences in chemistry. (Lecture time: 1 hour)

Technology Trends and Future of High Performance Computing. Computing

technologies are undergoing a dramatic transition. Because of physical limitations, the

computational power of a single microprocessor core, the basis of all computing

systems from laptops to supercomputers, has stopped increasing. Dual-core systemswere introduced in 2005, quad-core chips in 2007, and eight-core chips are now


17/20

17

available from many vendors. This trend will continue into the future, with the number

of cores on a chip continuing to increase. In fact, the use of innovative computing

technologies based on many-core chips, e.g., NVIDIA GPUs, is now being seriously

explored in many areas of scientific computing. This technology shift presents a

challenge for computational science and engineeringthe only significant

performance increases in the future will be through the increased exploitation of

parallelism. Although these technologies promise to bring petascale computers into

researchers institutions, and even their laboratories, computers built on these

technologies have significant implications for the design of the next generation of

science and engineering applications. This lecture will provide an overview of the

directions in computing technologies as well as describe the challenges associated

with exploiting these new technologies in computational science and engineering.(Lecture time: 1 hour)

Blue Waters: overview of a sustained petascale computing system. A new

generation of supercomputerspetascale computersis providing scientists and

engineers with the ability to simulate a broad range of natural and engineered systems

with unprecedented fidelity. Just as important in this increasingly data-rich world,

these new computers allow researchers to manage and analyze unprecedented

quantities of data, seeking connections, patterns and knowledge. The impact of thisnew computing capability will be profound, affecting science, engineering andsociety.

The National Center for Supercomputing Applications at the University of Illinois at

Urbana-Champaign is deploying a computing system that can sustain one quadrillion

calculations per second on a broad range of science and engineering applications as

well as manage and analyze petabytes of data. This computer, Blue Waters, has been

configured to enable it to solve the most compute-, memory- and data-intensive

problems in science and engineering. It will have tens of thousands of chips (CPUs &

GPUs), petabytes of memory, tens of petabytes of disk storage, and hundreds of

petabytes of archival storage. The presentation will describe Blue Waters and illustrate

the role that Blue Waters will play in a few illustrative areas of research. (Lecture time: 1

hour)


18/20

18

Yan Xu, Microsoft Research, USA

Open Data for Open Science. Part 1. Tools for data scientists. An introduction

to some of the most cutting-edge Microsoft technologies that facilitate

scientists to discover, access, consume, and share scientific data. Part 2. Demos

of data tools from Microsoft. Demos of how to create solutions using the tools

presented in Part-1, with real-world scenarios and data. Attendees may bring

their Windows PC to follow the demos to create data visualization samples with

their own environmental research data in WorldWide Telescope

(http://www.worldwidetelescope.org) and share the results on Layerscape

(http://www.layerscape.org).


19/20


20/20

Documents

SPAS e-SciBioenergy: Program and Presentation Abstracts