[IEEE 2011 Developments in E-systems Engineering (DeSE) - Dubai, United Arab Emirates (2011.12.6-2011.12.8)] 2011 Developments in E-systems Engineering - High Performance System Framework

High Performance System Framework for Parallel in-Silico Biological Simulations

Plamenka Borovska, Ognian Nakov, Veska Gancheva, Ivailo Georgiev Computer Systems Department Technical University of Sofia

Sofia, Bulgaria {pborovska, nakov, vgan, ivailo_georgiev}@tu-sofia.bg

Abstract—The parallel implementation of methods and algorithms for analysis of biological data using high-performance computing is essential for accelerating the research and reduce the investment. The paper presents a high-performance framework for carrying out scientific experiments in the area of bioinformatics, on the basis of parallel computer simulations on a heterogeneous compact computer cluster. Several of the most popular and widely used methods and algorithms using for simulations intended for high performance platforms in order to increase the efficiency of the computations have been implemented. Important role is building up a database consisting of a reference genetic biological data, advanced software tools for in-silico simulations for the purposes of molecular biology, and web portal enabling secure access to the services. Web portal provides as services access and extraction of biological data and execution of various parallel program implementations based on algorithms for comparative analysis of biological data. The proposed framework is verified experimentally for the case study of investigation the influenza virus variability.

Keywords-bioinformatics; biological data; high performance computing; in silico experiments; web portal

I. INTRODUCTION Biological sequence processing is a key of information

technology for molecular biology. One of the greatest problems is that many scientific investigations generate huge amounts of data – the data being doubled up per year basis, much faster than Moor’s law. In the recent years an increase in the number of the completely sequenced genomes has been observed. The recent whole genome sequencing technology made it possible to reveal the nucleotide sequence of more than 1500 viral, bacterial, plants and animal genomes after the year 2000. The world DNA databases are accessible for common use and usually contain information for more than one (up to several thousands) individual genomes for each species. Until June 1, 2011, 6895 human and avian isolates of influenza virus have been completely sequenced and made available through GenBank [1]. The National Institute of Allergy and Infectious Diseases (NIAID) at the National Institutes of Health (NIH), USA runs a project called “Influenza Genome Sequencing Project” [2]. It is a collaborative project designed to increase the genome knowledge base of influenza and help researchers understand how flu viruses evolve, spread, and cause disease.

Scientists are now dependent on databases and access to the set of information that is produced. This scientific area requires powerful computing resources for exploring large sets of biological data.

The goal of this paper is to present a high-performance framework for carrying out scientific experiments in the area of bioinformatics, based on the parallel computer simulations and local mirror database of genetic biological data on a heterogeneous compact computer cluster.

II. HIGH PERFORMANCE SYSTEM FRAMEWORK The proposed high-performance system framework for

carrying out scientific experiments in the area of bioinformatics consisting of heterogeneous compact computer cluster, reference genetic biological database, software tools for sequence alignment such as parallel implementations of widely used methods Blast [3, 4], ClustalW [5], Needleman-Wunch algorithm [6], Smith-Waterman algorithm [7], etc. and web portal is shown in Fig. 1. The components are explained in the follows sections.

Portal

Storage

ExternalData

HPC resources

Middleware

Job manager

Execution service Database accessservice

Figure 1. High performance framework

A. Experimental hardware The experimental framework is based on a compact

heterogeneous computer cluster of 10 nodes including 8 servers AMD Opteron 64 Dual Core Processor 1.8 GHz, RAM: 2GB 800 MHz, HDD: 2x160GB Hitachi SATA in

The work proposed in this paper is supported by the National Science Fund, Bulgarian Ministry of Education and Science, Grant DCVP 02/1.

2011 Developments in E-systems Engineering

978-0-7695-4593-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DeSE.2011.72

549


978-0-7695-4593-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DeSE.2011.72

549


978-0-7695-4593-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DeSE.2011.72

553


978-0-7695-4593-6/11 $26.00 © 2011 IEEE

DOI 10.1109/DeSE.2011.72

553

RAID 0 and 2 servers CPU: 2x Intel Xeon E5405 Quad Core Processor 2 GHz, RAM: 4GB 800MHz, HDD: 2x 146 GB Hitachi 10000 RPM in RAID 0. All nodes are interconnected via gigabit Ethernet switch. The operating system is 64 bit Scientific Linux 5.3. Message passing is based on the MPICH2 1.1.1p2 distribution of the MPI standard. Parallelism profiling is make utilizing Jumpshot v.4.0.

B. Local mirror database design and implementation Scientists are now dependent on databases and access

to the set of information that is produced. Data providers make great efforts in providing the needed resources of information those could be comprehensive, easy to use and linked to other databases. However, different data providers use different methods. This means that a researcher must look in ten or more different databases to find all information relating to a particular set of genes. If the search is done on a regular basis, this will require local copies of all these databases. Maintaining a current database of fully functioning versions and tools to search them is huge and complex task. Therefore it is necessary to offer biologists access to current data of molecular biology related to their research. In a high-performance processing, databases progressively increase. This requires greater integration of various databases containing information associated with different levels of expression. The huge sequences of biological data, being accumulated in databases, require the development of efficient tools for genome sequences comparative analysis and use powerful clusters. In summary, the main challenge for data analysis in life sciences is to propose to the molecular biologists an integrated and modern access to progressively increasing amounts of data in multiple formats. This section explains design and implementation of a local mirror database of all existing influenza virus isolates in working format with possibilities for online updating on a heterogeneous compact computer cluster. This allows to always keeping current available database.

The influenza A genome comprises 10 genes cared by 8 single-stranded negative-sense RNA segments of total length 13600 bases. The NCBI Influenza Virus Sequence Database contains nucleotide sequences, protein sequences and their encoding regions of all influenza viruses in GenBank, including the complete genome. The data are continuously updated and stored on ftp server [8].

The data are stored in the following files: � genomeset.dat - supplementary genomeset data

table � influenza_na.dat - supplementary nucleotide data

table � influenza_aa.dat - supplementary protein data

table � influenza.dat - nucleotide, protein and coding

regions IDs table � influenza.fna - FASTA nucleotide � influenza.cds - FASTA coding regions � influenza.faa - FASTA protein The genomeset.dat contains information for sequences

of viruses with a complete set of segments in full-length (or nearly full-length).

The genomeset.dat, influenza_na.dat and influenza_aa.dat files are tab-delimitated tables which

have the following fields: GenBank accession number; Host; Genome segment number or protein name; Subtype; Country; Year/month/date; Sequence length; Virus name; Age; Gender.

The influenza_na.dat and influenza_aa.dat files have an additional field in the last column to indicate if a sequence is full-length.

The influenza.dat file is a tab-delimitated table which has the following fields:

� GenBank accession number for nucleotide � GenBank accession number for protein � Identifier for protein coding region Software tools as parser of biological sequences,

converting input data from one type to another have been designed and implemented. The overall idea is to transform the input data, which are DNA sequences of influenza virus recorded in one file into classified data in a predetermined file structure.

The input data are many numbers of segments of the sequences of influenza virus ordered consecutively one by one, separated by special separator (containing information about identification number, type, host, segment) recorded in a file format .fasta. The program uses additional (auxiliary) input data stored in a scheme file extension .dat. The output data are stored in a hierarchical structure. The treatment process is as follows:

1) Reading from fasta file sequentially the different segments of the sequences; taking the identification code of the current segment; and searching for a match in the schema file (containing segment ID, type of sequence, type of virus, the object of which is extracted sample, number of the segment etc.).

2) In case of a match, follows data are taken: virus type, subtypes, from which subject (host) is isolated, segment number (ranging from 1 to 8).

3) Creating a current branch of the hierarchical structure (if it has not yet been established), namely: a root directory is created (which contains all the source data); in the root directory subdirectories are created named of the type of virus (eg Influenza A, Influenza B, etc.); then in each of these subdirectories are created subdirectories for the type of virus (H1N1, H1H2, etc.); then in each of these subdirectories are created subdirectories for the host (Human, Avian, etc.); then in these subdirectories are created files in fasta format named of 1 to 8 numbers corresponding to the current segment of a given sequence.

4) The result is a tree structure (Fig. 2) and in the end branches are the files contained various segments. Each file contains all segments with the same number of a type, subtype and host of the influenza virus. As example the file /Influenza/InfluenzaA/Human/H1N1/1.fna - contains all 1-st segments of the virus Influenza A type H1N1 isolated in the human. Each entry in this file consists of genetic code (DNA code) + additional information such as ID, name of virus, virus type, host from which is extracted. Meanwhile, during the start of the program a log file is created to save details of what would happen: creating of a directory, saving a segment of sequence; serial number of entry etc. Log file in the directory is created that is specified as an input parameter such as destination directory.

550550554554

The software tool has been actually implemented as a console application making it possible to perform the following input parameters:

� Source file - fasta file containing biological sequences.

� Schema file - file containing summary information on the number of segment -> type of virus, type a name object that is retrieved, number of segments and more.

� Destination directory - directory in which to save the output (the root of the tree).

� Parameter displaying directions on what input parameters are acceptable and their meanings.

The sequential actions of the program is as follows: � Take directions path to the file consisting DNA

sequences, opening the file and begins to read segment by segment using BioJava library.

� Read the identification code of the segment, then start searching for matching into the loaded schema file.

� In the case of finding a match the file is created (files already created will be append) and saved in it the current segment. The file is saved in the current directory (if the directory does not exist, it will be created before file saving).

� Continue reading the next segment of the DNA sequences file.

All existing sequences of influenza virus obtained from various isolates have been segmented. The local database comprises real datasets of the 8 segments of the influenza virus A for various hosts and subtypes, shown in Fig. 3.

Figure 2. Database hierarchical structure

C. Parallel program implementations Some parallel software tools based on of widely used

methods in respect to the relevant molecular biology investigation have been implemented on the computer cluster: MPI based implementation of local pairwise alignment and searching method mpiBlast [9]; MPI based implementation of multiple sequence alignment method ClustalW [10]; OpenMP based implementation of Needleman-Wunch algorithm; POSIX based implementation of Smith-Waterman algorithm [11], etc.

Similarity searching between various isolates RNA segments of influenza viruses A strains have been carried out based on the parallel program MPI-based

Influenza Virus

Type

Type A

Host Segment

Type B

Type C

All Hosts

Human

Avian

Swine

Horse

PB2

PB1

PA

HA

NP

NA

MP

NS

Subtype

All Subtypes

H1N1

H3N8

H7N7

Figure 3. Database segmentation schema

551551555555

implementation of ClustalW algorithm for multiple sequence alignment and mirror local database on the computer cluster [9]. The scaling of the parallel system in respect to data size and computer cluster size is shown in Fig. 4. The performance evaluation and scalability analysis show that the parallel implementation has good scalability both in respect to the workload and cluster size.

Figure 4. Efficiency of the parallel system

D. Web portal design and implementation The growth of the high performance computer clusters

used for scientific experiments appears the need of quickly and easily access through interface. For this purpose a web portal for management the access to the parallel software tools and database has been designed and implemented on a remote web server. The aim is to build up a portal that provides as services security on the user registration, performance of parallel programs, access to data, quickly and easily administration of the tasks from each user. Each of the steps is certainly of great importance. The communication between client-server and server-cluster is through the encrypted protocols (https and SSH) and is aimed to provide security access to the cluster and performance of services (Fig. 5). The server side architecture of the portal is shown in Fig. 6.

Figure 5. Communication between the portal server and the computer cluster

The main objectives in building up the portal are security and simplicity for using by the users. In order to achieve these objectives, the following are needed:

� Registration and logging into the portal should be secured.

� Navigation should be easy and intuitively to use. There are a number of problems associated with

starting, and correct executions of the parallel implementation of a program that may lead to a false calculation results. Ensuring the security and consistent

performance of software tools are issues of great importance.

Figure 6. Server side architechture of the web based portal

Building the portal enables users to manage their tasks through the interface and administrators to easily manage users. Basic administrative requirements are: manage users’ accounts, edit their profiles, terminate users’ sessions that may have been lost by users; monitor users’ activities (tasks) and terminate these tasks where possible; generate regular reports on users’ activities or anything useful for administrative, research, or funding purposes.

The web portal provides also: user profile management, e.g. different views of different user; personalized access to information, software tools and processes; getting information from local or remote data sources, e.g. from databases, transaction systems, or remote web sites; aggregated the information into composite pages to provide information to users in a compact and easily consumable form.

In addition to pure information, the portal also include applications like execution of some software tools for sequence alignment, etc.

The portal implementation is based on a component model that allows plugging components referred to as portlets into the portal infrastructure. The portlets are user-facing, interactive web application components rendering markup fragments for aggregating and displaying by the portal. The portlets provide an abstraction to hide differences and are intended to perform the following services: authentication; file upload; job submission; software tools execution; resource viewer; job history and output.

552552556556

III. CONCLUSION The paper presents a high-performance framework for

carrying out in silico scientific experiments for the purposes of molecularly biology on the basis on a heterogeneous compact computer cluster, local mirror reference genetic biological database and parallel implementations of some algorithms for analysis of biological data. The access to high performance infrastructure is through web portal allowing services secure access and extraction of biological data and execution of various parallel software tools. The system is aimed at dynamically adding new program implementations and the necessary services related to the portal. Also is provided a possibility for the user to upload their developed parallel program implementations on the high performance computational system.

REFERENCES [1] GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ [2] National Institute of Allergy and Infectious Diseases,

http://www3.niaid.nih.gov/ [3] S. Altschul, W. Gisha, W. Millerb, E. Meyersc, and D. Lipman,

“Basic local alignment search tool”, Journal of Molecular Biology, 1990, 215(3).

[4] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman “Gapped BLAST and PSIBLAST: a new generation of protein database search programs”, Nucleic Acids Research, 1997, 25:3389–3402.

[5] J. Thompson, D. Higgins, and T. Gibson, “ClustalW: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice”, Nucleic Acids Research, Vol. 22, No. 22, 1994, pp. 4673-4680.

[6] S. Needleman, and C. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins”, Journal of Molecular Biology, 1970, vol. 48, pp. 443–453.

[7] T. Smith, and M. Waterman, “Identification of Common Molecular Subsequences”, Journal Molecularly Biology, 1981, 147, pp.195-197.

[8] GenBank, ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ [9] A. Darling, L. Carey, and W. Feng, “The design, implementation,

and evaluation of mpiBLAST”, In Proceedings of the Cluster World Conference and Expo, in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution, 2003.

[10] P. Borovska, V. Gancheva, G. Dimitrov, K. Chintov, S. Gurov, “Parallel Performance Evaluation of Multithreaded Local Sequence Alignment”, Proceeding of International Conference on Computer Systems and Technologies, CompSysTech’11, Vienna, Austria, 2011 (under print).

[11] P. Borovska, O. Nakov, V. Gancheva, I. Georgiev, “Parallel Multiple Alignment of the Influenza Virus A/H1N1 Genome Sequences on a Heterogeneous Compact Computer Cluster“, Proceedings of the 9th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems (SEPADS'10), Cambridge, UK, 2010, pp. 50-55.

553553557557

Documents

[IEEE 2011 Developments in E-systems Engineering (DeSE) - Dubai, United Arab Emirates (2011.12.6-2011.12.8)] 2011 Developments in E-systems Engineering - High Performance System Framework