89
Bioinformática y supercomputación M. Gonzalo Claros Díaz Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática 1 http://about.me/mgclaros/ @MGClaros

Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

Embed Size (px)

DESCRIPTION

¿En qué consiste la bioinformática? ¿Cómo puedo especializarme? ¿Dónde? Capacidad de supercomputación en la UMA. Recientes logros bioinformáticos relacionados con la medicina y con la ciencia en general, muchos de ellos realizados por equipos de la UMA.

Citation preview

Page 1: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

Bioinformática y supercomputación

M. Gonzalo Claros Díaz Dpto Biología Molecular y Bioquímica

Plataforma Andaluza de Bioinformática

1

Centro de Bioinnovación

http://about.me/mgclaros/

@MGClaros

Page 2: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Empecemos con unas palabras que no son mías

2

http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html

La bioinformática es un campo científico nuevo y muy

atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones

nuevas sobre las enfermedades y el cuerpo

humano

La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los

seres vivos y sus enfermedades

Page 3: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Empecemos con unas palabras que no son mías

2

http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html

La bioinformática es un campo científico nuevo y muy

atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones

nuevas sobre las enfermedades y el cuerpo

humano

La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los

seres vivos y sus enfermedades

Page 4: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La bioinformática no sólo se aplica a los humanos

3

http://mscbioinformatics.uab.cat/base/base3.asp?sitio=msbioinformatics

Pero entiendo que para un Ingeniero de la Salud, el interés en los humanos esté por encima

de lo demás

Page 5: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La bioinformática es IMPRESCINDIBLE hoy en día

4

http://bioinformatics.biol.ntnu.edu.tw/sher/Teaching.html

Page 6: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Cómo surge la bioinformática?

5

Margaret Oakley Dayhoff Había que poner orden en….

!

¡¡¡ 65 proteínas !!!

Page 7: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Tras una base de datos, viene otra

6

1975

¡¡¡ 12 estructuras !!!

Page 8: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Llamarlas BD es un casi un insulto a un informático

7

HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68 ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69 ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70 ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71 SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72 SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73 SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74 ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75 ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76 !! ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916 ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917 TER 844 C B 9 1DGC 918 MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919 END 1DGC 920

FORTRAN era el rey

Page 9: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

1977: el punto de inflexión

8

Proc. Nati. Acad. Sci. USAVol. 74, No. 12, pp. 5463-5467, December 1977Biochemistry

DNA sequencing with chain-terminating inhibitors(DNA polymerase/nucleotide sequences/bacteriophage 4X174)

F. SANGER, S. NICKLEN, AND A. R. COULSONMedical Research Council Laboratory of Molecular Biology, Cambridge CB2 2QH, England

Contributed by F. Sanger, October 3, 1977

ABSTRACT A new method for determining nucleotide se-quences in DNA is described. It is similar to the "plus andminus" method [Sanger, F. & Coulson, A. R. (1975) J. Mol. Biol.94,441-4481 but makes use of the 2',3'-dideoxy and arabinonu-cleoside analogues of the normal deoxynucleoside triphosphates,which act as specific chain-terminating inhibitors of DNApolymerase. The technique has been applied to the DNA ofbacteriophage 4bX174 and is more rapid and more accurate thaneither the plus or the minus method.

The "plus and minus" method (1) is a relatively rapid andsimple technique that has made possible the determination ofthe sequence of the genome of bacteriophage 4X174 (2). Itdepends on the use of DNA polymerase to transcribe specificregions of the DNA under controlled conditions. Although themethod is considerably more rapid and simple than otheravailable techniques, neither the "plus" nor the "minus"method is completely accurate, and in order to establish a se-quence both must be used together, and sometimes confirma-tory data are necessary. W. M. Barnes (J. Mol. Biol., in press)has recently developed a third method, involving ribo-substi-tution, which has certain advantages over the plus and minusmethod, but this has not yet been extensively exploited.

Another rapid and simple method that depends on specificchemical degradation of the DNA has recently been describedby Maxam and Gilbert (3), and this has also been used exten-sively for DNA sequencing. It has the advantage over the plusand minus method that it can be applied to double-strandedDNA, but it requires a strand separation or equivalent frac-tionation of each restriction enzyme fragment studied, whichmakes it somewhat more laborious.

This paper describes a further method using DNA poly-merase, which makes use of inhibitors that terminate the newlysynthesized chains at specific residues.

Principle of the Method. Atkinson et al. (4) showed that theinhibitory activity of 2',3'-dideoxythymidine triphosphate(ddTTP) on DNA polymerase I depends on its being incorpo-rated into the growing oligonucleotide chain in the place ofthymidylic acid (dT). Because the ddT contains no 3'-hydroxylgroup, the chain cannot be extended further, so that terminationoccurs specifically at positions where dT should be incorporated.If a primer and template are incubated with DNA polymerasein the presence of a mixture of ddTTP and dTTP, as well as theother three deoxyribonucleoside triphosphates (one of whichis labeled with 32p), a mixture of fragments all having the same5' and with ddT residues at the 3' ends is obtained. When thismixture is fractionated by electrophoresis on denaturingacrylamide gels the pattern of bands shows the distribution ofdTs in the newly synthesized DNA. By using analogous ter-minators for the other nucleotides in separate incubations andrunning the samples in parallel on the gel, a pattern of bandsis obtained from which the sequence can be read off as in theother rapid techniques mentioned above.Two types of terminating triphosphates have been used-the

dideoxy derivatives and the arabinonucleosides. Arabinose is5463

a stereoisomer of ribose in which the 3'-hydroxyl group is ori-ented in trans position with respect to the 2'-hydroxyl group.The arabinosyl (ara) nucleotides act as chain terminating in-hibitors of Escherichia coli DNA polymerase I in a mannercomparable to ddT (4), although synthesized chains ending in3' araC can be further extended by some mammalian DNApolymerases (5). In order to obtain a suitable pattern of bandsfrom which an extensive sequence can be read it is necessaryto have a ratio of terminating triphosphate to normal triphos-phate such that only partial incorporation of the terminatoroccurs. For the dideoxy derivatives this ratio is about 100, andfor the arabinosyl derivatives about 5000.

METHODSPreparation of the Triphosphate Analogues. The prepa-

ration of ddTTP has been described (6, 7), and the material isnow commercially available. ddA has been prepared byMcCarthy et al. (8). We essentially followed their procedureand used the methods of Tener (9) and of Hoard and Ott (10)to convert it to the triphosphate, which was then purified onDEAE-Sephadex, using a 0.1-1.0 M gradient of triethylaminecarbonate at pH 8.4. The preparation of ddGTP and ddCTPhas not been described previously; however we applied thesame method as that used for ddATP and obtained solutionshaving the requisite terminating activities. The yields were verylow and this can hardly be regarded as adequate chemicalcharacterization. However, there can be little doubt that theactivity was due to the dideoxy derivatives.The starting material for the ddGTP was N-isobutyryl-5'-

O-monomethoxytrityldeoxyguanosine prepared by F. E.Baralle (11). After tosylation of the 3'-OH group (12) thecompound was converted to the 2',3'-didehydro derivative withsodium methoxide (8). The isobutyryl group was partly re-moved during this treatment and removal was completed byincubation in NH3 (specific gravity 0.88) overnight at 45° . Thedidehydro derivative was reduced to the dideoxy derivative (8)and converted to the triphosphate as for the ddATP. The mo-nophosphate was purified by fractionation on a DEAE-Se-phadex column using a triethylamine carbonate gradient(0.025-0.3 M) but the triphosphate was not purified.ddCTP was prepared from N-anisoyl-5'-O-monomethoxy-

trityldeoxycytidine (Collaborative Research Inc., Waltham,MA) by the above method but the final purification onDEAE-Sephadex was omitted because the yield was very lowand the solution contained the required activity. The solutionwas used directly in the experiments described in this paper.An attempt was made to prepare the triphosphate of the

intermediate didehydrodideoxycytidine because Atkinson et

Abbreviations: The symbols C, T, A, and G are used for the deoxyri-bonucleotides in DNA sequences; the prefix dd is used for the 2',3'-dideoxy derivatives (e.g., ddATP is 2',3'-dideoxyadenosine 5'-tri-phosphate); the prefix ara is used for the arabinose analogues.

Page 10: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Y un mes «antes» la primera suite bioinformática

9

Volume 4 Number 11 November 1977 Nucleic Acids Research

Sequence data handling by computer

R.Staden

MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK

Received 10 October 1977

ABSTRACT

The speed of the new DINA sequencing techniques has created a need forcomputer programs to handle the data produced. This paper describes simpleprograms designed specifically for use by people with little or no computerexperience. The programs are for use on small computers and provide facili-ties for storage, editing and analysis of both DNA and amino acid sequences.A magnetic tape containing these programs is available on request.

INTRODUCTIONThe development of rapid DNA sequencing techniques12 now enables large

amounts of sequence data to be accumulated in a short period of time. Thecomplete sequence of bacteriophage 0X174 has recently been published3 and

the sequences of other, similarly sized molecules are near to completion.

During the sequencing of 0X174 DNA it became necessary to develop computer

programs to process the large amounts of data produced. Some of the

programs are specific to DNA sequences but many are equally applicable to

amino acid sequences. These programs are designed for small computers in

common use, such as the PDP 11/45, and are simplified so that they can be

used by people with little or no experience of computers. This paper

describes some of the programs currently being used in this laboratory.They provide facilities for (1) storage and editing of a sequence, (2)producing copies of the sequence in various forms, e.g. in single or double

stranded form, (3) translation into the amino acid sequence coded by the DNA

sequence, (4) searching the sequence for any particular shorter sequences,e.g. restriction enzyme sites, (5) analysis of codon usage and base composi-tion, (6) comparison of two sequences for homology, (7) locating regions of

sequences which are complementary, and (8) translation of two sequences with

the printout showing amino acid similarities. All printouts are as descrip-tive as possible and, where appropriate, in a form suitable to be reproducedfor publication.

C) Information Retrieval Limited 1 Falconberg Court London Wl V 5FG England 4037

Page 11: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El Staden Package es hoy de dominio público

10

http://staden.sourceforge.net

Page 12: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Y surgen las BD de secuencias

11

1983

1980: 563

secuencias

1988

Page 13: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

También eran BD de «texto»

12

Page 14: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Empezamos a necesitar algoritmos de comparación

13

J. Mol. Bid. (1981) 147, 195-197

Identification of Common Molecular Subsequences

The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another.

These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970).

In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions.

Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A

similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set

Hto = Ho, = 0 for 0 I k I n and 0 I 1 I m.

Preliminary values of H have the interpretation that H, is the maximum similarity of two segments ending in ai and bj, respectively. These values are obtained from the relationship

Hij=max{Hi-,,j-1+S(ai,bj), ~F,X {Hi-k,j- W,}, ~2" {Hi,j-,- W,}, 0}, (1)

1 li<n and 1 <j<m.

195

0922-2836/80/09019&03 $02.00/O 0 1980 Academic Press Inc. (London) Ltd.

Proc. Natt Acad. Sci. USAVol. 80, pp. 726-730, February 1983Biochemistry

Rapid similarity searches of nucleic acid and protein data banks(global homology/optimal alignment)

W. J. WILBUR AND DAVID J. LIPMANMathematical Research Branch, National Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54,Bethesda, Maryland 20205

Communicated by Maxine Singer, November 8, 1982

ABSTRACT With the development oflarge data banks of pro-tein and nucleic acid sequences, the need for efficient methods ofsearching such banks for sequences similar to a given sequence hasbecome evident. We present an algorithm for the global compar-ison ofsequences based on matching k-tuples ofsequence elementsfor a fixed k. The method results in substantial reduction in thetime required to search a data bank when compared with priortechniques of similarity analysis, with minimal loss in sensitivity.The algorithm has also been adapted, in a separateimplementa-tion, to produce rigorous sequence alignments. Currently, usingthe DEC KL-10 system, we can compare all sequences in the en-tire Protein Data Bank ofthe National Biomedical Research Foun-dation with a 350-residue query sequence in less than 3 min andcarry out a similar analysis with a 500-base query sequence againstall eukaryotic sequences in the Los Alamos Nucleic Acid Data Basein less than 2 min.

As the number of protein molecules and nucleic acid fragmentsfor which the sequences have been determined has grown intothe thousands (the total number of nucleotides so analyzed isnow more than one million), it has become clear that a rapidmethod for carrying out similarity searches would be useful.Such a method should allow economical study of large databanks in search of related sequences that would then be sub-jected to more definitive analysis.

Currently, there are several different methods in use for ana-.lyzing the similarity oftwo sequences. For the purpose ofglobalcomparison (considering both complete sequences), there arethe methods of Fitch (1) as implemented by Dayhoff (2), ofNeedleman and Wunsch (3) and Sellers (4) [see Smith et al (5)for proof of the equivalence of these two algorithms], and ofSankoff (6). Given a set of scoring rules, such as + 1 for a basematch and -3 for a gap, a Needleman-Wunsch type algorithmconsiders all possible alignments, including gaps, and will findan optimal alignment under the scoring rules. All ofthese meth-ods require computer time on the order ofN x M, where N andM are the lengths of the sequences compared. Local searchmethods (a search for similar fragments of two sequences) havebeen proposed by Korn et aL (7), Sellers (8), Smith and Water-man (9), and Goad and Kanehisa (10). These methods are underthe same time constraints as the global methods already noted.Dayhoff (2) has implemented an algorithm that compares a 25-residue test subsequence from one peptide chain with all pos-sible 25-residue subsequences from another, not allowing gaps.If all test subsequences are used, the time is again of the orderof N X M but, in many instances, reasonable choices for testsubsequences can improve the time without significant sacrificein the accuracy of results.

All of the search techniques mentioned above become com-putationally intensive and quite expensive when applied to

large banks of sequences. We shall describe here a global al-gorithm for comparing two nucleic acid or two amino acid se-quences. This algorithm involves the construction ofan optimalalignment that is useful in its own right. The algorithm also re-quires a computation time on the order ofN X M, where N andM are the lengths of-the sequences being compared, but, forgiven sequences, the computation is many times faster than theabove-mentioned methods. Results obtained by the methodand its limitations and advantages are discussed.

METHODSComputational Methods and Data Sources. All computing

was done on the DEC KL-10 computer facility at the NationalInstitutes of Health. The programs are written in DEC-10 Pas-cal.* The graphs shown were generated by using the MLABprogram facility at the National Institutes of Health. All se-quences were taken from the Los Alamos Sequence Data Baseand the National Biomedical Research Foundation Data. Bank.The Algorithm. We shall here describe how two sequences,

Si and S2, of lengths N1 and N2, respectively, are to be com-pared. As motivation, it is useful to think in terms of the dotmatrix comparison of Si and S2 (11) in which the beginnings ofboth sequences are placed to the upper left of the matrix andone sequence is positioned horizontally and the other is posi-tioned vertically. The diagonals running downward from left toright in the dot matrix illustrate the degree of similarity thatwould be found by a simple sliding comparison with the dif-ferent possible choices of alignment register. Frequently, onecan look at the dot matrix comparison and immediately see cer-tain diagonals that appear to have a number of points abovebackground and, therefore, indicate a level of similarity for thetwo sequences in certain regions. It is generally true that thesesignificant diagonals are still clearly visible when the dot matrixis filtered to only show matches of length k or greater, wherek is a small positive integer. For this reas'on, our attention willbe directed to such k-tuples.The first step in the algorithm is the location ofall the k-tuple

matches between the sequences Si and S2. In precise terms,a k-tuple match consists of two k-tuples-S1(i),Si(i + 1),. . .,Sl(i + k - 1) from Si and S2(j),S2(j + 1), . . .,S2x (1 + k - 1) from S2-that are identical. If there are p ele-ments in the alphabet from which the sequences are made, thenthere are pk possible different k-tuples. To locate all k-tuplematches, we follow a method described by Dumas and Ninio(12). We have chosen a simple method (there are many possible)of converting any k-tuple into an integer between 1 and pk.Then, a one-dimensional array, C, of length pk and consistingof pointers set initially to nil is used. In a single pass throughSi, each integer position i is added to a list constructed at C(*),where ic is the coded form of the k-tuple beginning at i in S1.* The programs described in this paper available from the authors.

726

The publication costs ofthis article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertise-ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

Son buenos, pero lentos

Aparece FASTA

Page 15: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Empezamos a necesitar algoritmos de comparación

13

J. Mol. Bid. (1981) 147, 195-197

Identification of Common Molecular Subsequences

The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another.

These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970).

In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions.

Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A

similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set

Hto = Ho, = 0 for 0 I k I n and 0 I 1 I m.

Preliminary values of H have the interpretation that H, is the maximum similarity of two segments ending in ai and bj, respectively. These values are obtained from the relationship

Hij=max{Hi-,,j-1+S(ai,bj), ~F,X {Hi-k,j- W,}, ~2" {Hi,j-,- W,}, 0}, (1)

1 li<n and 1 <j<m.

195

0922-2836/80/09019&03 $02.00/O 0 1980 Academic Press Inc. (London) Ltd.

Proc. Natt Acad. Sci. USAVol. 80, pp. 726-730, February 1983Biochemistry

Rapid similarity searches of nucleic acid and protein data banks(global homology/optimal alignment)

W. J. WILBUR AND DAVID J. LIPMANMathematical Research Branch, National Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54,Bethesda, Maryland 20205

Communicated by Maxine Singer, November 8, 1982

ABSTRACT With the development oflarge data banks of pro-tein and nucleic acid sequences, the need for efficient methods ofsearching such banks for sequences similar to a given sequence hasbecome evident. We present an algorithm for the global compar-ison ofsequences based on matching k-tuples ofsequence elementsfor a fixed k. The method results in substantial reduction in thetime required to search a data bank when compared with priortechniques of similarity analysis, with minimal loss in sensitivity.The algorithm has also been adapted, in a separateimplementa-tion, to produce rigorous sequence alignments. Currently, usingthe DEC KL-10 system, we can compare all sequences in the en-tire Protein Data Bank ofthe National Biomedical Research Foun-dation with a 350-residue query sequence in less than 3 min andcarry out a similar analysis with a 500-base query sequence againstall eukaryotic sequences in the Los Alamos Nucleic Acid Data Basein less than 2 min.

As the number of protein molecules and nucleic acid fragmentsfor which the sequences have been determined has grown intothe thousands (the total number of nucleotides so analyzed isnow more than one million), it has become clear that a rapidmethod for carrying out similarity searches would be useful.Such a method should allow economical study of large databanks in search of related sequences that would then be sub-jected to more definitive analysis.

Currently, there are several different methods in use for ana-.lyzing the similarity oftwo sequences. For the purpose ofglobalcomparison (considering both complete sequences), there arethe methods of Fitch (1) as implemented by Dayhoff (2), ofNeedleman and Wunsch (3) and Sellers (4) [see Smith et al (5)for proof of the equivalence of these two algorithms], and ofSankoff (6). Given a set of scoring rules, such as + 1 for a basematch and -3 for a gap, a Needleman-Wunsch type algorithmconsiders all possible alignments, including gaps, and will findan optimal alignment under the scoring rules. All ofthese meth-ods require computer time on the order ofN x M, where N andM are the lengths of the sequences compared. Local searchmethods (a search for similar fragments of two sequences) havebeen proposed by Korn et aL (7), Sellers (8), Smith and Water-man (9), and Goad and Kanehisa (10). These methods are underthe same time constraints as the global methods already noted.Dayhoff (2) has implemented an algorithm that compares a 25-residue test subsequence from one peptide chain with all pos-sible 25-residue subsequences from another, not allowing gaps.If all test subsequences are used, the time is again of the orderof N X M but, in many instances, reasonable choices for testsubsequences can improve the time without significant sacrificein the accuracy of results.

All of the search techniques mentioned above become com-putationally intensive and quite expensive when applied to

large banks of sequences. We shall describe here a global al-gorithm for comparing two nucleic acid or two amino acid se-quences. This algorithm involves the construction ofan optimalalignment that is useful in its own right. The algorithm also re-quires a computation time on the order ofN X M, where N andM are the lengths of-the sequences being compared, but, forgiven sequences, the computation is many times faster than theabove-mentioned methods. Results obtained by the methodand its limitations and advantages are discussed.

METHODSComputational Methods and Data Sources. All computing

was done on the DEC KL-10 computer facility at the NationalInstitutes of Health. The programs are written in DEC-10 Pas-cal.* The graphs shown were generated by using the MLABprogram facility at the National Institutes of Health. All se-quences were taken from the Los Alamos Sequence Data Baseand the National Biomedical Research Foundation Data. Bank.The Algorithm. We shall here describe how two sequences,

Si and S2, of lengths N1 and N2, respectively, are to be com-pared. As motivation, it is useful to think in terms of the dotmatrix comparison of Si and S2 (11) in which the beginnings ofboth sequences are placed to the upper left of the matrix andone sequence is positioned horizontally and the other is posi-tioned vertically. The diagonals running downward from left toright in the dot matrix illustrate the degree of similarity thatwould be found by a simple sliding comparison with the dif-ferent possible choices of alignment register. Frequently, onecan look at the dot matrix comparison and immediately see cer-tain diagonals that appear to have a number of points abovebackground and, therefore, indicate a level of similarity for thetwo sequences in certain regions. It is generally true that thesesignificant diagonals are still clearly visible when the dot matrixis filtered to only show matches of length k or greater, wherek is a small positive integer. For this reas'on, our attention willbe directed to such k-tuples.The first step in the algorithm is the location ofall the k-tuple

matches between the sequences Si and S2. In precise terms,a k-tuple match consists of two k-tuples-S1(i),Si(i + 1),. . .,Sl(i + k - 1) from Si and S2(j),S2(j + 1), . . .,S2x (1 + k - 1) from S2-that are identical. If there are p ele-ments in the alphabet from which the sequences are made, thenthere are pk possible different k-tuples. To locate all k-tuplematches, we follow a method described by Dumas and Ninio(12). We have chosen a simple method (there are many possible)of converting any k-tuple into an integer between 1 and pk.Then, a one-dimensional array, C, of length pk and consistingof pointers set initially to nil is used. In a single pass throughSi, each integer position i is added to a list constructed at C(*),where ic is the coded form of the k-tuple beginning at i in S1.* The programs described in this paper available from the authors.

726

The publication costs ofthis article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertise-ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

Son buenos, pero lentos

Aparece FASTA

La bioinformática es una ciencia que se plantea problemas y les busca soluciones

La bioinformática es una ciencia porque busca descubrir información

Page 16: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Se acumulan más secuencias, por lo que se necesitan comparaciones más eficaces

14

Se mejora el algoritmo, no el ordenador: llega BLAST

Page 17: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El coste de secuenciar disminuye, gracias a los ingenieros

15

Page 18: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Menos coste: más secuenciación, más datos y más BD

16

Page 19: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Las BD son IMPRESCINDIBLES hoy para los bioinformáticos

17

Page 20: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Pero la ley de Moore no perdona

18

La información se acumula más rápido de lo que aumenta la velocidad de los procesadores

Número de transistores en los procesadores Intel

Crecimiento de datos en las bases de datos

Ingenieros informáticos: ¡SOCORRO!

Page 21: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La «info» no logra ponerse al ritmo de la «bio»

19

Page 22: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Si no aumentan los recursos, habrá que dedicar más gente a analizar los datos

20

Page 23: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Se necesitan bioinformáticos a pesar de (¿gracias a?) la crisis

21

http://www.indeed.com/jobtrends?q=molecular+biology,+bioinformatics,+biomedical+engineering&l=&relative=1

Page 24: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Vamos, que hay trabajo para bioinformáticos

22

Page 25: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Vamos, que hay trabajo para bioinformáticos

22

Page 26: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Vamos, que hay trabajo para bioinformáticos

22

Page 27: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Todos los días hay nuevas peticiones de bioinformáticos

23

Page 28: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Todos los días hay nuevas peticiones de bioinformáticos

23

30-dic-13

Page 29: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Todos los días hay nuevas peticiones de bioinformáticos

23

30-dic-13

Page 30: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Y también en España y Europa

24http://www.eurosciencejobs.com/jobs/bioinformatics

Page 31: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Si lo que quieres es ganar dinero, también

25

Puedes anunciarte aquí desde 50 euros

Contacta: 633 601 207 [email protected]

La Marea tiene un CÓDIGO ÉTICO consensuado con los socios para regular las inser-ciones publicitarias. La revista nunca publicará anuncios que entren en contradicción con nuestros principios. No acep-tamos publicidad con conte-nidos sexistas, racistas o que fomenten la discriminación.

BiogredosBollería y galletería, envasado de harinas, frutos secos y legumbres. Todo con deno-minación de agricultura ecológica.Ctra. AV923, km. 0,5. Mombeltrán. Ávila. Teléfono: 920 37 02 97

Genoma4uConocer tu genoma y el de tus hijos es la llave de la medicina personalizada.www.genoma4u.com

El Cantero de LeturAlimentos lácteos ecológicos de alta ca-lidad. Es lógico. Es ecológico. Teléfono: 967 42 60 66 www.elcanterodeletur.com

Ateneu RebelEspacio anticapitalista de lucha, encuen-tro y cultura. C. Font Honrada, 32-34. Barcelona. [email protected]

La MarabuntaLibrería-café con una amplia agenda cultu-ral. Punto de encuentro, crítica y reflexión. Poesía, música, debates políticos y sociales. C/ Torrecilla del Leal, 32. Madrid. www.lamarabunta. info libreria@lamarabunta. info

Farrachucho ComunicaciónApostamos por fomentar el espíritu crítico y crear valor social a través del diseño. En-redados con el cooperativismo y las em-presas de economía social.C/San Antón, 15. Casco Viejo. Pamplona-Iruña (Navarra) Teléfono: 948225971 [email protected]

Club de l’empanadaEmpanadas gallegas artesanas en el Barri Gòtic de Barcelona. Disfruta de una empanada de pulpo, de raxo, de baca-lao... Más de diez tipos diferentes. Menú diario. Cocina de mercado.Carrer de la Dagueria, 7. Barcelona. Teléfono: 93 310 76 47

EnCubiertaLa primera revista en formato ebook di-rigida a los lectores que leen en los dis-positivos electrónicos. Recomendamos títulos a partir de entrevistas con autores, extractos de libros, reseñas y listados de novedades. Publicamos cada número la primera semana del mes. www.encubierta.com

Ión RadioParticipa en esta nueva radio de análisis de los movimientos sociales. Periodismo a fuego lento.www.ionradio.es

Sindicato del cómicLibrería especializada en cómics y jue-gos de mesa, con actividades y presen-taciones.c/ Doctor Marañón, 15 Ourense Tfn. 988 25 08 28

EcogermenTienda de productos ecológicos, alimen-tación... Economía social. Consumo res-ponsable. Soberanía alimentaria. Plaza Elíptica 15, bis. Valladolid.Teléfono: 983 37 63 96. www.ecogermen.com

Librería Circus Una librería distinta. Libros usados, nue-vos, idiomas... Albacete. Frente Teatro Circo

DiDeSURAsociación para la promoción del comercio justo, el consumo crítico y la soberanía alimentaria. C/ Ciudad Real, 1 (El Foro. Local exterior) Azuqueca de Henares. Guadalajara. [email protected]

Librería AnónimaLibrería literaria general de barrio de pue-blo, universal, libertaria y aragonesa. C/Cabestany, 19. 22005. Huesca. www.libreriaanonima.es

Libraría PedreiraUnha libraría galega aberta ao mundo.Rúa do Home Santo, 55. Santiago de Compostela

Txoko TxinboTxokolatea, txurroak, eta zerbait gehiago. Chocolate, churros... y algo más.Plaza Nueva 10. Alde Zaharra. Bilbo (Bizkaia) www.txokotxinbo.eu

Anuncios breves 63Abril 2014www.lamarea.com

¿Se puede cambiar Europa a través del voto?El Parlamento de la UE gana poder pero carece de competencias para controlar organismos como la troika

ABRIL 2014

LA REVISTA MENSUAL

DE LA COOPERATIVA

MÁSPÚBLICO

MERCADONAEl rey de los supermercados impone sus propias condiciones laborales

AGUAEl Gobierno ultima la privatización de manantiales y de caudales de ríos

22-MLas Marchas de la Dignidad, un símbolo de unidad y poder popular

ABRIL 2014 | Nº15 | 3€

Page 32: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Se les paga bien, al menos en el extranjero

26

Se paga mejor linux y OSX

que Windows

http://www.r-bloggers.com/r-skills-attract-the-highest-salaries/

En la rama de bioinformática de Ing. de la

Salud se estudia R

Page 33: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Sabías que tras las BD, R es lo que más se usa en la bioinformática?

27

Lo que más se usan son las BD

Y luego R

Page 34: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Y que hay ofertas de trabajo para bioinformáticos con R?

28

http://www.r-bloggers.com/r-jobs-march-24th/

Page 35: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Tenéis este mundo a vuestro alcance en la UMA

29

http://www.uma.es/grado-en-ingenieria-de-la-salud

Page 36: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Siempre nos quedan los cursos de especialización

30

Page 37: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¡Y los libros! Que como Teruel, también existen

31

Page 38: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El bioinformático puede ejercer de muchas formas• Como un ingeniero

• Facilitando tareas difíciles o tediosas • Flujos de trabajo y automatización

• Como un informático

• Mejorando los algoritmos existentes

• Creando algoritmos nuevos • Por ejemplo, ensamblaje de secuencias

• Como un científico

• Descubriendo información biológica con el ordenador • Por ejemplo, relacionar enfermedades

aparentemente inconexas

32

Page 39: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Se están definiendo las competencias del bioinformático

33

Message from ISCB

Bioinformatics Curriculum Guidelines: Toward aDefinition of Core CompetenciesLonnie Welch1*, Fran Lewitter2, Russell Schwartz3, Cath Brooksbank4, Predrag Radivojac5, Bruno Gaeta6,

Maria Victoria Schneider7

1 School of Electrical Engineering and Computer Science, Ohio University, Athens, Ohio, United States of America, 2 Bioinformatics and Research Computing, Whitehead

Institute, Cambridge, Massachusetts, United States of America, 3 Department of Biological Sciences and School of Computer Science, Carnegie Mellon University,

Pittsburgh, Pennsylvania, United States of America, 4 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambridge, United Kingdom, 5 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America, 6 School of Computer

Science and Engineering, The University of New South Wales, Sydney, New South Wales, Australia, 7 The Genome Analysis Centre, Norwich Research Park, Norwich, United

Kingdom

Introduction

Rapid advances in the life sciences andin related information technologies neces-sitate the ongoing refinement of bioinfor-matics educational programs in order tomaintain their relevance. As the disciplineof bioinformatics and computational biol-ogy expands and matures, it is importantto characterize the elements that contrib-ute to the success of professionals in thisfield. These individuals work in a widevariety of settings, including bioinformaticscore facilities, biological and medical re-search laboratories, software developmentorganizations, pharmaceutical and instru-ment development companies, and institu-tions that provide education, service, andtraining. In response to this need, theCurriculum Task Force of the InternationalSociety for Computational Biology (ISCB)Education Committee seeks to definecurricular guidelines for those who trainand educate bioinformaticians. The previ-ous report of the task force summarized asurvey that was conducted to gather inputregarding the skill set needed by bioinfor-maticians [1]. The current article details asubsequent effort, wherein the task forcebroadened its perspectives by examiningbioinformatics career opportunities, survey-ing directors of bioinformatics core facili-ties, and reviewing bioinformatics educa-tion programs.

The bioinformatics literature providesvaluable perspectives on bioinformatics edu-cation by defining skill sets needed bybioinformaticians, presenting approaches forproviding informatics training to biologists,and discussing the roles of bioinformatics corefacilities in training and education.

The skill sets required for success in thefield of bioinformatics are considered byseveral authors: Altman [2] defines fivebroad areas of competency and lists keytechnologies; Ranganathan [3] presentshighlights from the Workshops on Educationin Bioinformatics, discussing challenges andpossible solutions; Yale’s interdepartmentalPhD program in computational biology andbioinformatics is described in [4], which liststhe general areas of knowledge of bioinfor-matics; in a related article, a graduate ofYale’s PhD program reflects on the skillsneeded by a bioinformatician [5]; Altmanand Klein [6] describe the Stanford Bio-medical Informatics (BMI) Training Pro-gram, presenting observed trends amongBMI students; the American Medical Infor-matics Association defines competencies inthe related field of biomedical informatics in[7]; and the approaches used in severalGerman universities to implement bioinfor-matics education are described in [8].

Several approaches to providing bioin-formatics training for biologists are de-scribed in the literature. Tan et al. [9]report on workshops conducted to identifya minimum skill set for biologists to beable to address the informatics challengesof the ‘‘-omics’’ era. They define arequisite skill set by analyzing responsesto questions about the knowledge, skills,and abilities that biologists should possess.The authors in [10] present examples ofstrategies and methods for incorporatingbioinformatics content into undergraduate

life sciences curricula. Pevzner and Shamir[11] propose that undergraduate biologycurricula should contain an additionalcourse, ‘‘Algorithmic, Mathematical, andStatistical Concepts in Biology.’’ Wingrenand Botstein [12] present a graduatecourse in quantitative biology that is basedon original, pathbreaking papers in diverseareas of biology. Johnson and Friedman[13] evaluate the effectiveness of incorpo-rating biological informatics into a clinicalinformatics program. The results reportedare based on interviews of four studentsand informal assessments of bioinformaticsfaculty.

The challenges and opportunities rele-vant to training and education in thecontext of bioinformatics core facilities arediscussed by Lewitter et al. [14]. Relatedly,Lewitter and Rebhan [15] provide guid-ance regarding the role of a bioinformaticscore facility in hiring biologists and infurthering their education in bioinfor-matics. Richter and Sexton [16] describea need for highly trained bioinformaticiansin core facilities and provide a list ofrequisite skills. Similarly, Kallioniemi et al.[17] highlight the roles of bioinformaticscore units in education and training.

This manuscript expands the body ofknowledge pertaining to bioinformaticscurriculum guidelines by presenting theresults from a broad set of surveys (of corefacility directors, of career opportunities,and of existing curricula). Although thereis some overlap in the findings of the

Citation: Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, et al. (2014) Bioinformatics CurriculumGuidelines: Toward a Definition of Core Competencies. PLoS Comput Biol 10(3): e1003496. doi:10.1371/journal.pcbi.1003496

Published March 6, 2014

Copyright: ! 2014 Welch et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: No specific funding was received for writing this article.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

PLOS Computational Biology | www.ploscompbiol.org 1 March 2014 | Volume 10 | Issue 3 | e1003496

Message from ISCB

Bioinformatics Curriculum Guidelines: Toward aDefinition of Core CompetenciesLonnie Welch1*, Fran Lewitter2, Russell Schwartz3, Cath Brooksbank4, Predrag Radivojac5, Bruno Gaeta6,

Maria Victoria Schneider7

1 School of Electrical Engineering and Computer Science, Ohio University, Athens, Ohio, United States of America, 2 Bioinformatics and Research Computing, Whitehead

Institute, Cambridge, Massachusetts, United States of America, 3 Department of Biological Sciences and School of Computer Science, Carnegie Mellon University,

Pittsburgh, Pennsylvania, United States of America, 4 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambridge, United Kingdom, 5 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America, 6 School of Computer

Science and Engineering, The University of New South Wales, Sydney, New South Wales, Australia, 7 The Genome Analysis Centre, Norwich Research Park, Norwich, United

Kingdom

Introduction

Rapid advances in the life sciences andin related information technologies neces-sitate the ongoing refinement of bioinfor-matics educational programs in order tomaintain their relevance. As the disciplineof bioinformatics and computational biol-ogy expands and matures, it is importantto characterize the elements that contrib-ute to the success of professionals in thisfield. These individuals work in a widevariety of settings, including bioinformaticscore facilities, biological and medical re-search laboratories, software developmentorganizations, pharmaceutical and instru-ment development companies, and institu-tions that provide education, service, andtraining. In response to this need, theCurriculum Task Force of the InternationalSociety for Computational Biology (ISCB)Education Committee seeks to definecurricular guidelines for those who trainand educate bioinformaticians. The previ-ous report of the task force summarized asurvey that was conducted to gather inputregarding the skill set needed by bioinfor-maticians [1]. The current article details asubsequent effort, wherein the task forcebroadened its perspectives by examiningbioinformatics career opportunities, survey-ing directors of bioinformatics core facili-ties, and reviewing bioinformatics educa-tion programs.

The bioinformatics literature providesvaluable perspectives on bioinformatics edu-cation by defining skill sets needed bybioinformaticians, presenting approaches forproviding informatics training to biologists,and discussing the roles of bioinformatics corefacilities in training and education.

The skill sets required for success in thefield of bioinformatics are considered byseveral authors: Altman [2] defines fivebroad areas of competency and lists keytechnologies; Ranganathan [3] presentshighlights from the Workshops on Educationin Bioinformatics, discussing challenges andpossible solutions; Yale’s interdepartmentalPhD program in computational biology andbioinformatics is described in [4], which liststhe general areas of knowledge of bioinfor-matics; in a related article, a graduate ofYale’s PhD program reflects on the skillsneeded by a bioinformatician [5]; Altmanand Klein [6] describe the Stanford Bio-medical Informatics (BMI) Training Pro-gram, presenting observed trends amongBMI students; the American Medical Infor-matics Association defines competencies inthe related field of biomedical informatics in[7]; and the approaches used in severalGerman universities to implement bioinfor-matics education are described in [8].

Several approaches to providing bioin-formatics training for biologists are de-scribed in the literature. Tan et al. [9]report on workshops conducted to identifya minimum skill set for biologists to beable to address the informatics challengesof the ‘‘-omics’’ era. They define arequisite skill set by analyzing responsesto questions about the knowledge, skills,and abilities that biologists should possess.The authors in [10] present examples ofstrategies and methods for incorporatingbioinformatics content into undergraduate

life sciences curricula. Pevzner and Shamir[11] propose that undergraduate biologycurricula should contain an additionalcourse, ‘‘Algorithmic, Mathematical, andStatistical Concepts in Biology.’’ Wingrenand Botstein [12] present a graduatecourse in quantitative biology that is basedon original, pathbreaking papers in diverseareas of biology. Johnson and Friedman[13] evaluate the effectiveness of incorpo-rating biological informatics into a clinicalinformatics program. The results reportedare based on interviews of four studentsand informal assessments of bioinformaticsfaculty.

The challenges and opportunities rele-vant to training and education in thecontext of bioinformatics core facilities arediscussed by Lewitter et al. [14]. Relatedly,Lewitter and Rebhan [15] provide guid-ance regarding the role of a bioinformaticscore facility in hiring biologists and infurthering their education in bioinfor-matics. Richter and Sexton [16] describea need for highly trained bioinformaticiansin core facilities and provide a list ofrequisite skills. Similarly, Kallioniemi et al.[17] highlight the roles of bioinformaticscore units in education and training.

This manuscript expands the body ofknowledge pertaining to bioinformaticscurriculum guidelines by presenting theresults from a broad set of surveys (of corefacility directors, of career opportunities,and of existing curricula). Although thereis some overlap in the findings of the

Citation: Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, et al. (2014) Bioinformatics CurriculumGuidelines: Toward a Definition of Core Competencies. PLoS Comput Biol 10(3): e1003496. doi:10.1371/journal.pcbi.1003496

Published March 6, 2014

Copyright: ! 2014 Welch et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: No specific funding was received for writing this article.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

PLOS Computational Biology | www.ploscompbiol.org 1 March 2014 | Volume 10 | Issue 3 | e1003496

database management languages (e.g.,Oracle, PostgreSQL, and MySQL), andscientific and statistical analysis software(such as R, S-plus, MATLAB, and Math-ematica). Additionally, a bioinformaticianshould be able to incorporate componentsfrom open source software repositories intoa software system. The ability to effectivelyutilize distributed and high-performancecomputing to analyze large data sets isessential, as is knowledge of networkingtechnology and internet protocols. A bioin-formatician should be able to utilize webauthoring tools, web-based user interfaceimplementation technologies, and versioncontrol and build tools (e.g., subversion,Ant, and Netbeans).

While it is important for a bioinforma-tician to have a suite of computational,mathematical, and statistical skills, thisalone is insufficient. Throughout theircareers, bioinformaticians usually contrib-ute to a variety of scientific projects, such asvariant detection in human exome rese-quencing; human genetic diversity; geno-mic and epigenomic mechanisms of generegulation; viral diversity; neurodegenera-tion and psychiatric disorders; drug discov-ery; the role of transcription factors andchromatin structure in global gene expres-sion, development, and differentiation; andcancer/tumor biology. To be a fullyintegrated member of a research team, abioinformatician must possess detailedknowledge of molecular biology, genomics,genetics, cell biology, biochemistry, andevolutionary theory. Furthermore, it isnecessary to understand related technolo-gies, including next generation sequencingand proteomics/mass spectrometry. It is

also desirable for a bioinformatician to havemodeling experience or background in oneor more specialized domains, such assystems biology, inflammation, immunolo-gy, cell signaling, or physiology.

Additionally, a bioinformatician musthave a high level of motivation, beindependent and dedicated, possess stronginterpersonal and managerial skills, andhave outstanding analytical ability. Abioinformatician must have excellentteamwork skills and have strong scientificcommunication skills.

As a bioinformatician progressesthrough his or her career, it is helpful todevelop managerial and programmaticskills, such as staff management andbusiness development; understanding ofor experience with grant funding and/oraccess to finance; awareness of researchand development (R&D) and innovationpolicy and government drivers; the use ofmodeling and simulation approaches; abil-ity to evaluate the major factors associatedwith efficacy and safety; and ability toanswer regulatory questions related toproduct approval and risk management.It is also important to have familiarity withpresenting biological results in both oraland written forms.

In summary, a senior bioinformaticianwill benefit from strong analytical reasoningcapabilities, as evidenced by a track recordof innovation; scientific creativity, collabo-rative ability, mentoring skills, and inde-pendent thought; and a record of outstand-ing research. Table 1 summarizes the skillsets identified by (1) surveying bioinfor-matics core facility directors and (2) exam-ining bioinformatics career opportunities.

Preliminary Survey of ExistingCurricula

An important step in developing guide-lines for bioinformatics education is togain a comprehensive understanding ofcurrent practices in bioinformatics andcomputational biology education. To thisend, the task force surveyed and cata-logued existing curricula used in bioinfor-matics educational programs.

As a first step, the task force began amanual search for educational programs.Due to the large number of educationprograms, the decision was made to initiallyrestrict the search to programs awarding adegree or certificate and explicitly including‘‘computational biology,’’ ‘‘bioinformatics,’’or some close variant in the name of thedegree or certificate awarded. The searchthus excluded non-degree tracks or optionswithin more traditional programs, non-degree programs of study, or programs inrelated fields that might have high overlapwith bioinformatics (e.g., biostatistics orbiomedical informatics). Although this wasa controversial decision even within the taskforce, this narrow scope and definition ofprograms was intended to keep the searchfrom becoming too unfocused or beingsidetracked over questions of which pro-grams should be included as belonging tothe field.

A search by committee members pro-duced a preliminary collection of twoprograms awarding degrees of associate ofarts or sciences; 72 awarding bachelor ofscience, arts, or technology; 38 awardingmaster of science, research, or biotechnolo-gy; 39 awarding doctor of philosophy; and

Table 1. Summary of the skill sets of a bioinformatician, identified by surveying bioinformatics core facility directors andexamining bioinformatics career opportunities.

Skill Category Specific Skills

General time management, project management, management of multiple projects, independence, curiosity, self-motivation, ability tosynthesize information, ability to complete projects, leadership, critical thinking, dedication, ability to communicate scientificconcepts, analytical reasoning, scientific creativity, collaborative ability

Computational programming, software engineering, system administration, algorithm design and analysis, machine learning, data mining, databasedesign and management, scripting languages, ability to use scientific and statistical analysis software packages, open sourcesoftware repositories, distributed and high-performance computing, networking, web authoring tools, web-based user interfaceimplementation technologies, version control tools

Biology molecular biology, genomics, genetics, cell biology, biochemistry, evolutionary theory, regulatory genomics, systems biology, nextgeneration sequencing, proteomics/mass spectrometry, specialized knowledge in one or more domains

Statistics and Mathematics application of statistics in the contexts of molecular biology and genomics, mastery of relevant statistical and mathematicalmodeling methods (including experimental design, descriptive and inferential statistics, probability theory, differential equations andparameter estimation, graph theory, epidemiological data analysis, analysis of next generation sequencing data using R andBioconductor)

Bioinformatics analysis of biological data; working in a production environment managing scientific data; modeling and warehousing of biologicaldata; using and building ontologies; retrieving and manipulating data from public repositories; ability to manage, interpret, andanalyze large data sets; broad knowledge of bioinformatics analysis methodologies; familiarity with functional genetic and genomicdata; expertise in common bioinformatics software packages, tools, and algorithms

doi:10.1371/journal.pcbi.1003496.t001

PLOS Computational Biology | www.ploscompbiol.org 3 March 2014 | Volume 10 | Issue 3 | e1003496

http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002

Page 40: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El ingeniero, el científico y el usuario

34http://www.ploscompbiol.org/article/info:doi

%2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002

Page 41: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El perfil de un bioinformático australiano

35

http://www.ebi.edu.au/news/braembl-community-survey-report-2013

¿Dónde trabaja? ¿Quién es el bioinformático?

Este es el bioinformático

Esto es un biousuario

Otro biousuario

Y este también

Page 42: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El bioinformático no tiene problemas de movilidad

36

Page 43: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Cuándo descansan los bioinformáticos?

37

12

NCBI is the most heavily site in

biomedicine. Why?

300,000

200,000

100,000

NCBI Web Traffic – 1997-2006

400,000

January

1998

500,000

600,000

700,000

January

1999

January

2000

January

2001

January

2002

January

2003

January

2004

January

2005

January

2006

722,000 Unique IPs a Day

91 Million Web Hits a Day

3200 Peak Web Hits a Second

1.5 Terabytes FTP a Day

1.8 Million Unique Users a Day

Global Entrez

Search

Page 44: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Siempre hay cosas que hará mejor un informático

38

10-04-13

Page 45: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

Ya sabemos lo que se espera de un bioinformático

Veamos ahora unos ejemplos reales como la vida misma

39

Page 46: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Flujos de trabajo que automaticen tareas repetitivas

40

Data miningMicroarray

«Wet» side «Dry» sideAssembling

Page 47: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Dos ejemplos «made in Málaga»

41

SeqTrimFullLengtherNEXTRaw

sequences Annotation with Maker

SeqTrimNEXT (pre-processing)

AssemblyMining with

FullLengtherNEXT

GENOMICS

TRANSCRIPTOMICS

Centro de Bioinnovación

Page 48: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Por qué se necesitaban estas herramientas?

42

0

15000

30000

45000

60000

OLC DE BRUIjN OLC+De BRUIJN+CAP3

Unigenes #Orthologs for unigenes Complete unigenes with orthologsUnique complete unigenes with orthologs

FullLengtherNEXTSeqTrimNEXT

Menos contigs Mayor N50

# contigs

0

6

12

18

24

30

BAC1 BAC2 BAC3

NewblerSeaTrimNext + Newbler

N50

0

10000

20000

30000

40000

50000

BAC1 BAC2 BAC3

Mejor ensamblaje cuanto más genes completos hay

Page 49: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Hay bioinformática para transcriptómica en la UMA

43

DATABASE Open Access

EuroPineDB: a high-coverage web database formaritime pine transcriptomeNoé Fernández-Pozo1, Javier Canales1, Darío Guerrero-Fernández2, David P Villalobos1, Sara M Díaz-Moreno1,Rocío Bautista2, Arantxa Flores-Monterroso1, M Ángeles Guevara3, Pedro Perdiguero4, Carmen Collada3,4,M Teresa Cervera3,4, Álvaro Soto3,4, Ricardo Ordás5, Francisco R Cantón1, Concepción Avila1, Francisco M Cánovas1

and M Gonzalo Claros1,2*

Abstract

Background: Pinus pinaster is an economically and ecologically important species that is becoming a woodygymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply.Therefore, the expressed portion of the genome has to be characterised and the results and annotations have tobe stored in dedicated databases.

Description: EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster(maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries andhigh-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic(germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre-processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs andInterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freelyavailable at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations,UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only coniferdatabase that provides this information) and will be periodically updated. Small assemblies can be viewed using adedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen canbe downloaded. Retrieval mechanisms for sequences and gene annotations are provided.

Conclusions: The EuroPineDB with its integrated information can be used to reveal new knowledge, offers aneasy-to-use collection of information to directly support experimental work (including microarray hybridisation),and provides deeper knowledge on the maritime pine transcriptome.

1 BackgroundConifers (Coniferales), the most important group ofgymnosperms, represent 650 species, some of which arethe largest, tallest, and oldest non-clonal terrestrialorganisms on Earth. They are of immense ecologicalimportance, dominating many terrestrial landscapes andrepresenting the largest terrestrial carbon sink. Currentlypresent in a large number of ecosystems, they haveevolved very efficient physiological adaptation systems.

Given that trees are the great majority of conifers, theyprovide a different perspective on plant genome biologyand evolution taking into account that conifers are sepa-rated from angiosperms by more than 300 million yearsof independent evolution. Studies on the conifer genomeare revealing unique information which cannot beinferred from currently sequenced angiosperm genomes(such as poplar, Eucaliptus, Arabidopsis or rice): around30% of conifer genes have little or no sequence similar-ity to plant genes of known function [1,2]. Unfortu-nately, conifer genomics is hindered by the very largegenome (e.g. the pine genome is approximately 160times larger than Arabidopsis and seven times larger

* Correspondence: [email protected] de Biología Molecular y Bioquímica, Facultad de Ciencias,Campus de Teatinos s/n, Universidad de Málaga, 29071 Málaga, SpainFull list of author information is available at the end of the article

Fernández-Pozo et al. BMC Genomics 2011, 12:366http://www.biomedcentral.com/1471-2164/12/366

© 2011 Fernández-Pozo et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

Research Article

De novo assembly of maritime pine transcriptome:implications for forest breeding and biotechnologyJavier Canales1†, Rocio Bautista2†, Philippe Label3†, Josefa G!omez-Maldonado1, Isabelle Lesur4,5,6,Noe Fern!andez-Pozo2, Marina Rueda-L!opez1, Dario Guerrero-Fern!andez2, Vanessa Castro-Rodr!ıguez1,Hicham Benzekri2, Rafael A. Ca~nas1, Mar!ıa-Angeles Guevara7, Andreia Rodrigues8, Pedro Seoane2,Caroline Teyssier9, Alexandre Morel9, Franc!ois Ehrenmann4,5, Gr!egoire Le Provost4,5, C!eline Lalanne4,5, C!elineNoirot10, Christophe Klopp10, Isabelle Reymond11, Angel Garc!ıa-Guti!errez1, Jean-Franc!ois Trontin11, Marie-AnneLelu-Walter9, Celia Miguel8, Mar!ıa Teresa Cervera7, Francisco R. Cant!on1, Christophe Plomion4,5, Luc Harvengt11,Concepci!on Avila1,2, M. Gonzalo Claros1,2 and Francisco M. C!anovas1,2*

1Departamento de Biolog!ıa Molecular y Bioqu!ımica, Facultad de Ciencias, Universidad de M!alaga, M!alaga, Spain2Plataforma Andaluza de Bioinform!atica, Edificio de Bioinnovaci!on, Parque Tecnol!ogico de Andaluc!ıa, M!alaga, Spain3INRA, Universit!e Blaise Pascal, Aubi"ere Cedex, France4INRA, Cestas, France5Universit!e de Bordeaux, Talence, France6HelixVenture, M!erignac, France7Departamento de Ecolog!ıa y Gen!etica Forestal, INIA-CIFOR, Madrid, Spain8Forest Biotech Lab, IBET/ITQB, Oeiras, Portugal9INRA, Unit!e Am!elioration, G!en!etique et Physiologie Foresti"eres, Orl!eans Cedex 2, France10INRA de Toulouse Midi-Pyr!en!ees, Auzeville, Castanet Tolosan cedex, France11FCBA, Pole Biotechnologie et Sylviculture, Cestas, France

Received 20 July 2013;

revised 24 September 2013;

accepted 26 September 2013.

*Correspondence (Tel: +34 952131942;

fax: +34 952132376;

email: [email protected])†These authors contributed equally to work.

Keywords: conifers, transcriptome

sequencing, next-generation

sequencing, full-length cDNA,

transcription factors, single nucleotide

polymorphism.

SummaryMaritime pine (Pinus pinaster Ait.) is a widely distributed conifer species in Southwestern Europeand one of the most advanced models for conifer research. In the current work, comprehensivecharacterization of the maritime pine transcriptome was performed using a combination of twodifferent next-generation sequencing platforms, 454 and Illumina. De novo assembly of thetranscriptome provided a catalogue of 26 020 unique transcripts in maritime pine trees and acollection of 9641 full-length cDNAs. Quality of the transcriptome assembly was validated byRT-PCR amplification of selected transcripts for structural and regulatory genes. Transcriptionfactors and enzyme-encoding transcripts were annotated. Furthermore, the available sequencingdata permitted the identification of polymorphisms and the establishment of robust singlenucleotide polymorphism (SNP) and simple-sequence repeat (SSR) databases for genotypingapplications and integration of translational genomics in maritime pine breeding programmes.All our data are freely available at SustainpineDB, the P. pinaster expressional database. Resultsreported here on the maritime pine transcriptome represent a valuable resource for future basicand applied studies on this ecological and economically important pine species.

Introduction

Forests are essential components of the ecosystems coveringapproximately one-third of the Earth’s land area and playing afundamental role in the regulation of terrestrial carbon sinks.Trees represent nearly 80% of the plant biomass (Olson et al.,1983) and 50%–60% of annual net primary production interrestrial ecosystems (Field et al., 1998).

Conifers are the most important group of gymnosperms.Having diverged from a common ancestor more than 300 millionyears ago (Bowe et al., 2000), gymnosperms and angiospermshave evolved very efficient and distinct physiological adaptations(Leitch and Leitch, 2012). Coniferous forests dominate largeecosystems in the Northern Hemisphere and include a broadvariety of woody plant species, some of which are the largest,tallest and longest living organisms on Earth (Farjon, 2010).

Please cite this article as: Canales J., Bautista R., Label P., G!omez-Maldonado J., Lesur I., Fern!andez-Pozo N., Rueda-L!opez M., Guerrero-Fern!andez D., Castro-

Rodr!ıguez V., Benzekri H., Ca~nas R. A., Guevara M.-A., Rodrigues A., Seoane P., Teyssier C., Morel A., Ehrenmann F., Le Provost G., Lalanne C., Noirot C., Klopp C.,

Reymond I., Garc!ıa-Guti!errez A., Trontin J.-F., Lelu-Walter M.-A., Miguel C., Cervera M.T., Cant!on F.R., Plomion C., Harvengt L., Avila C., Claros M.G. and C!anovas

F.M. (2013) De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology. Plant Biotechnol. J., doi: 10.1111/pbi.12136

ª 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd 1

Plant Biotechnology Journal (2013), pp. 1–14 doi: 10.1111/pbi.12136

Microarrays Bases de datos Herramientas y algoritmos…

Genómica, proteómica, metabolómica

Biotecnología

Page 50: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Primero se recopilan los datos

44homology was found, respectively, confirming that most assem-bled unigenes were pine transcripts.

Annotation of unigenes

Unigene annotation was achieved by combining the results ofseveral annotation processes. Each annotation is associated withan E-value to enable the empirical assessment of annotationquality. A preliminary analysis of the collection of unigenes usingFull-LengtherNext (FLN) revealed that, from the 181 100 unig-enes annotated (46.6%), 26 020 were nonredundant transcriptsbased on orthologue ID. It is also remarkable that 18 667 full-length (FL) unigenes were reconstructed with a mean length of1495 nucleotides, representing 19% of the total annotatedunigenes (Table 2). Of these, 9641 FL unigenes were different,unique genes (9.8% of the total annotated unigenes, and 37.0%of unique unigenes). The frequency distribution of FL unigenes(Figure 3) indicated a high proportion of unigenes ranging from500 to 1500 nucleotides with the longest transcript being 7876nucleotides.

Preliminary analysis using FLN also revealed that 111 577unigenes (53.0%) did not possess significant homology to anyother plant gene. This number includes new conifer genes aswell as artefactual assemblies. To distinguish between bothpossibilities, FLN includes a TestCode analysis (Fickett, 1982) anda comparison with the noncoding RNA database (http://www.mirbase.org). As a result, at least 9799 nonredundant codingunigenes can worth a consideration of putative new conifer gene.

In fact, 4608 unigenes had a homologue EST in the Pine GeneIndex 9.0 database (http://compbio.dfci.harvard.edu/cgi-bin/tgi/gimain.pl?gudb=pine). However, only 176 unigenes from thisstudy were determined to be candidate noncoding RNAs.Therefore, the minimal P. pinaster transcriptome can be calcu-lated as 26 020 unigenes having unique ID. In addition, it couldalso be considered the 9799 unigenes without homology buthaving coding characteristics, plus the 176 noncoding RNAs, thatis 35 995 unigenes.

Because this transcriptome was deemed satisfactory, genesannotated as described in Experimental procedures (GO term, adefinition or a KEGG code) were subjected to statisticalanalyses. Of total unigenes, 62.2% (130 845) were annotated,which indicated that the level of annotation was similar topreviously published results (Fern!andez-Pozo et al., 2011).Furthermore, the distribution of GO terms at level 2 of biologicalprocess and at level 3 of molecular function (Figure S1) showsthat the putative transcriptome covers most important cellfunctions. A total of 58 296 unigenes possessed unknownsequences that could not be found in existing databases. Theannotated transcriptome can be browsed, downloaded andqueried at http://www.scbi.uma.es/sustainpinedb/.

Validation of full-length cDNA sequences in thetranscriptome database

Full-length cDNA (FLcDNAs) are essential for gene annotation,unambiguous determination of intron–exon boundaries and gene

Table 1 Description of samples used for DNA sequencing

Gene library Sequencing platform Sampled plant material Experimental conditions SRA code

EuroPineDB Sanger/454 Bud, xylem, phloem, stem,

needles, roots, stem,

embryos, callus, cone,

male and female strobili

ESTs and SSH libraries from different tissues and conditions as

described by Fern!andez-Pozo et al., 2011

SRS479769

Biogeco1 454 Xylem, bud and needle ESTs from differentiating xylem, swelling bud and young needles SRX032960,

SRX032961,

SRX032962,

SRX032963

Biogeco2 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine

(low growing family) in well-watered or drought-stress conditions

SRX031546

Biogeco3 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine

(fast growing family) in well-watered or drought-stress conditions

SRX031589

UAGPF1 454 Embryome ESTs from developing, immature embryos (1-week maturation) SRX022618

INIA_PPIN 454 Bud ESTs from buds PRJNA221139

U_root 454 Root ESTs from roots (1-month-old seedlings) SRS480239

U_tip 454 Root tips ESTs from root tips (1-month-old seedlings) SRS480265

U_H 454 Hypocotyl ESTS from hypocotyl (1-month-old seedlings) SRS480236

U_N 454 Needle ESTs from needles (1-month-old seedlings) SRS480237

U_Cot_Os 454 Cotyledon ESTs from cotyledons grown under dark conditions SRS479771

U_H_Os 454 Hypocotyl ESTs from hypocotyl grown under dark conditions SRS480236

U_R_6 454 Roots ESTs from roots (6-month-old seedlings) SRS480238

U_S_8 454 Stem ESTs from stem (8-month-old seedlings) SRS480261

UAGPF2 Illumina Somatic embryo Paired-end ESTs from developing, immature embryos

(1 week maturation)

SRR609713

BIOGECO4 Illumina Bud ESTs from young and aged buds SRX031587

BIOGECO5 Illumina Root ESTs from drought-stressed and control roots in hydropony SRX031592,

SRX031590

BIOGECO6 Illumina Bud ESTs from young and aged buds SRX031594

IBET Illumina Zygotic embryo Paired-end ESTs from embryos SRS481044

ª 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd, Plant Biotechnology Journal, 1–14

The maritime pine transcriptome 3

Page 51: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Después se diseña el flujo de trabajo

45

#1 S. senegalensis

long-readsSeqTrimNext(pre-processing)

MIRA(pre-assembling)

EULER-SR(pre-assembling)

Debris

CAP3(reconciliation)

Unmapped contigs

Full-LengtherNext

UNIGENESS.senegalensis

v3

#6 Mapped contigs

#4 Contigs

#5 Coding contigs

Non-coding

Non-coding

#7 Coding unmapped

contigs

BOWTIE 2(mapping test)

#3

A#2 Rejected

Full-LengtherNext

#8

#9

#1Short reads

SeqTrimNext(pre-processing)

Oases(pre-assembling)

kmer 23 & 47paired-end + single

CD-HIT 99%

Miss-assembly rejection#3

#2 Rejected

#1 S. senegalensis

long-readsSeqTrimNext(pre-processing)

MIRA(pre-assembling)

EULER-SR(pre-assembling)

CAP3(reconciliation)

Unmapped contigs

UNIGENESS.senegalensis

v4

#6 Mapped contigs

#4 Contigs

Debris

Non-coding

#7 Coding unmapped

contigs

BOWTIE 2(mapping test)

#3

B #2 Rejected

#9

#10 #11

Full-LengtherNext

Missassemblies

#12 Contigs

#8

Page 52: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Los flujos son cada vez más importantes

46

Genes 2012, 3, 545-575; doi:10.3390/genes3030545

genes ISSN 2073-4425

www.mdpi.com/journal/genes Article

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows Federica Torri 1,2, Ivo D. Dinov 2,3, Alen Zamanyan 3, Sam Hobel 3, Alex Genco 3, Petros Petrosyan 3, Andrew P. Clark 4, Zhizhong Liu 3, Paul Eggert 3,5, Jonathan Pierce 3, James A. Knowles 4, Joseph Ames 2, Carl Kesselman 2, Arthur W. Toga 2,3, Steven G. Potkin 1,2, Marquis P. Vawter 6 and Fabio Macciardi 1,2,*

1 Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: [email protected] (F.T.); [email protected] (S.G.P.)

2 Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: [email protected] (I.D.D.); [email protected] (J.A.); [email protected] (C.K.); [email protected] (A.W.T.)

3 Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: [email protected] (A.Z.); [email protected] (S.H.); [email protected] (A.G.); [email protected] (P.P.); [email protected] (Z.L.); [email protected] (P.E.); [email protected] (J.P.)

4 Zilkha Neurogenetic Institute, USC Keck School of Medicine, Los Angeles, CA 90033, USA; E-Mails: [email protected] (A.P.C.); [email protected] (J.A.K.)

5 Department of Computer Science, University of California, Los Angeles, CA 90095, USA 6 Functional Genomics Laboratory, Department of Psychiatry And Human Behavior,

School of Medicine, University of California, Irvine, CA 92697, USA; E-Mail: [email protected]

* Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel.: +1-949-824-4559; Fax: +1-949-824-2072.

Received: 6 July 2012; in revised form: 15 August 2012 / Accepted: 15 August 2012 / Published: 30 August 2012

Abstract: Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. These methods can be applied to complex disorders as well, and have been adopted as one of the current mainstream approaches in population genetics. These achievements have been made possible by next generation sequencing (NGS) technologies, which require substantial bioinformatics resources to analyze the dense and complex sequence data. The

OPEN ACCESS

Genes 2012, 3 547

Table 1. Review of the most used software in next-generation sequencing (NGS) data analysis. Which includes two major computational macro-processes: (1) a primary step related to mapping and assembling, with alignment quality control, quality score re- regions of the genome; and (2) secondary, advanced steps focused on variant (single nucleotide polymorphisms (SNPs), insertions-deletions (Indels) and copy number variations (CNVs)) calling and annotation. These macro-processes are briefly reviewed to provide a background for the software algorithms embedded in DNA-Seq analysis.

Process Software & Algorithms Website Preprocessing step homemade script (N/A)

(1.1) Alignment

MAQ http://maq.sourceforge.net BWA http://bio-bwa.sourceforge.net/bwa.shtml BWA-SW (SE only) http://bio-bwa.sourceforge.net/bwa.shtml PERM http://code.google.com/p/perm/ BOWTIE http://bowtie-bio.sourceforge.net SOAPv2 http://soap.genomics.org.cn MOSAIK http://bioinformatics.bc.edu/marthlab/Mosaik NOVOALIGN http://www.novocraft.com/

(1.2) De novo Assembly VELVET http://www.ebi.ac.uk/%7Ezerbino/velvet SOAPdenovo http://soap.genomics.org.cn ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss

(1.3) Basic QC SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ PICARD http://picard.sourceforge.net/command-line-overview.shtml

(1.4) Advanced QC

GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit

PICARD http://picard.sourceforge.net/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ IGVtools http://www.broadinstitute.org/igv/igvtools

(2.1a) Variant Calling and annotation

Sequence Variant Analyzer v1.0, for hg18 annotations

SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm

SAMTOOLS and ANNOVAR for annotation

SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ANNOVAR http://www.openbioinformatics.org/annovar/

UnifiedGenotyper and ANNOVAR for annotation

GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit

ANNOVAR http://www.openbioinformatics.org/annovar/ (2.1b) CNVs CNVseq CNVseq http://tiger.dbs.nus.edu.sg/cnv-seq/

R http://www.r-project.org/ SAMTOOLS/ERDS/Sequence variant analyzer v1.0 ERDS

SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm

CNVer CNVer http://compbio.cs.toronto.edu/CNVer/ BOWTIE http://bowtie-bio.sourceforge.net SAVANT http://compbio.cs.toronto.edu/savant/

Simulated data generation tool dwgsim http://sourceforge.net/projects/dnaa/

Page 53: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Luego se ejecuta, y se paraleliza todo lo posible

47Fewer transcripts for genes encoding enzymes of ammonium

assimilation were found in conifer species. Only 2–3 transcriptsfor glutamine synthetase (GS) were identified in gymnosperms(P. pinaster and P. abies), in accordance with previous results(C!anovas et al., 2007). In contrast, genomes of angiospermspecies are endowed with GS families with a higher number ofmembers, eight in Populus and six in Arabidopsis. A singleexpressed gene was found for ferredoxin-glutamate synthase(Fd-GOGAT) and NADH-GOGAT in maritime pine and Norway

spruce. In contrast, two genes encode Fd-GOGAT and NADH-GOGAT in poplar.

A second group of genes where those encoding enzymesinvolved in synthesis of methionine and S-adenosylmethionine(SAM), the activated form of methionine, which participate in anumber of essential metabolic pathways in plants. In particular, wefocused on three genes involved in the synthesis and recycling ofSAM, amethyl donor inmultiple cellular transmethylation reactions(Figure 5). The number of genes encoding cobalamin-independent

Fig. 2 Flow chart showing preprocessing into useful reads, assembly into contigs and overlap-based reconciliation into final unigenes of sequenced data

from 5 (591 174 069 short reads, Illumina) or 14 (6 381 011 long reads, 454) cDNA libraries in maritime pine.

ª 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd, Plant Biotechnology Journal, 1–14

The maritime pine transcriptome 5

Aquí se ensamblaron «muchas» secuencias

Y aquí 10X más

Page 54: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Ahora diseñamos una base de datos

48

Con tablas para las anotaciones y

metainformación que encontremos

Page 55: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

… y le damos una interfaz web para la comunidad científica

49

UniGene dataset consists of a consensus sequence ofeach contig and the singletons (see above).Gene library clones are stored in 96-well and/or 384-

well plates in the laboratory. Navigation using the ‘96-Well plates’ and ‘384-Well plates’ tabs displays the plateorganisation of the libraries. Users can download thesequences of all clones in a plate or browse the plate inwhich red clones are useless sequences, green clones arethose that have successfully passed SeqTrim pre-proces-sing, and black ones are printed controls.Currently, only one microarray (Pinarray1) has been

designed with EuroPineDB sequences [22]. The ‘Micro-array’ tab displays general and statistical informationabout Pinarray1 (Figure 2), whose printed sequencesand annotations can be downloaded. Each microarrayblock organisation is displayed in the lower part of thepage. Coordinates refer to a single sequence. The col-ours green, red and black have the same meaning as inthe plates (see above). The graphic representation offersthe possibility of retrieving information from specificclones after analysis of any experimental result usingthis microarray.Sequences in EuroPineDB have been assembled by

gene library and pine species, and can be accessed usingthe ‘Assemblies’ tab. Each assembly can be inspected indetail, showing a paged list of UniGenes and a summarydescription. The detailed view of every UniGene

includes the aligned sequences, their orientation, thecontig alignment (as a simple-text), a description for theconsensus sequence, and the putative description ofeach included Sanger sequence.Clicking on the name of a clone provides access to all the

information about it (e.g. EMBL accession number,sequence length, the plate(s) in which it can be found,annotations, original and pre-processed sequences, genelibrary source, etc). From the sequence entry, users canreturn to any previously described browsing page (Figure 3).At the home page, a menu on the left enables filtered

browsing by microarray, pine species, or annotation. Fil-tered browsing only displays entries sharing the sameselected annotation. Each item in the list opens a newpage with the EuroPineDB entries that share this speci-fic annotation. For example, based on nitrogen metabo-lism (KEGG 00910), it is possible to know how manysequences are present in the database, since by clickingon 00910 every enzyme from this pathway can be seen,as well as the entries that are annotated as being one ofthese enzymes. As an additional example, all UniGenesinvolved in photosynthesis (GO:0015979) that belong toa particular library or pine species can be identified bymeans of GO term filtering.3.1.2 Database retrievalIn addition to a guided browsing, EuroPineDB contentscan be retrieved by means of text search or sequence

Home

Gene libraries 96-Well plates 384-Well plates Microarrays Assemblies

BLASTSearch

Each library

All sequences

Each 96w_plate

Each clone/sequence

Each 384w_plate Each microarray block

Each UniGeneExternal links Annotations

List of assemblies

Descriptions GO EC KEGG InterPro SNP SSR ORF

Figure 3 Navigating through EuroPineDB. Arrowheads indicate the direction of navigation. Green boxes correspond to available views fromall pages (thus, no incoming arrowhead is specified). Violet text indicates the option of downloading sequences in FASTA format.

Fernández-Pozo et al. BMC Genomics 2011, 12:366http://www.biomedcentral.com/1471-2164/12/366

Page 7 of 11

Page 56: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Ahora podemos descubrir información biológica

50

Gygabayses detection rates to the 55 607 putative SNPs iden-tified by CLCBio, the relaxed and stringent GigaBayes detectionprocedures led to a success rate of 0.94% (48 true SNPsdetected) and 3.75% (193 SNPs), respectively. In conclusion,CLCBio was found to perform nearly 10 times better than themost stringent GigaBayes SNP detection procedure with our dataset.

A total of 5974 putative simple-sequence repeat (SSRs) werefound, with trinucleotide repeats (3309) being the most common,and dinucleotide repeats (479) the less abundant. This is inagreement to previously published P. pinaster SSR abundance(Fern!andez-Pozo et al., 2011).

Discussion

Maritime pine transcriptome assembly

Long-read sequence data sets are required for transcriptomeassembly in nonmodel species for which a reference genome isnot available. In conifers, 454 sequencing has been recently usedto generate well-defined transcriptomes in several species ofecological and economic interest, that is, Pinus contorta (Parch-man et al., 2010), P. glauca (Rigault et al., 2011), P. pinaster(Fern!andez-Pozo et al., 2011), Pinus taeda and 11 other conifers(Lorenz et al., 2012). In the present work, we used a combinationof 454 and Illumina sequencing to define a minimal referencetranscriptome for maritime pine (P. pinaster). A similar approachwas recently used to characterize, for example, the globeartichoke transcriptome (Scaglione et al., 2012). The nonredun-dant transcriptome resulting from the assembly contains 26 020unique transcripts with orthologue ID in public databases, anumber very close to the 27 720 unique cDNA clusters reportedfor the P. glauca transcriptome (Rigault et al., 2011) and higherthan the 17 000 unique coding genes obtained in the assembly ofP. contorta transcriptome (Parchman et al., 2010). The numberof unique transcripts in maritime pine is also close to the numberof genes (28 354) resulting from the draft assembly of the20-gigabase genome of P. abies (Nystedt et al., 2013). Consid-ering all the available data, an elevated coverage of the maritimepine transcriptome is estimated.

FLcDNAs catalogues as genomic resources

The availability of large collections of FLcDNAs in severalconifers, such as Sitka (Ralph et al., 2008) and white spruces(Rigault et al., 2011), as well as Cryptomeria (Futamura et al.,2008), has greatly facilitated the assembly and annotation ofFLcDNAs in maritime pine. FLcDNAs are crucial for accurate

Table 3 Continued

Gene name

Theoretical size of ORF

from assembly (bp)

Experimental size

of ORF (bp) Accession number*

Cytosolic serine hydroxymethyltransferase 1413 1413 sp_v3.0_unigene17057

HE574564

D-3-Phosphoglycerate dehydrogenase 1947 1947 sp_v3.0_unigene543

HE574561

3-Phosphoserine aminotransferase 1302 1302 sp_v3.0_unigene37851

HE574562

Pinoresinol-lariciresinol reductase 939 939 sp_v3.0_unigene17681

HE574558

Phenylcoumaran benzylic ether reductase 927 927 sp_v3.0_unigene31659

HE574559

Phenylpropenal double-bond reductase 1056 1056 sp_v3.0_unigene22698

HE575885

*Accession number of unigene in Sustainpine and GenBank.†MYB family of TF.‡Dof family of TF.§NAC family of TF.

Fig. 4 Distribution of unique transcripts corresponding to TF gene

families in Pinus pinaster and comparison to other plant transcriptomes.

The number of different encoded transcripts with the conserved DNA-

binding domain of each family is represented. The distribution of TF gene

families in P. pinaster, Picea glauca, Picea abies, Populus trichocarpa and

Arabidopsis thaliana is compared.

ª 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd, Plant Biotechnology Journal, 1–14

Javier Canales et al.8

annotation, comparative analysis with other conifer species andalso for functional analysis of relevant genes associated tomaritime pine growth, development and response to environ-mental changes. Furthermore, this genomic resource will greatlyfacilitate protein identification as well as protein–protein inter-action studies through proteomics approaches (C!anovas et al.,2004). For all these reasons, it was of paramount importance tovalidate the assembly of the FLcDNA collection (9641 differenttranscripts).

Over the last few years, refined protocols have been developedfor Agrobacterium-mediated genetic transformation of maritimepine embryogenic tissue, cryopreservation of transgenic lines andefficient transgenic plant regeneration through somatic embryo-genesis (reviewed in Trontin et al., 2013). These new develop-ments and the availability of a large collection of FLcDNAs havepaved the way for the application of reverse genetics towardsfunctional dissection of traits of economic and ecological interestin maritime pine trees. Thus, the availability of both standardizedtransformation methods and FLcDNA catalogues is expected tosignificantly increase throughput in candidate gene analysistogether with facilitating comparison across laboratories inter-ested in maritime pine genomics. The functional analysis of keygenes is crucial for future applications in tree improvement, newvariety design and sustainable forest management (e.g. develop-ment of marker-assisted selection).

Maritime pine gene families and genome size

It has been suggested that the increased size and complexity ofconifer genomes relative to angiosperms may be explained by theexistence of large gene families (Kinlaw and Neale, 1997).However, this assumption is not fully supported by available dataas most TF (Figure 4 and Table S1) or other gene families

(Figure 5) present in maritime pine (this work) or spruce genomes(Birol et al., 2013; Nystedt et al., 2013; Rigault et al., 2011) wereof similar or even lower size compared with angiosperm species(P. trichocarpa, A. thaliana and V. vinifera). Meanwhile, theexistence of large gene families in conifers coding for enzymesof secondary metabolism has been reported (Martin et al., 2004),there are other families in primary metabolism that containsimilar, or even shorter, number of functional members inconifers than in angiosperms (C!anovas et al., 2007). Highgenome size and complexity in conifers may be more readilyexplained by divergence and accumulation of retrotransposonsand pseudogenes (Morse et al., 2009; Nystedt et al., 2013).Retrotransposons and pseudogenes can be expressed and con-tribute in some extent to the collection of unigenes withoutorthologue in the maritime pine transcriptome. Accumulation ofpseudogenes may have functional advantages in the regulation ofgene expression if they are expressed. Recently, Poliseno et al.(2010) reported that expressed pseudogenes compete withauthentic target transcripts for miRNA binding and, as such,modulate expression levels of their cognate genes.

Transcription factors

The identification of transcription factors (TF) and subsequentanalysis of the composition and organization of TF families arenecessary steps to understand the regulatory networks associatedwith key processes in conifer trees. The number of TF inP. pinaster appears to be similar to that of P. glauca (Rigaultet al., 2011), but considerably lower compared with those foundin the genomes of several angiosperms. This fact is confirmed bystudies carried out in specific families. For example, the Dof familyhas only ten members in maritime and loblolly pines (Rueda-L!opez et al., 2013); this is twofold to eightfold lower than gene

Fig. 5 Comparison of gene families for relevant

enzymes in Pinus pinaster, Picea abies, Populus

trichocarpa and Arabidopsis thaliana. The

following databases were used in addition to

SustainpineDB: P. abies v1.0, P. trichocarpa v3.0,

A. thaliana TAIR 10.

ª 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd, Plant Biotechnology Journal, 1–14

The maritime pine transcriptome 9

El genoma de pino es 10X el humano, pero las familias génicas son más pequeñas que en otras plantas

Page 57: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿No acabo de mencionar «paralelización»?

51

Hindawi Publishing CorporationComputational Biology JournalVolume 2013, Article ID 707540, 12 pageshttp://dx.doi.org/10.1155/2013/707540

Research ArticleSCBI_MapReduce, a New Ruby Task-Farm Skeleton forAutomated Parallelisation and Distribution in Chunks ofSequences: The Implementation of a Boosted Blast+

Darío Guerrero-Fernández,1 Juan Falgueras,2 and M. Gonzalo Claros1,3

1 Supercomputacion y Bioinformatica-Plataforma Andaluza de Bioinformatica (SCBI-PAB), Universidad de Malaga,29071Malaga, Spain

2Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, 29071Malaga, Spain3Departamento de Biologıa Molecular y Bioquımica, Universidad de Malaga, 29071Malaga, Spain

Correspondence should be addressed to M. Gonzalo Claros; [email protected]

Received 21 June 2013; Revised 18 September 2013; Accepted 19 September 2013

Academic Editor: Ivan Merelli

Copyright © 2013 Darıo Guerrero-Fernandez et al. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

Current genomic analyses often require the managing and comparison of big data using desktop bioinformatic software that wasnot developed regarding multicore distribution.The task-farm SCBI MapReduce is intended to simplify the trivial parallelisationand distribution of new and legacy software and scripts for biologists who are interested in using computers but are not skilledprogrammers. In the case of legacy applications, there is no need of modification or rewriting the source code. It can be used frommulticore workstations to heterogeneous grids. Tests have demonstrated that speed-up scales almost linearly and that distributionin small chunks increases it. It is also shown that SCBI MapReduce takes advantage of shared storage when necessary, is fault-tolerant, allows for resuming aborted jobs, does not need special hardware or virtualmachine support, and provides the same resultsthan a parallelised, legacy software. The same is true for interrupted and relaunched jobs. As proof-of-concept, distribution of acompiled version of Blast+ in the SCBI Distributed Blast gem is given, indicating that other blast binaries can be used whilemaintaining the same SCBI Distributed Blast code. Therefore, SCBI MapReduce suits most parallelisation and distributionneeds in, for example, gene and genome studies.

1. Introduction

The study of genomes is undergoing a revolution: the produc-tion of an ever-growing amount of sequences increases yearby year at a rate that outpaces computing performance [1].This huge amount of sequences needs to be processed withthe well-proven algorithms that will not run faster in newcomputer chips since around 2003 chipmakers discoveredthat they were no longer able to sustain faster sequential exe-cution except for generating themulticore chips [2, 3].There-fore, the only current way to obtain results in a timelymanneris developing software dealing with multicore CPUs or clus-ters of multiprocessors. In such a context, “cloud computing”is becoming a cost-effective and powerful resource of multi-core clusters for task distribution in bioinformatics [1, 2].

Sequence alignment and comparison are themost impor-tant topics in bioinformatic studies of genes and genomes. It isa complex process that tries to optimise sequence homologyby means of sequence similarity using the algorithm ofNeedleman-Wunsch for global alignment, or the one ofSmith-Waterman for local alignments. Blast and Fasta [4]are the most widespread tools that have implemented them.Paired sequence comparison is inherently a parallel pro-cess in which many sequence pairs can be analysed at thesame time by means of functions or algorithms that are iter-atively performed over sequences. This is impelling the par-allelisation of sequence comparison algorithms [5–9] as wellas other bioinformatic algorithms [10, 11].

In most cases, the parallelised versions need to be rewrit-ten from scratch, including explicit parallel programming

picassoFundamentos de programación

Page 58: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

SCBI_MapReduce: para paralelizar y distribuir

52

EficienteRobusto

Mejora el rendimiento de Blast

Page 59: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Luego la bioinfo no está reñida con la supercomputación

53

Red Española de Supercomputación

Picasso

Picasso: 2310 cores 700 TB disk

7 FAT nodes of shared memory:

80 cores 2 TB RAM

>25 GB/core

Computing nodes: 984 cores 4 TB RAM 4 GB/core

«Thin» nodes: 768 cores 3 TB RAM 8 GB/core

GPU nodes: 32 GPU

1 TB RAM 8 GB/core

Page 60: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Picasso: CPD para supercomputación y bioinformática

54

Hard disks

FAT nodes

Computing nodes

THIN nodes

More disks

GPU nodes

Page 61: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Por qué son buenas las infraestruturas de CPD

55

• Providing solid infrastructure for software and hardware

• More cost-efficient for large-scale projects

• Cost-effective (licenses, computers...)

• Including expensive software and multi-user licenses

• Specialization

• Collaboration with other research groups outside UMA

Editorial

The Need for Centralization of Computational BiologyResourcesFran Lewitter1*, Michael Rebhan2*, Brent Richter3*, David Sexton4*

1 Bioinformatics and Research Computing, Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America, 2Novartis Institutes for

BioMedical Research, Basel, Switzerland, 3 Enterprise Research IS and Informatics, Brigham and Women’s Hospital, Massachusetts General Hospital, and Partners

Healthcare, Boston, Massachusetts, United States of America, 4Center for Human Genetics Research, Computational Genomics Core, Vanderbilt University, Nashville,

Tennessee, United States of America

Biomedical research is benefiting fromthe wealth of new data generated in thelaboratory through new instrumentation,greater computational resources, and mas-sive repositories of public domain data.Using these data to make scientific discov-eries is sometimes straightforward, but canbe complicated by the number and breadthof public sources available to the researcheras well as by the plethora of tools fromwhich to choose. Complex searches, anal-yses, or even storage needs require morecomputational expertise than that availablewithin an individual laboratory. As bio-medical researchers develop more compu-tational skills, this may change over time.Having a centralized group of experts incomputational biology can be of great valueto the experimental biologist, and, recog-nizing this, many organizations have in-vested in building a team of computationalbiologists, bioinformaticists, and researchIT services to address the needs of theinvestigators. This Editorial presents ourviews on the benefits and challenges ofcentralizing these activities.In order to benefit from expertise

among existing teams of experts aroundthe world, the ‘‘Bioinfo-Core’’ group wasformed during the ISMB 2002 meeting inEdmonton, Canada, with approximately25 initial members. Since then, the grouphas expanded in both organization andinterest. Our worldwide membership nowincludes more than 150 people whoadminister centralized bioinformatics andresearch computing facilities within di-verse organizations, including academia,independent research institutes, academicmedical centers, and industry. Additional-ly, the group holds quarterly meetings viateleconference, continues an annual face-to-face meeting at ISMB (averaging 40–60people), and hosts a mailing list and Wiki(http://www.bioinfo-core.org) to furthercommunication.

Why Centralize?

Different institutions will have differentnames for these centralized resources—

‘‘core facility’’, ‘‘platform’’, etc.—and dif-ferent responsibilities for the group basedon size and organization. For the purposesof this Editorial and the accompanyingPerspectives (doi:10.1371/journal.pcbi.1000368 and doi:10.1371/journal.pcbi.1000369), we use the term ‘‘BioinformaticsCore Facility’’ to refer to these centralizedresources. No matter what name is used,the primary focus of the centralizedresource will be to support the investiga-tors with their computational needs. Be-low, we highlight some of the mostimportant reasons we see for centralizingthese resources.

Providing InfrastructureIt is important for an institution to have

a solid infrastructure for both hardwareand software. This is especially true withrespect to funding opportunities. Specifi-cally, having a solid computational andbioinformatics infrastructure may increasethe probability of a grant award whosemain scientific exploration is heavily data-driven. Furthermore, funding agencies areoffering larger, more integrated, complex,and cross-institutional projects. Thesegrants do not fund de novo technicalinfrastructure, but most times provideincremental improvements to existinginfrastructure. In addition, granting agen-cies find that centralizing resources is farmore cost-efficient for large-scale projects.This is especially true for NIH ProgramProjects and Center grants, Clinical andTranslational Science Awards, and forinstitutional or departmental researchinitiatives.

On the software side, it can be econom-ical to purchase multi-user, concurrent, orsite licenses rather than individual licenses.This also helps with support of the softwareas purchasers of the larger licenses willlikely be better prepared to field questionsand offer training opportunities aboutinstallation and use of the software. Inaddition, the Bioinformatics Core Facilitymay be in a position to purchase expensivesoftware that is used only occasionally byresearchers, thus being able to providemore options for individuals to addressimportant research needs.

Many researchers in an institution mayhave the same needs for custom software. Aperson working in a centralized facility canidentify such shared needs and build arobust tool for use by many researcherswithin the institution. These specializedtools or software functions can be reused,and this increases their value to theorganization. It also prevents the multiplere-invention of solutions within institutions.

Furthermore, solutions developed andimplemented within a centralized facilitycan be leveraged by institutional enterpriseprojects. Development, evaluation, andlive testing of infrastructure or applicationsfor a specific project need not be ad hoc insome cases. Frameworks can be developedthat can translate to enterprise-wide ap-plications providing competitive advantag-es in translational science activities. Ifeffective, these technologies can be trans-lated into the larger enterprise as-is, or,with adjustment, to fit within existingimplementations, additional requirements,or vendor solutions.

Citation: Lewitter F, Rebhan M, Richter B, Sexton D (2009) The Need for Centralization of ComputationalBiology Resources. PLoS Comput Biol 5(6): e1000372. doi:10.1371/journal.pcbi.1000372

Published June 26, 2009

Copyright: ! 2009 Lewitter et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

* E-mail: [email protected] (FL); [email protected] (MR); [email protected] (BR); [email protected] (DS)

The order of authors is alphabetic; each author has contributed equally to the development and writing of thisEditorial.

PLoS Computational Biology | www.ploscompbiol.org 1 June 2009 | Volume 5 | Issue 6 | e1000372

Editorial

The Need for Centralization of Computational BiologyResourcesFran Lewitter1*, Michael Rebhan2*, Brent Richter3*, David Sexton4*

1 Bioinformatics and Research Computing, Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America, 2Novartis Institutes for

BioMedical Research, Basel, Switzerland, 3 Enterprise Research IS and Informatics, Brigham and Women’s Hospital, Massachusetts General Hospital, and Partners

Healthcare, Boston, Massachusetts, United States of America, 4Center for Human Genetics Research, Computational Genomics Core, Vanderbilt University, Nashville,

Tennessee, United States of America

Biomedical research is benefiting fromthe wealth of new data generated in thelaboratory through new instrumentation,greater computational resources, and mas-sive repositories of public domain data.Using these data to make scientific discov-eries is sometimes straightforward, but canbe complicated by the number and breadthof public sources available to the researcheras well as by the plethora of tools fromwhich to choose. Complex searches, anal-yses, or even storage needs require morecomputational expertise than that availablewithin an individual laboratory. As bio-medical researchers develop more compu-tational skills, this may change over time.Having a centralized group of experts incomputational biology can be of great valueto the experimental biologist, and, recog-nizing this, many organizations have in-vested in building a team of computationalbiologists, bioinformaticists, and researchIT services to address the needs of theinvestigators. This Editorial presents ourviews on the benefits and challenges ofcentralizing these activities.In order to benefit from expertise

among existing teams of experts aroundthe world, the ‘‘Bioinfo-Core’’ group wasformed during the ISMB 2002 meeting inEdmonton, Canada, with approximately25 initial members. Since then, the grouphas expanded in both organization andinterest. Our worldwide membership nowincludes more than 150 people whoadminister centralized bioinformatics andresearch computing facilities within di-verse organizations, including academia,independent research institutes, academicmedical centers, and industry. Additional-ly, the group holds quarterly meetings viateleconference, continues an annual face-to-face meeting at ISMB (averaging 40–60people), and hosts a mailing list and Wiki(http://www.bioinfo-core.org) to furthercommunication.

Why Centralize?

Different institutions will have differentnames for these centralized resources—

‘‘core facility’’, ‘‘platform’’, etc.—and dif-ferent responsibilities for the group basedon size and organization. For the purposesof this Editorial and the accompanyingPerspectives (doi:10.1371/journal.pcbi.1000368 and doi:10.1371/journal.pcbi.1000369), we use the term ‘‘BioinformaticsCore Facility’’ to refer to these centralizedresources. No matter what name is used,the primary focus of the centralizedresource will be to support the investiga-tors with their computational needs. Be-low, we highlight some of the mostimportant reasons we see for centralizingthese resources.

Providing InfrastructureIt is important for an institution to have

a solid infrastructure for both hardwareand software. This is especially true withrespect to funding opportunities. Specifi-cally, having a solid computational andbioinformatics infrastructure may increasethe probability of a grant award whosemain scientific exploration is heavily data-driven. Furthermore, funding agencies areoffering larger, more integrated, complex,and cross-institutional projects. Thesegrants do not fund de novo technicalinfrastructure, but most times provideincremental improvements to existinginfrastructure. In addition, granting agen-cies find that centralizing resources is farmore cost-efficient for large-scale projects.This is especially true for NIH ProgramProjects and Center grants, Clinical andTranslational Science Awards, and forinstitutional or departmental researchinitiatives.

On the software side, it can be econom-ical to purchase multi-user, concurrent, orsite licenses rather than individual licenses.This also helps with support of the softwareas purchasers of the larger licenses willlikely be better prepared to field questionsand offer training opportunities aboutinstallation and use of the software. Inaddition, the Bioinformatics Core Facilitymay be in a position to purchase expensivesoftware that is used only occasionally byresearchers, thus being able to providemore options for individuals to addressimportant research needs.

Many researchers in an institution mayhave the same needs for custom software. Aperson working in a centralized facility canidentify such shared needs and build arobust tool for use by many researcherswithin the institution. These specializedtools or software functions can be reused,and this increases their value to theorganization. It also prevents the multiplere-invention of solutions within institutions.

Furthermore, solutions developed andimplemented within a centralized facilitycan be leveraged by institutional enterpriseprojects. Development, evaluation, andlive testing of infrastructure or applicationsfor a specific project need not be ad hoc insome cases. Frameworks can be developedthat can translate to enterprise-wide ap-plications providing competitive advantag-es in translational science activities. Ifeffective, these technologies can be trans-lated into the larger enterprise as-is, or,with adjustment, to fit within existingimplementations, additional requirements,or vendor solutions.

Citation: Lewitter F, Rebhan M, Richter B, Sexton D (2009) The Need for Centralization of ComputationalBiology Resources. PLoS Comput Biol 5(6): e1000372. doi:10.1371/journal.pcbi.1000372

Published June 26, 2009

Copyright: ! 2009 Lewitter et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

* E-mail: [email protected] (FL); [email protected] (MR); [email protected] (BR); [email protected] (DS)

The order of authors is alphabetic; each author has contributed equally to the development and writing of thisEditorial.

PLoS Computational Biology | www.ploscompbiol.org 1 June 2009 | Volume 5 | Issue 6 | e1000372

Page 62: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

¿Cómo se accede?

56

Web tools

Command line

Web interface

Web serverVirtual machines

Database

Home

Files

Virtual machine

File transfer

Page 63: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La bioinformática no se limita a secuencias y BD

57

Page 64: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

Aplicaciones de la bioinformática y la supercomputación

58

Page 65: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El descubrimiento de nuevos fármacos «era» carísimo

59

Hay que sintetizar cada compuesto y comprobarlo

en los animales

Método clásico Método bioinformático

Solo se sintetizan los candidatos. Ahorro en

síntesis, tiempo y animales

Ligand database

Page 66: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Ha valido para el Nobel de química en 2013

60

Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos

Químico teórico Biofísico Bioquímico

http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/

Bioquímico

Page 67: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Ha valido para el Nobel de química en 2013

60

Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos

Químico teórico Biofísico Bioquímico

http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/

Bioquímico

This Nobel Prize is the first given to work in computational biology, indicating that the field has matured and is on a par with experimental biology

!The blog of PLOS Computational Biology

Page 68: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La biología de sistemas nos revela las claves

61

La regulación celular se va complicando a medida que aumenta la complejidad del organismo

Page 69: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of

interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,

although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant

questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the

human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.

Results and Discussion

Topological analysis

Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After

filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].

Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law

Fig. 1. Human transcription factor network built from data extracted

from the TRANSFAC 8.2 database. Numbered black filled nodes

are the highest connected transcription factors. 1, TATA-binding

protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit

(RelA); 7, c-jun; 8, c-myc; 9, c-fos.

Human transcription factor network topology C. Rodriguez-Caso et al.

6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS

Nos dice qué proteína más vale no tocar

62

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

or via control of TF expression, less connected factorsmay also be relevant to cell survival.

Functional and structural patterns from topology

In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.

Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of

a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].

From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.

We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].

An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity

Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).

TF Description Associate disease k b · 103

TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3

p53 Tumor suppressor protein Proliferative disease [68] 23 18.5

P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2

RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8

pRB retinoblastoma suppressor protein.

Tumour suppressor protein

Proliferative disease Bladder cancer.

Osteosarcoma [71]

15 27.1

RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6

c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1

c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5

c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2

C. Rodriguez-Caso et al. Human transcription factor network topology

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427

2 1 !4 5 !!7 6 9

Hay al menos 9 factores de transcripción que provocan cáncer si se mutan, sí o sí

Biología de sistemas

Page 70: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Genes biomarcadores del cáncer de mama deducidos con análisis bioinformáticos

63

Page 71: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Eso lo hacemos en la UMA con miRNA del cáncer de mama

64

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

Biología Molecular

Microarrays Bases de datos Herramientas y algoritmos…

Microarrays Minería de datos

with tumors from relapse-free patients (group A, Table 2). MiR-625 was excluded from any further studies since RT-qPCR datashowed minimal variation between groups (FC,2). Next, we re-clustered the 71 tumors based on the 5-miRNA signature. Asshown in Figure 2, tumors from groups A and B were clearlysegregated in two distinct clusters, which included most of theexpected samples in each category: 78.8% group A in cluster 1b(low risk) and 70.4% group B in cluster 2b (high risk). Of note, thesupervised analysis included most tumors from group C (72.8%),in cluster 1b, indicating that the 5-miRNA signature specifically

discriminates tumors with an overall higher risk of earlyrecurrence.

The 5-miRNA signatureMiR-149 was the most significant miRNA downregulated in

group B, as determined by microarray hybridization and by RT-qPCR. This miRNA has been described as a TS-miR thatregulates the expression of genes associated with cell cycle,invasion or migration and its downregulation has been observed inseveral tumor diseases, including gastric cancer and breast cancer[70,77–81]. Down-regulation of miR-149 can occur epigenetical-

Figure 2. A 5-miRNA signature is associated with early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples basedon expression of the 5-miRNA signature. Note that lower expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includestumors from ‘‘high risk’’ patients (group B). On the contrary, most patients with good prognosis (group A) had tumors with normal or higher-thannormal levels of the 5-miRNA signature, defining a different cluster 1b (‘‘low risk’’).doi:10.1371/journal.pone.0091884.g002

Figure 3. The 5-miRNA signature discriminates patients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included inthis study. B) Those patients whose tumors showed an overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) wereclassified as ‘‘high risk’’ (red line) and their cumulative RFS was calculated (red line). RFS was also calculated for the remaining patients in the cohort(‘‘low risk’’, black line). The Kaplan-Meier plot shows that the 5-miRNA signature specifically discriminates tumors with an overall higher risk of earlyrecurrence.doi:10.1371/journal.pone.0091884.g003

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 7 March 2014 | Volume 9 | Issue 3 | e91884

ly, by hypermethylation of the neighbouring CpG island [80] or byimpaired processing of the pri-miR-149 precursor, in a polymor-phic variant [79]. In a recent work, downregulation of miR-149has been associated with elevated levels of the transcription factorSP1, increase invasiveness and lower 5-year survival in colorectalcancer [80]. The p53 repressor ZBTB2 is also a target of miR-149[81], which could explain, at least partially, its function as a TS-miR.

MiR-30a-3p is a member of the miR-30 family, which isassociated with mesenchymal and stemness features [82,83] and isdownregulated in several types of cancer [84–86]. Recently,Rodriguez-Gonzalez et al. have linked low levels of this miRNA totamoxifen resistance in ER+ breast tumors. They have alsoproposed several targets of miR-30a-3p involved in proliferationand apoptosis, such as BCL2, NFkB, MAP2K4, PDGFA,CDK5R1 and CHN1 [87].

Regarding miR-20b, this miRNA is part of the miR-106b-363cluster, which is frequently deregulated in cancer [88–91]. Thelevels of miR-20b associate with histological grade in breast cancer[92,93]. This miRNA has been involved in regulating several keyproteins such as ESR1, HIF-1a, VEGF or STAT3 [92,94,95]. Inparticular, because it targets both HIF-1a and VEGF and HIF-1anegatively controls miR-20b levels, it has been defined as an anti-angiogenic miRNA [95].

Both oncogenic and tumor suppressor features have beenreported for miR-10a [96]. Thus, reduced expression of miR-10ahas been associated with MAP3K7- and bTRC-mediatedactivation of the proinflammatory NFkB pathway [97]. Also,miR-10a downregulation represses differentiation in part byderegulation of the histone deacetylase HDAC4 [98] andpositively affects invasiveness by de-repressing several membersof the homeobox family of transcription factors [99].

Regarding miR-342-5p, it appears significantly deregulatedonly when we compare B vs AC (Table 2). Together with itscounterpart (miR-342-3p), it is deregulated in inflammatory breastcancer [74] and its low expression has been associated with lower

post-recurrence survival [100], likely because it targets AKT1mRNA [101].

In sum, the available bibliographic data suggests that down-regulation of miR-149, miR-30a-3p, miR-20b, miR-10a andmiR342-5p in primary breast tumors could confer them enhancedproliferative, angiogenic and invasive potentials.

Prognostic value of the 5-miRNA signature. The relation-ship between expression of the 5-miRNA signature and RFS wasexamined by a survival analysis. Figure 3A shows a Kaplan-Meiergraph for the whole series of patients included in the study. Due tothe intrinsic characteristics of the cohort, decreases in the RFS areonly observed in the intervals 0–24 and 50–60 months(corresponding to groups B and C, respectively). We next groupedthe tumors according to their 5-miRNA signature status in twodifferent groups. One group included those tumors with all fivemiRNAs simultaneously downregulated, (FC.2 and p,0.05) anda second group included those tumors not having all five miRNAsdownregulated. A survival analysis was performed using clinicaldata from the corresponding patients. As shown in Figure 3B, theKaplan-Meier graphs for the two groups demonstrate that the 5-miRNA signature defines a ‘‘high risk’’ group of patients with ashorter RFS (Peto-Peto test with p-value = 0.02, when comparingthe low vs high risk groups).

Using a Cox proportional hazard regression model, we alsotested all possible combinations of different covariates (tumorsubtype, patient age, tumor size, number of lymph nodes affectedand the 5-miRNA signature) with early relapse (#24 months) toidentify the best prognostic factors. The best model according tothe AIC criterion included the tumor size and expression of the 5-miRNA signature (data not shown). Only the 5-miRNA signature(all five miRNAs down-regulated) resulted statistically significant inthe Cox model for the high risk group (p-value = 0.02 withHR = 2.73, 95% CI: 1.17–6.36). The 5-miRNA expression datawere also used to develop a predictor model through boot-strapping over a Naive Bayes classifier (B = 200 with N = 71, seemethods). The prognostic accuracy of the models was assessed by areceiver operating charateristic (ROC) test (Figure 4). Consideredindividually, miR-30a-3p and miR-10a showed a strikingly highArea Under the Curve (AUC) score (0.890 and 0.875, respective-ly). This result suggests that mRNA targets regulated by miR-30a-3p and miR-10a could potentially add a greater contribution tothe final outcome of the disease. However, the 5-miRNA signaturehad the strongest predictive value to discriminate tumors frompatients that will develop early relapse (group B) from those thatwill remain free of disease (group A), with an AUC = 0.993(Figure 4). In summary, the 5-miRNA signature has a goodperformance as a risk predictor for early breast cancer recurrence.

Candidate targets for the 5-miRNA signature. To extendour set of five miRNAs with regulatory information, we next tookadvantage of the existing public databases curating predicted andvalidated miRNA-target interactions (MTIs). In particular, vali-dated targets were obtained from the miTarBase and miRecordsrepositories (see methods). First, we created a biological network inCytoscape [66] containing all the individual miRNAs included inthe 5-miRNA signature (miR-149, miR-20b, miR10a, miR-30a-3p and miR-342-5p). Next, we extended the network by adding H.sapiens MTI data retrieved from the indicated repositories and,finally, extended regulatory interaction networks (RIN) weregenerated and visualized in Cytoscape. Each regulatory interac-tion in the network consist of two nodes, a regulatory component(miRNA) and a target biomolecule (mRNA) connected throughone directed edge. Figure 5 shows the extended network when theRIN threshold was set to 1 (i.e. each predicted target appears in, atleast, one RIN). Thus, at RIN = 1 the network included 14

Figure 4. Receiver operating characteristic curve (ROC) forearly breast cancer recurrence by the 5-miRNA signaturestatus. ROC curves generated using the prognosis information andexpression levels of the 5-miRNA signature can discriminate betweenpatients who will develop early recurrence and those who will remainfree of disease. Note that, although miR-30-3p and miR10a, individuallyhave a high area under the curve (AUC) score, the 5-miRNA signaturehas the strongest predictive value (AUC = 0.993) to discriminate thosepatients likely to recur early (group B in our cohort).doi:10.1371/journal.pone.0091884.g004

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 8 March 2014 | Volume 9 | Issue 3 | e91884

validated targets assigned to miR-20b (VEGFA, BAMBI, EFNB2,MYLIP, CRIM1, ARID4B, HIF1A, HIPK3, CDKN1A, PPARG,STAT3, MUC17, EPHB4, and ESR1), 7 validated targetsassigned to miR-10a (HOXA1, NCOR2, SRSF1, SRSF10/TRA2B, MAP3K7, USF2 and BTRC) and 9 validated targetsassigned to miR-3a-3p (THBS1, VEZT, TUBA1A, CDK6,WDR82, TMEM2, KRT7, CYR61 and SLC7A6) (Figure 5).Taking these results into account and considering that i) theextended network was constructed with the 5-miRNA signature asthe network nodes and ii) all MTIs depicted in Figure 5 have beenexperimentally verified, we suggest that at least some of the

30 mRNAs (Figure 5) could be regulated in vivo by the 5-miRNAsignature in early-relapsing tumors.

To gain further insight into the molecular basis of the 5-miRNAsignature prognostic value, we investigated the biological pathwaysassociated with the 30 experimentally verified targets fromFigure 5. To that end, we searched for Gene Ontology (GO)terms and Kyoto Encyclopedia of Genes and Genomes (KEGG)pathways associated with the 30 targets as a whole set. It should benoted, however, that our restrictive approach –including onlyexperimentally validated miRNA targets-, left miR-149 and miR-342-5p out of the GO analysis and therefore, additional biologicalpathways could be affected by downregulation of the 5-miRNA

Figure 5. Prediction of mRNA targets likely to be regulated by the 5-miRNA signature. Biological networks were created using theCytoscape software. Each network includes two types of nodes: the five individual miRNAs included in the 5-miRNA signature and their predictedmRNA targets (yellow circles), obtained from two different public databases (miRTarBase and miRecords). The number of databases included in theanalysis defines the regulatory interaction network (RIN) threshold. Thus, at RIN = 1 the network includes all mRNA targets that appear in, at least, onedatabase. The databases included in the RIN are identified by the color of the connecting arrows: miRTarBase (blue) and miRecords (red). Althoughmany mRNAs are potential targets for miR-149 and miR-342-5p, the miRTarBase and miRecords versions included in this study did not reveal anytargets experimentally validated for the two miRNAs.doi:10.1371/journal.pone.0091884.g005

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 9 March 2014 | Volume 9 | Issue 3 | e91884

Page 72: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Con la bioinformática se explican algunas observaciones

65

Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*

1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,

CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France

Abstract

There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.

Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173

Editor: Marshall S. Horwitz, University of Washington, United States of America

Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014

Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected] (AB); [email protected] (AV)

. These authors contributed equally to this work.

Introduction

Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.

To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].

Results and Discussion

For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).

A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain

PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173

Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*

1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,

CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France

Abstract

There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.

Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173

Editor: Marshall S. Horwitz, University of Washington, United States of America

Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014

Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected] (AB); [email protected] (AV)

. These authors contributed equally to this work.

Introduction

Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.

To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].

Results and Discussion

For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).

A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain

PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173

Figure 1. Comparisons of Differentially Expressed Genes (DEGs). (A) Comparisons of DEGs associated with Central Nervous System (CNS)disorders and Cancers. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNSdisorder (Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ) and Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; andProstate Cancer, PC) are compared to each others. (B) Comparisons of DEGs between CNS disorders, Cancers and Asthma, HIV, Malaria, Dystrophy,Sarcoidosis. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNS disorder(Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ), Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; and ProstateCancer, PC), and in Asthma, HIV, Malaria, Dystrophia and Sarcoidosis, are compared to each others. Cells are coloured according to the significance ofthe overlaps (Fisher’s exact test, Bonferroni correction for multiple testing, see Methods). Grey cells correspond to non-significant overlaps(q-value.0.05).doi:10.1371/journal.pgen.1004173.g001

Table 1. DEGs significantly downregulated in the three CNS disorders and upregulated in the three Cancer types (q-value,0.05).

PPIAP11, IARS, GGCT, NME2, GAPDHP1, CDC123, PSMD8, MRPS33, FIBP, OAZ2, IARS2, SLC35B1, APOO, TMEM189-UBE2V1, VDAC1, TMED3, SMS, DNM1L, PRPS1, SRSF2,TMEM14D, TOMM70A, ATP6V1C1, NUP93, MRPL15, UBA5, PPIH, SMYD3, NIT2, SRD5A1, NUDT21, MRPL12, EEF1E1, MRPS7, TTPAL, BZW1P2, RP11-552M11.4, TSN, MECR,ZWINT, RPRD1A, UCHL5, NHP2P2, TFB2M, FEN1, CGREF1, IMPAD1, ARL1, ACLY, MRPL42, LSM4, KPNA1, TIMM23B, RP11-164O23.5, RP11-762H8.2, FARSA, MRPL4, API5,RP3-425P12.4, RFC3, RANBP9, TFCP2, GMDS, CCNB1, TMEM177, GUF1, HSPA13, NMD3, GCFC2, TUBGCP5, TBCE, YKT6, PHF14, BRCC3

doi:10.1371/journal.pgen.1004173.t001

Inverse Comorbidity among Cancer and CNS Disorders

PLOS Genetics | www.plosgenetics.org 3 February 2014 | Volume 10 | Issue 2 | e1004173

Comparación de genes con expresión diferencialWorkflow

Se sabía que los enfermos de alzhéimer sufrían menos cáncer que el resto de la población

El flujo de trabajo

Page 73: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Se ve con claridad

66

AD and PD, and upregulated in CRC (Reactome database;Figure S2).

Aside the Wnt and p53 pathways, our analysis reveals otherpathways related to protein folding and protein degradationdisplaying patterns of downregulation in CNS disorders andupregulation in Cancers, and that may be relevant for inversecomorbidity. For instance, the Ubiquitin/Proteasome system isconsistently downregulated in CNS disorders and upregulated inCancers according to the three pathway databases analyzed(Figure 2, Figure S2, Table S3). The inverse relationshipbetween the levels of expression deregulations of these pathwayspossibly suggests opposite roles in CNS disorders and Cancers.

A detailed examination of the KEGG pathways deregulated inopposite directions in CNS disorders and Cancers finallyrevealed that 89% of the KEGG pathways that wereupregulated in Cancers and downregulated in CNS disordersare related to Metabolism and Genetic Information Processing(Figure 2, Figure 3). By contrast, the pathways downregulatedin Cancers and upregulated in CNS disorders are related to thecell’s communication with its environment (EnvironmentalInformation Processing and Organismal System; Figure 2,Figure 3). Hence, global regulations of cellular activity mayaccount for a protective effect between inversely comorbiddiseases.

Table 2. DEGs significantly upregulated in the three CNS disorders and downregulated in the three Cancer types (q-value,0.05).

MT2A, MT1X, NFKBIA, AC009469.1, DHRS3, CDKN1A, TNFRSF1A, CRYBG3, IL4R, MT1M, FAM107A, ITPKC, MID1, IL11RA, AHNAK, KAT2B, BCL2, PTH1R, NFASC

doi:10.1371/journal.pgen.1004173.t002

Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24]significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways werecompared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status asCancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blueand yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels arecoloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process(pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/).doi:10.1371/journal.pgen.1004173.g002

Inverse Comorbidity among Cancer and CNS Disorders

PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173

El cáncer (próstata, colorrectal, pulmón) comparte 93 genes con otras enfermedades del sistema nervioso

central (párkinson, alzhéimer, esquizofrenia)

↑↑ cáncer ↓↓ SNC enfermo

74 genes19 genescáncer ↓↓

SNC enfermo↑↑

Genes exclusivos del cáncer

Genes exclusivos del SNC enfermo

Page 74: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La aplicación más llamativa a corto plazo• Hay fármacos antidepresivos que se

podrán utilizar como medicamentos contra el cáncer

• Hay fármacos antineoplásicos que se pueden usar contra las enfermedades del SNC

• el bexaroteno (contra el cáncer de piel) es eficaz para el tratamiento del alzhéimer en los ratones

67

http://esmateria.com/2014/02/20/iluminado-el-blindaje-contra-el-cancer-de-personas-con-otras-enfermedades-en-el-cerebro/

Page 75: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Y no se ha hecho esperar: 31-3-2014

68

Page 76: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

El genoma no nos permite predecir el organismo

69

?

Page 77: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Empezamos a saber el aspecto a partir del genoma

70

Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,

Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,

Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,

Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,

James S. Boster14, Mark D. Shriver2*

1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven

Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit

Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for

Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,

Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da

Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School

of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,

12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of

Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of

America

Abstract

Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.

Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224

Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America

Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014

Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art

genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing

PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224

Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,

Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,

Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,

Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,

James S. Boster14, Mark D. Shriver2*

1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven

Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit

Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for

Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,

Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da

Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School

of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,

12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of

Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of

America

Abstract

Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.

Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224

Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America

Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014

Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art

genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing

PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224

Modeling 3D Facial Shape from DNA

PLOS Genetics | www.plosgenetics.org 5 March 2014 | Volume 10 | Issue 3 | e1004224

Figure 4. Relationships between the ancestry and sex RIP variables and their initial predictor variables. (A) RIP-A with genomicancestry; genomic ancestry is calculated using the core panel of 68 AIMs and RIP-A is calculated using this ancestry estimate on the set of threepopulations combined (N = 592). Populations are indicated as shown in the legend with United States participants shown with black circles, Brazilianswith red circles, and Cape Verdeans with blue circles. (B) Histograms of RIP-S by self-reported sex.doi:10.1371/journal.pgen.1004224.g004

Modeling 3D Facial Shape from DNA

PLOS Genetics | www.plosgenetics.org 7 March 2014 | Volume 10 | Issue 3 | e1004224

Page 78: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

La bioinformática, la EPOC, y las publicaciones

71

Chen and Wang Journal of Clinical Bioinformatics 2011 1:35 doi:10.1186/2043-9113-1-35

Se necesita la bioinformática para descubrir los candidatos

Bioinformática pura y dura

Con la bioinformática se descubren:

Aquí no publicarán ni el

informático clínico ni el ingeniero biomédico

Page 79: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Los bioinformáticos y las publicaciones

72

Microarrays Bases de datos

Microarrays Minería de datos

Aprendizaje computacional

Page 80: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Los bioinformáticos y las publicaciones

72

Microarrays Bases de datos

Microarrays Minería de datos

Aprendizaje computacional

Con colaboración se llega más lejos

Page 81: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Ejemplo de colaboración e integración: la alergia al olivo

73

Las proteínas alergénicas están en el polen

Page 82: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.eshttp://www.scbi.uma.es/

Construcción de genoteca de polen

74

Grupo de investigación de Juan de Dios Alché

Estación Experimental «El Zaidín» (Granada)

Page 83: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.eshttp://www.scbi.uma.es/

1.º Secuenciación en el laboratorio

75

Picasso

Edificio de Bioinnovación

Se usan las máquinas virtuales de picasso

Page 84: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.eshttp://www.scbi.uma.es/

2.º Ensamblaje: de la secuencia al transcriptoma

76

Se usan los FAT NODES (máquinas de memoria compartida) de picasso

Page 85: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.eshttp://www.scbi.uma.es/

3.º Anotación y enriquecimiento biológico

77

Aparecen alérgenos ya

conocidos (Ole1-10)

Se están identificando nuevos alérgenos desconocidos

Se usan los COMPUTING NODES de supercomputación

Page 86: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Todavía queda mucho por descubrir

78

Page 87: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Todavía queda mucho por descubrir

78

Page 88: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Nuestro pequeño grupo interdisciplinar

79

Think & design Coding

Testing

Almudena

C

Darío

C

Juan

C

Noé

B

Rocío

B

Gonzalo

B

Isabel

B

Hicham

B

Rosario

B

Pedro

B

Biólogos y tal

Ing. Informático

B

C

IS Bioinformáticos

¡Necesito bioinformáticos!

IS

Page 89: Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

http://www.scbi.uma.es

Nuestro pequeño grupo interdisciplinar

79

Think & design Coding

Testing

Almudena

C

Darío

C

Juan

C

Noé

B

Rocío

B

Gonzalo

B

Isabel

B

Hicham

B

Rosario

B

Pedro

B

Biólogos y tal

Ing. Informático

B

C

IS Bioinformáticos

¡Necesito bioinformáticos!

IS

Rocío

GonzaloNoé

Rafa

HichamAlmudena

Antonio Banderas