85
Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological Systems, BCAM June 2014 with Erel Levine, Harvard

Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

Inference of evolutionary correlations and the miRNA regulatory network

Benedikt Obermayer

Nonlinear Dynamics of Biological Systems, BCAM June 2014

with Erel Levine, Harvard

Page 2: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

evolutionary correlations

changes at one locus are accompagnied by compensatory changes at another locus

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 3: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

identifying correlations

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 4: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

identifying correlations

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 5: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

identifying correlations

• sequences are evolutionarily related (alignable)

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 6: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

identifying correlations

• sequences are evolutionarily related (alignable)

• too little variability in the data: phylogenetic bias

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 7: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

identifying correlations

• sequences are evolutionarily related (alignable)

• too little variability in the data: phylogenetic bias

• how to estimate conservation and correlations in an unbiased way?

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Wan et al. Nat. Rev. Genet. 2011

Page 8: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

a phylogenetic Ising modelidea: treat observed values in different species

as spin variables on a tree

K↵�

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Page 9: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

a phylogenetic Ising modelidea: treat observed values in different species

as spin variables on a tree

K↵�+h

measures conservationh

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Page 10: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

a phylogenetic Ising modelidea: treat observed values in different species

as spin variables on a tree

K↵�+h

measures conservationh

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

K↵� +hi + hj + Jij

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

��� ����������+��������

��&�$ ��

��$)! ���"�&&�$ %

�%��

"����

�'�����'��$�"�&&�$ %

�'������%%!���&��

�������(�%�! ��""�$�&'%

��� ��$!������&)

A C G G C A A G U C A C G T���$�,%�

A C G G C A A G U C A C G T������

A C G G C A A G U C A C G T�!�

A C G G C A A G U C A C G T��&

A C C G C A A G U C A G G T�!'%�

����"� *�� G C C G C A A G U C A G G CG C C G C A A G U C A G G C

GAUA

CC

G AGC

GC

G C

�'��

�������

��#'� ���&$'�&'$�

GAU

U

C

A

A

A

C

C

A

A

A

GC

G CG

G

G

G

A

A

A

AA

U

U

UU

U

U

UU

U

U

U

U

U

A

ACC

C

C

C

C

G

C

C

C

G

G

U

U

U

U

U

U

U

AA

A

A

A

A

A

A

A

A

AAA

CC

C

U UU

U

C

��������������

��������������

�� ����������������������

��������������������������

������������

b.2,%-(V()(C4.A>8+>;?(3+478+7402(15+>63(654()#*DB>;A>;?D-45+.>;(+04?.+3(>;(1)#*3(6451(A>66.4.;+(54?0;>313:((0()(!(#$-:K/118(#$%,'$,%&/(:1$.0($7&$(?&#(01,34(.3($7-(-.27$(B31?3($&%2-$#(10($7-(7,:&3(*+!@".34.32(8%1$-.3((M*ZLN(S!PR[(M&/#1(B31?3(&#(egJ*ZSHf(31$($1("-('130,#-4(?.$71G@&4-31#</:-$7.13-N(,#.32(&3(*+!(:1$.0(0.34-%(M*+!8%1:1NW]5(B()(G7-(#&:-(*+!(:1$.0(0.34-%(?&#(,#-4($1(&3&/<#-(4&$&('1//-'$-4(0%1:(&(/&%2-(#$,4<(10(:*+!(/1'&/.A&$.13(4,%.32(0/<(-:"%<13.'(4-;-/18:-3$($1(8%-4.'$(#.23.0.'&3$(:1$.0#(.3(#.>(#-$#(10('1/1'&/.A-4(:&$-%3&/($%&3#'%.8$#W]5(S71?3(.#($7-(#$%,'$,%&/(:1$.0(-3%.'7-4(.3(-&'7(#-$(10(:*+!#5(8()(`13#-%;&$.13(10("&#-(8&.%#(.3(71:1/121,#(#-=,-3'-#(4.%-'$#(#$%,'$,%-(8%-4.'$.135(S-=,-3'-('1;&%.&$.13(.#(01,34(&$(&/.23-4(81#.$.13#5(S71?3(.#(&3(->&:8/-(&/.23:-3$(10(#-;-3(*+!(#-=,-3'-#5(63($7-(->&:8/-9(#-=,-3'-('1;&%.&$.13(.3("-$?--3($7-($?1(#-$#(10(:&%B-4('1/,:3#(#,22-#$#($7&$($7-#-("&#-#(:.27$(.3$-%&'$9(?7.'7('1,/4(2.;-($7-(:1$.0(#71?3(13($7-(%.27$5(L&%$#(0(&34(B(&%-(:14.0.-49(?.$7(8-%:.##.139(0%1:(REF.  30 ©(MV]][N(+&$.13&/(!'&4-:<(10(S'.-3'-#9(_S!5

PseudoknotsRNA topologies that contain non-nested nucleotide pairings.

algorithm can be modified such that large, positive free-energy terms are added to nucleotides that are known to be unpaired. This thereby restricts the algo-rithm from marking them as paired57. More recently, methods that use quantitative, nucleotide-resolution experimental data (discussed below) to direct the pre-diction of a folding algorithm have been introduced59. By integrating an additional per-nucleotide pseudo-free-energy term into the dynamic programming algo-rithm, this method was shown to significantly increase the accuracy of structure prediction.

Ongoing challenges. Despite their many successes, cur-rent prediction algorithms have several limitations. First, RNA molecules in solution may adopt second-ary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RBPs. These environment-dependent interactions are extremely complex to model and are

thus excluded from prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29,57,60,61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures.

Important limitations are the difficulty encoun-tered in predicting pseudoknots or in taking into account long-range and tertiary-structure interac-tions. Although those more complicated aspects of RNA structure are computationally hard to predict, some algorithms have been developed that allow for pseudoknots or are able to predict tertiary struc-ture62–64. Pseudoknots have been observed in a number of functional RNA sequences, such as rRNAs, tRNAs or the genomes of viral RNAs65, in which they have been shown to be involved in unique mechanisms of viral translation initiation and elongation66. Thus, ignoring pseudoknots results in inaccurate struc-ture predictions65,67. In contrast to the prediction of

!"#$"%&

454(| SEPTEMBER 2011 | VOLUME 12 (999:;0+74.:851<4.=>.93<?.;.+>83

© 2011 Macmillan Publishers Limited. All rights reserved

Jij measures correlations

Page 11: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

two applications

1. evolutionary correlations in the miRNA regulatory network

2. direct vs. indirect correlations

Page 12: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

layers of gene regulation

Page 13: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

regulatory function

• mRNA degradation, translational repression

• repression usually modest

• sometimes: developmental switches / define tissue identity

• mostly: tuning of gene expression / redundancy and robustness

Selbach et al. Nature 2008

Page 14: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA target recognition

Huntzinger & Izaurralde Nat Rev Genet 2011

• act on target sites in mRNA 3'UTR

• seed match most important determinant

• other variables: sequence context, site accessibility, position in UTR, ...

• many (conserved) target sites

Page 15: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA regulatory network

gene

miR

genegene

genegene

gene

gene

miR miR

Page 16: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA regulatory network

• collective regulatory function of miRNAs

gene

miR

genegene

genegene

gene

gene

miR miR

Page 17: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA regulatory network

• collective regulatory function of miRNAs

• combinatorial regulation

gene

miR

genegene

genegene

gene

gene

miR miR

Page 18: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA regulatory network

• collective regulatory function of miRNAs

• combinatorial regulation

• competitive regulation

gene

miR

genegene

genegene

gene

gene

miR miR

Page 19: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA regulatory network

• collective regulatory function of miRNAs

• combinatorial regulation

• competitive regulation

• coordinated regulation

gene

miR

genegene

genegene

gene

gene

miR miR

Page 20: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

rewiring the network

gene

miR

genegene

genegene

gene

gene

miR miR

Page 21: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

rewiring the network

• target sites less conserved than miRNAs

gene

miR

genegene

genegene

gene

gene

miR miR

Page 22: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

rewiring the network

• target sites less conserved than miRNAs

• if sites share regulatory function, rewiring is constrained: evolutionary correlations

gene

miR

genegene

genegene

gene

gene

miR miR

Page 23: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

target sites as binary spins

K↵� g↵+h

measures conservationh

Page 24: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

target sites as binary spins

measures conservationh

K↵� g↵+hi + hj + Jij

Jij measures correlations

Page 25: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

conservation of target sites

• control: compare to seed-like Kmers

• preferential conservation score relative to background selection in 3‘UTR

• sites are not strongly conserved: constraints on evolution of less conserved sites?

Page 26: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

• many genes are under control of multiple miRNAs

• close sites act cooperatively

• is this feature under selection?

• look at sites in the same UTR

Page 27: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

Page 28: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

• enrichment of correlations

Page 29: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

• enrichment of correlations

• mostly positive

Page 30: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

• enrichment of correlations

• mostly positive

• mostly between distant site pairs

Page 31: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

combinatorial regulation

• enrichment of correlations

• mostly positive

• mostly between distant site pairs

• site pairs for the same miRNA more strongly correlated

Page 32: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

miRNA correlation network

network of “interacting” miRNAs with many

(significantly, positively) correlated site pairsin the same UTRs

interacting miRNAs are often co-expressed

Tsang et al. Mol. Cell 2010

Page 33: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

Summary: miRNA network

• collective regulatory functions lead to evolutionary correlations in site pair conservation

• use phylogenetic Ising model to quantify conservation and measure correlations

• combinatorial regulation: sites are conserved as an ensemble

• distinguish direct from indirect correlations?

Page 34: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

protein structure prediction

of these possible causes of co-variation effects and is thus facedwith the complicated inverse problem of using observedcorrelations to infer contacts between residues (Figure 1). Givenalternative causes of true evolutionary co-variation, even ifconfounding correlations caused by technical reasons can beidentified, there is no guarantee that the remaining correlatedresidue pairs will be dominated by residues in three dimensionalproximity.The initial challenge is thus to solve the inverse sequence-to-

structure problem by reducing the influence of confoundingfactors. Only then is it possible to judge whether the evolutionaryprocess reveals enough residue contacts, which are sufficientlyevenly distributed (spread) throughout the protein sequence andstructure, to predict the protein fold. The ultimate criterion ofperformance is the accuracy of 3D structure prediction using theinferred contacts. Previous work combined a small number ofevolutionarily inferred residue contacts with other, structural,sources of information to successfully predict the structure of somesmaller proteins, [16,17,18,19]. However, three crucial openquestions remain with respect to using evolutionarily inferredresidue-residue couplings for protein fold prediction. The first iswhether one can develop a sufficiently robust method to identifycausative correlations that reflect evolutionary constraints. Thesecond is whether the inferred, plausibly evolutionary, correlationsprimarily reflect residue-residue proximity. The third is whetherthese inferred residue-residue proximities provide sufficientinformation to predict a protein fold, without the use of knownthree-dimensional structures.

The de novo protein structure prediction problem in theera of genome sequencingSolving this inverse problem would enable novel insight into the

evolutionary dynamics of sequence variation, and the role ofevolutionarily constrained interactions in protein folding. Deter-mination of protein structure, by experiment or theory, providesone essential window into protein function, evolution and design.However, our knowledge of protein structure remains incompleteand is far from saturation. In spite of significant progress in thefield of structural genomics over the last decade [20], only about

half of all well-characterized protein families (PFAM-A, 12,000families), have a 3D structure for any of their members [1]. At thesame time, the current upper limit on the total number of proteinfamilies (,200,000; PFAM-B) is an order of magnitude larger, andcontinues to grow with no clear limit in sight. Therefore, asmassive genomic sequencing projects rapidly increase the numberand size of protein families, in particular those without structuralhomologs [21], accurate de novo prediction of 3D structure fromsequence would rapidly expand our overall knowledge of proteinstructures in a way difficult to achieve by experiment.

Limited ability of current de novo 3D structure predictionmethodsAlthough the challenge of the computational sequence-to-

structure problem remains unsolved, methods that use fragmentlibraries [22,23] or other strategies to search conformational space[24,25], followed by sophisticated energy optimization ormolecular dynamics refinement, have been successful at predictingthe 3D structures of smaller proteins (,80 residues) [22,24,25,26][25,27,28]. In addition, custom-designed supercomputers haveallowed insight not only into molecular dynamics of proteinfunction, but also into the folding pathways of smaller proteinssuch as BPTI and WW domains [29,30]. However, none of thesecomputational approaches have yet achieved de novo folding from adisordered or extended polypeptide to the native folded state forlarger proteins and it is generally appreciated that the primaryobstacle to 3D protein structure prediction is conformationalsampling, i.e., successful search of the vast space of proteinconformations for the correct fold [26,31]. Using current methods,it is computationally infeasible to adequately sample the enormousset of all 3D configurations a protein might explore in the processof folding to the native state. In this paper we explore the idea thatinformation gleaned from statistical analysis of multiple sequencealignments can be used to solve this problem [2,5,6,32,33]. Thegoal is use residue-residue contacts inferred from the evolutionaryrecord (EICs) to identify the tiny region in the space of all possible3D configurations of a given protein that contains the correctlyfolded or ‘native’ structure.

Figure 1. Correlated mutations carry information about distance relationships in protein structure. The sequence of the protein forwhich the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarilyrelated family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structuralfamily). Evolutionary variation in the sequences is constrained by a number of requirements, including the maintenance of favorable interactions indirect residue-residue contacts (red line, right). The inverse problem of protein fold prediction from sequence addressed here exploits paircorrelations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensionalstructure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximatepredicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of theprotein of interest.doi:10.1371/journal.pone.0028766.g001

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 2 December 2011 | Volume 6 | Issue 12 | e28766

Page 35: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

protein structure prediction

of these possible causes of co-variation effects and is thus facedwith the complicated inverse problem of using observedcorrelations to infer contacts between residues (Figure 1). Givenalternative causes of true evolutionary co-variation, even ifconfounding correlations caused by technical reasons can beidentified, there is no guarantee that the remaining correlatedresidue pairs will be dominated by residues in three dimensionalproximity.The initial challenge is thus to solve the inverse sequence-to-

structure problem by reducing the influence of confoundingfactors. Only then is it possible to judge whether the evolutionaryprocess reveals enough residue contacts, which are sufficientlyevenly distributed (spread) throughout the protein sequence andstructure, to predict the protein fold. The ultimate criterion ofperformance is the accuracy of 3D structure prediction using theinferred contacts. Previous work combined a small number ofevolutionarily inferred residue contacts with other, structural,sources of information to successfully predict the structure of somesmaller proteins, [16,17,18,19]. However, three crucial openquestions remain with respect to using evolutionarily inferredresidue-residue couplings for protein fold prediction. The first iswhether one can develop a sufficiently robust method to identifycausative correlations that reflect evolutionary constraints. Thesecond is whether the inferred, plausibly evolutionary, correlationsprimarily reflect residue-residue proximity. The third is whetherthese inferred residue-residue proximities provide sufficientinformation to predict a protein fold, without the use of knownthree-dimensional structures.

The de novo protein structure prediction problem in theera of genome sequencingSolving this inverse problem would enable novel insight into the

evolutionary dynamics of sequence variation, and the role ofevolutionarily constrained interactions in protein folding. Deter-mination of protein structure, by experiment or theory, providesone essential window into protein function, evolution and design.However, our knowledge of protein structure remains incompleteand is far from saturation. In spite of significant progress in thefield of structural genomics over the last decade [20], only about

half of all well-characterized protein families (PFAM-A, 12,000families), have a 3D structure for any of their members [1]. At thesame time, the current upper limit on the total number of proteinfamilies (,200,000; PFAM-B) is an order of magnitude larger, andcontinues to grow with no clear limit in sight. Therefore, asmassive genomic sequencing projects rapidly increase the numberand size of protein families, in particular those without structuralhomologs [21], accurate de novo prediction of 3D structure fromsequence would rapidly expand our overall knowledge of proteinstructures in a way difficult to achieve by experiment.

Limited ability of current de novo 3D structure predictionmethodsAlthough the challenge of the computational sequence-to-

structure problem remains unsolved, methods that use fragmentlibraries [22,23] or other strategies to search conformational space[24,25], followed by sophisticated energy optimization ormolecular dynamics refinement, have been successful at predictingthe 3D structures of smaller proteins (,80 residues) [22,24,25,26][25,27,28]. In addition, custom-designed supercomputers haveallowed insight not only into molecular dynamics of proteinfunction, but also into the folding pathways of smaller proteinssuch as BPTI and WW domains [29,30]. However, none of thesecomputational approaches have yet achieved de novo folding from adisordered or extended polypeptide to the native folded state forlarger proteins and it is generally appreciated that the primaryobstacle to 3D protein structure prediction is conformationalsampling, i.e., successful search of the vast space of proteinconformations for the correct fold [26,31]. Using current methods,it is computationally infeasible to adequately sample the enormousset of all 3D configurations a protein might explore in the processof folding to the native state. In this paper we explore the idea thatinformation gleaned from statistical analysis of multiple sequencealignments can be used to solve this problem [2,5,6,32,33]. Thegoal is use residue-residue contacts inferred from the evolutionaryrecord (EICs) to identify the tiny region in the space of all possible3D configurations of a given protein that contains the correctlyfolded or ‘native’ structure.

Figure 1. Correlated mutations carry information about distance relationships in protein structure. The sequence of the protein forwhich the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarilyrelated family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structuralfamily). Evolutionary variation in the sequences is constrained by a number of requirements, including the maintenance of favorable interactions indirect residue-residue contacts (red line, right). The inverse problem of protein fold prediction from sequence addressed here exploits paircorrelations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensionalstructure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximatepredicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of theprotein of interest.doi:10.1371/journal.pone.0028766.g001

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 2 December 2011 | Volume 6 | Issue 12 | e28766

Marks et al. PLoS One 2011

increasingly difficult cases. We start with small single-domainproteins and move on to larger, more difficult targets, eventuallycovering a set of well-studied protein domains of wide-rangingbiological interest, from different fold classes. We report detailedresults for four example families, and summary results for 11further test families, and provide detailed 3D views of all 15 testprotein families in Figure S3 and detailed 3D coordinates andPymol session files for interactive inspection in Appendices A3 andA4, http://cbio.mskcc.org/foldingproteins.

Small: an RNA binding domain (RRM). The blindprediction of the 71-residue RRM domain of the human Elav4protein (Uniprot ID: Elav4_human) is a typical example of asmaller protein. The distance constraints are derived from a richcorpus of 25K example proteins in the PFAM family. The highestranking predicted structure has a (excellent) low 2.9 A Ca -RMSDdeviation from the crystal structure over 67 out of 71 residues, aTM score of 0.57 and GDT_TS 54.6, indicating overall goodstructural similarity to the observed crystal structure, [53,54],(Figure 2 top, Table 1). It has correct topography of the five b-strands and two a-helices, marred only by a missing H-bondpattern between strands 1 and 3, at least partly due to thetruncation of the strand 1, a consequence of the short length of thesequence in the PFAM alignment. Strands 2 and 3 align with only1.6 A Ca-RMSD deviation over the length of the predicted strandsand are positioned well enough for hydrogen bonding, with some

correct registration. Interestingly, the 4th b-strand (penultimate)missed by the secondary structure prediction method is placed inthe correct region in 3D: this is one of several examples in whichresidue coupling information overrides incorrect local prediction.The predicted top-ranked domain of Elav4 very likely lies withinthe refinement basin of the native structure.

Medium size: Ras oncogene (G-domain), an a/b domainwith an GTPase active site. The G-domain family in PFAM,with Human Ras proto-oncogene protein (Uniprot-ID:hras_human) chosen as the protein of interest, has a coremultiple sequence alignment (MSA) of 161 residues. Thestructure has an a/b fold with a 6-stranded b-sheet, surroundedby 5 a-helices, one of which (a-2) is involved in the GTPase switchtransition after GTP hydrolysis. The highest ranked, blindlypredicted structure is 3.6 A Ca-RMSD from the crystal structure,over 161 residues (Figure 2 middle) and has a high TM score of 0.7(range 0.0–1.0, with 1.0 implying 100% of residues are within a setdistance from the correct position [53]). The six b-strands and fivea-helices are placed in the correct spatial positions and arecorrectly threaded (Appendices A3 and A4). The 6 b-strands,which make 5 b-strand pairs are not within hydrogen bodingdistance for all backbone bonding, but the correct register can beeasily predicted for 26/30 of the residue pairs, Text S1. Theaccuracy of overall topography of the highest-ranked structures isremarkable (Table 1) and, as far as we know, currently not

Figure 3. Progress in contact prediction using the maximum entropy method. Extraction of evolutionary information about residuecoupling and predicted contacts from multiple sequence alignments works much better using the global statistical model (right, Direct Information,DI, Equation 3) than the local statistical model (left, Mutual Information, MI, Equation 1). Predicted contacts for DI (red lines connecting the residuespredicted to be coupled from sequence information) are better positioned in the experimentally observed structure (grey ribbon diagram), thanthose for MI (left, blue lines), shown here for the RAS protein (upper) and ELAV4 protein (lower). The DI residue pairs are also more evenly distributedalong the chain and overlap more accurately with the contacts in the observed structure (red stars [predicted, grey circles [observed] in contact map;center, upper right triangle) than those using MI (blue [predicted], grey circles [observed]; center, lower left triangle). Details of contact maps for allproteins comparing predicted and observed contacts are in Figures S1 and S2, Text S1.doi:10.1371/journal.pone.0028766.g003

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 6 December 2011 | Volume 6 | Issue 12 | e28766

Page 36: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

protein structure prediction

of these possible causes of co-variation effects and is thus facedwith the complicated inverse problem of using observedcorrelations to infer contacts between residues (Figure 1). Givenalternative causes of true evolutionary co-variation, even ifconfounding correlations caused by technical reasons can beidentified, there is no guarantee that the remaining correlatedresidue pairs will be dominated by residues in three dimensionalproximity.The initial challenge is thus to solve the inverse sequence-to-

structure problem by reducing the influence of confoundingfactors. Only then is it possible to judge whether the evolutionaryprocess reveals enough residue contacts, which are sufficientlyevenly distributed (spread) throughout the protein sequence andstructure, to predict the protein fold. The ultimate criterion ofperformance is the accuracy of 3D structure prediction using theinferred contacts. Previous work combined a small number ofevolutionarily inferred residue contacts with other, structural,sources of information to successfully predict the structure of somesmaller proteins, [16,17,18,19]. However, three crucial openquestions remain with respect to using evolutionarily inferredresidue-residue couplings for protein fold prediction. The first iswhether one can develop a sufficiently robust method to identifycausative correlations that reflect evolutionary constraints. Thesecond is whether the inferred, plausibly evolutionary, correlationsprimarily reflect residue-residue proximity. The third is whetherthese inferred residue-residue proximities provide sufficientinformation to predict a protein fold, without the use of knownthree-dimensional structures.

The de novo protein structure prediction problem in theera of genome sequencingSolving this inverse problem would enable novel insight into the

evolutionary dynamics of sequence variation, and the role ofevolutionarily constrained interactions in protein folding. Deter-mination of protein structure, by experiment or theory, providesone essential window into protein function, evolution and design.However, our knowledge of protein structure remains incompleteand is far from saturation. In spite of significant progress in thefield of structural genomics over the last decade [20], only about

half of all well-characterized protein families (PFAM-A, 12,000families), have a 3D structure for any of their members [1]. At thesame time, the current upper limit on the total number of proteinfamilies (,200,000; PFAM-B) is an order of magnitude larger, andcontinues to grow with no clear limit in sight. Therefore, asmassive genomic sequencing projects rapidly increase the numberand size of protein families, in particular those without structuralhomologs [21], accurate de novo prediction of 3D structure fromsequence would rapidly expand our overall knowledge of proteinstructures in a way difficult to achieve by experiment.

Limited ability of current de novo 3D structure predictionmethodsAlthough the challenge of the computational sequence-to-

structure problem remains unsolved, methods that use fragmentlibraries [22,23] or other strategies to search conformational space[24,25], followed by sophisticated energy optimization ormolecular dynamics refinement, have been successful at predictingthe 3D structures of smaller proteins (,80 residues) [22,24,25,26][25,27,28]. In addition, custom-designed supercomputers haveallowed insight not only into molecular dynamics of proteinfunction, but also into the folding pathways of smaller proteinssuch as BPTI and WW domains [29,30]. However, none of thesecomputational approaches have yet achieved de novo folding from adisordered or extended polypeptide to the native folded state forlarger proteins and it is generally appreciated that the primaryobstacle to 3D protein structure prediction is conformationalsampling, i.e., successful search of the vast space of proteinconformations for the correct fold [26,31]. Using current methods,it is computationally infeasible to adequately sample the enormousset of all 3D configurations a protein might explore in the processof folding to the native state. In this paper we explore the idea thatinformation gleaned from statistical analysis of multiple sequencealignments can be used to solve this problem [2,5,6,32,33]. Thegoal is use residue-residue contacts inferred from the evolutionaryrecord (EICs) to identify the tiny region in the space of all possible3D configurations of a given protein that contains the correctlyfolded or ‘native’ structure.

Figure 1. Correlated mutations carry information about distance relationships in protein structure. The sequence of the protein forwhich the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarilyrelated family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structuralfamily). Evolutionary variation in the sequences is constrained by a number of requirements, including the maintenance of favorable interactions indirect residue-residue contacts (red line, right). The inverse problem of protein fold prediction from sequence addressed here exploits paircorrelations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensionalstructure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximatepredicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of theprotein of interest.doi:10.1371/journal.pone.0028766.g001

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 2 December 2011 | Volume 6 | Issue 12 | e28766

Marks et al. PLoS One 2011

increasingly difficult cases. We start with small single-domainproteins and move on to larger, more difficult targets, eventuallycovering a set of well-studied protein domains of wide-rangingbiological interest, from different fold classes. We report detailedresults for four example families, and summary results for 11further test families, and provide detailed 3D views of all 15 testprotein families in Figure S3 and detailed 3D coordinates andPymol session files for interactive inspection in Appendices A3 andA4, http://cbio.mskcc.org/foldingproteins.

Small: an RNA binding domain (RRM). The blindprediction of the 71-residue RRM domain of the human Elav4protein (Uniprot ID: Elav4_human) is a typical example of asmaller protein. The distance constraints are derived from a richcorpus of 25K example proteins in the PFAM family. The highestranking predicted structure has a (excellent) low 2.9 A Ca -RMSDdeviation from the crystal structure over 67 out of 71 residues, aTM score of 0.57 and GDT_TS 54.6, indicating overall goodstructural similarity to the observed crystal structure, [53,54],(Figure 2 top, Table 1). It has correct topography of the five b-strands and two a-helices, marred only by a missing H-bondpattern between strands 1 and 3, at least partly due to thetruncation of the strand 1, a consequence of the short length of thesequence in the PFAM alignment. Strands 2 and 3 align with only1.6 A Ca-RMSD deviation over the length of the predicted strandsand are positioned well enough for hydrogen bonding, with some

correct registration. Interestingly, the 4th b-strand (penultimate)missed by the secondary structure prediction method is placed inthe correct region in 3D: this is one of several examples in whichresidue coupling information overrides incorrect local prediction.The predicted top-ranked domain of Elav4 very likely lies withinthe refinement basin of the native structure.

Medium size: Ras oncogene (G-domain), an a/b domainwith an GTPase active site. The G-domain family in PFAM,with Human Ras proto-oncogene protein (Uniprot-ID:hras_human) chosen as the protein of interest, has a coremultiple sequence alignment (MSA) of 161 residues. Thestructure has an a/b fold with a 6-stranded b-sheet, surroundedby 5 a-helices, one of which (a-2) is involved in the GTPase switchtransition after GTP hydrolysis. The highest ranked, blindlypredicted structure is 3.6 A Ca-RMSD from the crystal structure,over 161 residues (Figure 2 middle) and has a high TM score of 0.7(range 0.0–1.0, with 1.0 implying 100% of residues are within a setdistance from the correct position [53]). The six b-strands and fivea-helices are placed in the correct spatial positions and arecorrectly threaded (Appendices A3 and A4). The 6 b-strands,which make 5 b-strand pairs are not within hydrogen bodingdistance for all backbone bonding, but the correct register can beeasily predicted for 26/30 of the residue pairs, Text S1. Theaccuracy of overall topography of the highest-ranked structures isremarkable (Table 1) and, as far as we know, currently not

Figure 3. Progress in contact prediction using the maximum entropy method. Extraction of evolutionary information about residuecoupling and predicted contacts from multiple sequence alignments works much better using the global statistical model (right, Direct Information,DI, Equation 3) than the local statistical model (left, Mutual Information, MI, Equation 1). Predicted contacts for DI (red lines connecting the residuespredicted to be coupled from sequence information) are better positioned in the experimentally observed structure (grey ribbon diagram), thanthose for MI (left, blue lines), shown here for the RAS protein (upper) and ELAV4 protein (lower). The DI residue pairs are also more evenly distributedalong the chain and overlap more accurately with the contacts in the observed structure (red stars [predicted, grey circles [observed] in contact map;center, upper right triangle) than those using MI (blue [predicted], grey circles [observed]; center, lower left triangle). Details of contact maps for allproteins comparing predicted and observed contacts are in Figures S1 and S2, Text S1.doi:10.1371/journal.pone.0028766.g003

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 6 December 2011 | Volume 6 | Issue 12 | e28766

A

C

B

Page 37: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

protein structure prediction

of these possible causes of co-variation effects and is thus facedwith the complicated inverse problem of using observedcorrelations to infer contacts between residues (Figure 1). Givenalternative causes of true evolutionary co-variation, even ifconfounding correlations caused by technical reasons can beidentified, there is no guarantee that the remaining correlatedresidue pairs will be dominated by residues in three dimensionalproximity.The initial challenge is thus to solve the inverse sequence-to-

structure problem by reducing the influence of confoundingfactors. Only then is it possible to judge whether the evolutionaryprocess reveals enough residue contacts, which are sufficientlyevenly distributed (spread) throughout the protein sequence andstructure, to predict the protein fold. The ultimate criterion ofperformance is the accuracy of 3D structure prediction using theinferred contacts. Previous work combined a small number ofevolutionarily inferred residue contacts with other, structural,sources of information to successfully predict the structure of somesmaller proteins, [16,17,18,19]. However, three crucial openquestions remain with respect to using evolutionarily inferredresidue-residue couplings for protein fold prediction. The first iswhether one can develop a sufficiently robust method to identifycausative correlations that reflect evolutionary constraints. Thesecond is whether the inferred, plausibly evolutionary, correlationsprimarily reflect residue-residue proximity. The third is whetherthese inferred residue-residue proximities provide sufficientinformation to predict a protein fold, without the use of knownthree-dimensional structures.

The de novo protein structure prediction problem in theera of genome sequencingSolving this inverse problem would enable novel insight into the

evolutionary dynamics of sequence variation, and the role ofevolutionarily constrained interactions in protein folding. Deter-mination of protein structure, by experiment or theory, providesone essential window into protein function, evolution and design.However, our knowledge of protein structure remains incompleteand is far from saturation. In spite of significant progress in thefield of structural genomics over the last decade [20], only about

half of all well-characterized protein families (PFAM-A, 12,000families), have a 3D structure for any of their members [1]. At thesame time, the current upper limit on the total number of proteinfamilies (,200,000; PFAM-B) is an order of magnitude larger, andcontinues to grow with no clear limit in sight. Therefore, asmassive genomic sequencing projects rapidly increase the numberand size of protein families, in particular those without structuralhomologs [21], accurate de novo prediction of 3D structure fromsequence would rapidly expand our overall knowledge of proteinstructures in a way difficult to achieve by experiment.

Limited ability of current de novo 3D structure predictionmethodsAlthough the challenge of the computational sequence-to-

structure problem remains unsolved, methods that use fragmentlibraries [22,23] or other strategies to search conformational space[24,25], followed by sophisticated energy optimization ormolecular dynamics refinement, have been successful at predictingthe 3D structures of smaller proteins (,80 residues) [22,24,25,26][25,27,28]. In addition, custom-designed supercomputers haveallowed insight not only into molecular dynamics of proteinfunction, but also into the folding pathways of smaller proteinssuch as BPTI and WW domains [29,30]. However, none of thesecomputational approaches have yet achieved de novo folding from adisordered or extended polypeptide to the native folded state forlarger proteins and it is generally appreciated that the primaryobstacle to 3D protein structure prediction is conformationalsampling, i.e., successful search of the vast space of proteinconformations for the correct fold [26,31]. Using current methods,it is computationally infeasible to adequately sample the enormousset of all 3D configurations a protein might explore in the processof folding to the native state. In this paper we explore the idea thatinformation gleaned from statistical analysis of multiple sequencealignments can be used to solve this problem [2,5,6,32,33]. Thegoal is use residue-residue contacts inferred from the evolutionaryrecord (EICs) to identify the tiny region in the space of all possible3D configurations of a given protein that contains the correctlyfolded or ‘native’ structure.

Figure 1. Correlated mutations carry information about distance relationships in protein structure. The sequence of the protein forwhich the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarilyrelated family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structuralfamily). Evolutionary variation in the sequences is constrained by a number of requirements, including the maintenance of favorable interactions indirect residue-residue contacts (red line, right). The inverse problem of protein fold prediction from sequence addressed here exploits paircorrelations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensionalstructure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximatepredicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of theprotein of interest.doi:10.1371/journal.pone.0028766.g001

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 2 December 2011 | Volume 6 | Issue 12 | e28766

Marks et al. PLoS One 2011

increasingly difficult cases. We start with small single-domainproteins and move on to larger, more difficult targets, eventuallycovering a set of well-studied protein domains of wide-rangingbiological interest, from different fold classes. We report detailedresults for four example families, and summary results for 11further test families, and provide detailed 3D views of all 15 testprotein families in Figure S3 and detailed 3D coordinates andPymol session files for interactive inspection in Appendices A3 andA4, http://cbio.mskcc.org/foldingproteins.

Small: an RNA binding domain (RRM). The blindprediction of the 71-residue RRM domain of the human Elav4protein (Uniprot ID: Elav4_human) is a typical example of asmaller protein. The distance constraints are derived from a richcorpus of 25K example proteins in the PFAM family. The highestranking predicted structure has a (excellent) low 2.9 A Ca -RMSDdeviation from the crystal structure over 67 out of 71 residues, aTM score of 0.57 and GDT_TS 54.6, indicating overall goodstructural similarity to the observed crystal structure, [53,54],(Figure 2 top, Table 1). It has correct topography of the five b-strands and two a-helices, marred only by a missing H-bondpattern between strands 1 and 3, at least partly due to thetruncation of the strand 1, a consequence of the short length of thesequence in the PFAM alignment. Strands 2 and 3 align with only1.6 A Ca-RMSD deviation over the length of the predicted strandsand are positioned well enough for hydrogen bonding, with some

correct registration. Interestingly, the 4th b-strand (penultimate)missed by the secondary structure prediction method is placed inthe correct region in 3D: this is one of several examples in whichresidue coupling information overrides incorrect local prediction.The predicted top-ranked domain of Elav4 very likely lies withinthe refinement basin of the native structure.

Medium size: Ras oncogene (G-domain), an a/b domainwith an GTPase active site. The G-domain family in PFAM,with Human Ras proto-oncogene protein (Uniprot-ID:hras_human) chosen as the protein of interest, has a coremultiple sequence alignment (MSA) of 161 residues. Thestructure has an a/b fold with a 6-stranded b-sheet, surroundedby 5 a-helices, one of which (a-2) is involved in the GTPase switchtransition after GTP hydrolysis. The highest ranked, blindlypredicted structure is 3.6 A Ca-RMSD from the crystal structure,over 161 residues (Figure 2 middle) and has a high TM score of 0.7(range 0.0–1.0, with 1.0 implying 100% of residues are within a setdistance from the correct position [53]). The six b-strands and fivea-helices are placed in the correct spatial positions and arecorrectly threaded (Appendices A3 and A4). The 6 b-strands,which make 5 b-strand pairs are not within hydrogen bodingdistance for all backbone bonding, but the correct register can beeasily predicted for 26/30 of the residue pairs, Text S1. Theaccuracy of overall topography of the highest-ranked structures isremarkable (Table 1) and, as far as we know, currently not

Figure 3. Progress in contact prediction using the maximum entropy method. Extraction of evolutionary information about residuecoupling and predicted contacts from multiple sequence alignments works much better using the global statistical model (right, Direct Information,DI, Equation 3) than the local statistical model (left, Mutual Information, MI, Equation 1). Predicted contacts for DI (red lines connecting the residuespredicted to be coupled from sequence information) are better positioned in the experimentally observed structure (grey ribbon diagram), thanthose for MI (left, blue lines), shown here for the RAS protein (upper) and ELAV4 protein (lower). The DI residue pairs are also more evenly distributedalong the chain and overlap more accurately with the contacts in the observed structure (red stars [predicted, grey circles [observed] in contact map;center, upper right triangle) than those using MI (blue [predicted], grey circles [observed]; center, lower left triangle). Details of contact maps for allproteins comparing predicted and observed contacts are in Figures S1 and S2, Text S1.doi:10.1371/journal.pone.0028766.g003

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 6 December 2011 | Volume 6 | Issue 12 | e28766

A

C

B

inverse Ising problem: determine couplings from correlations

Page 38: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

protein structure prediction

of these possible causes of co-variation effects and is thus facedwith the complicated inverse problem of using observedcorrelations to infer contacts between residues (Figure 1). Givenalternative causes of true evolutionary co-variation, even ifconfounding correlations caused by technical reasons can beidentified, there is no guarantee that the remaining correlatedresidue pairs will be dominated by residues in three dimensionalproximity.The initial challenge is thus to solve the inverse sequence-to-

structure problem by reducing the influence of confoundingfactors. Only then is it possible to judge whether the evolutionaryprocess reveals enough residue contacts, which are sufficientlyevenly distributed (spread) throughout the protein sequence andstructure, to predict the protein fold. The ultimate criterion ofperformance is the accuracy of 3D structure prediction using theinferred contacts. Previous work combined a small number ofevolutionarily inferred residue contacts with other, structural,sources of information to successfully predict the structure of somesmaller proteins, [16,17,18,19]. However, three crucial openquestions remain with respect to using evolutionarily inferredresidue-residue couplings for protein fold prediction. The first iswhether one can develop a sufficiently robust method to identifycausative correlations that reflect evolutionary constraints. Thesecond is whether the inferred, plausibly evolutionary, correlationsprimarily reflect residue-residue proximity. The third is whetherthese inferred residue-residue proximities provide sufficientinformation to predict a protein fold, without the use of knownthree-dimensional structures.

The de novo protein structure prediction problem in theera of genome sequencingSolving this inverse problem would enable novel insight into the

evolutionary dynamics of sequence variation, and the role ofevolutionarily constrained interactions in protein folding. Deter-mination of protein structure, by experiment or theory, providesone essential window into protein function, evolution and design.However, our knowledge of protein structure remains incompleteand is far from saturation. In spite of significant progress in thefield of structural genomics over the last decade [20], only about

half of all well-characterized protein families (PFAM-A, 12,000families), have a 3D structure for any of their members [1]. At thesame time, the current upper limit on the total number of proteinfamilies (,200,000; PFAM-B) is an order of magnitude larger, andcontinues to grow with no clear limit in sight. Therefore, asmassive genomic sequencing projects rapidly increase the numberand size of protein families, in particular those without structuralhomologs [21], accurate de novo prediction of 3D structure fromsequence would rapidly expand our overall knowledge of proteinstructures in a way difficult to achieve by experiment.

Limited ability of current de novo 3D structure predictionmethodsAlthough the challenge of the computational sequence-to-

structure problem remains unsolved, methods that use fragmentlibraries [22,23] or other strategies to search conformational space[24,25], followed by sophisticated energy optimization ormolecular dynamics refinement, have been successful at predictingthe 3D structures of smaller proteins (,80 residues) [22,24,25,26][25,27,28]. In addition, custom-designed supercomputers haveallowed insight not only into molecular dynamics of proteinfunction, but also into the folding pathways of smaller proteinssuch as BPTI and WW domains [29,30]. However, none of thesecomputational approaches have yet achieved de novo folding from adisordered or extended polypeptide to the native folded state forlarger proteins and it is generally appreciated that the primaryobstacle to 3D protein structure prediction is conformationalsampling, i.e., successful search of the vast space of proteinconformations for the correct fold [26,31]. Using current methods,it is computationally infeasible to adequately sample the enormousset of all 3D configurations a protein might explore in the processof folding to the native state. In this paper we explore the idea thatinformation gleaned from statistical analysis of multiple sequencealignments can be used to solve this problem [2,5,6,32,33]. Thegoal is use residue-residue contacts inferred from the evolutionaryrecord (EICs) to identify the tiny region in the space of all possible3D configurations of a given protein that contains the correctlyfolded or ‘native’ structure.

Figure 1. Correlated mutations carry information about distance relationships in protein structure. The sequence of the protein forwhich the 3D structure is to be predicted (each circle is an amino acid residue, typical sequence length is 50–250 residues) is part of an evolutionarilyrelated family of sequences (amino acid residue types in standard one-letter code) that are presumed to have essentially the same fold (iso-structuralfamily). Evolutionary variation in the sequences is constrained by a number of requirements, including the maintenance of favorable interactions indirect residue-residue contacts (red line, right). The inverse problem of protein fold prediction from sequence addressed here exploits paircorrelations in the multiple sequence alignment (left) to deduce which residue pairs are likely to be close to each other in the three-dimensionalstructure (right). A subset of the predicted residue contact pairs is subsequently used to fold up any protein in the family into an approximatepredicted 3D shape (‘fold’) which is then refined using standard molecular physics techniques, yielding a predicted all-atom 3D structure of theprotein of interest.doi:10.1371/journal.pone.0028766.g001

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 2 December 2011 | Volume 6 | Issue 12 | e28766

Marks et al. PLoS One 2011

increasingly difficult cases. We start with small single-domainproteins and move on to larger, more difficult targets, eventuallycovering a set of well-studied protein domains of wide-rangingbiological interest, from different fold classes. We report detailedresults for four example families, and summary results for 11further test families, and provide detailed 3D views of all 15 testprotein families in Figure S3 and detailed 3D coordinates andPymol session files for interactive inspection in Appendices A3 andA4, http://cbio.mskcc.org/foldingproteins.

Small: an RNA binding domain (RRM). The blindprediction of the 71-residue RRM domain of the human Elav4protein (Uniprot ID: Elav4_human) is a typical example of asmaller protein. The distance constraints are derived from a richcorpus of 25K example proteins in the PFAM family. The highestranking predicted structure has a (excellent) low 2.9 A Ca -RMSDdeviation from the crystal structure over 67 out of 71 residues, aTM score of 0.57 and GDT_TS 54.6, indicating overall goodstructural similarity to the observed crystal structure, [53,54],(Figure 2 top, Table 1). It has correct topography of the five b-strands and two a-helices, marred only by a missing H-bondpattern between strands 1 and 3, at least partly due to thetruncation of the strand 1, a consequence of the short length of thesequence in the PFAM alignment. Strands 2 and 3 align with only1.6 A Ca-RMSD deviation over the length of the predicted strandsand are positioned well enough for hydrogen bonding, with some

correct registration. Interestingly, the 4th b-strand (penultimate)missed by the secondary structure prediction method is placed inthe correct region in 3D: this is one of several examples in whichresidue coupling information overrides incorrect local prediction.The predicted top-ranked domain of Elav4 very likely lies withinthe refinement basin of the native structure.

Medium size: Ras oncogene (G-domain), an a/b domainwith an GTPase active site. The G-domain family in PFAM,with Human Ras proto-oncogene protein (Uniprot-ID:hras_human) chosen as the protein of interest, has a coremultiple sequence alignment (MSA) of 161 residues. Thestructure has an a/b fold with a 6-stranded b-sheet, surroundedby 5 a-helices, one of which (a-2) is involved in the GTPase switchtransition after GTP hydrolysis. The highest ranked, blindlypredicted structure is 3.6 A Ca-RMSD from the crystal structure,over 161 residues (Figure 2 middle) and has a high TM score of 0.7(range 0.0–1.0, with 1.0 implying 100% of residues are within a setdistance from the correct position [53]). The six b-strands and fivea-helices are placed in the correct spatial positions and arecorrectly threaded (Appendices A3 and A4). The 6 b-strands,which make 5 b-strand pairs are not within hydrogen bodingdistance for all backbone bonding, but the correct register can beeasily predicted for 26/30 of the residue pairs, Text S1. Theaccuracy of overall topography of the highest-ranked structures isremarkable (Table 1) and, as far as we know, currently not

Figure 3. Progress in contact prediction using the maximum entropy method. Extraction of evolutionary information about residuecoupling and predicted contacts from multiple sequence alignments works much better using the global statistical model (right, Direct Information,DI, Equation 3) than the local statistical model (left, Mutual Information, MI, Equation 1). Predicted contacts for DI (red lines connecting the residuespredicted to be coupled from sequence information) are better positioned in the experimentally observed structure (grey ribbon diagram), thanthose for MI (left, blue lines), shown here for the RAS protein (upper) and ELAV4 protein (lower). The DI residue pairs are also more evenly distributedalong the chain and overlap more accurately with the contacts in the observed structure (red stars [predicted, grey circles [observed] in contact map;center, upper right triangle) than those using MI (blue [predicted], grey circles [observed]; center, lower left triangle). Details of contact maps for allproteins comparing predicted and observed contacts are in Figures S1 and S2, Text S1.doi:10.1371/journal.pone.0028766.g003

3D Structure Computed from Sequence Alone

PLoS ONE | www.plosone.org 6 December 2011 | Volume 6 | Issue 12 | e28766

A

C

B

inverse Ising problem: determine couplings from correlations

how can this be done if measurements are biased?

Page 39: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

inverse Ising inference with correlated samples

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� �

����������� � ��������

��������������

Page 40: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

inverse Ising inference with correlated samples

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� �

����������� � ��������

��������������

P(X|h,J) = 1

ZTr0e�H(X)

H(x) = �X

i,↵

g↵x↵i �X

i,↵<�

K↵�x↵ix�i �X

↵,i

hix↵i �X

↵,i<j

Jijx↵ix↵j

treat the entire dataset as one sample and infer parameters by maximum likelihood:

Page 41: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

inverse Ising inference with correlated samples

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� �

����������� � ��������

��������������

P(X|h,J) = 1

ZTr0e�H(X)

H(x) = �X

i,↵

g↵x↵i �X

i,↵<�

K↵�x↵ix�i �X

↵,i

hix↵i �X

↵,i<j

Jijx↵ix↵j

treat the entire dataset as one sample and infer parameters by maximum likelihood:

between samples (known)

Page 42: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

inverse Ising inference with correlated samples

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� �

����������� � ��������

��������������

P(X|h,J) = 1

ZTr0e�H(X)

H(x) = �X

i,↵

g↵x↵i �X

i,↵<�

K↵�x↵ix�i �X

↵,i

hix↵i �X

↵,i<j

Jijx↵ix↵j

treat the entire dataset as one sample and infer parameters by maximum likelihood:

between samples (known) between loci (unknown)

Page 43: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

• for small systems: condense values from each sample into -dimensional Potts spin

• interaction graph is tree: use belief propagation

• larger systems: cluster expansion

numerical approach for a tree� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� �

����������� � ��������

��������������

P(X|h,J) = 1

ZTr0e�H(X)

2N

Cocco & Monasson Phys Rev Lett 2011

Page 44: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

“branch length”

results on the tree

• create phylogenies by sampling from perfect tree

• average sequence identity measures degree of bias

Page 45: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

results on the tree

naive inference exact method

Page 46: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

results on the tree

• exact method is much more accurate

naive inference exact method

Page 47: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

results on the tree

• exact method is much more accurate

• reweighting only marginally beneficial

naive inference exact method reweighting

Page 48: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

results on the tree

• exact method is much more accurate

• reweighting only marginally beneficial

• rescaling heuristic as good as exact solution

naive inference exact method reweightingrescaling

Page 49: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

the mean-field solution

• simple mean-field approach very successful

• reweighting only marginally beneficial

• rescaling heuristic performs much better

naivereweightingrescaling

Jij = �C�1ij Cij = fij � fifj

Page 50: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

the mean-field solution

• simple mean-field approach very successful

• reweighting only marginally beneficial

• rescaling heuristic performs much better

naivereweightingrescaling

Jij = �C�1ij Cij = fij � fifj

Page 51: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

Summary: correlations

• distinguish direct and indirect evolutionary correlations to infer protein structure

• data come from alignment: phylogenetic bias

• unified treatment with Ising model gives exact numerical algorithm and simple rescaling heuristic

• applicable to mean-field approach

• useful beyond comparative genomics

Page 52: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

Acknowledgements

Erel Levine, Harvardmembers of the Levine lab

members of the Rajewsky lab

Page 53: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

how quantitative?

• assume constant selection

• UTR length expansion in mammals, but alignability dominates

• consistency check: compare to results for 60way alignment to mouse: r2 = 0.56

Page 54: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

measuring correlation

• restrict analysis to sites in the same context

• estimate level of background correlations from pairs of matched control sites

• in the same 3'UTR: positive correlations due to alignment block structure

Page 55: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

matched controls

Page 56: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

loss of miRNA genes

Page 57: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

use binding evidence

Page 58: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

use binding evidence

• miRNA associate with Argonaute (AGO) proteins

Page 59: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

use binding evidence

• miRNA associate with Argonaute (AGO) proteins

• experimental profiling of AGO binding sites via PAR-CLIP

Page 60: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

use binding evidence

• miRNA associate with Argonaute (AGO) proteins

• experimental profiling of AGO binding sites via PAR-CLIP

• site pairs in AGO footprints are more strongly correlated

Hafner et al. Cell 2010

Page 61: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

miRNAs tend to coordinately target members of the same protein complex or signaling pathway

Tsang et al. Mol. Cell 2010

Page 62: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

miRNAs tend to coordinately target members of the same protein complex or signaling pathway

Tsang et al. Mol. Cell 2010

are there signatures of evolutionary constraint indicative of different regulatory strategies?

Page 63: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

miRNAs tend to coordinately target members of the same protein complex or signaling pathway

Tsang et al. Mol. Cell 2010

are there signatures of evolutionary constraint indicative of different regulatory strategies?

• many or few sites in the same transcript vs.

Page 64: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

miRNAs tend to coordinately target members of the same protein complex or signaling pathway

Tsang et al. Mol. Cell 2010

are there signatures of evolutionary constraint indicative of different regulatory strategies?

• many or few sites in the same transcript

• sites in many or few different transcripts

vs.

vs.

Page 65: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

Page 66: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

• enrichment for correlated pairs

• sites on the same gene are enriched for positive correlations

signaling pathways

protein complexes

Page 67: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

• enrichment for correlated pairs

• sites on the same gene are enriched for positive correlations

• protein complexes: sites on different genes tend to avoid each other

• signaling pathways: sites on different genes have positive correlations

signaling pathways

protein complexes

Page 68: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

coordinated regulation

• enrichment for correlated pairs

• sites on the same gene are enriched for positive correlations

• protein complexes: sites on different genes tend to avoid each other

• signaling pathways: sites on different genes have positive correlations

• hypothesis: downregulation of single members is compensated in different ways

signaling pathways

protein complexes

Page 69: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

Page 70: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

Page 71: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

true coupling

Page 72: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

assumed coupling true coupling

Page 73: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

assumed coupling true coupling

average inference error

Page 74: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

assumed coupling true coupling

average inference error finite-size error

Page 75: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

assumed coupling true coupling

average inference error finite-size error

Page 76: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

analytical results for linear chain

K↵� = K0�↵,↵+1

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

����������� ������������������� ������������ � ��������

��������������

"tim

e"↵

ˆh2i = h2

i

⇣e

�2(K�K0) � 1

⌘2+

e

�2(2K�K0)

M

ˆJ2ij = J2

ij

✓cosh 2K0

cosh 2

ˆK� 1

◆2

+

cosh 2K0

M cosh

22

ˆK

assumed coupling true coupling

average inference error finite-size error

fi = e�2K 1M

P↵ X↵i

fij =1

cosh 2

ˆK1

M

P↵ X↵iX↵j

globally remove phylogenetic bias by re-scaling:

Page 77: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

Cocco & Monasson Phys Rev Lett 2011

Page 78: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

Cocco & Monasson Phys Rev Lett 2011

Page 79: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

Cocco & Monasson Phys Rev Lett 2011

Page 80: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

Cocco & Monasson Phys Rev Lett 2011

Page 81: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

Cocco & Monasson Phys Rev Lett 2011

Page 82: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

Cocco & Monasson Phys Rev Lett 2011

Page 83: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

• recursively test larger clusters and adjust fit parameters

Cocco & Monasson Phys Rev Lett 2011

Page 84: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

adaptive cluster expansion

• start from pairs of loci, test for correlations

• recursively test larger clusters and adjust fit parameters

• include only if fit significantly improves

Cocco & Monasson Phys Rev Lett 2011

Page 85: Inference of evolutionary correlations and the miRNA ......Inference of evolutionary correlations and the miRNA regulatory network Benedikt Obermayer Nonlinear Dynamics of Biological

compare inference methods