Bioinformatics approaches for studying of gene regulation. By Ilya Ioshikhes, Ph.D. Department of Biomedical Informatics

Bioinformatics approaches for studying of gene

regulation.By Ilya Ioshikhes, Ph.D.

Department of Biomedical Informatics.

Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the Cell Chapter 3. Macromolecules: Structure, Shape, and Information Nucleic Acids 8

Figure 3-19. Information flow in protein synthesis. (A) The nucleo-tides in an mRNA molecule are joined together to form a complementary copy of a segment of one strand of DNA. (B) They are then matched three at a time to complementary sets of three nucleotides in the anticodon regions of tRNA molecules. At the other end of each type of tRNA molecule, a specific amino acid is held in a high-energy linkage, and when matching occurs, this amino acid is added to the end of the growing polypeptide chain. Thus translation of the mRNA nucleotide sequence into an amino acid sequence depends on complementary base-pairing between codons in the mRNA and corresponding tRNA anticodons. The molecular basis of information transfer in translation is therefore very similar to that in DNA replication and transcription. Note that the mRNA is both synthesized and translated starting from its 5' end.

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=cell.biblist.d1e7498#d1e7777

Figure 9-30. Structure of the nucleosome. (a) Ribbon diagramof the nucleosome shown face-on (left) and from the side (right).One DNA strand is shown in green and the other in brown. H2Ais yellow; H2B, red; H3, blue; H4, green. (b) Space-filling modelshown from the side. DNA is shown in white; histones are coloredas in (a). H2A, H2A′, H2B, H2B′, H3, and H4 indicate the positionsof the respective histone N-terminal tails visible in this view. TheH2A′ N-terminal tail interacts with the upper loop of DNA, whilethe H2A N-terminal tail (only partially seen in this view) interactswith the bottom loop of DNA. The N-terminal tail of one H4 extendsfrom the bottom of the nucleosome and interacts with the neighboringhistone octamer in the crystal lattice (not shown). The N-terminaltails of histones H2B, H2B′, H3, and H3′ pass between the two loopsof DNA. The N-terminal tails of H2A, H4, H3, and H2B include anadditional 3, 15, 19, and 23 residues, respectively, that are notvisualized in the crystal structure because they are not highly structured.They extend further from the surface of the nucleosome where they mayparticipate in nucleosome-nucleosome interactions in the 30 nm fiber(See Figure 9-31) or interact with other chromatin-associated proteins.[From K. Luger et al., 1997, Nature 389:251; courtesy of T. J. Richmond.]

Molecular Cell Biology

9. Molecular Structure of Genes and Chromosomes 9.5. Organizing Cellular DNA into Chromosomes

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=mcb.figgrp.d1e34531

Figure 8-30. Model of chromatin packing.This schematic drawing shows some of the manyorders of chromatin packing postulated to give riseto the highly condensed mitotic chromosome.

Molecular Biology of the Cell, 3rd edn.Part II. Molecular GeneticsChapter 8. The Cell Nucleus

The Global Structure of Chromosomes

Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D.“Ordered recruitment of chromatin modifying and general transcription factors to the IFN-beta promoter.”Cell. 2000 Nov 10;103(4):667-78.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11106736&dopt=Abstract












Characteristic features of gene regulation mechanisms:

• Large number and variety of participating regulatory elements: thousands of transcription factors (TFs), chromatin, DNA methylation etc.

• None of those elements is neither absolutely necessary nor sufficient for the regulatory processes.

• There are a lot of DNA sequence motifs (signals) related to these agents: TF binding sites, nucleosome sequence pattern, CpG islands etc.

• Majority of those signals are very weak.• Gene expression is regulated by large number of weak

signals interacting with each other in some sophisticated ways.

Possible approaches inthat study :

• Exhaustive analysis of signals caused by 1-2 elements, with gradual generalization of results.

• From intuitive model to sequence analysis.• From known sequence features to their

quantitative analysis.• From sequences to revealing common sequence

motifs.• In depth analysis of known features.

SEQ_1 Frog Xenopus borealis ACCURACY 1 bpNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTTGGCAGGACAAGGGCAGCTCTGCAAACTGTAAAACCGGACAAAGGCTTTCCCCTGGCTTACACGCAAAAGGGAAGGGCCTTTCCTGAGGAGGTGAGCGGCAACCTGGACTCGGGGATGGCGCTGGAAGTGATCTGCTTGGATTTTGCTCAAGACTTGGATGCAAGGGCTATCCCGATGAGCTGACAAGGGCCTTGGGAGGGGGGCGGGGGCTGTGCAGATAACAAGCTGTCCACTTCCAGGCACTGCCCTTCCGTGGCTCCCGTAGC> SEQ_2 Frog Xenopus borealis ACCURACY 1 bpGGGCTCCGCCCXTTCGGAAGGATGCTAGGGAGCCGGAGAGAGCGCAGAGAGGCGGGGTGAAAGGGATGGGGGGAGCTGAGGCAGGAGGGCAGGCTGTCAAGGCCGGGCTTGTTTTCCTGCCTGGGGGAAAAGACCCTGGCATGGGGAGGAGCTGGGCCCCCCCCAGAAGGCAGCACAAGGGGAGGAAAAGTCAGCCTTGTGCTCGCCTACGGCCATACCACCCTGAAAGTGCCCGATATCGTCTGATCTCGGAAGCCAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGGAATACCAGGTGTCGTAGGCTTTTGCACTTTTGCCATTCTGAGTAACAGCAGGGGGCAGTCTCCTCCATGCATTTTTCTTTCCCCGAACAGCTGCCTG> SEQ_3 African Green Monkey ACCURACY 1 bp ACTGCTCTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTCAAGAAGCCTTTCGCTAAGGCTGTTCTTGTGGAATTGGCAAAGGGATATTTGGAAGCCCATAGAGGGCTATGGTGAAAAAGGAAATATCTTCCGTTCAAAACTGGAAAGAAGCTTTCTGAGAAACTGCTCTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTCAAGAAGCCTTTCGCTAAGGCTGTTCTTGTGGAATTGGCAAAGGGATATTTGGAAGCCCATAGAGGGCTATGGTGAAAAAGGAAATATCTTCCGTTCAAAACTGGAAAGAAGCTTTCTGAGAAACTGCTCTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTCAAGAAG> SEQ_4 Mouse ACCURACY 1 bp AAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTGTAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGA> SEQ_5 Psammechinus miliaris (sea urchin) ACCURACY 1 bpNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTTATAATCATCCTTATACACGCGCAGTCGATGAGATGAAAAGTTCATTAACGCTACATTTACAGTGTTTTGGGCAATTCTCCCTCCCCCCCCCCCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCCCTTCCTCTAAATATGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN> SEQ_6 Yeast Saccharomyces cerevisiae ACCURACY 1 bpAGTACAGAGGTCAATGGCAGTAATGGCACTTGGTGCGGCTTCTGTGCCAGTAATGTGGCTTTCTCTAACAAGTTGGATGCACATCGGCGAGAGACAAGGCTTTAGAATACGGTCACAGATATTGGAGGCATATTTGGAGGAAAAGCCAATGGAATGGTACGACAATAATGAAAAATTGTTAGGAGATTTTACTCAAATCAACAGATGTGTGGAAGAGCTAAGATCAAGCTCCGCAGAGGCATCAGCCATAACTTTCCAGAATTTAGTTGCAATATGTGCGCTTCTGGGGACGTCATTCTACTATTCTTGGTCATTAACTTTAATTATTCTTTGCAGCTCTCCAATAATCACATTTTTTGCAGTGGTGTTTTCCAGAATGATTCATGTATATTCAGAGAAG> SEQ_7 Yeast Saccharomyces cerevisiae ACCURACY 1 bpTTCTCTATTCTGCCACTATACAATTTATTGTTTTCCACAAAGGGTAAAGGTACTTTAAGAAAATAGTTTCTTATTTTTTTTGCCATGTAATTACCTAATAGGGAAATTTACACGCTGCTTCGCACATATACAATTGTTTCAGATATGAAAACTGTTGCATTATTGCCGTTCATCATTTAAATACCAGAGCTTATAAACCTGGATATGGCTGAACTATCTCCCGTTGTTACGTTCACACAGAGAGCTTTCAAGTGCCGCTGAAAATTCCACTAGGAAACAAAGAACAAGCTACGTCATGAACTTTTTAAGTTTTAAGACTACAAAACACTATCACATTTTCAGGTACGTGAACATACGGAATGACTACAGGCTGTTAATGATAATGATAATAGGTACCGTG> SEQ_8 Saccharomyces cerevisiae (yeast) ACCURACY 1 bp TAGTATCCGCTAAGAATTTAAGCAGGCCAACGTCCATACTGCTTAGGACCTGTGCCTGGCAAGTCGCAGATTGAAGTTTTTTCAACCATGTAAATTTCCTAATTGGGTAAGTACATGATGAAACACATATGAAGAAAAAAGCTTTCCTACATATTCAAGATTTTTTTCTGTGGGTGGAATACTATTTAAGGAGTGCTATTAGTATCTTATTTGACTTCAAAGCAATACGATACCTTTTCTTTTCACCTGCTCTGGCTATAATTATAATTGGTTACTTAAAAATGCACCGTTAAGAACCATATCCAAGAATCAAAAATGTCTGATGCGGCTCCTTCATTGAGCAATCTATTTTATGATCCAACGTATAATCCTGGTCAAAGCACCATTAACTACACTTCCA> SEQ_9 Yeast Saccharomyces cerevisiae ACCURACY 1 bp (evaTATAATGGCGAAGAAGTTAAGCCTTCAATTGATTGCAGGTCTATGAGTACTTATAATGAGCATAGATCTTCCACCTACCAATATCTGGAAAATGGTAGGTTTTACATCACATATGCTGACGGAACATTTGCTGACGGTAGTTGGGGGACGGAAACTGTATCAATTAATGGAATTGACATCCCCAATATCCAGTTCGGAGTTGCCAAGTATGCTACGACACCCGTTAGTGGTGTTCTTGGAATTGGGTTTCCTAGAAGAGAGTCCGTTAAGGGCTATGAAGGTGCTCCTAATGAATATTATCCTAATTTTCCTCAGATTTTAAAAAGTGAAAAAATAATCGATGTGGTCGCGTATTCGCTGTTCTTAAACTCACCTGATTCAGGTACTGGTTCGATTGTTT

Sequences absolutely dissimilar.

No conserved regions.

Conventional evolution-basedapproaches of sequence alignment(like BLAST) are hardly applicable.

Dinucleotides (AA/TT first)are primary subject to alignment.

Possible number of configurations: 2Ac.i + 1)

204 sequences, Ac. 1 to 55

Roughly 51 204 configurations

Algorithms ofmultiple sequence alignment.

• Alignment of the most accurately mapped nucleosome sequences.

• Multicycle consecutive alignment – AA/TT matrices Mi of Ac.-sorted sequences aligned one by one to pattern derived on previous step. Results of 10,000 cycles are averaged.

• Quasi-exhaustive consecutive alignment – keeps track of several “suboptimal” alignments; alignment with highest SIM = ij (Mi

^*Mj) is final.

• Alignment with simulated annealing strategy: new alignment is accepted if SIMk+1 > SIMk or with probability P(-E)=e-E/T, where

–E=SIMk+1-SIMk otherwise. T is decreasing “temperature” factor.

• Multiple alignment by positional entropy criterion using Gibbs sampling strategy.

Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D.“Ordered recruitment of chromatin modifying and general transcription factors to the IFN-beta promoter.”Cell. 2000 Nov 10;103(4):667-78.













Approach 2

Chromatin structure of promoter sequences and regularity in positioning of TF sites –

an example of intuitive conceptual model.

F

TF – nucleosome correlation.

• Putative TF binding sites mapped on promoter sequences.• Distribution of each TF site overall sequences calculated.• Scanning with a “nucleosomal” 145 bp window through

distributions of all TF sites.• Calculation of spectral distribution for each TF inside the

window in every scanning point.• Evaluation of number N of TFs with main “nucleosomal”

period 10.1-10.5 bp in their spectra.• Evaluation of difference between N and statistically

expected R number of such TFs: dS(StD)=(N-R)/SQRT(R).

Left: Order of events leading to transcription initiation from the IFN-ß promoter.

I and II represent nucleosomes positioned in the promoter area. Derived from Agalioti, T., Lomvardas, S., Parekh, B., Yie, J., Maniatis, T., and Thanos, D. 2000. Ordered recruitment of chromatin modifying and general transcription factors to the IFN-ß promoter. Cell 103:667-678.

Right: Nucleosome positioning at the pS2 promoter. Derived from Sewack, G.F. and Hansen, U. 1997. Nucleosome positioning and transcription-associated chromatin alterations on the human estrogen-responsive pS2 promoter. J. Biol. Chem. 272:31118-31129.

.

To further optimize the findings increasing the statistical significance of the results, we varied the length of the windows. The results of the calculation indicate themost statistically significant effect of 6.68 StD for the windows (-46…+121) and (-46…+124), covering the TSS. Size of this window (167–170 bp) is similar to those of chromatosome.

Nucleosome-TF correlation.

• Very consistent effect of high statistical significance.

• Obtained on two large, representative and essentially independent data sets.

• Obtained by two independent approaches.

• Has many correlations with known experimental data.

Large-scale human promoter mapping using CpG islands.

(Program CpG_promoter by

Quadratic Discriminant Analysis QDA)

Approach 3

Quantitative analysis of known sequence feature

Definition of CpG island

• Length > 200 bp

• C + G content > 50%

• CpG ratio Obs/Exp > 0.6

(Gardiner-Garden and Frommer,

J.Mol.Biol. 196, 261-282 (1987))

Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics Chapter 9. Control of Gene Expression The Molecular Genetic Mechanisms That Create Specialized Cell Types 41

Figure 9-70. The CG islands surrounding the promoter in three mammalian housekeeping genes. The yellow boxes show the extent of each island. Note also that, as for most genes in mammals, the exons (dark red) are very short relative to the introns (light red). (Adapted from A.P. Bird, Trends Genet. 3:342-347, 1987.)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=1638112

SN and SP

Sensitivity SN is proportion of True Positive (TP) predictions out of all de-facto positives:

SN = TP / (TP + FN)

Specificity SP is proportion of True Positive (TP) predictions out of all positive predictions:

SP = TP / (TP + FP)

http://www.amazon.com/exec/obidos/tg/detail/-/0879696087/ref=lib_rd_ss_TFCV/002-1925580-3267243?v=glance&vi=reader&img=1

Results of promoter mapping(Test Set 2)

• 135 genes

• 68 have CpG island around promoter

• 63 recognized

• SN = 0.47 (0.93)

• SP = 0.34 (1 Pos./26 kb; 1/36 kb is in fact)

• Promoter Scan gives

SN = 0.44

SP = 0.06 (1 Pos. / 4.7 kb)

Approach 4

Revealing of regulatory mechanisms inpromoter sequences.

From sequence to model.

(Work in progress)

Alternative Architecture Typesof Human Pol II Promoters


4. Nucleic Acids, the Genetic Code, and the Synthesis of Macromolecules4.3. Nucleic Acid Synthesis

Figure 4-15. Transcription of DNA into RNA is catalyzed by RNA polymerase, which can initiate the synthesis of strands de novo on DNA templates. The nucleotide at the 5′ end of an RNA strand retains all three of its phosphate groups; all subsequent nucleotides release pyrophosphate (PPi) when added to the chain

and retain only their α phosphate (red). The released PPi is subsequently hydrolyzed by

pyrophosphatase to Pi, driving the equilibrium

of the overall reaction toward chain elongation. In most cases, only one DNA strand is transcribedinto RNA.

The Cell II. The Flow of Genetic Information 6. RNA Synthesis and Processing Eukaryotic RNA Polymerases and General Transcription Factors

Figure 6.14. RNA polymerase II holoenzyme The holoenzyme consists of a preformed complex of RNA polymerase II, the general transcription factors TFIIB, TFIIE, TFIIF, and TFIIH, and several other proteins that activate transcription. This complex can be recruited directly to a promoter via interaction with TFIID (TBP + TAFs).

http://www.sinauer.com/detail.php?id=2143

An Introduction to Genetic Analysis

11. Regulation of Gene TranscriptionTranscription: an overview of gene regulation in eukaryotes.

Figure 11-29. (a) Assembly of the RNA Polymerase II initiation complex begins with the binding of transcription factor TFIID to the TATA box. TFIID is composed of one TATA box-binding subunit called TBP (dark blue) and more than eight other subunits (TAFs), represented by one large symbol (light blue). Inhibitors can bind to the TFIID-promoter complex, blocking the binding of other general transcription factors. Binding of TFIIA to the TFIID-promoter complex (to form the D-A complex) prevents inhibitor binding. TFIIB then binds to the D-A complex, followed by binding of a preformed complex between TFIIF and RNA polymerase II. Finally, TFIIE, TFIIH, and TFIIJ must add to the complex, in that order, for transcription to be initiated. (From H.Lodish, D.Baltimore, A.Berk, S.L.Zipursky, P.Matsudaira,and J.Darnell, Molecular Cell Biology, 3d ed.Copyright © 1995 by Scientific American Books)

Molecular Cell BiologyFourth Edition

Harvey Lodish (Massachusetts Institute of Technology)

Arnold Berk (U. of California, Los Angeles)

Lawrence Zipursky (U. of California, Los Angeles)

Paul Matsudaira (Massachusetts Institute of

Technology)

David Baltimore (California Institute of Technology)

James Darnell (Rockefeller U.)

Figure 10-52. Structure of the complex formed between TBP, promoter DNA, and TFIIB. In in vitro transcription systems, TFIIB binds to the assembled TBP – promoter DNA complex. Shown here are the C-terminal domain of Arabidopsis TBP and the C-terminal domain of human TFIIB. Transcription initiation in vivo also requires TFIIA, which binds to the TBP – promoter DNA complex on the side opposite to where TFIIB binds. TFIIA is thought to bind before TFIIB does. [Adapted from D. B. Nikolov et al., 1995, Nature 377:119.]

Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics Chapter 9. Control of Gene Expression How Genetic Switches Work 20

Figure 9-34. The gene control region of a typical eucaryotic gene. The promoter is the DNA sequence where the general transcription factors and the polymerase assemble. The most important feature of the promoter is the TATA box, a short sequence of T-A and A-T base pairs that is recognized by the general transcription factor TFIID. The start point of transcription is typically located about 25 nucleotide pairs downstream from the TATA box. The regulatory sequences serve as binding sites for gene regulatory proteins, whose presence on the DNA affects the rate of transcription initiation. These sequences can be located adjacent to the promoter, far upstream of it, or even downstream of the gene. DNA looping is thought to allow gene regulatory proteins bound at any of these positions to interact with the proteins that assemble at the promoter. Whereas the general transcription factors that assemble at the promoter are similar for all polymerase II transcribed genes, the gene regulatory proteins and the locations of their binding sites relative

to the promoter are different for each gene.

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=cell.biblist.2167#2240

A total of 1871 non-redundant human promoter sequencesfrom the Eukaryotic Promoter Database (EPD) release75 (http://www.epd.isb-sib.ch) and 8793 human promotersfrom the Database of Transcriptional Start Sites (DBTSS)(http://www.dbtss.hgc.jp/index.html) were used for statisticalanalyses as two separate datasets. We also constructeda small test set of 27 human promoters with MSS. This set was utilized to analyze the statistics of core-promoter elementsin MSS promoters. Each promoter was considered several times,one time for each known TSS, so the total number ofsequences in this set is 107.


10. Regulation of Transcription Initiation 10.4. Regulatory Sequences in Eukaryotic Protein-Coding Genes

Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981,

Ann. Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.]

To extract a subset of promoter sequences containing the TATA box or Inr element at theirfunctional positions, the positional weight matrices (PWM) with optimal cut-off values were applied (Bucher,1990).We define the TATA or Inr element as being present at acertain position if the PWM score at this position exceeds thecut-off value, and define the element to be absent at this positionotherwise. Since there are no matrices for DPE and BRE, we matched 5 out of 5 letters and 6 out of 7 for the DPE and BRE consensuses (Smale and Kadonaga, 2003), respectively.

We used the same parameters to extract subsets containingknown synergetic combinations, yet the respective elementshad to be placed at their experimentally defined synergeticdistance from one another. The distances between the elements in the remaining combinations were chosen based on the positions of the respective elements in the known combinations.

To estimate the statistical significance of the occurrencefrequency of an element or synergetic combination in therespective functional window, we calculated a parameterstatistical significance, dS, measured in units of standard deviation(StD = √Nout) dS = (Nin−Nout)/√Nout, where Nin

is the number of occurrences of an element or combinationinside its functional window and Nout is the number of occurrencesof that element or combination in the average intervalof the same length outside the functional window.

Figure 1. The occurrence frequency (the percentage of sequences having a considered motif centered at particular position) distribution of the TATA box motifs based on scanning of EPD (blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed at position +1. The straight horizontal gray line depicts the average amount of TATA motifs found in the randomly generated sequence with the same percentage of each of four nucleotides as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5. The shadow rectangles indicate standard deviation calculated based on 1871 random sequences (short rectangle) and on 8973 random sequences (long rectangle), respectively.

Figure 2. The occurrence frequency distribution of the Inr motifs based on scanning of EPD (blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed at position +1. The straight horizontal gray line depicts the average amount of TATA motifs found in the randomly generated sequence with the same percentage of each of four nucleotides as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5. The shadow rectangles indicate standard deviation calculated based on 1871 random sequences (short rectangle) and on 8973 random sequences (long rectangle), respectively.

According to these data, half of the promoters, 49.0% (48.4%), have theInr element at a functional position, only 21.8% (10.4%) have TATA box, 24.6% (24.6%) contain DPE, and 24.5% (25.5%) have BRE.The majority of the promoters, 77.3% (74.3%), have at least one of four core-promoter elements at its functional position and 41.8% (44.1%) have only one element including TATA – 5.5% (2.9%), Inr – 20.1%(23.0%), DPE – 6.6% (8.4%), and BRE – 9.6% (9.8%)

Figure 1. Occurrence frequency distribution of combination TATA_Inr for EPD (blue) and DBTSS (magenta). TSS is placed at position +1.

Figure 2. Occurrence frequency distribution of combination Inr_DPE for EPD (blue) and DBTSS (magenta).

Figure 3. Occurrence frequency distribution of combination TATA_BRE for EPD (blue) and DBTSS (magenta).

Figure 4. Occurrence frequency distribution of combination Inr_BRE for EPD (blue) and DBTSS (magenta).

Figure 5. Occurrence frequency distribution of combination DPE_BRE for EPD (blue) and DBTSS (magenta). The value at each position is an 11-point sliding average.

Figure 6. Occurrence frequency distribution of combination TATA_DPE for EPD (blue) and DBTSS (magenta).

Note the common features of the aforementioned combinations:(1) all of them involve TFIID, and TBP binds to DNAregardless of the presence/absence of TATA box; (2) TFIIDcovers the TSS area; (3) the distance from the TSS to theedge of the complex is approximately the same (~30–40 bp).Combinations BRE_Inr, BRE_DPE and TATA_DPE also satisfythese requirements. These combinations are presented in anumber of promoters comparable with the three previous combinationswith comparable statistical significance (Table 4).They may therefore be also considered as possible synergeticcombinations of core-promoter elements (Fig. 1D–F).

We found that 83 (76.9%) of the MSS promoters containat least one core-promoter element in the functional positionrelatively the TSS. This percentage is practically the sameas for all promoters from both the datasets. The statisticalsignificance of the presence of any one of the four elements in the functional position is comparatively high for a relativelysmall dataset: dS = 3.5StD, P-value = 0.0005. Remarkably,the portion of MSS promoters containing BRE (29.6%) islarger than on average in the EPD/DBTSS datasets. Thus the presence of the BRE element in the CpG+and MSS promoters is comparable with the presence of the TATA box in the CpG-less promoters.

An example of MSS promoter.

Figure 1. An example of MSS promoter sequence (36, GenBank Accession #X52601, TSS positions marked by shadow) containing all four core promoter elements at functional position relative to a TSS (marked by the bold letters of a color same as the respective core element).

Nature Structural & Molecular Biology 11, 1031 - 1033 (2004) doi:10.1038/nsmb1104-1031 Another piece in the transcription initiation puzzleFrancisco J Asturias The author is at the Department of Cell Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA. [email protected] new report provides evidence that the TFIIB-RNAPII interaction depends on the presence of additional factors and highlights the importance of structural characterization of the entire preinitiation complex.

mailto:[email protected]





Beyond core-promoter


10. Regulation of Transcription Initiation 10.4. Regulatory Sequences in Eukaryotic Protein-Coding Genes

Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981,

Ann. Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.]

The occurrence frequency (the percentage of sequences having a considered motif centered at particular position) distribution of the GC-box sites. The distribution is obtained by scanning of 8973 human promoters from DBTSS (magenta – positive strand, red – negative strand, dark blue – both strands) and 1871 human promoters from EPD (green – both stands) sequences. The value at each position is an eleven point sliding average. The TSS is placed at position +1. The straight horizontal line depicts the average amount of GC-box sites found in both strands of the randomly generated sequence with the same percentage of each of four nucleotides as in the training set of promoter sequences.

The flowchart of optimization process.

The input parameters are promoter database, an initial PWM (or motif consensus), a set of experimentally defined sites, and a “functional window”.

The first step is the extraction of the dataset of putative sites. There are two levels of optimization at the beginning: cutoff value and motif length. The Correlation Coefficient (CC) is used as optimization parameter.

Each cycle brings a portion of new sites typical for this particular window and excludes some not typical sites increasing the influence of sites from that window. This influence is strongly limited by the requirement to be as close as possible to the previous matrix expressed by the definition of CC. All aforementioned steps should be repeated for each window from the functional window. As a result we will have a set of optimal matrices, one matrix for each considered window. Each matrix has its own sensitivity and specificity.

)(*)(*)(*)(

)*()*(

FNTNFPTPFPTNFNTP

FPFNTNTPCC

Sensitivity (Sn) - percentage of experimentally confirmed sites recognized by the respective matrix.

Specificity (Sp). To compare the specificity of two matrices we will suppose that the majority of sites found by these matrices in the randomly generated DNA sequences are false positives. If this is true, the ratio of the occurrence frequencies found by the new and original matrices is inversely proportional to the ratio of their specificities. Therefore, we will consider the averaged occurrence frequency of sites in the randomly generated sequences as a parameter describing the specificity of the PWM.

4-row mononucleotide versus 16-row dinucleotide matrices

The majority of practically used PWMs are the 4-row mononucleotide matrices based on the ‘additivity hypothesis’, which considers the contributions from each position of the binding site as independent and additive (Berg and von Hippel, 1987).

Some experimental evidence (Man and Stormo 2001; Bulyk,M., Johnson,P., and Church,G., 2002) and theoretical considerations (Zhang and Marr, 1993) show that a dinucleotide approach (counting of dependence between adjacent nucleotides of TFBS) could be in some cases the more appropriate approximation. Using the same methodology, we built the 16-row dinucleotide matrices.

The limitations of small experimental datasets have convinced researchers to use less accurate, but fairly reliable 4-rows matrices (Benos,P., Bulyk,M., and Stormo,G., 2002). There is no such limitation in our case since we use a large set of putative sites.

The sensitivity/specificity ratios for the original and new matrices for GC-box.

Specificity - the averaged occurrence frequency of GC-box sites found by the original matrix (circle at the left upper corner) and two sets of new 4-row (squares)

and 16-row (diamonds) matrices. The x-axis is sensitivity - the percentage of recognized sites from a control set of experimentally defined sites.

Figure 3. The occurrence frequency distribution of the HMG1 sites. The rest as for Sp1.

Figure 4. The occurrence frequency distribution of the PAX2 sites. The rest as for Sp1.

Figure 5. The occurrence frequency distribution of the NRF2 sites. The rest as at Figure 2.

A pair of two closely positioned TF binding sites that acquire new regulatory properties due to direct or indirect interactions between corresponding transcription factors is called a composite element (CE).

We performed clustering of putative binding sites predicted by the MATCH program in a vicinity of putative binding sites for TF STAT-1, as a study case. Clear over-representation of putative binding sites was obtained for transcription factors AML-1a, AP-2, CDX-a, c-Ets-1, c-Myb, c-REL, ELK-1, EN-1, GKLF, HSF-1, HSF-2, IK-1, IK-2, IK-3, LYF-1, MSX-1, Myo-D, NF-AT, NF-κB, NRF-2, Oct-1, P300, Pax-4, Pax-6, RFX-1, SRY, TST-1. On the contrary, putative binding sites for GATA-1, MZF-1, and Sp1 were clearly under represented in that area. Although some of the results might be a mere consequence of shared motifs for respective binding sites, others warrant different interpretation and may point to potential CEs.

Influence of variant histone H2A.Z on local chromatin

dynamics(In-depth chromatin analysis

by structural modeling)

Gaussian Network Model (Bahar et al.,1997)•The dynamics of the interactions is controlled by the connectivity (or Kirchhoff) matrix , by analogy with the statistical mechanical theory of elasticity originally developed by Flory and coworkers for polymer networks.

•The elements of are defined as

•Here rc is the cutoff distance defining the range of interaction of residues, each residue being represented by its -carbon, and Rij is the distance between ith and jth residues.

•The value of rc = 7 Å includes the neighboring residues located in the first coordination shell near a central residue.

•Note that the columns (or rows) of are interdependent (all sum up to zero), and thus cannot be inverted; instead it is reconstructed after removal of its zero eigenvalue and corresponding eigenvector.

Inhibitor binding alters the directions of motions in HIV-1 reverse transcriptase"Anisotropy of fluctuation dynamics of proteins with an elastic network model" Atilgan, AR, Durrell, SR, Jernigan, RL, Demirel, MC, Keskin, O. & Bahar, I. Biophys. J. 80, 505-515, 2001. (.pdf)Anisotropic Network Model (Atilgan et al., 2001)•The anisotropic network model (ANM) is an extension of the GNM to the 3N-d space of collective modes.

•The inter-residue 'distances' are controlled by harmonic potentials in the GNM, ANM adopts further assumption that the three (-x, -y and -z) components of the inter-residue separation vectors obey Gaussian dynamics.

is replaced by its 3N x 3N counterpart (1/)H where H is the Hessian matrix of the second derivatives of the intermolecular potential V = (/2) RT R.

http://www.ccbb.pitt.edu/PDFFiles/DynPro97.pdf



http://www.ccbb.pitt.edu/CCBBResearchDomMotImages/PDFFiles/143.pdf





















Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the CellChapter 2. Small Molecules, Energy, and Biosynthesis The Chemical Components of a Cell

Panel 2-5: The 20 amino acids involved in the synthesis of proteins

Molecular Biology of the Cell, 3rd edn. © 1994 by Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson.

Part I. Introduction to the Cell Chapter 2. Small Molecules, Energy, and Biosynthesis

The Chemical Components of a Cell Panel 2-6: A survey of the major types of nucleotides and their derivatives encountered in cells

http://www.ncbi.nlm.nih.gov/books/data/cell/html/copy.html



Going beyond:

• To other species (promoter-chromatin architecture in Drosophila and Yeast).

• TF regulatory modules.

• Post-transcriptional regulation (RNAi).

• From sequence analysis to molecular modeling and vice versa.

• Still beyond…

Acknowledgements• Prof. Ed Trifonov (Weizmann

Institute / University of Haifa)• Prof. Michael Q. Zhang (Cold Spring Harbor Lab NY)• Prof. Ivet Bahar (University of Pittsburgh)• Prof. Gary Stormo (Washington

University, St. Louis)• Prof. Alex Bolshoy (Weizmann Ins. /Haifa U.)• Prof. Mark Borodovsky (Georgia

Institute of Technology, Atlanta)• K. Derenshteyn (GIT)

Ioshikhes’ group:• Dr. Naum Gershenzon• Dr. Li Wang• Dr. Amutha

Ramaswamy

(Dept. Biomedical Informatics, Ohio State University)

Summary

“Do you see anything there?” …“Just a suggestion, perhaps. But wait an instant!” He

stoodupon a chair, and holding up the light in his left hand,

hecurved his right arm over the board hat and round

the longringlets.“Good havens!” I cried in amazement.The face of Stapleton had sprung out of the canvas.“The fellow is a Baskerville – that is evident.”Arthur Konan-Doyle

“The Hound of the Baskervilles”

Documents

Bioinformatics approaches for studying of gene regulation. By Ilya Ioshikhes, Ph.D. Department of Biomedical Informatics