2
Life Science Open Day | 2006 | Weizmann Institute of Science We develop exploratory methods and use them for the analysis of high-throughput biological data, to study compound biological processes, especially the onset and progression of human cancer. We introduced Sorting Points Into Neighborhoods (SPIN), a novel unsupervised approach for the organization and visualization of multidimensional data. At the heart of the method is a presentation of the full pairwise distance matrix of the data points, viewed in pseudocolor. An iterative process leads to an optimal ordering of the points, one that reveals embedded structures whose shapes are of interest (e.g. elongated, circular, compact). Elongated objects are particularly important, since they are associated with continuous variation induced by temporal evolution, characteristic of many biological processes, such as gradual differentiation of various cell lineages, or change from normal tissue to premalignant, malignant and metastatic form. SPIN excels at tracking progression, even when it traces a complicated trajectory in multi-dimensional space; its development was motivated by the need to detect and visualize continuous trajectories in expression data. Chromosomal instabilities play key roles in several types of cancer. As part of an NIH-funded PPG on colon cancer, we examined the relationship between chromosomal aberrations and large-scale alterations in transcription. Using three types of array- generated data – SNP, CGH and expression, collected from many patients, we explored the genome-wide relationships between DNA copy number and gene expression, measured at different stages of the disease. We showed that the % of patients with these aberrations becomes higher as disease advances. Alterations in DNA copy number are correlated with transcription levels, which often are coordinated over large chromosomal segments (Fig. 1), implying that while specific chromosomal abnormalities may arise stochastically, the associated changes in gene expression cause the selection of cells bearing these abnormalities for clonal expansion. Amplified (deleted) chromosomal regions in carcinoma are currently searched for oncogenes (tumor suppressors). Similar study was done on glioblastoma, in collaboration with a group at Lausanne. Finally, in collaboration with groups at Sheba, Zurich and St. Jude (Tenessee), we explored the Fig. 1 Instability patterns in chromosome 20. A. correlation matrix calculated from CGH data for the 75 probes annotated to chr 20, ordered according to their chromosomal locations. Each element in the matrix represents the Pearson correlation between the DNA fold change profiles of two probes, as measured across 37 primary colon tumors (Douglas et al, 2004). B. The correspondingly ordered DNA fold change matrix. Probes (rows) ordered as in A), samples – columns – ordered by SPIN. Colors stand for relative fold-change. C-D. Similar figures for expression data, for the 302 genes annotated to chromosome 20 on the U133 Affychip. Genes are ordered according to chromosomal locations, samples according to tissue identity, indicated by the colored bar below the matrix. The reddish vertical stripes seen in some samples are indicative of chromosomal amplification. They do not appear in normal samples, very few are observed among the adenomas, much more among the stage II to IV carcinoma, and metastasis samples are predominantly amplified. 972 8 934 3964 Department of Physics of Complex Systems 972 8 934 4109 [email protected] Exploratory Analysis of Biological Data Eytan Domany Jacob Bock Axelsen Roman Brinzanik Liat Ein-Dor Anat Reiner Noam Shental Hilah Gal Libi Hertzberg Itai Kela Tal Shay Michal Sheffer Yuval Tabach Dafna Tsafrir Or Zuk Garold Fuks Mark Koudritsky Dvir Netanely Paz Polak Shlomo Urbach Assif Yitzhaky www.weizmann.ac.il/physics/complex/ compphys/

Exploratory Analysis of Biological Data...Exploratory Analysis of Biological Data Eytan Domany Jacob Bock Axelsen Roman Brinzanik Liat Ein-Dor Anat Reiner Noam Shental Hilah Gal Libi

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploratory Analysis of Biological Data...Exploratory Analysis of Biological Data Eytan Domany Jacob Bock Axelsen Roman Brinzanik Liat Ein-Dor Anat Reiner Noam Shental Hilah Gal Libi

Life Science Open Day | 2006 | Weizmann Institute of Science

We develop exploratory methods and use them for the analysis of high-throughput biological data, to study compound biological processes, especially the onset and progression of human cancer.

We introduced Sorting Points Into Neighborhoods (SPIN), a novel unsupervised approach for the organization and visualization of multidimensional data. At the heart of the method is a presentation of the full pairwise distance matrix of the data points, viewed in pseudocolor. An iterative process leads to an optimal ordering of the points, one that reveals embedded structures whose shapes are of interest (e.g. elongated, circular, compact). Elongated objects are particularly important, since they are associated with continuous variation induced by temporal evolution, characteristic of many biological processes, such as gradual differentiation of various cell lineages, or change from normal tissue to premalignant, malignant and metastatic form. SPIN excels at tracking progression, even when it traces a complicated trajectory in multi-dimensional space; its development was motivated by the need to detect and visualize continuous trajectories in expression data.

Chromosomal instabilities play key roles in several types of cancer. As part of an NIH-funded PPG on colon cancer, we examined the relationship between chromosomal aberrations and large-scale alterations in transcription. Using three types of array-generated data – SNP, CGH and expression, collected from many patients, we explored the genome-wide relationships between DNA copy number and gene expression, measured at different stages of the disease. We showed that the % of patients with these aberrations becomes higher as disease advances. Alterations in DNA copy number are correlated with transcription levels, which often are coordinated over large chromosomal segments (Fig. 1), implying that while specific chromosomal abnormalities may arise stochastically, the associated changes in gene expression cause the selection of cells bearing these abnormalities for clonal expansion. Amplified (deleted) chromosomal regions in carcinoma are currently searched for oncogenes (tumor suppressors). Similar study was done on glioblastoma, in collaboration with a group at Lausanne. Finally, in collaboration with groups at Sheba, Zurich and St. Jude (Tenessee), we explored the

Fig. 1 Instability patterns in chromosome 20. A. correlation matrix calculated from CGH data for the 75 probes annotated to chr 20, ordered according to their chromosomal locations. Each element in the matrix represents the Pearson correlation between the DNA fold change profiles of two probes, as measured across 37 primary colon tumors (Douglas et al, 2004). B.The correspondingly ordered DNA fold change matrix. Probes (rows) ordered as in A), samples – columns – ordered by SPIN. Colors stand for relative fold-change. C-D. Similar figures for expression data, for the 302 genes annotated to chromosome 20 on the U133 Affychip. Genes are ordered according to chromosomal locations, samples according to tissue identity, indicated by the colored bar below the matrix. The reddish vertical stripes seen in some samples are indicative of chromosomal amplification. They do not appear in normal samples, very few are observed among the adenomas, much more among the stage II to IV carcinoma, and metastasis samples are predominantly amplified.

972 8 934 3964

Department of Physics of Complex Systems

972 8 934 4109

[email protected]

Exploratory Analysis of Biological Data

Eytan Domany

Jacob Bock Axelsen

Roman Brinzanik

Liat Ein-Dor

Anat Reiner

Noam Shental

Hilah Gal

Libi Hertzberg

Itai Kela

Tal Shay

Michal Sheffer

Yuval Tabach

Dafna Tsafrir

Or Zuk

Garold Fuks

Mark Koudritsky

Dvir Netanely

Paz Polak

Shlomo Urbach

Assif Yitzhaky

www.weizmann.ac.il/physics/complex/compphys/

Page 2: Exploratory Analysis of Biological Data...Exploratory Analysis of Biological Data Eytan Domany Jacob Bock Axelsen Roman Brinzanik Liat Ein-Dor Anat Reiner Noam Shental Hilah Gal Libi

connection between aneuploidy and expression in childhood leukemia. A method that predicts which of the chromosomes is duplicated in a given sample (on the basis of expression data) was developed using expression data combined with cytogenetic information.

Surprisingly, two independent studies in which we took part converged to the same group of genes that plays a central role in malignancy. The first (collaboration with V. Rotter and T. Pilpel), was aimed at deciphering regulatory events that drive malignant transformation. Analysis of genome-wide transcription data taken at different times from an in-vitro cancerous transformation process identified a “proliferation cluster” of predominantly cell-cycle associated genes, whose expression levels increased as a function of p53 and p16INK4A tumor suppressors’ inactivation. By linking expression profiles of the genes in the cluster with the dynamic behavior of p53 and p16INK4A, we identified a promoter architecture that integrates signals from the two tumor suppressive channels and that maps their activity onto distinct levels of expression of the cell cycle genes, which, in turn, correspond to different cellular proliferation rates. Several of our predictions were tested and verified experimentally. Our study demonstrates how a well-controlled transformation process links gene expression, promoter architecture and activity of upstream signaling molecules (Fig. 2).

The second study (collaboration with a group from Inst. Curie, Paris) aimed at finding an expression signature that is predictive of outcome for cervical cancer patients. Using exploratory analysis algorithms such as Coupled Two-Way Clustering (CTWC) and SPIN, we analyzed gene expression data from 40 patients and cell lines and identified a ‘cervical cancer proliferation cluster’ composed of 163 highly correlated transcripts. These genes had low expression in a subset of tumors with a favorable course; most corresponded to the E2F pathway and were also members of the proliferation cluster of the in-vitro experiment. However, in cervical cancer control of proliferation breaks down by the oncogenic activities of E6 and E7, two viral proteins whose DNA is inserted (by HPV) into the host cervical cells, where they inactivate P53 and RB (Fig. 2). We found that E6/E7 mRNA expression levels were correlated with both the expression level of the proliferation cluster and with viral DNA copy number.

These findings suggest a causal relationship, from the HPV E6/E7 DNA copy number to protein level, to expression level of the proliferation cluster, to cell proliferation rate and to disease outcome. Hence E6/E7 DNA and expression levels may provide a biological marker for prognosis.

Predicting at the time of discovery the metastatic potential of cancer is a major challenge in clinical research. Many studies searched for gene expression signatures that outperform traditionally used clinical parameters in outcome prediction (especially in breast cancer). Finding such a signature will free patients of the toxicity associated with adjuvant chemotherapy given to them under current protocols, even though they do not need such treatment. Several groups published lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common, raising doubts about the reliability and robustness of the reported gene lists. We showed that the main source of the problem is the small number of samples that were used to generate the gene lists. We introduced a novel mathematical method, PAC sorting, to evaluate the robustness of such predictive gene lists and calculated for several datasets the number of samples needed to achieve a desired level of reproducibility. For example, to get a typical overlap of 50% between two lists of 70 predictive genes, expression profiles of more than 2300 early-discovery breast cancer patients is needed. There are many other collaborative projects at WIS.

Selected publicationsTsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A., and

Domany, E. (2005). Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics 21, 2301-2308.

Tabach, Y. Milyavsky, M. Shats, I. Brosh, R. Zuk, O. Yitzhaky, A. Mantovani, R. Domany, E. Rotter, V. and Pilpel, Y. (2005). Genome-wide transcription regulatory circuits controlling cellular malignant transformation. Mol. Sys. Biology

Rosty, C., Sheffer, M., Tsafrir, D., Stransky, N., Tsafrir, I., Peter, M., de Cremoux, P., de La Rochefordiere, A., Salmon, R., Dorval, T., Thiery, J.P., Couturier, J., Radvanyi, F., Domany, E., and Sastre-Garau, X. (2005). Identification of a proliferation gene cluster associated with HPV E6/E7 expression level and viral DNA load in invasive cervical carcinoma. Oncogene 24, 7094-7104.

Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21, 171-178.

Tsafrir, D., Bacolod, M., Selvanayagam, Z., Tsafrir, I., Shia, J., Zeng, Z., Liu, H., Krier, C., Stengel, R.F., Barany, F., Gerald, W.L., Paty, P.B., Domany, E., and Notterman, D.A. (2006). Relationship of gene expression and chromosomal abnormalities in colorectal cancer. Cancer Res 66, 2129-2137.

Ein-Dor, L., Zuk, O. and Domany, E. (2006). Tousands of samples are needed to generate a robust gene list for prediction of outcome in cancer. PNAS 103, 5923-5928.

AcknowledgementsEytan Domany is the incumbent of the Henry J Leir Professorial

Chair. We acknowledge support from the NIH, grant #5 P01 CA 65930-06; The Ridgefield Foundation; The Israel Academy of Sciences, Bikura Program; The Minerva Foundation; EC FP6 funding; EC RTN (STIPCO); IMOS Tashtiyot; The Wolfson Foundation; Yeda CEO Fund; Mrs Ruth Ziegler Fund for Stem Cell Research

Fig. 2 Three-way linkage between expression profiles, promoter architecture and tumor suppressor pathways. A. Expression matrix of the proliferation cluster, sorted with SPIN. B. Correlation coefficient between the expression profile of each of the genes in the cluster, with expression of p21, p16 and their sum. C. The appearance of motifs of the five TF in the network for each gene in the expression matrix. D. Bars indicate high densities of various regulatory motifs along the sorted genes. E. The proliferation cluster regulatory network. The manner in which the viral proteins E6 and E7 affect the network in cervical cancer and the events that cause loss of control over proliferation in the in-vitro transformation process are marked in red