Liquid association for large scale gene expression and network studies Ker-Chau Li Institute of Statistical Science Academia Sinica (presentation at Isaac

Liquid association for large scale gene expression and network studies

Ker-Chau Li

Institute of Statistical Science

Academia Sinica

(presentation at Isaac Newton Institute for mathematical Sciences, Workshop under the Program of Statistical Theory and Methods for Complex, High-Dimensional Data, June 23-27, 2008)

Abstract The fast-growing public repertoire of microarray gene expression databases provides individual investigators with unprecedented opportunities to study transcriptional activities for genes of their research interest at no additional cost. Methods such as hierarchical clustering, principal component analysis, gene network and others, have been widely used. They offer biologists valuable genome-wide portraits of how genes are co-regulated in groups. Such approaches have a limitation because it often turns out that the majority of genes do not fall into the detected gene clusters. If one has a gene of primary interest in mind and cannot find any nearby clusters, what additional analysis can be conducted? In this talk, I will show how to address this issue via the statistical notion of liquid association. An online biodata mining system is developed in my lab for aiding biologists to distil information from a web of aggregated genomic knowledgebase and data sources at multi-levels, including gene ontology, protein complexes, genetic markers, drug sensitivity. The computational issue of liquid association and the challenges faced in the context of high p low n problems will be addressed.

Change, change, and change

Calculus is a subject about ‘change’ Life Science is about ‘change’ My entire talk is about ‘change’

Intuition of SIR

Input data variables

X1 crime rate

X2 room size

X3 family income

…

…

Xp air quality

p : the total number of input variables

Regression Models:

• Parametric

Multiple linear reg

Nonlinear reg

Wavelet

•Nonparametric

Spline smoothing

kernel smoothing

•Semiparametric

Cox regression for survival analysis

Output data variable

Y house price

=f(x1,x2 …, xp,error)

Principle Component Analysis (PCA)

Critical issue : Danger of Information loss Information relevant to Y may not be contained in the reduced variables because Y is not used in the dimension reduction process

Dogma of regression teaches how output Y CHANGES in response to CHANGES in input X

Failedwhendimensionis high

DimensionDeductionon X

A reversal of the regression paradigm

input data variables

X1 X2 X3 … Xp

p : the total number of input variables

output data variable

Y

= f(b1’X,b2’X …, bk’X ,error)

conduct dimension reduction on

E(X1|Y), …., E(Xp|Y) to

k projection variables, k <<p

A fundamental theory for resolving Information loss !!!A theorem in Li(1991, JASA) shows that(i) dimension reduction after inverse regression recovers the effective projection variables b1’X,b2’X …, bk’X , (ii) No Need to specify the nonlinear function f .

sliced meansE(X1|Y)E(X2|Y)…E(Xp|Y)

Instead of asking how Y changes in response to changes in X,

Ask how X CHANGES as Y CHANGES

Regression on k projectionvariables

Regression is about change Sir Francis Galton (1822 - 1911), half-cousin of Charles Darwin, was an English Victorian polymath, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician.

He was knighted in 1909. Galton invented the use of the regression line (Bulmer 2003,

p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas.

QuickTime™ and a decompressor

are needed to see this picture.

Bivariate normal

Regression slope equals correlation after variable standardization

Correlation Coefficient has been used by Gauss, Bravais, Edgeworth …

Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heredity in man”Karl Pearson modifies and popularizes its use.

A building block in multivariate analysis, of which clustering, classification, dimension reduction are recurrent themes

Correlation and Changes Inside Correlation

Liquid association is about the change of correlation pattern inside the scatter diagram of a pair of variables

Liquid association(LA) a new bioinformatics tool for exploring gene expression data and much beyond

Fig2-Top

low(-) gene X high(+)

low(-) gene Y high(+)

state1

transit

state2

Linear(state1)Linear(state2)

Fig2-Bottom

condition

low (-) geneZ high(+)

How LA works? Instead of two genes, three genes are considered at a time. LA measures the CHANGE in correlation between two genes X, Y as mediated by a third gene Z.

Basis for clustering genes in microarray: two genes X, Y are likely to be functionally associated if sharing similar expression profiles; measured by correlation coefficient

Fig1-bottom

low(-) geneX high(+)

low(-) geneY high(+)

The converse statement is not true: many functionally associated genes are often uncorrelated in expression owing to complexity of gene regulation such as multiple functional roles for most genes, role-changing as hidden cellular state variables vary, etc.

Pearson correlation(X,Y)

LA

Li(1992, PNAS) invented a novel statistical notion termed ‘liquid association’: ‘Liquid’ as opposed to ‘solid’ is a metaphor for ‘change’

ARG1

ornithine

L-arginino-succinate

citrulline arginine

ARG3

ARG4

CAR1

ARG2

CPA2

CPA1

carbamoyl phosphate

N-acetyl-glutamate

Glutamate

Glutamine

CAR2

Proline

L-glutamate-5-semialdehyde

urea

fumarateaspartate

Figure 2 . The four genes in the urea cycle are coded by ARG3, ARG1, ARG4, and CAR1 in S. Cerevisiae.ARG2 enocodes acetyl-glutamate synthase, which catalyzes the first step of ornithine biosynthesis. CPA1 and CPA2 enocode small and large units of carbamoylphosphate synthetase. CAR2 encodes ornithine aminotransferase. This chart is adapted from KEGG.

LA helps Elucidation of Gene regulation in metabolic pathways (Li 2002)

How to alleviate computation burden for computing a genome-wide total of N3 triplets of genes N=6,000 (yeast) 36 billions N=50,000(human ) 20 trillions

An enabling algorithm is derived from an elegant theorem that offers a simple formula for measuring LA under the setting of continuous cellular state changes. “Genome-wide co-expression dynamics: theory and application”, K.C. Li (2002, PNAS )

LA helps Finding disease candidate genes(Li et al 2007)

On-line LA system developed for aiding integrative cancer biology study• Biomarkers and disease candidate genes finding• Gene/drug correlation;• eQTL ; • gene signature for clinical survival prediction;• MicroRNA expression;•Array CGH DNA copy number

http://kiefer.stat2.sinica.edu.tw/LAP3

Examples of LA application

gene-expression data cond1 cond2 …….. condp

x11 x12 …….. x1p

x21 x22 …….. x2p

… …

gene1gene2

gene n

Why clustering makes sense biologically?

Profile similarity implies functional association

The rationale is

Genes with high degree of expression similarity are likely to be functionally related.

may form structural complex,

may participate in common pathways.

may be co-regulated by common upstream regulatory elements.

Simply put,

Mitochondrial ATP Synthase E. coli ATP ( 三磷酸腺苷 ) SynthaseThese images depicting models of ATP Synthase subunit structure were provided by John Walker. Some equivalent subunits from different organisms have different names.

Protein rarely works as a single unitHomo-dimer, hetro-dimer, protein complex

粒線體 ATP 合成脢

(scatterplot-matrix (select (transpose normalized-data) (list 4180 122 14395692 3833 1977 370)) :variable-labels (list "mcm1" "mcm2" "mcm3" "mcm4""mcm5" "mcm6" "mcm7"))

Among the entire Yeast genome, the top three genes that have the largest correlation with mcm2 are mcm6 , mcm3, and mcm4, with the correlation coefficient of .69 .64 .63 respectively. Genes mcm7 and mcm5 are ranked 6 and 41, with correlation coefficients of .61 and .49 respectively.

Figure 1. Genes with similar expression profiles may share roles in common cellular processes. MCM1 encodes a transcript factor, while the rest of the MCM genes encode a hexameric protein complex which participates in DNA replication. This cellular role difference is reflected well in expression patterns. The last row (symmetrically the first column ) shows the distribution of MCM1 against other 6 MCM genes. The correlation is nearly zero in each plot. In contrast, the association between MCM2,..., MCM7 appears much tighter.

Example : 散佈圖列陣 SCATTERPLOT MATRIX ofMCM1,MCM2, MCM3, MCM4, MCM5, MCM6, MCM7,

The tighter association among the six genes, MCM2,..., MCM7 is in a sharp contrast to the association between each of them and MCM1.

It turns out that the gene products of MCM2,..,MCM7 form a hexameric complex that binds Chromatin(染色質 ). It is a part of pre-replicative complex, an assembly of proteins that form at origins of DNA replication between late M phase and the G1/S transition and includes other

proteins believed to act in DNA replication initiation.

複製染色體



MCM1 is a Transcription factor helps

Activation of gene expression

However, the converse is not true

The expression profiles of majority of functionally associated genes are indeed uncorrelated

• Microarray is too noisy

•Biology is complex

Why no correlation? Protein rarely works alone Protein has multiple functions Different biological processes or pathways have to be

synchronized Competing use of finite resources : metabolites, hormones, Protein modification: Phosphorylation, proteolysis, shuttle, … Transcription factors serving both as activators and repressors

Transcription factors: proteins that bind to DNA

Activator; X=TF, Y= target gene ; correlation is positive

Repressor: X=TF, Y=target gene; correlation is negative

Some transcription factors can actAs both activators and repressors

Thyroid hormone receptors can be changed from repressors to activatorsDependent on the absence/presence of thyroid hormone

X=THRY=target geneCorr may cancel out if hormone level fluctuates

Going subtle:Protein modification

Histone inhibits transcription

To activate transcription, the lysine side chain must be acetylated.

Weaver(2001)

Transcription factor can switch between activator and repressor, dependent on the abundance level of thyroid hormone.

Current NextmRNA

mRNA

protein kinase

Nutrients- carbon, nitrogen sourcesTemperatureWater

ATP, GTP, cAMP, etc

localization

DNA methylation, chromatin structure

Math. Modeling : a nightmare

FITNESS

FUNCTION

mRNA

CytoplasmNucleusMitochondriaVacuolar

Observed

hidden

Statistical methods become useful

What is LA? PLA?

Concept of “mediator”

Fig2-Top

low(-) gene X high(+)

low(-) gene Y high(+)

state1

transit

state2

Linear(state1)Linear(state2)

Fig2-Bottom

condition

low (-) geneZ high(+)

Schematic illustration of LA

A Challenge What genes behave

like that ? Can we identify all of

them ? N=5878 ORFs N choose 3 = 33.8

billion triplets to inspect

Statistical theory for LA X, Y, Z random variables with mean 0

and variance 1 Corr(X,Y)=E(XY)=E(E(XY|Z))=Eg(Z) g(z) an ideal summary of association

pattern between X and Y when Z =z g’(z)=derivative of g(z) Definition. The LA of X and Y with

respect to Z is LA(X,Y|Z)= Eg’(Z)

One way to go about estimating LA is to apply nonparametric regression for g’(z)

But this is probably going to eat up too much computing time and also face the issue of regularization such as shall we apply a common smoothing parameter to all curves or not, …

A idea pop out because of my early work on SURE and cross validation.

applications of Stein Lemma

Nonparametric regression with stein estimates Connection of Stein’s unbiased risk estimate, with generalized

cross validation (Li 1984, Ann. Stat)

Decision Theory

Lemma 1 : Eh’(X)=h(1)-h(0)X uniform[0,1]

h is differentiable Fundamental theorem of calculus

Sir Issac Newton(1643-1727)Gottfried Leibniz(1646-1716)

[from Wikipedia]





Lemma : Eh’(X)= EXh(X)X~Normal(0,1) Stein’s Lemma Charles Stein Integration by part Proof : Start from the right side Write down the density of X Integration by part

QuickTime™ and a decompressorare needed to see this picture.



Statistical theory-LA

Theorem. If Z is standard normal, then LA(X,Y|Z)=E(XYZ)

Proof. By Stein’s Lemma : Eg’(Z)=Eg(Z)Z

=E(E(XY|Z)Z)=E(XYZ) Additional math. properties: bounded by third moment =0, if jointly normal transformation

Normality ?

Convert each gene expression profile by taking normal score transformation

LA(X,Y|Z) = average of triplet product of three gene profiles:

(x1y1z1 + x2y2z2 + …. ) / n

How does LA work in yeast?

Urea cycle/arginine biosynthesis

Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)

Most visible event

ARG1

ARG2Glutamate

ARG1

ornithine

L-arginino-succinate

citrulline arginine

ARG3

ARG4

CAR1

ARG2

CPA2

CPA1

carbamoyl phosphate

N-acetyl-glutamate

Glutamate

Glutamine

CAR2

Proline

L-glutamate-5-semialdehyde

urea

fumarateaspartate

Figure 2 . The four genes in the urea cycle are coded by ARG3, ARG1, ARG4, and CAR1 in S. Cerevisiae.ARG2 enocodes acetyl-glutamate synthase, which catalyzes the first step of ornithine biosynthesis. CPA1 and CPA2 enocode small and large units of carbamoylphosphate synthetase. CAR2 encodes ornithine aminotransferase. This chart is adapted from KEGG.

ARG1

Adapted from KEGG

X

Y

Compute LA(X,Y|Z) for all Z

Rank and find leading genes

8th place negative

Why negative LA?

high CPA2 : signal for arginine demand. up-regulation of ARG2 concomitant with down-regulation of CAR2 prevents ornithine from leaving the urea cycle.

When the demand is relieved, CPA2 is lowered, CAR2 is up-regulated,

opening up the channel for orinthine to leave the urea cycle.

-2

-1

0

1

2

-2 -1 0 1 2

Low ARG2 High

Low CAR2 High

low CPA2median CPA2high CPA2Linear (low CPA2)Linear (high CPA2)

Other examples (see Li 2002) X=GLN3(transcription factor), Y=CAR1, Z=ARG4 (8th place

negative end) Electron transport: X=CYT1(cytochome c1), gives ATP1 (11

times), ATP5 (subunits of ATPase) Calmodulin CMD1, NUF1 (binding target of CMD1),

CMK1(calmodulin-regulated kinase), YGL149W Glycolysis genes PFK1, PFK2 (6-phospho-fructokinase) CYR1(adenylate cyclase) , GSY1 (glycogen synthase),

GLC2( glucan branching), SCH9(serine/threonine protein kinase; longevity)

Liquid association:A method for exploiting lack of correlation between variables

LA related References Li, K.C. (2002) Genome-wide co-expression dynamics: theory and application. Proceedings of National Academy of

Science . 99, 16875-16880.

Li, K.C., and Yuan, S. (2004) A functional genomic study on NCI's anticancer drug screen. The Pharmacogenomics

Journal, 4, 127-135. Li, K.C., Ching-Ti Liu, Wei Sun, Shinsheng Yuan and Tianwei Yu (2004). A system for enhancing genome-wide co-

expression dynamics study. Proceedings of National Academy of Sciences . 101 , 15561-15566. Yu , T., Sun, W., Yuan , S., and Li, K.C. (2005). Study of coordinative gene expression at the biological process level.

Bioinformatics 21 3651-3657. Yu, T., and Li, K.C. (2005). Inference of transcriptional regulatory network by two-stage constrained space factor analysis.

Bioinformatics 21, 4033-4038. Wei Sun; Tianwei Yu; Ker-Chau Li(2007). Detection of eQTL modules mediated by activity levels of transcription factors. Bioinformatics; doi:

10.1093/bioinformatics/btm327 (correspondence author: Li) Yuan, S., and Li. K.C. (2007) Context-dependent Clustering for Dynamic Cellular State Modeling of Microarray Gene Expression. Bioinformatics

2007; doi: 10.1093/bioinformatics/btm457 (correspondence author: Li) Li, KC, Palotie A, Yuan, S, Bronnikov, D., Chen D., Wei X., Choi, O., Saarela J., Peltonen L. (2007) Finding candidate disease genes by

liquid association. Genome Biology (in Press).

The human examples

Gene expression profile for NCI’s 60 cell lines

For each cell line, the relative mRNA concentrations are measured by cDNA glass array.

Cell lines used in microarray experiment are without drug administration.

Ross D.T. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 24, 227-235 (2000)

NCI 60 Cell linesOVARIAN (6)IGROV1OVCAR-3OVCAR-4OVCAR-5OVCAR-8SK-OV-3

PROSTATE (2)DU-145PC-3

LEUKEMIA (6)CCRF-CEMHL-60K-562MOLT-4RPMI-8226SR

MELANOMA (8)LOXIMVIM14MALME-3MSK-MEL-2SK-MEL-28SK-MEL-5UACC-257UACC-62

BREAST (8)BT-549HS578TMCF7MCF7/ADF-RESMDA-MB-231/ATCCMDA-MB-435MDA-NT-47D

LUNG (9)A549/ATCCEKVXHOP-62HOP-92NCI-H226NCI-H23NCI-H322MNCI-H460NCI-H522

CNS (6)SF-268SF-295SF-539SNB-19SNB-75U251

COLON (7)COLO205HCC-2998HCT-116HCT-15HT29KM12SW-620

RENAL (8)786-0A498ACHNCAKI-1RXF-393SN12CTK-10UO-31

How does LA work in cell-lines?

Alzheimer’s Disease hallmark gene

Amyloid-beta precursor protein (APP)

The brain tissue shows "neurofibrillary tangles" (twisted fragments of protein within nerve cells that clog up the cell), "neuritic plaques" (abnormal clusters of dead and dying nerve cells, other brain cells, and

protein), and "senile plaques" (areas where products of dying nerve cells have accumulated around protein). Although these changes occur to some extent in all brains with age, there are many more of them in the brains of people with AD. The destruction of nerve cells (neurons) leads to a decrease in neurotransmitters (substances secreted by a neuron to send a message to another neuron). The

correct balance of neurotransmitters is critical to the brain.

Amyloid beta peptide is the predominant component of senile plagues in brains of MD patients.

It is derived from

Amyloid-beta precusor protein (APP) by consecutive proteolytic cleavage of

Beta-secretase and

gamma-secretase

What is the physiological role of APP?

Cao X, Sudhof TC.

A transcriptionally active complex of APP with Fe65 and histone acetyltransferase Tip60.Science. 2001 Jul 6;293(5527):115-20.

Abstract of Cao and SudhofAmyloid-beta precursor protein (APP), a widely expressed cell-surface protein, is cleaved in the transmembrane region by gamma-secretase. gamma-Cleavage of APP produces the extracellular amyloid beta-peptide of Alzheimer's disease and releases an intracellular tail fragment of unknown physiological function. We now demonstrate that the cytoplasmic tail of APP forms a multimeric complex with the nuclear

adaptor protein Fe65 and the histone acetyltransferase Tip60. This complex potently stimulates transcription via heterologous Gal4- or LexA-DNA binding domains, suggesting that release of the cytoplasmic tail of APP by gamma-cleavage may function in gene expression.

Take X=APP, Y=APBP1

APBP1 encodes FE65 Find BACE2 from our short list of LA

score leaders. BACE2 encodes a beta-site APP-cleaving

enzyme

Take X=APP, Y=HTATIP HTATIP encodes Tip60

Finds PSEN1 (second place positive LA score leader)

Which encodes presenilin 1,

a major component of

gamma-secretase

Finding disease candidate genes by liquid association

Ker-Chau Li , Aarno Palotie,Shinsheng Yuan, Denis Bronnikov,Daniel Chen, XuelianWei, Oi-Wa Choi, Janna Saarela andLeena Peltonen

Multiple sclerosis

• 1. MS is a chronic neurological disorder disease, characterized by multicentric inflammation, demyelination and axonal damage, resulting in heterogeneous clinical features, including pareses, sensory symptoms and ataxia. The classical clinical features include disturbances in sensation and mobility. The typical age of onset is between years 20 and 40, making MS one of the most common neurological diseases of young adults. Four genome-wide scans (US, UK, Canada, and Finland) have revealed several putative susceptibility loci, of which the loci on chromosomes 6p, 5p, 17q and 19q have been replicated in multiple study samples. More recently, Professors Aarno Peltonen and Leena Peltonen’s teams have generated a fine map on 17q22-q24 (Saarela et al 2002). They are now interested in the functional aspect of the genes in this region using microarray technology.

Application: finding candidate genes for Multiple sclerosis

glutamate-induced excitotoxicity

SLC1A3 is highly expressed in various brain regions including cerebellum, frontal cortex, basal ganglia and hippocampus. It encodes a sodium-dependent glutamate/aspartate transporter 1 (GLAST). Glutamate and aspartate are excitatory neurotransmitters that have been implicated in a number of pathologic states of the nervous system. Glutamate concentration in cerebrospinal fluid rises in acute MS patients whilst glutamate antagonist amantadine reduces MS relapse rate. In EAE, the levels of GLAST and GLT-1 (SLC1A2) are found down-regulated in spinal cord at the peak of disease symptoms and no recovery was observed after remission. We consider highly encouraging that several lines of evidence including both genetic association and gene expression association, would be consistent with the glutamate-induced excitotoxicity hypothesis of the mechanisms resulting in demyelination and axonal damage in MS.

Validation for the genetic relevance of SLC1A3 to MS. We set to test if there is any genetic relevance of SLC1A3 to MS. Before SLC1A3 was brought up by the LA method, our fine mapping effort focused on a more telomeric region (between 10.3 and 17.3 Mb) of 5p, which had provided the highest two-point lod scores in Finnish MS families (22) . Guided by the LA findings, we further included five SNPs flanking the SLC1A3 gene (Table 2) to be genotyped in our primary study set consisting of 61 MS families from the high risk region of Finland. The most 5’ SNP, rs2562582, located within 2kb from the initiation of the SLC1A3 transcript showed initial evidence for association to MS (p=0.005) in the TDT analysis, suggesting a possible functional role of this variant in the transcriptional regulation of this gene. Moreover, as shown in Table 2, stratification of the Finnish MS families according to the strongest associating SNP on the HLA region23, rs2239802, strengthened the association between the SLC1A3 SNP and MS (p=0.0002, TDT). Thus, based on LA, and supported by association analyses in an MS study sample, the presence of SLC1A3 serves to connect all four major MS loci identified in Finnish families, elucidating a potential functional relationship between genetically identified genes and loci.

ARTICLES

Interleukin 7 receptor a chain (IL7R) shows allelic and functional association with multiple sclerosis Simon G Gregory

1,9

, Silke Schmidt1,9

, Puneet Seth2

, Jorge R Oksenberg3

, John Hart1

, A ngela Proko p1

, Stacy J Caillier

3

, Maria Ban4

, A n Goris5

, Lisa F Barcellos6

, Robin Lincoln3

, Jacob L McCauley7

, Stephen J Sawcer

4

, D A S Compston4

, Benedicte Dubois5

, Stephen L Hauser3

, Mariano A Garcia-Blanco2

, Margaret A Pericak -Vance

8

& Jonatha n L Haines7

, fo r the Multiple Sclerosis Genetics Group

Multipl e sclerosis is a demyelinat ing neurodeg enerat ive di sease with a strong genetic componen t. Previo us genetic risk stu dies have failed to ident ify consis tent ly linke d regio ns or genes outsi de of the major histocompatibi lity complex on chromosome 6p. We describe allelic assoc iat ion of a polymorphism in the gene encodi ng the interleu kin 7 receptor a chain (IL7R) as a significant risk factor for multip le sclerosis in four indep endent family-based or case-control data sets (overall P • 2.9 10

–7

). Further, the likely causal SNP, rs68 9793 2, located with in the alternat ively spl iced exon 6 of IL7R, has a functional effect on gene expressio n. The SNP influences the amount of soluble and membrane-bo und isoforms of the protein by putat ively di srupting an exoni c spl icing silencer.

Multipl e sclerosis is the prototypical human demyelinat ing di sease, whic h requi res multipl e sources and types of positive evidence to and numerous epidemiolog ical, adop tion and twin stu dies have impl icate a candidate gene, can be used to integrate bo th sta tistical provi ded evidence for a strong underlying genetic liabil ity

1

.The and functional data. di sease is most common in youn g adults, with more than 90 % of Us ing genomic convergence

3

, we identi fied 28 genes that were affected individ uals diagnosed before the age of 55, and fewer than 5% differentia lly expresse d in at least two of nine previo us expression diagnosed before the age of 14. Females are two to three times more stu dies (Supp lementary Table 1 online). We focused on three genes frequent ly affected than males

2

, and the dis ease course can vary (interleukin 7 receptor alpha chain (IL7R) [MIM: 14666 1], matrix su bs tant ially , with some affected in divid uals suffering only minor metallo proteinase 19 (MMP19) [MIM: 6018 07] and chemokine (C-C di sabil ity several decades after their initial diagnosi s, and others motif) ligand 2 (CCL2) [MIM: 15 810 5]) that had a previo us ly reaching wheelchair depende ncy short ly after di sease onset. The publ ished or inferred functional role in multip le sclerosis and that complex etiolog y of the di sease and the current ly undefined molecular were not located with in the MHC. Two of the three genes localize to mechanisms of multipl e sclerosis sug gest that moderate contribut ion s previo us regio ns of geneti c linkag e on 17q1 2 (CCL2)

4

and 5p13.2 from multipl e risk loci u nderlie the development and progress ion of (IL7R)

5

. We analyzed a large dat a set of 760 US families of Europ ean the di sease. descent, including 1,05 5 individ uals with multip le sclerosis, and

Many different approaches, including genetic linkag e, candidate identi fied a significant assoc iat ion with multip le sclerosi s sus ceptibility gene associat ion and gene expression stu dies, have been used (sum-only for a nonsynonymous codi ng SNP (rs68979 32) withi n a key marized in ref. 2) to ident ify the genetic basis of multip le sclerosis. transmembrane domain of IL7R . We subsequent ly repli cated th is However, genetic linkag e screens have failed to ident ify consis ten t initi al significant associat ion in three independen t European pop ula regio ns of linkag e out side of the major histocompatibil ity complex tions or populat ion s of European descent: 43 8 individ uals with (MHC). Candidate gene stu dies have suggeste d over 100 different multipl e sclerosis and 47 9 unrelated controls ascertained in the United associa ted genes, bu t there has n ot been consensu s acceptance of any States, 1,338 individ uals with multip le sclerosis and their parent s such candidates. Similarly, gene express ion stu dies have identi fied ascertained in nort hern Europ e (the UK and Belgi um) and 1,0 77 hund reds of differentia lly expressed transcrip ts, with li ttl e cons istency individ uals with multip le sclerosis and 2,725 unrela ted control s also across stu dies. The alternat ive approach of genomic

conv ergence3, ascertained in nort hern Europe. We show that rs68 9793 2 affects

© 2007 Nature Publishing Group http://www.nature.com/naturegenetics

1

Center for Human Genetics, and 2

Center for RNA Biology and Department of Mol ecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA.

3

Department of Neurology, University of California, San Francisco, Cali fornia 94143, USA. 4

Department of Clinical Neurosciences, University of Cambridge, Addenbrooke’s Hospital, Cambridge CB2 2QQ, UK.

5

Section for Experimental Neurology, Katholieke Universiteit Leuven, 3000 Leuven, Belgium. 6

School of Public Health, University of Cali fornia, Berkeley, Cali fornia 94720, USA.

7

Center for Human Genetics Research, Vanderbil t University Medical Center, Nashville, Tennessee 37232, USA.

8

Miami Institute for Human Genomics, University of Miami Mill er School of Medicine, Miami, Florida 33136, USA. 9

These two authors contributed equally to this work. Correspondence should be addressed to J.L.H. ([email protected]) or M.A.P.-V. ([email protected]).

Received 22 February; accepted 18 June; published online 29 July 2007; doi:10.1038/ng2103

LETTERS

Variation in interleukin 7 receptor a chain (IL7R) influences risk of multiple sclerosis Frida Lundmark

1

, Kristina Duvefelt2

, Ellen Iacobaeus3

, Ingrid Kockum1,3

, Erik Wallstro¨m3

, Mohsen Khademi

3

, An nette Oturai4

, Lars P Ryder5

, Janna Saarela6

, Hanne F Harbo7,8

, Elisabeth G Celius8

, Hugh Salter9

, Tomas Olsson

3

& Jan Hillert1

Multipl e sclerosis is a chronic, often di sabli ng , di sease of the central nervous system affecting more than 1 in 1,000 peop le in most west ern coun tries. The inflammatory les ions typical of multip le sclerosi s show autoimmune features and depend part ly on genetic factors. Of these genetic factors, only the HLA gene complex has been repeatedly confirmed to be associa ted with multip le sclerosi s, despi te cons iderable efforts. Polymorphisms in a number of non-HLA genes have been reported to be assoc iated with multiple sclerosis, but so far confirmation has been difficult . Here, we report compelli ng evidence that pol ymorphisms in IL7R, whic h encodes the int erleukin 7 receptor a chain (IL7Ra), indeed contribute to the non-HLA genetic risk in multip le sclerosis, demonstrating a role for this path way in the patho physiolog y of this di sease. In additio n, we report altered express ion of the genes encodi ng IL7Ra and its ligand, IL7, in the cerebrospinal fluid compartment of individ uals with multip le sclerosis.

IL7Ra (also known as CD12 7), encoded by IL7R,is a member of the type I cytoki ne receptor family and forms a receptor complex with the common cytokine receptor gamma chain (CD13 2) in whic h IL7 is the ligand

1

. The IL-7–IL7Ra ligand -receptor pair is crucial for proliferation and survival of T and B lymphocytes in a nonredund ant fashion, as shown in human and animal models, in whic h genetic aberrations lead to immune deficiency syndromes

2

. IL7R is located on chromosome 5p13, a regio n occasionally suggest ed to be linked with multip le sclerosis

3

.

We have cons idered IL7R a promising candidat e gene in multipl e sclerosis, and we have recent ly reported genetic associat ion s with three IL7R SNP markers in up to 67 2 Swedish individ uals with multipl e sclerosis and 672 control s, as well as two assoc iated haplotypes spann ing these markers

4

. To confirm th ese associations, we assessed an independent case-control group consisting of 1,820

individuals with multiple scle rosis and 2,634 heal thy controls from the Nordic count ries

(Denmark, Finl and, Norway and Sweden), independen t from the data set analyzed in ref. 4 (Supp lementary Table 1 online). Of these, 91% of the affected individ uals had experienced an init ially relapsing-remitt ing course of multip le sclerosis (RRMS), whereas 9% had a primary progress ive course (PPMS). In additio n, we analyzed the express ion of IL7R and IL7 in the peripheral bloo d as well as in cells from the cerebrospinal fluid (CSF).

We gen otyped the three previo us ly ass ocia ted SNP s, located in intron 6 (rs98710 6 and rs98 7107) and exon 8 (rs319405 1). rs98 710 6 and rs3194 051 were in high linkage di sequ ilib rium (LD) (r

2

• 0.99, |D¢ • 1.00| ), and rs98710 7 was in partial LD (r2

• 0.29, |D¢ • 0.99|) w ith rs9871 06 a nd rs3 19405 1, as they were l ocated in the same haplo type block. The size of the stu dy allo wed full po wer (100 %) to detect an odd s ratio (OR) of 1.3. All SNP s were genotyped us ing the Sequenom hME assa ys. The ob served control genotypes conformed to Hardy-We inberg equ ilib rium. All three SNPs confirmed significant associat ion with multipl e sclerosis in th is nonoverlappi ng case-control g roup, with very similar ORs as in t he previo us stu dy (rs9871 07, P • 0.002 ; rs 98710 6, P • 0.001 ; rs31 9405 1, P • 0.002). A test for heterogeneity bet ween the dat a sets from Norway, Denmark, Finla nd and Sweden showed no evidence of stratification, thu s permitt ing a combined analysis (Mantel-Haenszel–corrected and crude ORs are shown in Ta ble 1 ). We est imated the three-marker haplo type frequencies us ing the EM algorithm in Haplovi ew

5

.The est imated di stribut ion of ha plo type s differed significant ly bet ween affected individ uals and controls (P • 0.00 1), with two ha plo types assoc iating with multipl e sclerosis, one conferring an increased risk of disease (P • 0.000 4) and the other conferring a decreased risk of disease (P • 0.003 ) (Supp lementary Table 2 online), in accordance with previou s resu lts

4

. According to data from the HapMap CEU pop ulat ion, IL7R is located with in a tight LD block conta ining no addi tional g enes. To

1

Division of Neurology, Department of Cli nical Neuroscience, Karolinska Institutet at Karolinska University Hospital–Huddinge, SE-141 8 6 Stockholm, Sweden. 2

Clinical Research Centre, Mutation Analysis Facility, Karolinska University Hos pital, SE-141 8 6 Huddinge, Sweden. 3

Neuroimmuno logy Unit, Department of C linical Neuroscience, Karolinska Institutet at Karolinska University Hospital–Solna, SE-171 76 Stockholm, Sweden.

4

Danish Multiple Sclerosis Research Centre and 5

Department of Cli nical Immuno logy, Rigsho spitalet, Cop enhag en University Hospital, DK-2100 Copenhagen, Denmark.

6

Department of Molecular Medicine, National Public Health Institute, FI-00290 Helsinki, Finland.

7

Institute of Immunolog y, University of Oslo, N-0027 Oslo, Norway. 8

Department of Neurology, Ulleva˚l University Hosp ital, N-040 7 Oslo, Norway.

9

Pharmacogenomics Section, Department of Disease Bi ology, AstraZeneca R&D, SE-151 85 So¨derta¨lje, Sweden. Correspondence should be addressed to J.H. ([email protected]).

Received 22 Februar y; accepted 20 Jun e; pu blished online 29 July 20 07; doi:10.103 8/ng2106

Alleles of IL2RA and IL7RA and those in the HLA locus are identified as heritable risk factors for multip le sclerosis.

Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study

The International Multiple Sclerosis Genetics Consortium*

Abstract

Backgroun d

Multiple sclerosis has a clinically significant heritable component. We conducted a genomewide association study to identify alleles associated with the risk of multiple sclerosis.

Methods

We used DNA microarray technology to iden tify common DNA sequence variants in 931 family trios (consisting of an affected child and both parents) and tested them for association. For replication, we genotyped another 609 family trios, 2322 case subjects, and 789 control subjects and used genotyping data from two external control data sets. A joint analysis of data from 12,360 subjects was performed to estimate the overall significance and effect size of associations between alleles and the risk of multiple sclerosis.

Results

A transmission disequilibrium test of 334,923 single-nucleotide polymorphisms (SNPs) in 93 1 family trios revealed 49 SNPs having an association with multip le sclerosis (P<1 10

4

); of these SNPs, 38 w ere selected for the second-stage analysis. A comparison between the 931 case subjects from the family trios and 2431 control subjects identified an additional nonoverlapping 32 S NPs (P<0.001). An additional 40 SNPs with less stringent P values (<0.01) were also selected, for a total of 110 SNPs for the second-stage analysis. Of these SNPs, two w ithin the interleukin-2 receptor gene (IL2RA) were strongly associated with multiple sclerosis

(P=2.96 108), as were a nonsynonymous SNP in the interleukin-7 receptor gene

(IL7RA) (P=2.94 107) and multip le SNPs in the HLA-DRA locus (P=8. 94 10

81).

Conclusion s

The writing grou p (David A. Hafler, M.D., Alastair Compston, F.Med.Sci., Ph.D. , Stephen Sawcer, M.B., Ch.B. , Ph.D., Eric S. Lander, Ph.D. , Mark J. Daly, Ph.D., Philip L. De Jager, M.D., Ph.D. , Paul I.W. de Bakker, Ph.D., Stacey B. Gabriel, Ph.D. , Daniel B. Mirel, Ph.D., Adrian J. Ivinso n, Ph.D. , Margaret A. Pericak-Vance, Ph.D. , Simon G . Gregory, Ph.D. , John D. Rio ux, Ph.D., Jacob L. McCauley, Ph.D. , Jo nathan L. Haines, Ph.D. , Lisa F. Barcellos, Ph.D. , Bruce Cree, M.D., Ph.D. , Jorge R. Oksenberg, Ph.D. , and Stephen L. Hauser, M.D.) assume responsib ility for the overall content and integrity of the article. *The affiliations of the writing group and other members of the International Multiple Sclerosis Genetics Consortium are listed in t he Appendix.

This article (10. 105 6/NEJMoa07349 3) was published at www.nejm.org on Jul y 29, 200 7.

N Engl J Med 2007;35 7. Copyright © 2007 Massachusetts Medical Soci ety.

N Engl J Med 10.10 56/NEJMoa073 493 _

International MS whole genome association study(2007).

Affymetrix 500K to screen common genetic variants of 931 family trios. Using the on-line supplementary information provided, we found two

SNPs, rs4869676(chr5:36641766) and rs4869675(chr5: 36636676 ) with TDT p-value 0.0221 and 0.00399 respectively, are in the upstream regulatory region of the SLC1A3 gene.

In fact, within the 1Mb region of rs486975, there are a total 206 SNPs in the Affymetrix 500K chip. No other SNPs have p-value less than that of rs486975.

The next most significant SNPs in this region are rs1343692(chr5:35860930), and rs6897932(chr5:35910332; the identified MS susceptibility SNP in the IL7R axon).

The MS marker we identified rs2562582(chr5: 36641117) is , less than 5K apart from rs4869675, but was not in the Affymetrix chip.

A little bit late IL7R was found long time ago before by LA !!!See the attached the e-mailﾊ sent more than two years ago in 2005 !!!

Begin forwarded message:From: Ker Chau Li (local) <[email protected]>Date: March 28, 2005 10:17:51

AM PSTTo: Robert Yuan <[email protected]>, Aarno Palotie <[email protected]>, Daniel Chen <[email protected]>, Denis Bronnikov <[email protected]>, Palotie Leena <[email protected]>Cc: Ker Chau Li (local) <[email protected]>

Subject: IL7R

(I thought this e-mail should have been sent out already; but it has not)I take X=SLC1A3, Y=MBP, Z= any gene, using 2002 Atlas data. Two genes are from the short list of genes with highest LA scores.IL7R interleukin 7 receptor and HLA-A

IL7R is at 5p13. Interesting coincidence??

other interesting findings include GFAP glial fibrillary acidic protein on 17q21 (Alexander disease)GRM3 (glutamate receptor, metabotropic )CDR1 (cerebellar degeneration-related protein 1)Ighg3 (immunoglobulin heavy constant

gamma 3)Iglj3 ( immunoglobulin lambda joining 3)

mailto:[email protected]




















A2M The output of a short list of 25 gene pairs with the best LA scores each from the positive and the negative ends is given in Additional data file 1 (Table S1). The statistical significance of the results of this gene search procedure is discussed in Additional data file 2 (Supplementary Text 3). We find that the gene A2M (encoding α2-macroglobulin, a cytokine transporter and protease inhibitor) appears many times. We further find an interesting biologic functional association between A2M and MBP from some literature about the pathogenesis of MS. Following demyelination in human MS and rodent EAE, immunogenic MBP peptides are released into cerebrospinal fluid and serum (see Oksenberg and coworkers [2] for references) and A2M represents the major MBP-binding protein in human plasma [17]. A significant increase in α2-macroglobulin is found in plasma of MS patients [18]. Analogously, in rodent EAE, infusion of α2-macroglobulin significantly reduces disease symptoms [19].

Genome-wide LA scores,X=MBP

P-values by randomizationEach dot represents a case of simulated X; highest corr

v.s. 20th highest LA

LAP website

Basic workflow is simple

User interface for browsing computation output

Acknowledgement Mathematics in Biology (MIB), Institute of Statistical Science,

Academia Sinica Web-based Liquid association development team

Team leader: Dr. Shin-sheng Yuan IT specialists: Guan-I Wu, Hung-Wei Tseng, Shi-Hsien Yang, Yi-Wei Chen, Chang-Dao Chen,

Ying-Fu Ho Arabidopsis Gene Expression Analysis: Dr. Ai-Ling Hour

Acknowledgment Biodata refining group,UCLA Statistics Htpp://kiefer.stat.ucla.edu/lap Shinsheng Yuan (chief architect for website development, gene-drug) Wei Sun (yeast segregation ) Ching-Ti Liu (yeast protein complex) Tianwei Yu (Stress, gene ontology) Xuelian Wei (graphics, cancer, disease page), Tun-Hsian Yang

(disease page) Yijing Shen (clustering) Tongtong Wu(Stress) Jack Li(graph)

Lung Cancer project

National Taiwan University Hospital

Pan-Chyr Yang

Sung-Liang Yu

Hsuan-Yu Chen

Causal analysis X, Y, Z X->Y, X->Z Y=aX+b+error Z=a’X+b’+error’Partial correlation =corr (error, error’)X causes Y and Z if partial correlation=0(X=Coke sale, Y=eye disease incidence rate, Z=season)Start with a pair of correlated genes Y, Z, find X to

minimize partial correlationThis is very different from LA.

A limited goal : remove the trend Universal trend (affects all genes) could be artificial

(due to chip technology imperfection ) Localized trend (affects a limited number of highly

expressed genes): likely to be biologically real Partial correlation can be used to detrend X=one gene, Y=one gene, Z=trend X’= residual after regressing X on Z Y’=residual after regressing Y on Z Find correlation between X’ and Y’

Maximizing the absolute value of partial corr, given a pair of variables.

Partial corr= cosine (angle between two planes, X,Z plane and Y,Z plane)

(1) Consider the prediction of Z from Z, Y.

(2) Fixing the error variance, then the optimal Z should have highest correlation with (X+Y) (if X,Y positively correlated) or with (X-Y) (otherwise) as possible.

Documents

Liquid association for large scale gene expression and network studies Ker-Chau Li Institute of Statistical Science Academia Sinica (presentation at Isaac