20

Click here to load reader

The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Embed Size (px)

Citation preview

Page 1: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

The R genetics package:Tools for statistical genetics

Gregory R. WarnesGregory R. Warnes

Associate DirectorAssociate Director

NonClinical StatisticsNonClinical Statistics

Pfizer Global R&DPfizer Global R&D

Groton CTGroton CT

Page 2: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 2 CT ASA Mini Conference: 2005-03-05

Outline

Project GoalsProject GoalsSimplify Population Genetic Analysis

Design Details Design Details Extend R ‘Factor’ objects

Functions IncludedFunctions Included Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation,

Sample-size tools

Simple ExamplesSimple Examples Creating Genotype Objects

Example SessionExample Session Future Development: Future Development:

Emulate BioConductor Project Large scale SNP analysis Formal Object Class Multi-team collaboration

Page 3: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 5 CT ASA Mini Conference: 2005-03-05

Problem

At each genetic position within a gene, diploid cells have At each genetic position within a gene, diploid cells have two allelestwo alleles. .

This suggests storing This suggests storing each allele as separate variableeach allele as separate variable. .

However, most laboratory methods cannot distinguish However, most laboratory methods cannot distinguish between A/B and B/A, yielding between A/B and B/A, yielding three observed three observed genotypesgenotypes at each position: (A/A), (A/B or B/A), (B/B). at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed Consequently, the observed alleles are confoundedalleles are confounded,,

This suggests the use of a This suggests the use of a single genotype variablesingle genotype variable..

This duality is not directly handled by standard statistical This duality is not directly handled by standard statistical packages.packages.

As a consequence, the need to handle both views As a consequence, the need to handle both views creates complexity when manipulating or including creates complexity when manipulating or including genotype data in statistical analysis. genotype data in statistical analysis.

Page 4: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 6 CT ASA Mini Conference: 2005-03-05

Initial Project Goals

Simplify Statistical Analysis using Genetic Data by providing: A genotype object class that appropriately captures the single

variable / separate allele duality Methods to import and manipulate genotype objects without string

manipulation Simple tools including different ‘views’ of genotype variables in

standard statistical models Dominant ( at least one copy of X) Recessive ( both alleles are X) Additive ( Number of copies of X) Heterozygote Effect (Differing Alleles) Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B)

Functions for computing and visualizing common genetic summaries and statistical tests Allele Frequencies Hardy-Weinberg Equilibrium Linkage Disequilibrium

Other statistical methods

Page 5: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 7 CT ASA Mini Conference: 2005-03-05

Design Details

Design:Design: Genotypes are stored in ‘Factor’ objects, with factor levels formatted as

‘A/C’. A translation table is constructed to quickly extract individual allele

information:

ConsequencesConsequences Can be stored in standard data frames Can be efficiently manipulated (space & time) Permits both biallelic (C/T) and multi-allelic genetic markers (SSLP’s)

GenotypeGenotype Allele 1Allele 1 Allele 2Allele 2

A/AA/A AA AA

A/BA/B AA BB

B/BB/B BB BB

Page 6: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 8 CT ASA Mini Conference: 2005-03-05

Genotype Manipulation

Importing & CreationImporting & Creationgenotype(), as.genotype(), makeGenotypes(), …haplotype(), as.haplotype(), makeHaplotypes(), …

ManipulationManipulation[] (subsetting), []<- (subset assignment), == (equality)

InformationInformationsummary() (Allele and genotype counts and frequencies), allele.names(), allele() (Extract individual alleles), nallele() (Number of distinct allele values)

AnnotationAnnotationlocus(), gene(), marker(), …

TransformationTransformationcarrier(), homozygote(), heterozygote(),allele.count()

ExportExportwrite.marker.file(), write.pedigree.file(),write.pop.file()

Page 7: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 9 CT ASA Mini Conference: 2005-03-05

Installation

Windows GUI:Windows GUI:

Command Line: Command Line: > install.packages(“genetics”, dependencies=TRUE)

Page 8: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 10 CT ASA Mini Conference: 2005-03-05

Statistical Functions

Hardy-Weinberg (Dis-)Equilibrium: D, D’, r, rHardy-Weinberg (Dis-)Equilibrium: D, D’, r, r22, X, X22

diseq(), diseq.ci() (Confidence Intervals!)

HWE.test(), HWE.chisq(), HWE.exact() Linkage Disequlibrium: D, D’, r, rLinkage Disequlibrium: D, D’, r, r22

LD(), LDplot(), LDtable() Haplotype Imputation:Haplotype Imputation:

hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle() Sample-size toolsSample-size tools

gregorius() (Probability of observing a marked of given frequency with specified sample size)

power.casectrl() UtilitiesUtilities

Bootstrap.ci

Page 9: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 11 CT ASA Mini Conference: 2005-03-05

Simple Examples : Creating Genotype Objects

A single vector with a character separator:

> g1 <- genotype( c('A/A','A/C','C/C','C/A',

+ NA,'A/A','A/C','A/C') )

> g3 <- genotype( c('A A','A C','C C','C A',

+ '','A A','A C','A C'),

+ sep=' ', remove.spaces=F)

Page 10: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 12 CT ASA Mini Conference: 2005-03-05

Simple Examples : Creating Genotype Objects

A single vector with a positional separator

> g2 <- genotype( c('AA','AC','CC','CA','',

+ 'AA','AC','AC'), sep=1 )

Two separate vectors

> g4 <- genotype(

+ c('A','A','C','C','','A','A','A'),

+ c('A','C','C','A','','A','C','C')

+ )

Page 11: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 13 CT ASA Mini Conference: 2005-03-05

Simple Examples : Creating Genotype Objects

A dataframe or matrix with two columns

> gm <- cbind(+ c('A','A','C','C','','A','A','A'),+ c('A','C','C','A','','A','C','C') ) > gm [,1] [,2][1,] "A" "A" [2,] "A" "C" [4,] "C" "A" …> g5 <- genotype( gm )> g5[1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C"Alleles: A C

Page 12: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 14 CT ASA Mini Conference: 2005-03-05

Simple Examples : Creating Genotype Objects

Convert 1-column genotype variables read from a file:Convert 1-column genotype variables read from a file:> > gm1 <- makeGenotypes(gm1 <- makeGenotypes(++ read.csv("gm1.csv")) read.csv("gm1.csv"))> > gm1gm1 Age Sex G1 V2Age Sex G1 V21 31 M A/A G/T1 31 M A/A G/T2 27 F A/C G/G2 27 F A/C G/G3 35 M C/C G/T3 35 M C/C G/T4 19 M A/C G/T4 19 M A/C G/T5 55 M <NA> G/G5 55 M <NA> G/G6 34 F A/A G/G6 34 F A/A G/G7 45 F A/C T/T7 45 F A/C T/T8 32 M A/C G/T8 32 M A/C G/T> > gm1$G1gm1$G1[1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C"[1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C"Alleles: A C Alleles: A C

_ gm1.csv __

Age,Sex,G1,G2

31,M,A/A,G/T

27,F,A/C,G/G

35,M,C/C,G/T

19,M,A/C,G/T

55,M,,G/G

34,F,A/A,G/G

45,F,A/C,T/T

32,M,A/C,G/T

Page 13: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 15 CT ASA Mini Conference: 2005-03-05

Simple Examples : Creating Genotype Objects

Convert 2-column genotype variables read from a fileConvert 2-column genotype variables read from a file

> gm2 <- makeGenotypes( + read.csv("gm2.csv"),+ convert=list(3:4,5:6))> gm2 Age Sex G1.1/G1.2 V2.1/V2.21 31 M A/A G/T2 27 F A/C G/G3 35 M C/C G/T4 19 M A/C G/T5 55 M <NA> G/G6 34 F A/A G/G7 45 F A/C T/T8 32 M A/C G/T

______ gm2.csv _____

Age,Sex,G1.1,G1.2,G2.1,G2.2

31,M,A,A,G,T

27,F,A,C,G,G

35,M,C,C,T,G

19,M,C,A,G,T

55,M,,,G,G

34,F,A,A,G,G

45,F,A,C,T,T

32,M,A,C,T,G

Page 14: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 16 CT ASA Mini Conference: 2005-03-05

Simple Examples : Displaying Genotype Information

“Raw”

> g5

[1] "A/A" "A/C" "C/C"

[4] "A/C" NA "A/A“

[5] "A/C" "A/C"

Alleles: A C

“Summary”

> summary(g5)

Allele Frequency:

Count Proportion

A 8 0.57

C 6 0.43

NA 2 NA

Genotype Frequency:

Count Proportion

A/A 2 0.29

A/C 4 0.57

C/C 1 0.14

NA 1 NA

Page 15: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 17 CT ASA Mini Conference: 2005-03-05

Simple Examples: Extracting allele information

Genotypes (Independent factor Genotypes (Independent factor levels): levels): > g5

[1] "A/A" "A/C" "C/C" "A/C"

[5] NA "A/A" "A/C" "A/C"

Alleles: A C Allele Counts (Additive Effect):Allele Counts (Additive Effect):

> allele.count(g5, "A")

[1] 2 1 0 1 NA 2 1 1

attr(,"allele")

[1] "A" Allele presence (Dominant Effect):Allele presence (Dominant Effect):

> carrier(g5,'A')

[1] TRUE TRUE FALSE TRUE

[5] NA TRUE TRUE TRUE

Allele Homozygote (Recessive Allele Homozygote (Recessive Effect):Effect):> homozygote(g5,'A')

[1] TRUE FALSE FALSE FALSE

[5] NA TRUE FALSE FALSE Heterozygote (Heterozygote Heterozygote (Heterozygote

Advantage Effect):Advantage Effect):> heterozygote(g5,'A')

[1] FALSE TRUE FALSE TRUE

[5] NA FALSE TRUE TRUE

Page 16: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 18 CT ASA Mini Conference: 2005-03-05

Simple Examples: Extracting allele information

First allele:First allele:> allele(g5, 1)

[1] "A" "A" "C" "A" NA "A"

[7] "A" "A"

attr(,"which")

[1] 1

attr(,"allele.names")

[1] "A" "C“

Both alleles:Both alleles:> allele(g5)

[,1] [,2]

[1,] "A" "A"

[2,] "A" "C"

[3,] "C" "C"

[4,] "A" "C"

[5,] NA NA

[6,] "A" "A"

[7,] "A" "C"

[8,] "A" "C"

attr(,"which")

[1] 1 2

attr(,"allele.names")

[1] "A" "C"

Page 17: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 19 CT ASA Mini Conference: 2005-03-05

Example Session

Page 18: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 20 CT ASA Mini Conference: 2005-03-05

Future Development

R GeneticsNGR GeneticsNG Mission:Mission:

GeneticsNG is a collaborative project to develop a core set of data structures and analytic tools for the management, visualization, and analysis of genetic data. This core will provide sufficient ease of use, stability, features, documentation, and community support to inspire users and developers to utilize, contribute and extend the system.

Goals:Goals: Scalable to Whole-Genome genetic analysis (>1e5 SNPs) Read/Write common genetics data storage formats Port existing open-source genetics codes

• Current R genetics packages (genetics, haplo.score, gap, …)• Other open-source packages…

Provide good documentation, including tutorials and training Engage the entire R genetics user/developer community

Page 19: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 21 CT ASA Mini Conference: 2005-03-05

Future Development

R GeneticsNGR GeneticsNG Current TeamCurrent Team

• Pfizer: Gregory Warnes, Nitin Jain

• Channing Laboratory (Harvard): Ross Lazarus

• BMS: Scott D Chasalow, Giovanni Montana

• Insightful: Michael O'Connell

• Univ. Chicago: Junsheng Cheng

• Join us!

Project Page: Project Page:

http://r-genetics.sf.net/

Page 20: The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 22 CT ASA Mini Conference: 2005-03-05

References

R Project:R Project: http://www.r-project.org

R genetics package:R genetics package: http://cran.r-project.org/contrib/main/Descriptions/genetics.html

R-News article:R-News article: Warnes GR. ``The Genetics Package,'' R News, Volume 3,

Issue 1, June 2003. R GeneticsNG project:R GeneticsNG project:

http://r-genetics.sf.net/ Me:Me:

http://www.warnes.net [email protected]