Tutorial of STRUCTURE Software

Embed Size (px)

Citation preview

  • 7/31/2019 Tutorial of STRUCTURE Software

    1/29

    Tutorial of the STRUCTURE

    software

    Dr. Sung-Chur Sim

    Tomato Genetics and Breeding programThe Ohio State Univ., OARDC

  • 7/31/2019 Tutorial of STRUCTURE Software

    2/29

    STRUCTURE software

    A model-based clustering method (Pritchard et al. 2000)

    Free software

    (http://pritch.bsd.uchicago.edu/software/structure2_1.html)

    Bayesian approach (MCMC: Markov Chain Monte Carlo) Detects the underlying genetic population among a set of

    individuals genotyped at multiple markers

    Computes the proportion of the genome of an individual

    originating from each inferred population (quantitativeclustering method)

    http://pritch.bsd.uchicago.edu/software/structure2_1.htmlhttp://pritch.bsd.uchicago.edu/software/structure2_1.html
  • 7/31/2019 Tutorial of STRUCTURE Software

    3/29

    Input data

    A matrix where the data for individuals are in rows, the

    loci are in column

    n consecutive rows have the data for each individual ofn-

    ploid species Integershould be used for coding genotype

    Missing data should be indicated by a numberwhich doesnt

    occur elsewhere in the data (e.g. -1)

    The data file should be a text file (.txt)not an excel file (.xls)for running STRUCTURE

  • 7/31/2019 Tutorial of STRUCTURE Software

    4/29

    Information of user-defined populations (market class)

    Missing data

    2 consecutive rows

    for alleles

  • 7/31/2019 Tutorial of STRUCTURE Software

    5/29

    Running STRUCTURE from a graphical

    interface, Front End

    The Front End organizes dataanalysis into project

  • 7/31/2019 Tutorial of STRUCTURE Software

    6/29

    Importing input data into a project

  • 7/31/2019 Tutorial of STRUCTURE Software

    7/29

    Importing input data into a project (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    8/29

    Importing input data into a project (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    9/29

    Importing input data into a project (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    10/29

    Importing input data into a project (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    11/29

    Importing input data into a project (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    12/29

    Configuring a parameter set

  • 7/31/2019 Tutorial of STRUCTURE Software

    13/29

    Length of Burnin Period: how long to run the simulation before collecting data to minimizethe effect of the starting configuration

    Number of MCMC Reps after Burnin: how long to run the simulation after burnin to get

    accurate parameter estimates

    Configuring a parameter set (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    14/29

    Configuring a parameter set (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    15/29

    Configuring a parameter set (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    16/29

    Configuring a parameter set (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    17/29

    Configuring a parameter set (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    18/29

    Running STRUCTURE: a single run

  • 7/31/2019 Tutorial of STRUCTURE Software

    19/29

    Running STRUCTURE: a single run (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    20/29

    Running STRUCTURE: a batch run

  • 7/31/2019 Tutorial of STRUCTURE Software

    21/29

    Running STRUCTURE: a batch run (cont.)

  • 7/31/2019 Tutorial of STRUCTURE Software

    22/29

    Ln P(D): Estimated probability of Ks

  • 7/31/2019 Tutorial of STRUCTURE Software

    23/29

    Inference of true K(number of populations)

    The log likelihood for each K, Ln P(D) = L(K)

    Two approaches to determine the best K

    1. Use of L(K): When K is approaching a true value,L(K) plateaus (or continues increasing slightly) andhas high variance between runs (Rosenberg et al.2001).

    Nonparametric test (Wilcoxon test)

    2. Use of an ad hoc quantity (K): Calculatedbased on the second order rate of change of thelikelihood (K) (Evanno et al. 2005). The K showsa clear peak at the true value of K.

    K = m([LK])/s[L(K)]

    Evanno et al. 2005. Molecular Ecology 14: 2611-2620

  • 7/31/2019 Tutorial of STRUCTURE Software

    24/29

    SAS code for the nonparametric method

  • 7/31/2019 Tutorial of STRUCTURE Software

    25/29

    Inference of best K using the delta K method

    The best K = 8

    L(K) = an average of 20 values of Ln P(D)

    L(K) = L(K)n L(K)n-1L(K) = L(K)nL(K)n-1

    Delta K = [L(K)]/Stdev

  • 7/31/2019 Tutorial of STRUCTURE Software

    26/29

    Q-matrix

  • 7/31/2019 Tutorial of STRUCTURE Software

    27/29

    Format the marker data

    Run STRUCTURE w/10K for burnin and 50K for MCMC reps

    20 times at each of K=1 to 10

    Infer true K (5~7)

    Run STRUCTURE w/500K for burnin and 750K for MCMC

    reps 20 times at each of K=3 to 8

    Identify the best K based on L(K) and K

    An example of steps to identify the best K

  • 7/31/2019 Tutorial of STRUCTURE Software

    28/29

    We may not always be able to know the TRUE value of

    K, but we should aim for the smallest value of K that

    captures the major structure in the dataPritchard et al. (2000)

  • 7/31/2019 Tutorial of STRUCTURE Software

    29/29

    Enjoy running STRUCTURE