Tutorial of STRUCTURE Software

7/31/2019 Tutorial of STRUCTURE Software

1/29

Tutorial of the STRUCTURE

software

Dr. Sung-Chur Sim

Tomato Genetics and Breeding programThe Ohio State Univ., OARDC


2/29

STRUCTURE software

A model-based clustering method (Pritchard et al. 2000)

Free software

(http://pritch.bsd.uchicago.edu/software/structure2_1.html)

Bayesian approach (MCMC: Markov Chain Monte Carlo) Detects the underlying genetic population among a set of

individuals genotyped at multiple markers

Computes the proportion of the genome of an individual

originating from each inferred population (quantitativeclustering method)
http://pritch.bsd.uchicago.edu/software/structure2_1.htmlhttp://pritch.bsd.uchicago.edu/software/structure2_1.html


3/29

Input data

A matrix where the data for individuals are in rows, the

loci are in column

n consecutive rows have the data for each individual ofn-

ploid species Integershould be used for coding genotype

Missing data should be indicated by a numberwhich doesnt

occur elsewhere in the data (e.g. -1)

The data file should be a text file (.txt)not an excel file (.xls)for running STRUCTURE


4/29

Information of user-defined populations (market class)

Missing data

2 consecutive rows

for alleles


5/29

Running STRUCTURE from a graphical

interface, Front End

The Front End organizes dataanalysis into project


6/29

Importing input data into a project


7/29

Importing input data into a project (cont.)


8/29



9/29



10/29



11/29



12/29

Configuring a parameter set


13/29

Length of Burnin Period: how long to run the simulation before collecting data to minimizethe effect of the starting configuration

Number of MCMC Reps after Burnin: how long to run the simulation after burnin to get

accurate parameter estimates

Configuring a parameter set (cont.)


14/29



15/29



16/29



17/29



18/29

Running STRUCTURE: a single run


19/29

Running STRUCTURE: a single run (cont.)


20/29

Running STRUCTURE: a batch run


21/29

Running STRUCTURE: a batch run (cont.)


22/29

Ln P(D): Estimated probability of Ks


23/29

Inference of true K(number of populations)

The log likelihood for each K, Ln P(D) = L(K)

Two approaches to determine the best K

1. Use of L(K): When K is approaching a true value,L(K) plateaus (or continues increasing slightly) andhas high variance between runs (Rosenberg et al.2001).

Nonparametric test (Wilcoxon test)

2. Use of an ad hoc quantity (K): Calculatedbased on the second order rate of change of thelikelihood (K) (Evanno et al. 2005). The K showsa clear peak at the true value of K.

K = m([LK])/s[L(K)]

Evanno et al. 2005. Molecular Ecology 14: 2611-2620


24/29

SAS code for the nonparametric method


25/29

Inference of best K using the delta K method

The best K = 8

L(K) = an average of 20 values of Ln P(D)

L(K) = L(K)n L(K)n-1L(K) = L(K)nL(K)n-1

Delta K = [L(K)]/Stdev


26/29

Q-matrix


27/29

Format the marker data

Run STRUCTURE w/10K for burnin and 50K for MCMC reps

20 times at each of K=1 to 10

Infer true K (5~7)

Run STRUCTURE w/500K for burnin and 750K for MCMC

reps 20 times at each of K=3 to 8

Identify the best K based on L(K) and K

An example of steps to identify the best K


28/29

We may not always be able to know the TRUE value of

K, but we should aim for the smallest value of K that

captures the major structure in the dataPritchard et al. (2000)


29/29

Enjoy running STRUCTURE

Documents

Tutorial of STRUCTURE Software