한남대학교 생물정보학 강의 9강

  • View
    586

  • Download
    9

  • Category

    Science

Preview:

DESCRIPTION

한남대학교 생물정보학 강의 9강-개인지놈시퀀싱 #1

Citation preview

Bioinformatics

2014 2학기

생명시스템과학과

한남대학교

9강 2014.10.28

강의계획서

주 수업내용

1주 생물정보학의개요및기본이론

2주차 추석(휴강)

3주차 서열분석의원리 I

4주차 서열분석의원리 II

5주차 단백질의구조및기능예측

6주차 지놈시퀀싱및시퀀스어셈블리

7주차 중간고사

8주차 차세대시퀀싱 (Next Generation Sequencing)

9주차 개인유전체학 I

10주차

개인유전체학 II

11주차

발현체학

12주차

메타지놈

13주차

최신연구동향

Personal Genome

- 개별적인인간사이의유전적인변이

-유전적인변이에의해달라지는표현형

• 피부색• 머리카락색• 눈색• 외관• 신장….

Snyder M et al. Genes Dev. 2010;24:423-431

개별적인인간사이의유전적인변이에는어떤것이있나?

SNP/Indel

Phased SNP

Deletion

Insertion

Inverstion

ACGTTTGGATAC

TGCAAACCTATG

ACGTTTGTATAC

TGCAAACATAT

G

SNP (Single Nucleotide Polymorphisms)

• DNA 염기서열의 1염기의변화

• 표준참조서열과비교하면개개인은약 3-400만개의 SNP를지니고있음

• 빈도가매우낮은변이가있는가하면, 빈도가높은경우도있음

- Common Variant (20-40% 빈도)

- Rare Variant (1% 이하의빈도)

SNPs vs. SNVs

둘다한염기에서발견되는변이이지만..• SNP

– 특정한종에서이미발견된변이 (특성파악이잘되어있음)– 집단에서특정한비율로이미존재한다고알려진변이– 집단에서검증되서– dbSNP에기록(http://www.ncbi.nlm.nih.gov/snp)

• SNV– ‘단한사람’ 에게만발견된변이 (특성파악이잘되어있지않음)– 아주낮은빈도로만발생– 집단의다른사람에게서존재한다는것이검증되지않음

Really a matter of frequency of occurrence

http://ccsb.stanford.edu/education/Nair_NGS.pptx

TGCAAACCTATG

Indel (Insertion/Deletion)

• 미세한 (1kb 이하) 염기의추가혹은삭제• 개인별로약 30-60만개의 Insertion/Deletion 이있는것으로추산됨• Large Scale Structural Variation (2kb 이상의추가혹은삭제)

- 개인별로약 1,000곳이상

TGCAAAC-TATG

TGCAAACC-TATG

TGCAAACCCTATG

Structural Variation

• Large Scale Change of DNA (1kb ~ 3Mb)

• 구분- Microscopic structural variation

- Copy Number variation

- Chromosomal Inversion

Microscopic structural variation

• 현미경으로관찰할수있을수준의큰유전적인변이

• Aneuplodidy : 23쌍의 염색체대신추가로염색체가존재하는현상

• Chromosome Translocation

다운증후군 (Down syndrome)

Copy Number Variation

지놈내의특정영역/유전자의증폭혹은감소

휴먼지놈내의 Copy Number Variations

인간의유전변이에대한조사

• The 1000 Genomes Project

– http://www.1000genomes.org/

– SNPs and structural variants

– genomes of about 2500 unidentified people from about 25 populations around the

world will be sequenced using NGS technologies

• HapMap

– http://hapmap.ncbi.nlm.nih.gov/

– identify and catalog genetic similarities and differences

• dbSNP

– http://www.ncbi.nlm.nih.gov/snp/

– Database of SNPs and multiple small-scale variations that include indels,

microsatellites, and non-polymorphic variants

• COSMIC

– http://www.sanger.ac.uk/genetics/CGP/cosmic/

– Catalog of Somatic Mutations in Cancer

• TCGA

– http://cancergenome.nih.gov/

– The Cancer Genome Atlas researchers are mapping the genetic changes in 20

selected cancers

개인간의변이의검출

• SNP, Indel, Insertion/Deletion, Inversion…

• 이러한변화를어떻게검출할것인가?

Microarray : High throughput

PCR-Sanger Sequencing : Low throughput

Next Generation Sequencing : Method of Choice nowaday

Cost of DNA sequencing and cumulative number of genomes sequenced as a function of

time.

Snyder M et al. Genes Dev. 2010;24:423-431

NGS에의한개인지놈변이결정

Snyder M et al. Genes Dev. 2010;24:423-431

Methods for detecting variation in a human genome sequence using DNA sequencing

technologies.

Snyder M et al. Genes Dev. 2010;24:423

431

NGS Read Mapping

NGS 에서얻은시퀀싱데이터 (Reads)를참조지놈서열 (Reference Genome Sequence)에매핑

ATGAGATAGAGATAGAAAGGGAGAGAGAATAGA

Genome Sequence

Sequence Reads

이미우리는 BLAST 혹은 BLAT을이용하여이런것을할수있다는것을배웠음.

그러나시퀀싱데이터의크기는막대하여 BLAST 혹은 BLAT 에비해훨씬빠른방법이필요

NGS Read Mapping Software

Earlier

Eland

SOAP

MAQ

Newer

Bowtie

BWA

SOAP2

FasterUses Less Memory

NGS Read Mapping

필요한것

- Reference Genome Sequences (Fasta Format)

- Sequence Data (Fastq format)

Software

-Bwa

• 대부분의 Software는 unix 기반• 지놈데이터는매우큰관계로일반적인 PC에서구동하기는버거움

Flow

Sequence DataFastQ

Genome Sequence Alignment File(sam format)

Galaxy

휴대폰이야기가아님 -.-

http://usegalaxy.org

대부분의 NGS 관련분석을웹인터페이스에서수행가능

First Thing to do..

데이터 (시퀀싱데이터, 레퍼런스지놈시퀀스)를얻어업로드

파일혹은인터넷위치를지정

업로드된데이터혹은분석결과는History 에저장됨

ftp://ftp.gmod.org/pub/gmod/Courses/2012/SummerSchool/Galaxy/phiX174_genome.fa

ftp://ftp.gmod.org/pub/gmod/Courses/2012/SummerSchool/Galaxy/phiX174_reads.fastqsanger

http://gmod.org/wiki/Galaxy_Tutorial_2012_Extras

https://usegalaxy.org/u/luce/h/workshopdatasets

타인이올려둔예제데이터를이용

FASTQ Format

Data QC

Data Trimming

퀄리티가나쁜데이터를잘라버림

Before After

https://usegalaxy.org/u/galaxyproject/p/galaxy-101-ngs-variant

샘플데이터 : 엄마 – 자식의미토콘드리아 DNA Sequencing Data• 미토콘드리아는모계유전. • 이것을미토콘드리아 Reference DNA에매핑하고 Variant를찾는다

시퀀싱데이터는 Paired End

업로드됨

4개의데이터에대해서퀄리티체크 (FastQC)

나쁘지않으므로그냥 Trimming 없이 Mapping 진행

Map with BWA

레퍼런스는 Human mtDNA

이데이터는 Paired End다

Child의첫번째

Child의두번째

매핑개시

이번에는엄마데이터

매핑완료

매핑된위치

Paired End Reads

Genome

이런식으로제대로매핑된 Read만골라냄

필터링할 sam 파일을고르고

Yes

매핑되고제대로짝을이룬 Read만골라낸다

Sam 형식의파일을 Bam 형식 (여러프로그램에서지원하는) 으로변환

Mother와 Child 데이터를통합

Variant Calling

• mPileup (SamTools)

• Genome Analysis Toolkit (GATK)

Bam/Sam File (Alignment File) 로부터 SNP, Indel 을찾아내는작업

SAM/BAM VCFSequence Alignments Variant Informations

Variant Caller

Visualize mapping

Mapping 된데이터를 BAM 포맷으로변경

Downloads

Sam/Bam Format 의 alignment를보기위해서는여러가지소프트웨어가존재

SNV Filtering

Pre-processing in the mapping phase and SNV filtering help minimize false positives

• Absent in dbSNP

• Exclude LOH events

• Retain non-synonymous

• Sufficient depth of read coverage

• SNV present in given number of reads

• High mapping and SNV quality

• SNV density in a given bp window

• SNV greater than a given bp from a predicted indel

• Strand balance/bias

• Concordance across various SNV callers

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Variant Annotation

• 실제찾아진 Variant에대한해석• SeattleSeq

– annotation of known and novel SNPs

– includes dbSNP rs ID, gene names and accession numbers, SNP functions (e.g., missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association

• Annovar– Gene-based annotation

– Region-based annotations

– Filter-based annotation

http://snp.gs.washington.edu/SeattleSeqAnnotation/http://www.openbioinformatics.org/annovar/

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Galaxy Demo

Discovery of CNV

SNP/SNV의해석에비해난이도가높은편

CNV를발견하는방법 : Comparative Genome Hybridization, WGS

Comparative Genome Hybridization

Discovery of CNV by WGS

1. Generation of ‘Mate-Pair’ Library

Long Insert :

2. Mapping of Reads To Genome

3. Detection of Deletion

4. Detection of Insertion

5. Detection of Duplication

Variation의해석

- 개인지놈에서나타나는 SNP/SNV, Indel, Insertion, Deletion, Inversion 등의생물학적인의미를해석

- 기존에알려진 Variation 과의비교

- 새로운 Variation인경우, 그의미를추정

dbSNPDatabase for the short DNA variations

Example of SNP

ACTN3 유전자의 R577X, 유전자의기능을상실

http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1815739

해당 SNP에대한자세한정보

해당 SNP에대한자세한정보

SNPedia Wiki for SNP informations http://www.snpedia.com/index.php/SNPedia

여기서 Rs1815739 에대한데이터를검색

http://www.snpedia.com/index.php/Rs1815739

ACTN3 : 근육단백질인 alpha-actinin-3 이라는단백질에 Stop codon을유도

C:C 단백질을제대로만드는경우에는 RRT:T 단백질을제대로만들지못하는경우에는 XX

운동선수에는 T:T 분포가매우적음정상적인생활에는상관이없으나운동선수에게는해당유전자가필요함

23andme개인의유전체정보 (SNP) 를알려주는서비스

http://23andme.com

침을뱉어서회사에보내면..

DNA를추출한후 SNP Genotyping

결과를웹사이트에서확인가능..

Recommended