109
김형용 , 이규열 , 이성찬 _ 2013. 02. 05 ~ 2013.02.06 R&D Center, Insilicogen, Inc. NGS Analysis using Galaxy 2013 한국유전체학회 동계심포지엄 생물정보분석교육 워크샵

Kogo 2013-ngs galaxy

Embed Size (px)

Citation preview

Page 1: Kogo 2013-ngs galaxy

김형용, 이규열, 이성찬 _ 2013. 02. 05 ~ 2013.02.06 R&D Center, Insilicogen, Inc.

NGS Analysis using Galaxy 2013 한국유전체학회 동계심포지엄 생물정보분석교육 워크샵

Page 2: Kogo 2013-ngs galaxy

01 Galaxy introduction 02 Galaxy examples 1,2 03 Galaxy installation 04 Galaxy function details 05 Galaxy examples 3,4 06 Galaxy tools 07 Galaxy on Grid 08 Galaxy on Cloud

목차 있을 시 간지

Index NGS Analysis using Galaxy

Page 3: Kogo 2013-ngs galaxy

Agenda

3 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

구분 시간 강의내용 비고

1부: Introduction

and Application

15:00 ~ 15:20 Galaxy 소개 진행 김형용

15:20 ~ 15:50 Galaxy 분석예제 시연 1. Human exon 가운데 가장 SNP 많은 exon 찾기 2. NGS QC and assembly 예제

16:00 ~ 16:20 Galaxy 설치 진행 이성찬 16:20 ~ 17:10 Galaxy 설치 및 분석예제 실습 1. Galaxy 설치 실습

2. Human exon 가운데 가장 SNP가 많은 exon 찾기 실습 3. NGS QC and assembly 예제 실습

17:20 ~ 17:50 Galaxy 세부 기능에 대한 설명 진행 김형용

2부: Custom

operation

09:00 ~ 09:20 Galaxy 분석예제 시연 진행 김형용 1. RNA-seq 분석 예제 2. NGS 분석예제 2

19:20 ~ 09:50 Galaxy 분석예제 실습 1. RNA-seq 분석 예제 2. NGS 분석예제 2

10:00 ~ 10:20 Galaxy tool의 이해 진행 김형용 10:20 ~ 11:00 Galaxy tool 작성 실습 1. Primer design 11:10 ~ 11:30 Galaxy on Grid 진행 이규열

1. 그리드의 이해 2. 분산작업 시연

11:30 ~ 11:50 Galaxy on Cloud 진행 김형용 1. 클라우드의 이해 2. Galaxy on Amazon EC2

Page 4: Kogo 2013-ngs galaxy

NGS Technologies

Page 5: Kogo 2013-ngs galaxy

Sequencer Comparison

5 Copyrightⓒ Insilicogen, Inc. 2011. All rights reserved.

Illumina 454 SOLiD

HiSeq 2000 HiSeq 1000 HiScan SQ GAIIx GS FLX 5500

microbeads 5500xl

microbeads 5500xl

nanobeads

Read length

2X100 bp 2X150 bp 400 bp Mate pair : 60 bp X60 bp Paired-end : 75 bp X35 bp

Fragment : 75 bp

Gb/day 55 35 17.5 6.5 10h 10-15 20-30 30-45

Yield 600Gb 300Gb 150Gb 95Gb 35Mb 90Gb 180Gb 300Gb

Required input

50 ng with Nextera 100 ng – 1 μg with TruSeq

Accuracy 85% (2X50 bp, >Q30) 80% (2X100 bp, >Q30)

99% (>Q20) 99.99%

Illumina의 Gb/day는 2X100 bp run 결과

Illumina read length : 1X35, 2X50, 2X100 GA : 1X35, 2X50, 2X100, 2X150

Page 6: Kogo 2013-ngs galaxy

Applications

6 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Interaction of DNA and Protein

Mutation Detection

Structure Variation

Transcriptional Control

Personal Genomics

Personal Genomics

Environmentology

Microbiology Toxicology

Chemical Biology

Application of NGS Technique

Page 7: Kogo 2013-ngs galaxy

Issue of New Genomic Era.

7 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

many researchers, having invested

in next generation sequencing

instruments, now face

a computational bottleneck in their research

work-flow.

BGI

Page 8: Kogo 2013-ngs galaxy

Most Significant Improvement to Your Next Generation Sequencing Workflow

8 Copyrightⓒ Insilicogen, Inc. 2010. All rights reserved.

(출처: The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, 2011. BioInformatics, LLC)

Page 9: Kogo 2013-ngs galaxy

Issue of New Genomic Era.

9

Library construction

Template purification

Sequence delineation

Finishing & Assembly

Sequence annotation

Secondary annotation

Data delivery

•DNA shearing •Insert into high and /or low copy number vectors

• Primer walking • Transposon insertion methods • Proprietary & commercial assembly

• PCR Amplicons • BACs • Cosmids/ Fosmids

• Big Dye • ABI 3730 • Data compliation

• Gene prediction • BLAST search

• FTP • Web browser • Commercial software

• SNP • Comparative genomics • Expression analysis

Cost

Process

Bioinformatics

Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 10: Kogo 2013-ngs galaxy

Application of Next Genomic Data

10 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 11: Kogo 2013-ngs galaxy

Practical Software Platforms for NGS data analysis

Page 12: Kogo 2013-ngs galaxy

What kind of?

• Biological Features

• Framework (Enterprise/Informatics) Features

• Service

• Price

Page 13: Kogo 2013-ngs galaxy

List of NGS Frameworks

13 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.

Page 14: Kogo 2013-ngs galaxy

유전변이 추출 전문 파이프라인 HugeSeq

14 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.

Page 15: Kogo 2013-ngs galaxy

사용자 친화적 GUI환경을 제공하는 CLC Genomics Server

15 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.

CLC Genomics Server

- 3계층 시스템 구조의 데이터 분석 및 공유, 관리를 위한 엔터프라이즈 솔루션

1

CLC Bioinformatics Database

- 데이터의 중앙 집중 방식의 저장 및 공유 관리를 위한 데이터베이스

2

CLC Assembly Cell

- NGS 데이터의 초고속 assembly 분석 솔루션 (커맨드라인 기반)

3

CLC Genomics Workbench

- NGS 데이터의 다양한 생물정보 분석 솔루션 (GUI 기반)

4

CLC Developer Kit

- 사용자가 원하는 생물정보 분석 툴과 워크플로우 커스터마이징 솔루션

5

③ ④

Page 16: Kogo 2013-ngs galaxy

16 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 17: Kogo 2013-ngs galaxy

17 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

30x Human genome 1 sample (150G) 500만원 (1년저장)

Page 18: Kogo 2013-ngs galaxy

18 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

구글로부터 투자받아 NCBI SRA 서비스 연동 온라인에서 실험없이 곧바로 분석 가능

Page 19: Kogo 2013-ngs galaxy

GALAXY

Page 20: Kogo 2013-ngs galaxy

20 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 21: Kogo 2013-ngs galaxy

21 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 22: Kogo 2013-ngs galaxy

What is Galaxy

22 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Galaxy, a web-based genome analysis platform http://usegalaxy.org • An open-source framework for integrating various computational tools and databases into a cohesive workspace • A web-based service we provide, integrating many popular tools and resources for comparative genomics • A completely self-contained application for building your own Galaxy style sites

Page 23: Kogo 2013-ngs galaxy

Galaxy Usage

23 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

• One of the fastest growing open source bioinformatics projects, a highly successful high throughput data analysis platform for Life Sciences with over 15,000 users worldwide • Annual Galaxy Community Conference

Page 24: Kogo 2013-ngs galaxy

Galaxy visualization

External Genome Browser

UCSC

Ensembl

GBrowse

24 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Trackster

Track/data viewer in web browser

HTML5 Canvas, jQuery

Renders in browser, not on server

Page 25: Kogo 2013-ngs galaxy

Galaxy visualization

25 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 26: Kogo 2013-ngs galaxy

Trackster

26 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 27: Kogo 2013-ngs galaxy

Trackster

27 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 28: Kogo 2013-ngs galaxy

Trackster

28 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 29: Kogo 2013-ngs galaxy

Galaxy 구성요소

Galaxy 주요구성 요소

Datasources : 입력 데이터 지정. 별도의 지역

시스템이나, 외부 웹사이트의 데이터를 등록 가능

Tool : 기본적인 분석의 최소 단위, 지역설치시

원하는 툴을 만들어 넣을 수 있음

History : 입력데이터가 Tool의 조합을 거쳐 얻어진

중간 결과물 목록

Workflow : History 는 입력데이터 및 파라메터만

바꾸면 새로운 데이터 결과를 얻을 수 있다. 이를

별도로 프로세스 등록

Visualization : 분석결과를 가시화 도구와 연결

Page : 위 요소들을 종합한 보고서 작성 기능

29 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Eprimer3 tool 을 별도로 만들어 등록한 예제

Page 30: Kogo 2013-ngs galaxy

Galaxy tool 은

30 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Tool 입력포맷

출력포맷

입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할

조합하면 Workflow가 된다

Page 31: Kogo 2013-ngs galaxy

Galaxy formats

31 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Auto-detect 데이터가 어떤 형식인지 자동으로 인식

Ab1 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.

Axt blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields.

Bam A binary file compressed in the BGZF format with a '.bam' file extension.

Bed Tab delimited format (tabular). Does not require header line

Fasta A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters

FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file

Gff GFF lines have nine required fields that must be tab-separated.

Gff3 The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.

Interval (Genomic Intervals)

Tab delimited format (tabular)

Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..

MAF TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=".

Scf A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file.

Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.

Tabular (tab delimited)

Any data in tab delimited format (tabular)

Wig The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track.

Other text type Any text file

Page 32: Kogo 2013-ngs galaxy

Galaxy 특징 한번 더

Amazon Cloud 이용

NGS 관련 분석기능 탑재 논문에 Galaxy URL 제공

Transparent analysis

Biologist

최근 Galaxy 사용 추세

Galaxy 특징 한번 더

파이썬으로 만들어져 있으나, 확장시 파이썬이 아니어도 됨

“투명한” 분석 플로우를 만들고 공유하고 확장할 수 있다.

거의 모든 생물정보 분석을 Galaxy 로 할 수 있다.

Galaxy만 잘 써도 뽑겠다 (NCBI)

32 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Bioinformatician

Page 33: Kogo 2013-ngs galaxy

GALAXY Examples 1

Page 34: Kogo 2013-ngs galaxy

Example 1.

34 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Finding Human Exons with the highest number of SNPs

1. Download all Human Exons from NCBI or Ensembl BioMart or UCSC TableBrowser

2. Download all Human SNPs from … 3. Scripting

Join 1, 2 according to position Group by Exon id Sort by SNP count Filter Exon which has more than 10 SNPs

Have to do programming! (Python, Perl, …)

Page 35: Kogo 2013-ngs galaxy

On Galaxy

35 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

http://usegalaxy.org

Page 36: Kogo 2013-ngs galaxy

On Galaxy

36 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Get data UCSC main

Get data UCSC main

: Exon 데이터 가져오기

: SNP 데이터 가져오기

Operate on Genomic Interval Join : 영역이 겹치는 Exon 추출하기

Join, Substract and Group Group : Exon 이름으로 그룹핑하고 SNP 세기

Filter and Sort Sort : SNP 개수로 Exon 정렬하기

Text Manipulation Select first : SNP 개수가 많은 top 5 exon 추출하기

Join, Substract and Group Compare two Datasets : 잃어버린 exon 정보 회복하기

Page 37: Kogo 2013-ngs galaxy

GALAXY Examples 2

Page 38: Kogo 2013-ngs galaxy

Example 2.

38 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Human NGS data QC and assembly

1. NGS Quality Control 2. NGS Single End Mapping 3. SNP Calling 4. Compare with dbSNP

Have to do in Unix and need programming! (Python, Perl, …)

Page 39: Kogo 2013-ngs galaxy

On Galaxy

39 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

http://usegalaxy.org

Page 40: Kogo 2013-ngs galaxy

On Galaxy

40 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

NGS 분석을 위해서는 프로그램 추가 설치해야 함

( http:// http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )

프로그램 사용되는 곳 설치방법

Fastx-toolkit NGS QC Ubuntu apt-get

Gnuplot NGS QC boxplot Ubuntu apt-get

Bowtie2 Reference assembly 복사 후 PATH 설정

SAMTools SNP calling Ubuntu apt-get

Page 41: Kogo 2013-ngs galaxy

On Galaxy

41 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Get data Upload File

NGS: QC and minipulation FASTQ Groomer

: human illumina fastq 파일 업로드

: fastsanger 포맷을 변경

: fastq quality 통계정보 보기

: fastq quality 통계정보로 boxplot 그리기

: 의미없는 부분 잘라내기, 가리기

NGS: QC and minipulation Compute quality statistics

NGS: QC and minipulation Draw quality score boxplot

NGS: QC and minipulation FASTQ Trimmer, Quality Trimer, Masker

Page 42: Kogo 2013-ngs galaxy

On Galaxy

42 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Get data Upload File

NGS: Mapping Bowtie2

: Reference assembly를 위한 레퍼런스 서열 입력

: Bowtie2를 이용한 assembly

: BAM 파일에서 SNP, indel 정보 추출하기

: 추출된 SNP, indel 가운데 높은 점수 추출하기

: Genomic interval 형식으로 변경

NGS: SAM Tools MPileup

NGS: SAM Tools Filter pileup

NGS: SAM Tools Pileup-to-interval

Get data UCSC Main : dbSNP 정보 가져오기

Operate on Genomic Interval Join : 영역이 겹치는 SNP 추출하기

Page 43: Kogo 2013-ngs galaxy

Galaxy Installation

Page 44: Kogo 2013-ngs galaxy

Install Virtualbox - Ubuntu

44 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

rm /etc/udev/rules.d/70-persistent-net.rules sudo shutdown –h now

5. Ubuntu 실행 후, Network 설정 파일을 삭제합니다.

6. Linux(ubuntu) 를 재 시작합니다.

1. USB에서 Virtualbox와 Galaxy 폴더를 복사합니다.

2. Virtualbox를 설치합니다.

3. Virtualbox를 실행한 후, Galaxy 이미지를 Import합니다.

4. 설정에서 네트워크를 브릿지(Bridge)로 변경합니다.

Page 45: Kogo 2013-ngs galaxy

Creating your own Galaxy

45 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 46: Kogo 2013-ngs galaxy

Running Galaxy in an production environment

46 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

By default, Galaxy uses

SQLite database

Built-in HTTP server for all tasks

Local job runnser

Single process

Simplest error-proof configuration

Change configuration for service

Disable the developer settings use_interactive = False, use_debug = False

Get a real database PostgresSQL

Offload the menial tasks: Proxy Nginix, Apache

Let your tools free: Cluster Move intensive processing to other host, TORQUE, GRID, DRMAA

Other advanced settings

Page 47: Kogo 2013-ngs galaxy

Galaxy on Cluster

47 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Intensive processes to other hosts

TORQUE

GRID

DRMAA

Working with Galaxy on the Cloud

Page 48: Kogo 2013-ngs galaxy

Virtualization

Page 49: Kogo 2013-ngs galaxy

Virtualization

• 컴퓨터 자원의 추상화를 일컫는 말

• 가상의 물리적 리소스를 만들어 냄.

•물리적인 1대의 하드웨어 자원을 논리적으로 여러 개로 나누어 사용하거나,

•여러대의 하드웨어 자원을 논리적으로 통합하여 이용하는 기술

• 하드웨어 관리, 재난에 대한 시스템 복구 등 여러 문제를 해결할 수 있는 방법으로 최근 각광

받고 있음

가상화

Page 50: Kogo 2013-ngs galaxy

Virtualization

50 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

• 비용절감

서버 한 대를 분할하여 여러 대의 서버를 구성할 수 있음

서버 구입비용 절감, 전기, 상면비용, 서버관리비용이 절감

• 자원의 효율적인 사용

서버의 비 활용되는 자원을 이용하여 가상머신을 만듬으로써 효율적인 자원사용이 가능

• 안정적인 운영

서버를 이미지로 백업, 손쉬운 서버 이전으로 장애에 대한 신속한 대처 가능

• SW의 지속적인 운영

서버 HW의 수명 주기가 끝나면 OS 벤더는 장치 드라이버 지원이 중단됨

-> 마이그레이션 문제가 발생

가상머신에 기존의 시스템을 가상머신에 올리기 때문에 장치 드라이버에 대한 문제

가 발생하지 않음

가상화의 장점!!

Page 51: Kogo 2013-ngs galaxy

클라우드 서비스에 기본적으로 활용

51 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 52: Kogo 2013-ngs galaxy

Public Galaxy environment

52 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 53: Kogo 2013-ngs galaxy

Example of Cloud

53 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.

출처 : iSC 2012 Amazon HPC session

Page 54: Kogo 2013-ngs galaxy

Running Galaxy Web server

54 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

ifconfig

4. 자신의 호스트 OS (windows) 에서 웹브라우저에서 주소창에 다음을 입력합니다.

1. 자신의 컴퓨터의 IP Address를 확인합니다.

2. Galaxy 폴더로 이동합니다.

3. Galaxy web server를 실행합니다.

cd galaxy-dist

sh run.sh

IP Address:8080 (예, 172.20.8.162:8080)

Page 55: Kogo 2013-ngs galaxy

Galaxy Detail functions

Page 56: Kogo 2013-ngs galaxy

Get Data

56 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 57: Kogo 2013-ngs galaxy

Get Data / Send Data

57 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 58: Kogo 2013-ngs galaxy

Text Manipulation

58 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 59: Kogo 2013-ngs galaxy

Convert Format

59 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 60: Kogo 2013-ngs galaxy

FASTA manipulation

60 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 61: Kogo 2013-ngs galaxy

Filter and Sort

61 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 62: Kogo 2013-ngs galaxy

Join, Subtract and Group

62 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 63: Kogo 2013-ngs galaxy

Operate on Genomic Intervals

63 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 64: Kogo 2013-ngs galaxy

NGS Toolbox

64 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 65: Kogo 2013-ngs galaxy

Galaxy Examples 3

Page 66: Kogo 2013-ngs galaxy

Example 3.

66 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Human RNA-seq

1. RNA-seq result: adrenal_1,2.fastq, brain_1,2.fastq 2. Reference: iGenome UCSC hg19, chr19 gene notation (GTF format)

Have to do in Unix and need programming! (Python, Perl, …)

Page 67: Kogo 2013-ngs galaxy

On Galaxy

67 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

http://usegalaxy.org

Page 68: Kogo 2013-ngs galaxy

On Galaxy

68 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

RNA-seq 분석을 위해서는 프로그램 추가 설치해야 함

( http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )

프로그램 사용되는 곳 설치방법

java FastQC Ubuntu apt-get install openjdk-7-jre

FastQC NGS QC tool-data/shared/jars/ 로 복사

Tophat RNA-seq mapping (다음페이지 참고)

Cufflinks RNA-seq assembly Ubuntu apt-get install cufflinks

Page 69: Kogo 2013-ngs galaxy

Tophat install in Ubuntu

69 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

$ cp samtools-0.1.18.tar.gz2 ~/work $ bzip2 –d samtools-0.1.18.tar.gz2 $ tar xvf samtools-0.1.18.tar $ cd samtools-0.1.18 $ make $ cd .. $ cp tophat-1.4.1.tar.gz ~/work $ tar zxvf tophat-1.4.1.tar.gz $ cd tophat-1.4.1 $ apt-get install libboost libbam libboost-thread-dev $ cp ../samtools-0.1.18/libbam.a /usr/local/lib $ sudo mkdir /usr/local/include/bam $ cp ../samtools-0.1.18/*.h /usr/local/include/bam $ configure $ make $ make install

Page 70: Kogo 2013-ngs galaxy

On Galaxy

70 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Get data Upload File

NGS: QC and minipulation FASTQ Groomer

: fastq, chr19.fa, gtf 파일 업로드

: fastqsanger 포맷으로 변경

: fastq quality 통계정보 보기

: RNA-seq fastq 데이터에서 splice junction 찾기 레퍼런스로 chr19.fa 이용

: Transcript assembly, FPKM 추정

NGS: QC and minipulation FastQC:Read QC

NGS: RNA Analysis Tophat for Illumina

NGS: RNA Analysis Cufflinks

Page 71: Kogo 2013-ngs galaxy

On Galaxy

71 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

NGS: RNA Analysis Cuffmerge

NGS: RNA Analysis Cuffdiff

: brain, adrenal 데이터를 reference에 맞게 합치기

: 유의한 발현변화 찾기

Page 72: Kogo 2013-ngs galaxy

Galaxy Tools

Page 73: Kogo 2013-ngs galaxy

Galaxy tool 은

73 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Tool 입력포맷

출력포맷

입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할

조합하면 Workflow가 된다

Page 74: Kogo 2013-ngs galaxy

Galaxy formats

74 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Auto-detect 데이터가 어떤 형식인지 자동으로 인식

Ab1 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.

Axt blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields.

Bam A binary file compressed in the BGZF format with a '.bam' file extension.

Bed Tab delimited format (tabular). Does not require header line

Fasta A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters

FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file

Gff GFF lines have nine required fields that must be tab-separated.

Gff3 The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.

Interval (Genomic Intervals)

Tab delimited format (tabular)

Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..

MAF TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=".

Scf A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file.

Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.

Tabular (tab delimited)

Any data in tab delimited format (tabular)

Wig The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track.

Other text type Any text file

Page 75: Kogo 2013-ngs galaxy

Creating your own Galaxy

75 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 76: Kogo 2013-ngs galaxy

Primer design tool

76 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 77: Kogo 2013-ngs galaxy

Primer3

77 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Primer3 • Primer design program • http://primer3.sourceforge.net/releases.php • Download from

http://sourceforge.net/projects/primer3/files/primer3/1.1.4/primer3-1.1.4.tar.gz

• make & copy to PATH

eprimer3 • Wrapper for Primer3, it’s used in EMBOSS package • Easy command line interface • http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/

eprimer3.html • apt-get install emboss

Page 78: Kogo 2013-ngs galaxy

erimer3

78 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

$ eprimer3 –sequence INPUT_FASTA_FILE –outfile PRIMER_DESIGN_RESULT -osize OSIZE -gcclamp GCCLAMP …

# EPRIMER3 RESULTS FOR GL020027.1 # Start Len Tm GC% Sequence 1 PRODUCT SIZE: 199 FORWARD PRIMER 571071 20 60.06 45.00 CTTGCCAATAGCGAATGGAT REVERSE PRIMER 571250 20 59.99 55.00 GACGGCGTAGATCTTCAAGC 2 PRODUCT SIZE: 199 FORWARD PRIMER 55074 20 60.05 55.00 TAACACCACTGCTCCTGCTG REVERSE PRIMER 55253 20 59.97 50.00 CATTGCATGGTCAGAACCAC 3 PRODUCT SIZE: 200 FORWARD PRIMER 71990 20 60.03 45.00 GGGGTTGATTTTCATTGTGG REVERSE PRIMER 72170 20 59.88 45.00 GTTTGCACCAACCTGGTTTT 4 PRODUCT SIZE: 200 FORWARD PRIMER 427182 20 59.83 50.00 CTGATGTGCTCTGTGGGAAA REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT 5 PRODUCT SIZE: 197 FORWARD PRIMER 427185 20 59.97 50.00 ATGTGCTCTGTGGGAAAACC REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT

이 결과 형식을 수정하여 다른 Galaxy tool의 입력으로 쓰고 싶다.

직접 Primer design Galaxy tool 만들기

Page 79: Kogo 2013-ngs galaxy

erimer3.xml

79 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 80: Kogo 2013-ngs galaxy

erimer3.py

80 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 81: Kogo 2013-ngs galaxy

tool_conf.xml

81 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

… <section name="VCF Tools" id="vcf_tools"> <tool file="vcf_tools/intersect.xml" /> <tool file="vcf_tools/annotate.xml" /> <tool file="vcf_tools/filter.xml" /> <tool file="vcf_tools/extract.xml" /> </section> <section name=“MyTools" id=“mytools"> <tool file=“mytools/eprimer3.xml" /> </section> </toolbox>

Page 82: Kogo 2013-ngs galaxy

EMBOSS eprimer3 tool added

82 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 83: Kogo 2013-ngs galaxy

실습

83 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Install Primer3

Install EMBOSS

: make 명령으로 컴파일 후, primer3_core PATH 설정

: sudo apt-get install emboss

: sudo apt-get install python-biopython

: mytools 디렉토리는 직접 생성

: mytools/eprimer3.xml 설정

Install Biopython

Copy eprimer3.py, eprimer3.xml to galaxy-dist/tools/mytools/

Edit tool_conf.xml

Page 84: Kogo 2013-ngs galaxy

Galaxy on Grid

Page 85: Kogo 2013-ngs galaxy

Grid vs Cluster

85 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

대용량 데이터에 대한 연산을 작은 소규모 연산들로 나누어 작은 여러대의 컴퓨터로 분산시켜 수행

WAN상에서 서로 다른 기종의 머신들을 연결 다양한 플랫폼을 서로 연결함 연결대수에 제한이 없음

공통점

차이점

Page 86: Kogo 2013-ngs galaxy

Grid

86 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 87: Kogo 2013-ngs galaxy

Globus Toolkit

87 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

대표적인 계산 그리드 미들웨어 Open source toolkit for building computing grids

developed and provided by Globus Alliance Standards implementation

• Open Grid Service Architecture (OGSA) • Open Grid Service Infrastructure (OGSI) • Web Services Resource Framework (WSRF) • Job Submission Description Language

(JSDL) • Distributed Resource Management

Application API (DRMAA) • SOAP • WSDL • Grid Security Infrastructure

Page 88: Kogo 2013-ngs galaxy

88 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

High level Open Grid Forum API specification for submission and control of jobs to a Distributed Resource Management (DRM, Job scheduler) system, such as a Cluster or Grid computing infrastructure

Page 89: Kogo 2013-ngs galaxy

PBS (Portable Batch System)

89 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Computer software that performs job scheduling in Unix cluster environment

A component of the Globus Toolkit

Originally developed by NASA

Following versions

• OpenPBS

• TORQUE – a fork of OpenPBS

• PBS Professional (PBS pro) - commercial

Page 90: Kogo 2013-ngs galaxy

TORQUE

90 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Distributed resource manager providing control over batch jobs and distributed compute node

It stands for Terascale Open Source Resource and QUEue Manager

Slave 노드의 CPU개수, core 개수, RAM사이즈, 임시저장소 등의 설정정보를 가지고 스케줄러에 의해 요청이 왔을 때 클러스터 리소스를 분배함

Master

Slave 1

Slave 2

Slave 3

> qsub a.sh

NFS

a.sh 명령을 스케줄러에 따라 slave로 넘김

Page 91: Kogo 2013-ngs galaxy

Virtualized Galaxy (Test-bed)

91 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 92: Kogo 2013-ngs galaxy

Galaxy on Cloud

Page 93: Kogo 2013-ngs galaxy

Cloud computing

93 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Delivery of computing and storage capacity as a service to a heterogeneous community of end-recipients.

Page 94: Kogo 2013-ngs galaxy

94 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 95: Kogo 2013-ngs galaxy

VPS (Virtual Private Server)

95 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Internet hosting services to refer a virtual machine in a cloud

Page 96: Kogo 2013-ngs galaxy

96

Amazon EC2 (Amazon Elastic Compute Cloud)

Virtualization + Grid(Cluster) computing in a Cloud

Page 97: Kogo 2013-ngs galaxy

97

Amazon EC2 (Amazon Elastic Compute Cloud)

Page 98: Kogo 2013-ngs galaxy

98

Amazon EC2 (Amazon Elastic Compute Cloud)

Page 99: Kogo 2013-ngs galaxy

99

Amazon EC2 (Amazon Elastic Compute Cloud)

Page 100: Kogo 2013-ngs galaxy

100

Amazon S3 (Amazon Simple Storage Service)

Page 101: Kogo 2013-ngs galaxy

Galaxy on Cloud

101 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Using Amazon EC2 + S3

Select AMIs in Community AMIs

Page 102: Kogo 2013-ngs galaxy

Galaxy on Cloud

102 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 103: Kogo 2013-ngs galaxy

Galaxy on Cloud

103 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 104: Kogo 2013-ngs galaxy

Galaxy on Cloud

104 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 105: Kogo 2013-ngs galaxy

Galaxy on Cloud

105 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 106: Kogo 2013-ngs galaxy

Galaxy on Cloud

106 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Page 107: Kogo 2013-ngs galaxy

Galaxy on Insilicogen

107 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.

Galaxy localization on cluster

Tool development

Workflow development

Page 108: Kogo 2013-ngs galaxy

www.insilicogen.com E-mail [email protected] Tel 031-278-0061 Fax 031-278-0062

Page 109: Kogo 2013-ngs galaxy

www.insilicogen.com E-mail [email protected] Tel 031-548-1008,1009 Fax 031-278-0062