39
Linux Cluster and Distributed Resource Manager Center for Genome Science, NIH, KCDC Hong Chang Bum [email protected]

Cluster Drm

Embed Size (px)

Citation preview

Page 1: Cluster Drm

Linux Cluster and Distributed Resource

Manager

Center for Genome Science, NIH, KCDCHong Chang Bum

[email protected]

Page 2: Cluster Drm
Page 3: Cluster Drm

대부분이 병렬화된 코드를 제공하지 않는다.

만들기 또한 쉽지 않다.

Page 4: Cluster Drm

고급의 프로그래머가 존재하지 않는다.

Hadoop, MapReduce, Erlang...

Page 5: Cluster Drm

Compute IntensiveData/Memory Intensive

High ThroughputLarge Simulation

64 bit address spaceshared memory parallelism

all good but....think basic ^^

Page 6: Cluster Drm

THE RETURN OF THE SERIAL PROGRAM

Page 7: Cluster Drm
Page 8: Cluster Drm

Linux Cluster

• 3 Linux Cluster Machine

• KHAN (192.168.100.245, PPC, 94 Node)

• KGENE (192.168.100.205, X86, 28 Node)

• LOGIN (192.168.100.208, 8GB RAM, IA64)

• LOGINDB (192.168.100.207, 12GB RAM, IA64)

• DEV(192.168.100.226, 6GB RAM X86)

Page 9: Cluster Drm

KHAN Cluster

• Total 94 Nodes (1 Master + 94 work Node)

• 64Bit PowerPC 720 4-way, 16GB RAM

• 64Bit PowerPC 770 2-way, 2GB RAM

• User Space: /home1 1.6TB

• Scratch Space: /home2/scratch 935GB

• Software: /home1/biosoftware

• EIGNSTRAT, merlin, phase, plink, R ...

Page 10: Cluster Drm

DRM

• Distribute Resource Manager (System)

• Job Scheduler, batch system

• IBM LoadLeveler, PBS, OpenPBS, Torqueue, Grid Engine, Sun Grid Engine(SGE), LSF(Load Sharing Facility), Maui, Xgrid, Globus Toolkit, Grid MP

• Job Distribute, Job Status, Queue status,

Page 11: Cluster Drm

Grid Engine Inside...$1000 Genome

$10 Million Dollar X-Prize

Illumina Genome Analyzer

Page 12: Cluster Drm

1,000 Serial Job

• 1,000 Data Sets

• input.1, input.2... input.1000

• program -i input.(1~1000) -o output.(1~1000)

• 1,000 Job Submit Scripts

• 1,000 times Queue Submit Commands

Page 13: Cluster Drm

Job Array - LoadLeveler

• Large Data sets: 1,000(input.1 ~ input.1000)

1,000 times??

Page 14: Cluster Drm

Using Script

Page 15: Cluster Drm

1,000 Command Script

1,000 Job Submit Command

Page 16: Cluster Drm

1,000 Command Script

Page 17: Cluster Drm

1,000 Job Submit Command

Job Status

Page 18: Cluster Drm
Page 19: Cluster Drm

R

Page 20: Cluster Drm
Page 21: Cluster Drm

R Contributed Packages

Page 22: Cluster Drm
Page 23: Cluster Drm

Using R Scripts• Interactive Program

• Need to use R to analyze your data

Bash Shell Script??

Page 24: Cluster Drm

--quiet, -q Don’t print startup message--no-save Don’t save it

Using I/O redirection Form

Page 25: Cluster Drm

R and LoadLeveler

Page 26: Cluster Drm
Page 27: Cluster Drm
Page 28: Cluster Drm

Parallelism

Page 29: Cluster Drm

Hadoop• History

• 2005년 Nutch 오픈소스 검색 엔진의 분산확장 문제에서 출발Inspired from Google’s GFS, BigTable, MapReduce

• Goal: Web Scale

• 2006년 Yahoo!의 전폭적인 지원

• 2008년 Apache Top-level Project로 승격

• 현재 0.17.1 Release

• 특징• Java 언어 기반

• Apache License

• 많은 컴포넌트들 (HDFS, HBase, MapReduce)

Page 30: Cluster Drm

장점과 단점•모든 일에 적합 것은 아니다.

주로 해야 할 일들이 잘 나눠지는 것들에 적합하고, 분산된 일들끼리 통신이나 데이터 공유가 필요하면 적용이 까다롭다.

• Optimal을 보장하지 않는다.

•하지만, 대부분 구현 가능하고, 구현하기 쉬우며 재사용성이 높다.

Page 31: Cluster Drm

Hadoop 사용현황• Nutch: Open Source Web Search Software

• Yahoo!

• ~10,000 machines running Hadoop

• Porting ~100 webmap application

• The New York Times: Times Machine

• EC2/S3/Hadoop

• Large TIFF images(405,000), articles(3,300,000), meta data(405,000) -> 810,000 PNG data

Page 32: Cluster Drm

Yahoo! webmap

Blogger Ranking

Page 33: Cluster Drm

Hadoop Architecture

Page 34: Cluster Drm

Word Count Example

Page 35: Cluster Drm

Clustering User Viewing Patterns

Cluster of pages viewed together in a

session

Page 36: Cluster Drm

ApacheLogParser

ApacheLogFilter

KmeansCluster

HttpSessionDector

Map Map Map Map

Reduce(Identity)

Reduce(Identity) Reduce Reduce

K: line #V : line string

K: lPV : Log entry

K: IPV : Log entry

K: IPV : Log entry

K: IPV : URL vector

K: IPV : Log entry

K: IPV : ReqTime+URL

K: IPV : URL vector

K: ClusterIDV : URL vector

K: ClusterIDV : Cluster centroid

Page 37: Cluster Drm

MapReduce

•대규모 데이터의 분산처리를 위한 유용한 모델

•Large-scale application을 Functional Language의 패러다임 적용

•사용자는 분산에 대해 많은 고민없이 해결하고자 하는 문제에 집중할 수 있다.

Page 38: Cluster Drm

MapReduce with KHAN

• Hadoop 0.17.1

• Test Node(14)

• Apache LogAnalyzer

• Our Research apply??

Page 39: Cluster Drm

Thank u ^^QnA