Upload
samuru
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
H. Bos – Leiden University 13/02/2004. 1. Using Network Processors in Genomics. Herbert Bos * † Kaiming Huang * {herbertb,khuang}@liacs.nl * Leiden Universiteit, Netherlands † Vrije Universiteit, Netherlands http://www.liacs.nl/~herbertb/projects/biocomp/. - PowerPoint PPT Presentation
Citation preview
Using Network Processors inGenomics
Herbert Bos* †
Kaiming Huang*
{herbertb,khuang}@liacs.nl
*Leiden Universiteit, Netherlands† Vrije Universiteit, Netherlands
http://www.liacs.nl/~herbertb/projects/biocomp/
H. Bos – Leiden University 13/02/2004 1
Case study: BLAST
● search nucleotide/protein database for query● BLAST discovers similarity rather than exact
match● two main phases:
1. scoring (registering where query and DNADB match)
2. alignment (dynamic programming)
● only the first phase on NPUs
H. Bos – Leiden University 13/02/2004 2
Window matching
H. Bos – Leiden University 13/02/2004 3
Window matching
H. Bos – Leiden University 13/02/2004 4
Window matching
H. Bos – Leiden University 13/02/2004 5
Window matching
H. Bos – Leiden University 13/02/2004 6
Window matching
● naïve approach: roughly W*N*M comparisons● does not scale ● string search algorithms: Aho-Corasick
– all windows matched at the same time– shifting genome one nucleotide at a time– matching algorithm transformed in a DFA
● DFA may be quite large
H. Bos – Leiden University 13/02/2004 7
Aho-Corasick
H. Bos – Leiden University 13/02/2004 8
● Alphabet: acgt● Window size: 3● Query: acgccga● Windows:
{acg,cgc,gcc,ccg,cga}
Aho-Corasick
H. Bos – Leiden University 13/02/2004 9
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
s 1 2 3 4 5 6 7 8 9 10 11 12
f(s) 0 4 5 0 7 8 0 4 10 4 5 1
● Alphabet: acgt● Window size: 3● Query: acgccga● Windows:
{acg,cgc,gcc,ccg,cga}
Aho-Corasick
H. Bos – Leiden University 13/02/2004 10
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
● Alphabet: acgt● Window size: 3● Query: acgccga● Windows:
{acg,cgc,gcc,ccg,cga}
s 1 2 3 4 5 6 7 8 9 10 11 12
f(s) 0 4 5 0 7 8 0 4 10 4 5 1
3 6 9 11 12
acg cgc gcc ccg cga
Aho-Corasick
H. Bos – Leiden University 13/02/2004 11
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
● Alphabet: acgt● Window size: 3● Query: acgccga● Windows:
{acg,cgc,gcc,ccg,cga}
s 1 2 3 4 5 6 7 8 9 10 11 12
f(s) 0 4 5 0 7 8 0 4 10 4 5 1
3 6 9 11 12
acg cgc gcc ccg cga tacgcga
H. Bos – Leiden University 13/02/2004 12
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
H. Bos – Leiden University 13/02/2004 13
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
H. Bos – Leiden University 13/02/2004 14
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
H. Bos – Leiden University 13/02/2004 15
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
H. Bos – Leiden University 13/02/2004 16
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
H. Bos – Leiden University 13/02/2004 17
ControlProcessor
NPU (IXP1200)
ME
ME
ME
ME
ME
ME
PCI Bus
StrongARM Microengines
DRAM
SRAM
Gbps ports
Pentium
PCI
scratch
IXPBlastArchitecture
0 1 2 3
4 5 6
12
10 11
7 8 9
t a c g
c
g
g c
a
g
cc
c
IXPBlast: packet handling
● packets read and processed in batches of 100.000● “spilling” must be taken into account● currently no feedback
H. Bos – Leiden University 13/02/2004 18
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Results
● 232 MHz IXP1200 ~ 1.8GHz Pentium-4● 1611 Nucleotide query (MyD88)● 1.4 GB genome (Zebrafish)
– IXP1200: 90 sec with DFA– IXP1200: 129 sec with “trie”– P4: 132: 132 sec with “trie”
● number of matches: 524856
H. Bos – Leiden University 13/02/2004 19
Results
H. Bos – Leiden University 13/02/2004 20
Query size
DNADB
sizeImpl. Performance
1611 1.4 GB P4 132 sec
1611 1.4 GB IXP1200 129 sec
1611 1.4 GB IXP1200
DFA
90 sec
Conclusions
● NPUs are useful in other application domains● Newer hardware is expected to perform much
better● “Throughput processors”● Adapting our current approach to use BLAST
tricks/heuristics
H. Bos – Leiden University 13/02/2004 21
Network processors
● geared for high throughput● used exclusively in network systems● example: intrusion detection● similar to looking for gene on
in genomes● differences
H. Bos – Leiden University 13/02/2004 22
Radisysixp1200 board
Application domain: “Genomics”
● example: search genome for occurrence of “patterns”● similar problems as IDS, poor performance on GPP
cannot exploit parallelism– throughput-driven– how about FPGAs?– how about clusters?
● NPU– easier to program than FPGAs– cheaper than cluster computing– “on the desktop” IP never leaves the room
H. Bos – Leiden University 13/02/2004 23