21
Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

  • Upload
    diane

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance. Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia. High peak compute power High communication overhead. High peak memory bandwidth Limited memory space. - PowerPoint PPT Presentation

Citation preview

Page 1: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance

Abdullah Gharaibeh Matei Ripeanu

NetSysLabThe University of British Columbia

Page 2: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

2

GPUs offer different characteristics

High peak compute power

High communication overhead

High peak memory bandwidth

Limited memory space

Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

Page 3: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

3

Motivating Question: How should we design applications to efficiently exploit GPU characteristics?

Context: A bioinformatics problem: Sequence Alignment

A string matching problem Data intensive (102 GB)

Page 4: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

4

Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:

A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version

Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

> 50% overhead

(%)

Page 5: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

5

Use a space efficient data structure (though, from higher computational complexity class): suffix array

4x speedup compared to suffix tree-based on GPU

Idea: trade-off time for space

Consequences: Opportunity to exploit

multi-GPU systems as I/O is less of a bottleneck

Focus is shifted towards optimizing the compute stage

Significant overhead reduction

Page 6: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

6

Outline

Sequence alignment: background and offloading to GPU

Space/Time trade-off analysis

Evaluation

Page 7: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

7

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT…. ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: sequence alignment problem

Find where each query most likely originated from Queries

108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length

Queries

Reference

Page 8: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

8

GPU Offloading: opportunity and challenges

Sequence alignment

Easy to partition Memory intensive

GPU

Massively parallel High memory bandwidth

Opp

ortu

nity

Data Intensive Large output size

Limited memory space No direct access to

other I/O devices (e.g., disk)C

halle

nges

Page 9: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

9

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

• Large output size→compressed output

representation (decompress on the CPU) High-level algorithm (executed on the host)

Page 10: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

10

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

Page 11: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

11

The core data structuremassive number of queries and long reference => pre-

process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

Page 12: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

12

The core data structuremassive number of queries and long reference => pre-

process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

Page 13: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

13

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: reduced communication

Less data to transfer

Page 14: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

14

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 2: better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

Page 15: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

15

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 3: lower post-processing overhead

Page 16: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

16

Evaluation

Page 17: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

17

Evaluation setup

Workload / Species Reference sequence length

# of queries

Average read length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Page 18: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

18

Speedup: array-based over tree-based

Page 19: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

19

Dissecting the overheads

Significant reduction in data transfers and post-processing

Workload: HS1, ~78M queries, ~238M ref. length on Geforce

Page 20: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

20

Summary GPUs have drastically different performance

characteristics

Reconsidering the choice of the data structure used is necessary when porting applications to the GPU

A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of

additional per thread processing time Low post-processing overhead

Page 21: Size  Matters :  Space/Time Tradeoffs to Improve GPGPU Application Performance

21

Code available at:Code available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca