Multi-Core SIMD ASIP for DNA Sequence Alignmentpremio-vidigal.inesc.pt/pdf/NunoNevesMSc.pdf · Multi-Core SIMD ASIP for DNA Sequence Alignment Nuno Filipe Simoes Santos Moraes Neves˜

Multi-Core SIMD ASIP for DNA Sequence Alignment

Nuno Filipe Simoes Santos Moraes Neves

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Examination Committee

Chairperson: Prof. Nuno Cavaco Gomes HortaSupervisor: Prof. Nuno Filipe Valentim Roma

Co-supervisor: Prof. Paulo Ferreira Godinho FloresMember of the Committee: Prof. Horacio Claudio de Campos Neto

April 2013

”Today is victory over yourself of yesterday; tomorrow is your victory over lesser men.”

- Miyamoto Musashi

ii

Acknowledgements

First and foremost I would like to thank my parents, my whole family and my girlfriend Joana

Silva, for the continued support and motivation.

I offer my sincerest gratitude to my supervisors, Professors Nuno Roma and Paulo Flores for

their continued support, guidance and motivation. A very special thanks goes to Professor Pedro

Tomas and to Nuno Sebastiao, from the INESC-ID’s Signal Processing Systems group, for their

support and guidance throughout the past year. I also have to thank Professor David Matos and

my colleague Andre Patrıcio from IST, for without their work mine would have not been possible.

I would like to personally thank all my friends from IST for their support and the good memories

over the years, without them this journey would have been a lot less pleasant.

Last, but definitely not least, a special thanks to my life-long friends, each and every one of them

contributed in some way to make me who I am today.

Thank you all!

This thesis was performed in the scope of project ”HELIX: Heterogeneous Multi-Core Architecture for Biological Se-quence Analysis”, funded by the Portuguese Foundation for Science and Technology (FCT) with reference PTDC/EEA-ELC/113999/2009.

iii

Abstract

A novel Application-Specific Instruction-set Processor (ASIP) architecture for biologic se-

quences alignment is proposed in this manuscript. The presented processor achieves high

processing throughputs by exploiting both fine and coarse-grained parallelism. The former is

achieved by extending the Instruction Set Architecture (ISA) of a synthesizable processor to in-

clude multiple specialized SIMD instructions that implement vector-vector and vector-scalar arith-

metic, logic, load/store and control operations. Coarse-grained parallelism is achieved by using

multiple cores to cooperatively align multiple sequences in a shared memory plain, comprising

proper hardware-specific synchronization mechanisms. To ease the programming of the se-

quence alignment algorithms, a compilation framework based on a suitable adaptation of the

GCC back-end was also implemented. The proposed system was prototyped and evaluated on

a Xilinx Virtex-7 FPGA VC707 Kit, achieving a 190MHz working frequency. A vanilla and a state-

of-the-art SIMD implementations of the Smith-Waterman algorithm were programmed in both the

proposed ASIP and in an Intel Core i7 processor. When comparing the achieved speedups, it was

observed that the proposed ISA allows to achieve a 33x speedup, which contrasts with the 11x

speedup provided by SSE2 in the Intel Core i7 processor. The scalability of the multi-core sys-

tem was also evaluated and proved to scale almost linearly with the number of cores. A 900-fold

speedup was achieved with a 64-core processing framework.

Keywords

Biological sequence alignment, RISC, SIMD, ASIP, symmetric multiprocessing.

v

Resumo

Uma nova arquitectura de processador com um conjunto de instrucoes particularmente adap-

tado para uma classe de aplicacoes de alinhamento de sequencias biologicas e proposta neste

trabalho. O processador desenvolvido permite um desempenho elevado, atraves da exploracao

de paralelismo de granularidade fina e intermedia. O paralelismo de granularidade fina e explo-

rado atraves da extensao da arquitectura do conjunto de instrucoes (ISA) de um processador

ja existente, de modo a incluir novas instrucoes vectoriais (ou SIMD), com vista a implementar

operacoes aritmeticas, logicas, de controlo e de acesso a memoria. O paralelismo de gran-

ularidade mais elevada e explorado atraves da cooperacao entre multiplos processadores, no

alinhamento simultaneo de varias sequencias, utilizando um paradigma de memoria partilhada,

associado a mecanismos de sincronizacao especıficos. Para facilitar a programacao de algorit-

mos de alinhamento de sequencias, uma ferramenta de compilacao foi adaptada a partir do back-

end do GCC. O sistema proposto foi prototipado e avaliado no kit de avaliacao da Xilinx (Virtex-7

FPGA VC707), obtendo-se uma frequencia maxima de trabalho de 190MHz. Uma versao original

e uma versao vectorizada (SIMD) do algoritmo de Smith-Waterman foram programadas no pro-

cessador proposto e num processador Intel Core i7. Comparando os valores obtidos, verificou-se

que o ISA proposto permite atingir aceleracoes de 33x, o que contrasta com a aceleracao de 11x

proporcionada pelas extensoes SSE2 do Intel Core i7. O sistema multi-processador provou es-

calar quase linearmente com o numero de processadores. Uma aceleracao na ordem de 900 foi

estimada para um sistema com 64 processadores.

Palavras Chave

Alinhamento de sequencias biologicas, RISC, SIMD, ASIP, multiprocessamento simetrico.

vii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 State of the art 9

2.1 High performance computing architectures . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Smith-Waterman algorithm implementation . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Related Technology 21

3.1 MB-LITE processor core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Shared memory mechanisms in multi-core systems . . . . . . . . . . . . . . . . . 30

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 SIMD instruction set for DNA sequence alignment 37

4.1 SIMD optimization of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 SIMD register architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Proposed SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Proposed SIMD ISA implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Proposed architecture for the defined SIMD extension 47

5.1 Register and memory support for SIMD vectors . . . . . . . . . . . . . . . . . . . . 48

5.2 Modification of the execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Adaptation of the decode unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 GCC back-end extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

ix

Contents

6 Multi-core platform architecture 59

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 AMBA 3 AHB-Lite protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Shared bus arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.5 DMA controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.6 Hardware mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.7 System implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Experimental Evaluation 71

7.1 Prototyping framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2 SIMD ASIP evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.3 Multi-core processing structure evaluation . . . . . . . . . . . . . . . . . . . . . . . 77

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8 Conclusions 85

8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A SIMD Instruction Set 93

A.1 Scalar maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.2 Vector addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.3 Vector reverse subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.4 Vector compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.5 Vector maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.6 Vector element shift left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.7 Move-to-vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.8 Load vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.9 Store vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.10 Vector Disjunctive Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.11 Vector Disjunctive Branch w/ Immediate . . . . . . . . . . . . . . . . . . . . . . . . 99

A.12 Vector Conjunctive Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.13 Vector Conjunctive Branch w/ Immediate . . . . . . . . . . . . . . . . . . . . . . . . 100

B Code 101

B.1 Striped SW algorithm implementation pseudocode (new SIMD ISA) . . . . . . . . 102

B.2 Affine gap local alignment function . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

B.3 Striped SW algorithm implementation in C (new SIMD ISA) . . . . . . . . . . . . . 105

B.4 Striped SW algorithm implementation (Intel SSE2) . . . . . . . . . . . . . . . . . . 107

x

List of Figures

2.1 Functional units organization in generic a vector processor . . . . . . . . . . . . . 11

2.2 MMX registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Systolic array structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 SIMD implementations of the SW algorithm . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Data dependencies in Farrar’s implementation . . . . . . . . . . . . . . . . . . . . 17

3.1 MB-LITE configuration example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 MB-LITE bit numbering and instruction format . . . . . . . . . . . . . . . . . . . . . 24

3.3 MB-LITE architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 MB-LITE address decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Multiple sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Mutex conceptual operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Wishbone bus topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.8 AHB Multilayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Excerpt of Farrar’s SIMD implementation of the SW algorithm . . . . . . . . . . . . 40

4.2 SIMD register definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Vector move concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Branch disjunctive/conjunctive condition assertion . . . . . . . . . . . . . . . . . . 44

4.5 4-bit byte select signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Maximum implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Maximum decision logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 SIMD ALU arithmetic module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Vector element shift left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Move-to-vector operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Saturation detection logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.7 Compare and subtraction instructions . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.8 Instruction encoding field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.9 ASIP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xi

List of Figures

5.10 Compiler structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.11 Abstract Syntactic Tree example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1 Overview of the multi-core architecture . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Processing cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Memory mapped multi-core system . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 AHB-Lite signal interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.5 AHB-Lite bus architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.6 Arbiter block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.7 DMA commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.8 DMA controller architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.9 DMA FSM state diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.10 Mutex circuit architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.11 Multi-core platform architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.1 Speedup with different SIMD vector sizes . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 ASIC frequency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3 Multi-core platform clock cycle speedup . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4 Multi-core platform processing time speedup . . . . . . . . . . . . . . . . . . . . . 79

7.5 Multi-core system raw throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.6 Multi-core system Cell Updates per Joule . . . . . . . . . . . . . . . . . . . . . . . 81

7.7 Multi-core system Performance-Energy efficiency . . . . . . . . . . . . . . . . . . . 82

xii

List of Tables

1.1 Example substitution score matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Example Smith-Waterman score matrix . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Register reservation policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Proposed SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Saturation truth table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 SIMD type encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1 Priority truth table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.1 Lazy-loop counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 Clock cycles and obtained speedup values . . . . . . . . . . . . . . . . . . . . . . 74

7.3 Hardware specifications for the ASIP . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.4 ASIC synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.5 Multi-core system hardware specifications (Virtex-7 FPGA) . . . . . . . . . . . . . 77

7.6 Multi-core system hardware specifications (Zynq FPGA) . . . . . . . . . . . . . . . 78

7.7 Multi-core system hardware specifications (Aria II GX FPGA) . . . . . . . . . . . . 78

7.8 GPP characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.9 Multi-core system power consumption . . . . . . . . . . . . . . . . . . . . . . . . . 81

xiii

List of Tables

xiv

List of Acronyms

AHB AMBA High-performance Bus

ALU Arithmetic and Logic Unit

ASIC Application Specific Integrated Circuit

ASIP Application Specific Instruction-set Processor

ASM Assembly Language

AST Abstract Syntax Tree

AVX Advanced Vector Extensions

CISC Complex Instruction Set Computer

CPU Central Processing Unit

CUPJ Cell Updates per Joule

CUPJS Cell Updates per Joule-Second

CUPS Cell Updates per Second

DMA Direct Memory Access

DNA Deoxyribonucleic Acid

DP Dynamic Programming

DSM Distributed Shared Memory

EDP Energy-Delay Product

EX Execute

FIFO First-In First-Out

FPGA Field-Programmable Gate Array

FPU Floating Point Unit

xv

List of Acronyms

FSM Finite-State Machine

GCC GNU Compiler Collection

GPP General Purpose Processor

GPU Graphics Processing Unit

HMM Hidden Markov Model

IC Integrated Circuit

ID Intruction Decode

IF Instruction Fetch

ILP Instruction-level Parallelism

IPC Instruction per Cycle

ISA Instruction Set Architecture

LSB Least-Significant Bit

MEM Memory

MIMD Multiple-Instruction Multiple-Data

MSA Multiple Sequence Alignment

MSB Most-Significant Bit

OPB On-chip Peripheral Bus

PE Processing Element

PC Program Counter

RISC Reduced Instruction Set Computer

SIMD Single-Instruction Multiple-Data

SW Smith-Waterman

SSE Streaming SIMD Extensions

SMP Symmetric Multiprocessor

SoC System on Chip

TDP Thermal Dissipation Power

TLP Thread-level Parallelism

WB Write-Back

xvi

1Introduction

Contents

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1

1. Introduction

Over the past decade, several application domains have been demanding increasing computa-

tional resources, while energy and power consumption constraints remain a major concern in pro-

cessor and Integrated Circuit (IC) development. Traditional General Purpose Processors (GPPs)

are in some cases unable to meet the application requirements, while most Application Specific

Integrated Circuit (ASIC) solutions are not flexible enough to support algorithmic changes and

are much more expensive. Application Specific Instruction-set Processors (ASIPs) arise as an

intermediate solution, by representing a trade-off between the flexibility of the overall less expen-

sive GPPs and the higher performance of the ASIC, filling the architectural spectrum between

the two [1]. On the other hand, the latest advances in programmable devices, such as Field-

Programmable Gate Arrays (FPGAs), have proven their ability to serve as excellent platforms for

system prototyping, greatly facilitating the development of such processors. To achieve the re-

quired ASIP performance for specific application domains, the Instruction Set Architecture (ISA)

shall be carefully designed in order to provide specialized instructions, and inherent hardware

structures, with algorithmic acceleration in view, while still keeping the offer of more general in-

structions to allow the support for new algorithms. Hence, through the inherent flexibility of the

ASIP and the ease of prototyping and instantiation offered by FPGAs, there is room for significant

application improvements and further specialization of the ISA, if required.

Given the ever increasing application performance requirements, coupled with the known is-

sues of component size reduction and power consumption walls in IC production, parallelism

has been gradually exploited over the years. Seeking to improve the processor performance,

Instruction-level Parallelism (ILP) was early introduced with pipelined and superscalar architec-

tures, such as the MIPS (’81) and part of Intel’s i960 (’88) family, respectively. The goal of this

strategy is to increase the Instruction per Cycle (IPC) metric (close or even greater than 1, with

pipelined or superscalar structures, respectively) and maintain high enough clock frequencies, so

that increased performances are obtained. Until the end of the 1990s, research in GPP archi-

tectures was mainly focused on increasing the ILP, not only by developing the processor archi-

tectures (such as the mentioned pipelined and superscalar structures), but also by using caches,

multiple issue, dynamic scheduling, out-of-order execution and even deeper pipelined architec-

tures [2].

Inherent to the technology and to the increased performance level, energy and power con-

sumption became a major concern. A point was reached where most high-end GPPs required

significant amounts of energy, which was mostly dissipated as heat. This problem, coupled with

difficulties in exploiting larger degrees of ILP, led to the adoption of other strategies that were de-

veloped in the early years, such as Thread-level Parallelism (TLP), in which greater performances

are achieved by executing multiple threads and/or programs at the same time, in a single or in

multiple cooperating processors. The later configuration usually requires less energy and power

per processing unit than a single high-end GPP.

2

1.1 Motivation

Multi-processor architectures are usually classified as Multiple-Instruction Multiple-Data

(MIMD), according to Flynn’s taxonomy [3]. However, another viable strategy, that explores paral-

lelism at the data level, was developed. This strategy was first used in 1972 by the Control Data

Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer (ASC) vector

processors, but only fully exploited by the Cray-1 supercomputer in 1976 [2]. Later on, it began

to be used in desktop GPPs. These Single-Instruction Multiple-Data (SIMD) [3] architectures

provided an extra increase of performance by loading and processing multiple data elements in

parallel, with a single instruction. The first commercial desktop GPP with SIMD architecture was

delivered by Intel in 1995, with the MMX extension for the Pentium with MMX Technology and the

P6 processor families [4], opening the way to other technologies, such as PowerPC’s AltiVec [5]

and Intel’s Streaming SIMD Extensions (SSE) for the Pentium III processor [4].

1.1 Motivation

Bioinformatics applications are among the most demanding in terms of performance and com-

puting power. One particular example is the family of biological sequence processing algorithms,

such as protein and Deoxyribonucleic Acid (DNA) alignment, gene sequencing and gene dis-

covery. Moreover, the available biological databases have had an exponential growth. As an

example, the latest release of GenBank (February 2013), contains over 150× 109 base pairs from

over 260000 formally described species [6].

Most of these algorithms’ optimal solutions are obtained by applying Dynamic Programming

(DP) methods, such as the Smith-Waterman (SW) algorithm [7] or the Needleman-Wunsch algo-

rithm [8], for local and global sequence alignment, respectively. Given the implied computational

demands and the huge data sets, these types of algorithms usually require large runtimes in

GPPs. Although not optimal, some heuristic-based algorithms have also been developed to re-

duce the runtime, e.g., BLAST [9] and FASTA [10]. These alternative, but less accurate, heuristic

algorithms work by estimating the statistical significance of matches (BLAST [9]), or by running

large heuristic methods to identify the possible location of potential matches (FASTA [10]), before

performing a more time-consuming and optimized search, using an optimal DP method. Hence,

the use of DP algorithms is always preferred when high accuracy levels are required, but time

restrictions are not an issue.

The SW algorithm [7], characterized by an O(nm) time complexity, is a widely established DP

algorithm to obtain the local alignment between a query sequence (q) and a reference sequence

(d), of sizes m and n respectively. It operates in two distinct phases: it starts by filling a score

matrix H, followed by a traceback phase over this matrix. The score matrix H is calculated with

the recursive relations given by Eq. 1.1, where Sbc(q[i], d[j]) denotes the substitution score value

3

1. Introduction

obtained by aligning character q[i] against character d[j].

H(i, j) = max

H(i− 1, j − 1) + Sbc(q[i], d[j]),H(i− 1, j)− α,H(i, j − 1)− α,

0

H(i, 0) = H(0, j) = 0

(1.1)

The α value in Eq. 1.1 represents the gap penalty cost, and is always positive and greater than

the maximum substitution score. An example substitution score matrix is given in Table 1.1.

Table 1.1: Example substitution score matrix. The scores are typically positive in case the symbolsmatch and negative otherwise.

Sbc A C T G

A 3 -1 -1 -1C -1 3 -1 -1T -1 -1 3 -1G -1 -1 -1 3

The traceback phase, for local alignment, is performed by first locating the cell with the highest

score. Then, all the score matrix cells that lead to the maximum score are traced and the aligned

subsequences are obtained. The path is chosen based on which of the adjacent cells was used

to calculate the current cell’s score (with Eq. 1.1). An example of a filled score matrix, with the

highlighted optimal alignment between two sequences, is given in Table 1.2.

Table 1.2: Example score matrix after the Smith-Waterman algorithm execution; the highlightedcells represent the optimal alignment between both sequences.

ø A A T G C C A T T G A C

ø 0 0 0 0 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 3 3 0 0 0 0 0 3A 0 3 3 0 0 0 2 6 2 0 0 3 0G 0 0 2 2 3 0 0 2 5 1 3 0 2C 0 0 0 1 1 6 3 0 1 4 0 2 3C 0 0 0 0 0 4 9 5 1 0 3 0 5T 0 0 0 3 0 0 5 8 8 4 0 2 1C 0 0 0 0 2 3 3 4 7 7 3 0 5G 0 0 0 0 3 1 2 2 3 6 10 6 2C 0 0 0 0 0 6 4 1 1 2 6 9 9T 0 0 0 3 0 2 5 3 4 4 2 3 8

1.2 Objectives

The presented work was performed in the scope of HELIX (Heterogeneous Multi-Core Ar-

chitecture for Biological Sequence Analysis) project, which aims the development of integrated

and parallel hardware and software platforms targeting the acceleration of a wide range of bioin-

formatics algorithms. The project exploits new parallel processing strategies and architectures to

accelerate several algorithms related to DNA re-sequencing, Multiple Sequence Alignment (MSA)

and gene finding. The HELIX project is the result of a joint research effort of SiPS, ALGOS and

4

1.3 Main contributions

KDBIO research groups of INESC-ID, and is financially supported by the Fundacao para a Ciencia

e a Tecnologia (FCT) (ref. PTDC/EEA-ELC/113999/2009). Accordingly, the main objective of the

research work that was conducted in the scope of this Master Thesis dissertation was to design

an efficient ASIP architecture targeting the acceleration of important bioinformatics applications

through a flexible (i.e., programmable) solution. Specifically, it was targeted the acceleration of

sequence alignment algorithms, such as those based on the Smith-Waterman algorithm.

To tackle this problem, it was conducted a thorough study of some of the state-of-art soft-

ware implementations of these algorithms, and from these a new SIMD-based ISA, particularly

optimized for this class of algorithms, should be envisaged. Furthermore, the objective was to

implement the proposed ISA extension in an existent processor soft-core, by taking advantage of

some already available software development tools (e.g., compiler, assembler, etc.). The adapta-

tion of these tools is being conducted in the scope of another Master Thesis dissertation.

Finally, it was also designed as a primary objective the prototyping and evaluation of the con-

cerned ASIP in several different implementation technologies, such as FPGA and ASIC standard

cell libraries.

1.3 Main contributions

Based on the final objectives, and after a careful analysis of state-of-the-art software imple-

mentations of the SW algorithm, with a special attention to the Intel SSE2 implementation, as

proposed by M. Farrar [11], a new SIMD ASIP architecture was developed, providing a versatile

platform for the implementation of bioinformatics applications.The newly defined ISA was imple-

mented based on the MB-LITE [12] soft-core processor, by including support for the set of new

and specialized SIMD instructions. The implementation of the ISA was carefully conducted in

order to have a very small impact in the processor original performance, as well as in the required

hardware resources. This allowed achieving clock frequencies close to 190 MHz in a FPGA

implementation, and a maximum of 770 MHZ in a 90nm CMOS technology process. Further-

more, an extensive multi-core computational structure, composed by multiple instantiations of the

designed ASIP, was also developed. The framework comprises i) a memory hierarchy, with a

private (scratchpad) and shared memory organization; ii) a shared bus topology compatible with

the ARM’s AMBA 3 Multilayer AHB-Lite protocol; iii) core synchronization mechanisms, such as

bus arbitration and a hardware mutex circuit; and iv) a Direct Memory Access (DMA) controller,

associated with each core, to limit the bus contention in the access to the shared memory.

A sequential version and an adapted M. Farrar’s [11] version of the SW algorithm were im-

plemented in the new ASIP, with the aid of an adapted extension of the GNU Compiler Collec-

tion (GCC) compiler’s back-end to include the new instruction set. The achieved speedups were

compared to those of the SSE2 implementation in an Intel i7 processor. It was observed that

5

1. Introduction

the new ISA achieves speedups up to 33x, which contrasts to the 11x speedup obtained in the

Intel processor. Also, the multi-core ASIP platform’s performance proved to scale almost linearly

with the number of cores, perspectivating a 900-fold clock cycle speedup with 64 cores, over the

sequential version in a single processor.

Hence, both the developed multi-core and the specialized architecture provide a fully-

functional specialized bioinformatics platform, capable of efficiently executing computational in-

tensive algorithms, with aid of both fine and coarse-grained parallelism.

As a result of the developed research, it was already accepted for publication manuscript that

will be presented on the 24th IEEE International Conference on Application-specific Systems,

Architectures and Processors (ASAP 2013):

• N. Neves and N. Sebastiao and A. Patrıcio and D. Matos and P. Tomas and P. Flores and N.

Roma, ”BioBlaze: Multi-Core SIMD ASIP for DNA Sequence Alignment”, 24th IEEE Inter-

national Conference on Application-specific Systems, Architectures and Processors (ASAP

2013), June 2013

Meanwhile an extend version of this paper is under preparation and will be submitted to the

IEEE Transactions On Very Large Scale Integration (VLSI) Systems journal:

• N. Neves and N. Sebastiao and A. Patrıcio and D. Matos and P. Tomas and P. Flores and N.

Roma, ”Multi-Core SIMD ASIP for Biological Sequences Alignment”, IEEE Transactions On

Very Large Scale Integration (VLSI) Systems

1.4 Dissertation outline

After this preliminary introduction, the state-of-the-art on high performance processing struc-

tures is presented in Chapter 2, focusing on GPP SIMD extensions and technologies, ASIC and

FPGA solutions. Also, the most recent SIMD implementations of the SW algorithm are presented.

Chapter 3, presents the main reasons behind the choice of the MB-LITE [12] soft-core as the

base processor, together with a detailed description of its architecture, as well as the most rel-

evant multi-processor architecture features and synchronization mechanisms usually included in

multi-core structures. Chapter 4 includes a requirement analysis of the M. Farrar [11] implemen-

tation of the SW algorithm, followed by a proposed set of specialized SIMD instructions. This

chapter also presents a performance prediction of the new SIMD ISA against the Intel SSE2 ISA.

The changes performed to the MB-LITE [12] processor in order to include the new specialized

instructions are presented in Chapter 5. Following this presentation, it is included a brief descrip-

tion of a GCC [13] back-end adaptation, to support the new ISA. In Chapter 6 it is described the

developed multi-core structure, by presenting the implemented bus topology and synchronization

mechanisms. Chapter 7 presents the hardware utilization and performance evaluation for both

6

1.4 Dissertation outline

the ASIP and the multi-core structure. Finally, Chapter 8 concludes the dissertation, presenting

the main achievements and addressing possible future work directions.

7

1. Introduction

8

2State of the art

Contents

2.1 High performance computing architectures . . . . . . . . . . . . . . . . . . . . 10

2.2 Smith-Waterman algorithm implementation . . . . . . . . . . . . . . . . . . . . 14

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

9

2. State of the art

2.1 High performance computing architectures

As increasingly larger degrees of ILP are exploited, several issues arise in terms of hardware

complexity. By increasing both the number of issued instructions and the pipeline depth the

number of partially executed instructions at once also increases. Specially, due to the complex

hardware structures, a great deal of resources are required in dynamically scheduled machines to

keep up with the necessary capacity to hold those instructions and logic to track the dependencies

between them [2]. This rapid growth in complexity limits the issue width and pipeline depths to

more feasible degrees of ILP, placing a practical wall in the exploration of this strategy.

2.1.1 Vectorial architectures

Developed before ILP have began to be explored, vector processors allow greater levels of

parallelism by performing multiple parallel operations in a single instruction. One particular class

of vector architectures include the SIMD processing structures, where the same operation is si-

multaneously applied to several data elements. SIMD instructions operate over one-dimensional

arrays of independent data elements, called vectors, instead of single scalar values. As an ex-

ample, on earlier vector machines a typical SIMD operation would add two 64-element vectors

and obtain a 64-element result. Hence, it allows replacing of an entire loop of iterations, contain-

ing control operations, with a single instruction. This leads to significant performance gains over

scalar processors. In more recent supercomputers, also coupled with ILP strategies, some vector

sizes can be as great as 4096 64-bit elements per vectors (Fujitsu VPP5000) [2].

A vector processor is usually characterized by having a regular pipelined scalar unit, together

with an independent vector unit, as shown in Fig. 2.1. Both units can explore state-of-the-art

ILP technologies. However, as opposed to the dependencies between consecutive scalar opera-

tions, vector data elements are independent between themselves. Hence, data hazard detection

is only required between the vectors and not the data elements, allowing their computation to be

completely parallel. Moreover, vector operations can substitute entire loops of data calculations,

reducing the number of control hazards and, therefore, pipeline flushes. The hazard mitigation

leads to a reduction of the number of pipeline stalls per vector element when compared to se-

quential scalar operations, improving the overall performance of the processor.

Vectors are operated either by a number of parallel functional units or by pipelined functional

units with several clock cycle latencies, depending on the hardware resource constraints. Some

vector processors also have multiple and parallel vectorial functional units, similar to the paral-

lel scalar functional units (e.g., addition/subtraction units and multiplication units with separate

pipelines). The first vector computers would perform memory-memory operations, i.e. vectors are

fetched and stored from the memory banks instead of the register files, implying an initial over-

head in the computation. Later on, most vector computers adopted a vector-register operation,

10


Figure 2.1: Functional units organization in a generic vector processor. A dedicated Load-Storeunit takes data from memory and from the scalar register file to fill the vector register file. Therepresented functional units are highly parallel and pipelined.

similar to scalar operations, by first fetching the vector from memory and storing it in a register

and then perform register operations [2]. This led to the addition of dedicated vector register files

in the processor architectures.

Fetching data elements from memory and gathering them in a vector register introduces some

performance issues in vector processors. If the data elements are stored in adjacent positions,

then the memory load works well. However, if the data elements are not contiguous, the pro-

cessor must access nonsequential memory positions, resulting in higher memory latencies. Most

vector processors have the ability to simultaneously access nonsequential memory locations and

gather the data in a vector register [2]. Also, although inefficient is some scientific applications,

some more recent supercomputers also have a vector caching system to allow higher memory

bandwidths.

In the late 90’s, aside the deep pipeline structures and the state-of-the-art ILP technologies,

GPPs started to support vector processing techniques by adding extensions to their ISAs. Al-

though with less vector elements than the sixty-four 64-bit elements in supercomputers, most of

those extensions allow notable performance increases, formally mostly focused to multimedia ap-

plications. Coupled with the fact that GPPs are significantly cheaper than a supercomputer, some

application areas began to take advantage of these SIMD extensions, in order to speedup their

performance.

2.1.2 SIMD instruction set extensions

Among the wide variety of GPPs supporting SIMD instruction set extensions, Intel processors

emerged as one of the main competitors by introducing the MMX technology, followed by five gen-

erations of SSE extensions (i.e. SSE, SSE2, SSE3, SSSE3 and SSE4), and by the most recent

11

2. State of the art

Advanced Vector Extensions (AVX) extensions. The MMX technology was first introduced in the

Pentium II and Pentium with MMX technology processor families and provides operations over

packed integer operands in 64-bit vectors. The MMX registers were directly mapped to the lower

64 bits of the existing 80-bit Floating Point Unit (FPU) registers [4] (see Fig. 2.2). Although requir-

ing less hardware resources than adding a completely new register file, the usage of both floating

point and SIMD data in the same application became very inefficient from the performance point

of view, due to the relatively slow switching between the two modes. The required state update

represented the main overhead, when a FPU instruction is executed after a MMX instruction [14].

Initially, the MMX SIMD integer instructions were only relevant for 2D and 3D graphics processing

and have turned somewhat redundant with the evolution of graphics cards, or even obsolete with

more recent multimedia application that are floating-point intensive.

Figure 2.2: Conceptual scheme of the MMX register mapping to the FPU registers, in Intel pro-cessors [4].

With the release of the Pentium III processor family, Intel provided the SSE extensions to the

instruction set. In addition to the MMX technology packed integer operations, the SSE exten-

sions added eight and completely independent 128-bit registers (named XMM0 to XMM7), further

extended to 16 registers in the Intel 64 architectures. This solved the issue of using MMX and

FPU instructions in the same application. Another important feature of the SSE extension is the

offer of prefetch instructions. These instructions allow the programmer to prefetch data into a

specified level of the cache hierarchy beforehand, allowing greater pipeline throughputs than us-

ing conventional load instructions that force pipeline stalls [15]. Moreover, new SSE instructions

for cacheability control, state management and instruction and memory ordering were also made

available [4]. The SSE set of instructions is specially useful for 3D geometry, rendering, video

encoding and decoding applications. However, for some scientific applications the 80-bit FPU in-

structions are still preferable than the lower precision SIMD 32-bit floating-point instructions, which

cause measurable numerical deviation in the results of some mathematical sets of operations.

Although SSE3 an SSE4 brought great architectural improvements, such as Hyper-Threading

Technology support and data processing or algorithm acceleration instructions, the most relevant

improvements of the SSE family became available with SSE2 [4], introduced in Pentium 4 and

12


Intel Xeon Processors. SSE2 extended the earlier MMX instructions to operate in the 128-bit

XMM registers and added new 128-bit SIMD integer operations, therefore operating over up to

16 packed integers. Improvements to the existing packed floating-point instructions were also

made, by adding support for double-precision instructions [4]. Although, not so precise as the

80-bit FPU instructions the packed double-precision operations provide a major contribution for

scientific applications.

The latest SIMD extensions, and possibly the most important so far, are the AVX, and the

upcoming AVX2, extensions. Aside the extension of the XMM registers to sixteen 256-bit regis-

ters (aliased as YMM0 to YMM15) and the 256-bit floating-point operation enhancements, a new

SIMD-specific instruction encoding (i.e., VEX) is provided. This new syntax allows support for

three and four-operand instructions, eliminating the inefficiency inherent to the two-operand de-

structive syntax, used by MMX/SSE. The VEX-encoding is only supported in 128-bit and 256-bit

SIMD instructions, operating over the XMM and YMM registers, respectively. Further support of

the VEX-encoding for general-purpose operations will be provided with AVX2. AVX instructions

promote almost all the previous 128-bit operations from the SSE generations and provide addi-

tional operations over 256-bit vectors with packed double-precision floating-point values. AVX2

will provide 256-bit vector integer instruction extensions, promoting most of the existing integer

SSE 128-bit vector operations [4]. AVX and AVX2 are major breakthroughs, given the contin-

ued need for vector floating point performance in scientific and engineering applications, gaming

physics, cryptography and other areas of applications.

2.1.3 Dedicated architectures

ASICs are among the best algorithmic performance offering solutions, by providing customiz-

able and highly optimized ICs. Coupled with state-of-the-art manufacturing processes, it is possi-

ble to achieve low-power dedicated architectures, operating at high clock frequencies. However,

the fabrication and design processes are usually expensive. Moreover, most ASIC solutions do

not have the flexibility for algorithmic improvement and adaptability for the execution of wider

ranges of algorithms.

To overcome some of these issues, some FPGA-based reconfigurable designs have been pro-

posed. These designs are highly pipelined Processing Elements (PEs), integrated in array struc-

tures, controlled by a master GPP and First-In First-Out (FIFO) queues are used as input/output

buffers for commands and data (see Fig. 2.3). This type of implementation allows a wide range of

hardware accelerators to be implemented depending on the application. Despite the possibility of

being continuously updated, as opposed to the ASIC solutions, the PEs themselves still lack the

adaptability for implementations of different algorithms.

As a consequence, it is frequently not easy to achieve the optimal design compromise that si-

multaneously satisfies all system requirements in terms of computational performance and adapt-

13

2. State of the art

Figure 2.3: Systolic array structure. The structure is composed by an array of PEs with an auxiliarycontroller. Data and control buffers are provided to feed the array, store intermediate results andoutput the obtained results. A master GPP provides the array with data and commands andreceives the final results.

ability. GPPs are often regarded as convenient and alternative solution, although they hardly

provide the same performance.

2.2 Smith-Waterman algorithm implementation

Since the SW algorithm was proposed, a wide variety of implementations were developed,

using different approaches to address the high performance and the extensive runtime issues.

Dedicated hardware structures or even implementations based on SIMD instruction set exten-

sions were proposed to accelerate the algorithm’s execution. These range from ASIC solutions

and FPGA implementations, to implementations that take advantage of Graphics Processing

Unit (GPU) capabilities, or even the usage of multimedia instructions in desktop GPPs.

One of the first optimized implementations was proposed by O. Gotoh [16], where the score

matrix is filled by using an affine gap penalty model, instead of the linear gap penalty model of the

original algorithm (see Eq. 1.1). The optimization is given by Eq. 2.1, where α and β represent

the cost of gap opening/insertion and extension, respectively.

H(i, j) = max

H(i− 1, j − 1) + Sbc(q[i], d[j]),E(i, j),F (i, j)

0

F (i, j) = max

{

H(i− 1, j)− α,F (i− 1, j)− β

E(i, j) = max

{

H(i, j − 1)− α,E(i, j − 1)− β

(2.1)

E(i, j) and F (i, j) represent the maximum local-alignment score involving the first j characters of

sequence d and the first i characters of sequence q ending with a gap in q or d, respectively. H(i, j)

14


represents the overall maximum local-alignment score involving the first i characters of q and the

first j characters of d. The initial conditions are given by H(i, 0) = H(0, j) = E(i, 0) = F (0, j) = 0.

As it can be ascertained by Eq. 2.1, the score matrix results in the (i, j)th iteration depend

on the (i − 1, j), (i, j − 1) and (i − 1, j − 1) values, representing the vertical, horizontal and di-

agonal dependencies, respectively. The implementations presented in the following subsections,

coupled with data or thread-level parallelism, move towards the mitigation or elimination of these

dependencies in order to speedup the overall execution of the algorithm.

2.2.1 Dedicated hardware accelerators

The highest performance for the SW algorithm is usually obtained with specialized hardware

accelerators. These implementations are often composed by optimized PEs that perform the

required computations for each matrix cell very efficiently. The most common implementations,

achieving high throughputs, are structures with high PE densities, such as systolic arrays, im-

plemented either as ASICs [17–19] or in FPGAs [20–23]. An example of one of those high per-

formance ASIC solutions is the BioScan machine [18]. This system is composed by 16 chips,

each one containing 812 1-bit processors, leading to a total of 12 992 processors and achieving

1000-fold speedups against existing current solutions.

As mentioned before, given the reduced flexibility of these solutions, the trade-off GPPs pro-

vide between adaptability and performance, coupled with the SIMD extensions currently available

in most off-the-shelf processors, is often the most convenient solution to implement bioinformatics

algorithms.

2.2.2 SIMD implementations

To accelerate the alignment procedure as much as possible, while still assuring the best align-

ment, several parallelization approaches of the SW algorithm have been presented. Three pro-

posed SIMD approaches, based on the exploitation of MMX/SSE instruction set extensions, are

particular relevant and were presented by Wozniak [24], Rognes and Seeberg [25, 26] and by M.

Farrar [11]. Their main differences result from the adopted data processing pattern (see Fig. 2.4).

To ensure the simultaneous commitment of as many dependencies as possible, the parallel

scheme proposed by Wozniak [24] concurrently processes a set of cells along the minor diagonal

of the alignment matrix (see Fig. 2.4(a)). However, this straightforward approach comes at the cost

of non-trivial memory access patterns, leading to considerable overheads in the manipulation of

the data.

Some years later, Rognes and Seeberg [25] presented an alternative approach to simplify and

accelerate the loading of the substitution scores from memory, by pre-computing a query profile

for the entire database. This newly introduced query profile is composed by all the substitution

scores for matching the desired query sequence with each of the reference sequence different

15

2. State of the art

(a) Wozniak [24] (b) Rognes [25] (c) Farrar [11] (d) Rognes [26]

Figure 2.4: SIMD implementations of the SW algorithm [26]. The first five SIMD iterations werenumbered and represented with different gray levels. For simplicity, only 4 data elements areshown in each SIMD register.

symbols. With such technique, a vector of cells parallel to the query sequence can be simultane-

ously processed by each instruction (see Fig. 2.4(b)). The commitment of the data dependencies

is guaranteed by defining threshold conditions relating each computed score value and the in-

sertion/extension gap penalties, allowing to disregard most comparisons related to the vertical

dependencies of the algorithm. Nevertheless, such approach implied the introduction of condi-

tional branches in the inner loop of the algorithm (where each matrix cell is calculated according

to its dependencies), making the execution time dependent on the scoring matrix and the gap

penalties.

Later, M. Farrar [11] improved this processing scheme by adopting a striped access pattern

(see Fig. 2.4(c)). With such approach, it was possible to move the conditional procedures related

to the commitment of the vertical dependencies to a lazy loop, executed outside the inner loop.

Hence, the conditional statements only have to be taken into account once, when processing

the next database symbol. In the whole, such cumulative set of contributions and improvements

led to significant speedup values of the alignment procedure, which justified its integration in

many current high performance alignment frameworks, such as in the most recent versions of

SSEARCH [10].

Meanwhile, T. Rognes [26] presented another parallelization of the SW algorithm that exploits

other capabilities made available by modern processors. However, the considered application do-

main is somewhat different from the previous approaches. Instead of solving the single-reference

single-query alignment problem, the algorithm simultaneously compares several different refer-

16


ence sequences with one single query sequence in each SIMD operation (see Fig. 2.4(d)). Com-

bined with a multi-threaded processing scheme to concurrently exploit the several cores available

in the latest generations of Symmetric Multiprocessor (SMP) architectures, additional speedups

are achieved when each query is aligned to a database of reference sequences.

Farrar’s striped technique

Among the presented approaches, Farrar’s algorithm is still considered one of the fastest SIMD

implementations of the SW algorithm. This is mainly due to its striped processing pattern. With

such layout, the precomputed query profile is accessed by following a pattern parallel to the query

sequence and the computations are carried out in several separate stripes, that cover different

parts of the query sequence. As mentioned before, this approach significantly reduces the impact

of some of the data dependencies, moving them out of the inner loop and placing them in an outer

lazy loop, where they have to be considered only once, before starting the processing of the next

database symbol (see Fig. 2.5).

Figure 2.5: Data dependencies between the last segment and the first segment of the next columnof the score matrix H. For simplicity, only 4 data elements are shown in each SIMD register(represented in different gray levels), while sixteen 8-bit elements would normally be used.

In this approach, the query is divided into p equal length segments, where p is given by the

number of data elements that can be simultaneously accommodated in a SIMD register. As an

example, when 128-bit SSE2 registers are considered to process 8-bit data elements, p equals to

16. However, when greater resolution scores using 16-bit integers are used, p must be reduced

to 8. The length of each segment is given by ⌊t = (m+ p− 1)/p⌋, where m is the query sequence

size and padding zeros are inserted whenever the query is not long enough to completely fill all

the segments.

The computation of the score matrix is fulfilled by assigned each data element of the SIMD reg-

17

2. State of the art

ister to one distinct segment, leading to the striped access pattern illustrated in Fig. 2.5. Accord-

ingly, each matrix column, corresponding to a database symbol d[j] is processed in t iterations,

where each iteration simultaneously processes p query symbols, separated by t − 1 lines in the

score matrix. As an example, when p = 16 the second iteration of the algorithm simultaneously

processes, in a SIMD register, the following query symbols:

q2 =[

q[2] q[t+ 2] q[2t+ 2] q[3t+ 2] q[4t+ 2] · · · q[15t+ 2]]

(2.2)

The processing of such query symbols considers a precomputed query profile that adopts a sim-

ilar pattern:

W2,j =[

W [q[2], d[j]] W [q[2t+ 2], d[j]] W [q[3t+ 2], d[j]] W [q[4t+ 2], d[j]] · · · W [q[15t+ 2], d[j]]]

(2.3)

Where W [q[i], d[j]] is the cost of matching symbol d[j] with symbol q[i], and W2,j represents the

query profile vector for the second iteration of the algorithm.

2.3 Discussion

In the previous sections, it can be ascertained that the highest performances in the SW al-

gorithm are obtainable with ASIC solutions. However, the lack of flexibility of such dedicated

solutions, even with the latest advances in FPGA technologies, still proves to be a major draw-

back when algorithmic improvements or implementations of other DP algorithms are required.

GPPs compromise the higher performances of ASIC solutions for flexibility. Which, coupled with

the latest SIMD instruction set extensions, has been proven to be a convenient solution for the

acceleration of DP algorithms. However, they still struggle to meet the performances of dedicated

hardware implementations.

In order to fill the gap between performance and flexibility in bioinformatics applications, ASIPs

can prove to be a major breakthrough, by combining the dedicated paradigm of the ASIC solu-

tions with the adaptability of the GPPs. Hence, by designing a simple processor architecture with

a specialized instruction set, the high performances can be achieved a low-cost and low-power ar-

chitecture. Moreover, including SIMD capabilities in the already specialized ISA, most algorithms

that take advantage of such capabilities in GPPs can be implemented, possibly achieving higher

performances.

Also, with the increase in FPGA resources, a number of instantiations of ASIP cores can be

combined in a multi-core structure or integrated in larger processing systems, exploring greater

degrees of parallelism and obtaining even higher performance gains.

2.4 Summary

In this Chapter, an introduction to vector processors, by highlighting the most important char-

acteristics of earliest solutions in supercomputers, and the SIMD extensions available in Intel

18

2.4 Summary

processors were presented in the first section. In the second section, the most relevant imple-

mentations and optimizations of the SW algorithm are presented, with a special focus on the Intel

SSE2 implementation by M. Farrar [11]. Furthermore, a brief overview of the latest dedicated

hardware implementations of the SW algorithm is presented, including both ASIC and FPGA so-

lutions.

19

2. State of the art

20

3Related Technology

Contents

3.1 MB-LITE processor core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Shared memory mechanisms in multi-core systems . . . . . . . . . . . . . . . 30

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

21

3. Related Technology

Three Reduced Instruction Set Computer (RISC) processors were considered to implement

the SIMD instruction set, that will be proposed in Chapter 4: i) LEON3 [27]; ii) MicroBlaze [28];

and iii) MB-LITE [12]. The LEON3 soft-core is an open-source implementation of the SPARCv8

32-bit architecture, with a 7-stage RISC pipeline. Hence, compilers and kernels for SPARCv8

can be used with LEON3. Operating frequencies up to 125 MHz in FPGA and 400 MHz on

0.13 µ m ASIC technologies are reported in [29]. Although highly configurable, the architecture

itself is quite complex. Therefore, the required modifications for extending the instruction set may

prove too hard. Moreover, the required FPGA resources required for its instantiation are too high,

namely for the multi-core platform presented in Chapter 6.

The MicroBlaze is a 32-bit RISC architecture with a full GCC compiler support, designed by

Xilinx for use in their FPGAs. Although it is an improvement over the LEON3 regarding FPGA

area usage and with a higher operating frequency of 200 MHz, it still is an hard IP core. Hence, it

is not possible to perform modifications to extend the instruction set, since the description files are

not available. This leads to the MB-LITE [12] soft-core that was developed in the Delft University

of Technology, by T. Kranenburg. The MB-LITE soft-core is a simple open-source implementation

of the MicroBlaze ISA. Which means the MicroBlaze tool-chain can still be used. Moreover,

modifications are very easy to perform, due to the simplicity and modularity of the design. Due

to its design, as described in [12], MB-LITE operates at the same frequency as the MicroBlaze,

requiring a lot less resources in a FPGA implementation.

3.1 MB-LITE processor core

The significant predominance of loop patterns in DP algorithms (generally implemented with

conditional branch instructions) and the severe penalties that the control instructions generally im-

pose on deep pipeline architectures, lead to inherent losses in the attainable throughput (imposed

by unavoidable pipeline flushes introduced by branch instructions). Such restrictions determined

the adoption of a shallower pipeline structure, contrasting with the most recent Intel GPP archi-

tectures. As a consequence, the MB-LITE [12] soft-core was selected as the base architecture for

this project, not only due to its simple and portable processing structure, but also due to its compli-

ant implementation with the well known MicroBlaze ISA [28]. Therefore, it also has the advantage

of having an available and highly matured compiler, i.e., GCC , which can be extended to support

new instructions. Furthermore, the adopted MB-LITE design is highly configurable and is rela-

tively easy to adapt to support the extended ISA that will be proposed in Chapter 4. The reduced

hardware resources that are required by this processor were also taken into account, prospecting

the bases for a scalable multi-core processing platform to exploit coarse-grained parallelism.

22


3.1.1 MB-LITE architecture overview

The MB-LITE [12] processor implements a reduced MicroBlaze ISA [28] in a 32-bit Harvard

RISC architecture, based on the MIPS five-stage pipeline structure. Accordingly, all instructions

have a single cycle latency, except the branches whose latency is two or three clock cycles (with

or without delay slots, respectively). The adopted architecture also includes an address decoder

to allow communication with the different peripherals in a memory mapped I/O organization inte-

grating a Wishbone Bus [30] adapter. It also provides a character device, so that STDOUT can be

easily used in the software development phase. In the following subsections, the architecture will

be described based on the observation of the VHDL files and on the documentation provided in

the MB-LITE project, accessible in [31].

The MB-LITE microprocessor is implemented with a five-stage pipeline, being the stages the

usually featured in this type of architectures, i.e., Instruction Fetch (IF), Intruction Decode (ID),

Execute (EX), Memory (MEM) and Write-Back (WB). The WB stage does not require a separate

module as it is only composed by wiring, i.e., only contains connections to the ID stage for result

storage in the register file and for data forwarding lines.

The design of this processor was made completely modular, which provides an easy way to

add, remove and connect other modules and peripherals. The core, itself, features the five com-

ponents that implement the pipeline, the instruction and data memories, as well as the address

decoder and the wishbone adapter, as completely independent modules, allowing the micropro-

cessor to be highly configurable. There are also other parameters that can be configured depend-

ing on the targeted application, such as the enabling/disabling of the multiplier, barrel shifter and

interrupts, the size of the memory buses, the byte order policy or the usage of forwarding in the

different pipeline stages. In Fig. 3.1 it is illustrated a possible configuration, with the inclusion of

all the modules described above.

Figure 3.1: MB-LITE configuration example, featuring the connection of I/O mapped devicesthrough the address decoder. The configuration includes a local memory and a character de-vice directly connected to the core, as well as a global memory access through a Wishbone bus.

As in the MicroBlaze [28] architecture, MB-LITE has two basic instruction types, Type A (or

23


Register Type) instructions and Type B (or Immediate Type) instructions. Both instructions start

with a 6-bit field that represents the opcode field that identifies the instruction. Type A instructions

provide up to two source registers and one destination register, while Type B instructions contain

one source register, one destination register and a 16-bit immediate operand. For operations that

require an immediate operand with more than 16 bits, the instruction can be preceded with an IMM

instruction to preload the upper 16-bit half of the value. The instruction format and bit numbering

are shown in Fig. 3.2. Thirty two general purpose 32-bit registers are available, adopting the same

reservation policy as in the MicroBlaze processor, shown in Table 3.1.

Figure 3.2: MB-LITE bit numbering and instruction format for both types of instructions (Type A -Register Type, Type B - Immediate Type).

Table 3.1: MicroBlaze [28] and MB-LITE [12] general purpose register reservation policy.Name Description

R0 Constant value of 0. Writes are ignored

R1-R13 General purpose (not reserved)

R14 Reserved for interrupt return address storage

R15 General purpose and used to store return addresses for user interrupt vectors

R16 Reserved for break return address storage

R17 General purpose or used to store the address before an hardware exception

R18-R31 General purpose (not reserved)

The complete MB-LITE architecture is shown in Fig. 3.3. The module hierarchy of this project

is as follows (module - file name):

• Instruction memory – sram.vhd

• Data memory – dsram.vhd

• Core – core.vhd

· Instruction fetch – fetch.vhd

· Instruction decode – decode.vhd

· General purpose register file – gprf.vhd

⋄ Port a – dsram.vhd

⋄ Port b – dsram.vhd

⋄ Port d – dsram.vhd

· Execute – execute.vhd

· Memory access – mem.vhd

24


• Package files: core Pkg.vhd, std Pkg.vhd

• Configuration file: config Pkg.vhd

The Instruction memory and Data memory modules implement the instruction and data mem-

ories. These modules connect to the core module and can be replaced depending on the appli-

cation, as long as the memory interfaces match with processor’s input/output interfaces. The core

module is itself an interface that connects the modules that implement the pipeline stages (fetch,

decode, execute and mem). It includes the control signals between one pipeline stage and the

next, the forward, branch and hazard values and respective control signals, and provides external

interfaces to the Instruction memory and Data memory modules. The pipeline stage modules

are described in detail in the following sections. Also included in the design are the package

files (core Pkg.vhd and std Pkg.vhd). The first provides the declaration of record types that rep-

resent the bus connections between the modules, the instantiation of the pipeline modules and

the functions that provide memory alignment, the register data selection and forwarding condition

analysis. The other package file provides the instantiation of the instruction and data memories,

the functions used in the Arithmetic and Logic Unit (ALU) and some minor subroutines used to

help in the design process.

Figure 3.3: MB-LITE architecture description. The different pipeline stages are separated with aRegister bank (in dark grey), except for the WB which is only composed by wiring. Only the maincontrol signals are represented, all other control is implicit. Also represented are the connectionsfrom the processor to the external instruction and data memories, as well as to the internal registerfile.

To allow a better design organization and maintenance, a two-process design methodology,

according to [32], was implemented. In this methodology, the combinatorial and sequential el-

ements are explicitly separated in two processes, being the combinatorial process the one that

contains the entire necessary computation. The sequential process waits for the results of the

combinatorial process and outputs them at the rising edge of the clock. All timing constraints and

25


major delays are, in this way, in the combinatorial process, thus allowing a better design flow that

takes full advantage of the synthesizer capabilities for optimization.

3.1.2 MB-LITE processor structure

In this subsection, the MB-LITE pipeline stages are separately described in detail. This is done

by describing the operations and assertions each stage performs during a clock cycle. It is also

described, at the end of this subsection, the address decoder and the Wishbone bus interface.

Instruction Fetch

The Instruction fetch module implements the IF stage of the pipeline. This module generates

the control signals to fetch an instruction from memory and feed it to the pipeline. It also stores

the Program Counter (PC) which represents the address of the current instruction (also called

instruction number).

The combinatorial process begins by testing the hazard and branch control signals. The haz-

ard control indicates the eventual detection of a hazard at the ID stage; in this case, the PC

maintains the current value, which leads to a pipeline stall. If the branch control indicates a suc-

cessful branch at the EX stage, the PC is set to the branch target address. When none of the

previous conditions occur the PC is incremented by 4, since the instruction word is 32 bits long

and the instruction memory is addressed to the byte. At this point, the value of the PC is passed

to the sequential process to be outputted.

The fetch module is connected to an external instruction memory module through the Core

module interface. The Core module uses the PC value to address the corresponding instruction

in the instruction memory and forwards it to the Instruction decode module.

Instruction Decode

The ID pipeline stage is implemented in the Instruction decode module, which also contains

the general purpose register file. The ID stage decodes the instruction provided by the IF stage

and detects and solves any eventual data hazard condition. It is also responsible for storing in the

corresponding registers the values provided by the WB stage.

The ID stage decodes the instruction through its opcode. Given the opcode, the control signals

for operands A, B and D, used in the EX stage, are generated accordingly. Taking the ADD, RSUB

and CMP instructions as examples, note that the operations required to perform this instructions are

very similar, i.e., an addition and a final processing of the result. For this reason, the opcodes of

these instructions are very similar, which facilitates the instruction decoding and further execution

control in the EX stage. This allows all the above instructions to be treated as an addition but

with different operands. In order to correctly execute the instruction, the ID generates the control

signals to indicate if, for instance, the operand A will be negated or if the operand B comes from

26


a register or it is an immediate value. Control signals for carry input and carry keep are also

generated at this stage, as well as one control signal that indicates if the CMP uses signed or

unsigned results. At this point, all the control signals are generated and can be delivered to the

EX stage.

All other instructions are also grouped in an identical way as those described in the above

example, and the process to generate the control signals is also similar.

If a flush was not requested from the EX stage and a hazard resulting from a memory load

is detected, by the hazard detection logic, in the operand registers (A or B), a Read-After-Write

hazard is identified. In this situation, the current instruction is latched and a stall is inserted

and passed on to the EX stage. Also, if memory forwarding in disabled, the ID detects if there

is a hazard in the destination register (D) with the current instruction and the destination of the

value read from memory, leading to a Write-After-Write hazard. If such a conflict exists, a stall is

inserted.

If a hazard occurred in the previous instruction cycle, the latched instruction is recovered and

the execution proceeds. If an interrupt was triggered after the hazard detection, it is registered

for further analysis. The execution control signals are then initialized to default values and the

16-bit immediate value, if it is the case, is concatenated with the previous 16-bit value from the

IMM instruction or, if no IMM instruction was issued beforehand, the value is sign-extend to 32 bits.

If no hazards are detected, then the module determines if an interrupt was triggered and it can

be handled, being latched if it can not be processed immediately. So, the interrupts are handled by

evaluating if the normal execution flow can be interrupted without causing problems. An interrupt

can only be handled if: i) the current instruction is not a branch; ii) if is not in a delay slot; iii) if

the instruction is not preceded by an IMM instruction; iv) no flush signal was issued; and v) no

hazards were detected. If all the previous conditions match, the control signals are overloaded

with a branch to the interrupt routine, located in address 10h in the instruction memory.

If a flush or a hazard control signal was issued before, nothing is done and default values are

passed on to the EX stage (emulating a NOP instruction).

Current execution variables (instruction word, PC, etc.) are stored in registers and passed on

to the following pipeline stages at the end of the cycle. Incoming data from memory, through the

WB, is aligned and sent to the register file with the corresponding control signals. Otherwise, if

an ALU result is available, also through the WB, it is stored in the register file. As for the register

source operands, the number of the register is sent to the register file and the output sent to the

EX stage, while all the above operations are performed.

The register file is implemented in the General purpose register file module. This is imposed

by the store instructions, which require three register values, i.e., one value to store in memory

(operand D) and two values for address calculation (operands A and B). Hence, three reads are

performed at the same time from the register file, leading to its implementation with three dual-port

27


synchronous 32x32 bit memories, for simplification.

Execute

The EX stage is implemented in the Execute module. This stage is responsible for selecting

the correct operands for A, B and D, which can be either a value read from the register file or a

forwarded value either from the MEM stage or from the ALU result from the previous clock cycle.

It then performs all the necessary arithmetic operations, as well as the decision of the branch

conditions. The execution in the EX stage is divided in two parts, the first is the operand selection

and the second is the ALU operation to those operands.

For the operand selection, if there is no flush control signal being issued, the register data

for operands A, B and D is obtained, as described above. After the register data and forwarding

conditions are analyzed, all possible values for each operand are available, then the final value

can be evaluated. For operand A, the possible values are the PC, zero, the register data or the

negated register data. For operand B, the possible values are the immediate value passed from

the ID stage, the negated immediate value, the register value and the negated register value. The

D operand is obtained either from the register file or from the forwarding lines, and is only used

as an auxiliary value for the EX stage in case the instruction is a store. For this type of instruction

the D operand contains the value to be stored in the memory location that results from adding the

operands A and B. At the same time, the affected flags of the state register are updated.

The second part of the execution occurs inside the ALU. The ALU architecture is simplified by

the above selection of the involved operands, since it allows the abstraction of the operations that

require the use of the ALU. It uses the A and B operands (and eventually the carry state bit) to

perform the ADD, RSUB operations and most part of the CMP and branch operations, which require

a result analysis at the end. The CMP instruction uses the adder result to determine whether A

is larger than B and changes the Most-Significant Bit (MSB) of the result accordingly. The adder

result for a branch represents the instruction number to were the branch is taken. All the other

types of operations in the ALU, such as logical operations, multiplication, shifts (with or without

the barrel shifter) and signal extensions are performed in parallel with the adder. In parallel, it is

also analyzed the branch conditions, as well as the generation of the control signals.

Besides the carry flag, the state registers store the flush control signal.

Memory Access and Write-back

The MEM stage is implemented in the Memory access module. This module contains the

interface to the external data memory that supports memory access of byte, halfword and word

size data. It also implements the connections to the ID and EX stages, to apply data forwarding.

In this stage the ALU result is used either as the branch target, by sending it to the IF stage

along with the branch control signal, or as the memory load/store address. Memory data, to be

28


stored, is aligned according to the transfer size. Control and data connections are made to the

data memory, through the dmem i.* and dmem o.* signals

If a Load-after-Store hazard occurs, it can be solved with forwarding. However, it requires

additional logic for both the forwarding and the alignment. If the resource constraints do not allow

it, a stall can be inserted instead of the additional logic, which by itself does not degrade the

performance since this type of hazards are very rare.

The WB stage, itself, is simply implemented by wiring in the core module and is responsible for

making the connections to the ID and EX stages to complete the pipeline and provide the control

signals and data for forwarding. The WB stage ends when, at the rising edge of the clock, the

ALU results or memory loaded data are stored in the register file, thus completing the five pipeline

stages.

Address decoder and Wishbone bus interface

The address decoder is a highly configurable module that allows the connection of multiple

devices to the core. It is directly connected to the data memory interface, selecting the data bus

from several slave interfaces. To access a peripheral in a read or write operation, the address

is firstly decoded using a generic memory map. After that, one of the slaves is activated by

computing the corresponding enable and write enable signals. A block diagram showing the

address decoder architecture is presented in Fig. 3.4.

Figure 3.4: MB-LITE address decoder internal architecture. Representation with two connectionsto slave peripherals.

MB-LITE also comes with a Wishbone Bus interface. This interface gives the ability to design a

memory interface with a single cycle latency, taking at least two cycles to perform the handshake

protocol and the data transfer [30]. The adapter disables the core until the slave acknowledges

the transaction in progress. This makes it possible to add multiple Wishbone compatible devices

29


with multiple cycle latencies.

3.2 Shared memory mechanisms in multi-core systems

Many data processing applications require independent processing of multiple data sets. This

requirement adds a new level of parallelism to the computation. Two possible approaches can be

considered: i) a fine-grained data level parallelism, where a single core simultaneously processes

different data chunks, using a SIMD processing model (one chunk in each vector element); and

ii) a thread level parallelism, where multiple cores perform the parallel processing of multiple

data sets. The second of these approaches is considered the most interesting solution in many

application domains, such as bioinformatics. This requires the use of a shared memory model,

similar to the one in [33], where the data chunks (e.g., reference and query sequences) are stored

in a shared main memory (see Fig. 3.5). The computation is performed by: i) a work controller

that is used to manage the work queue; ii) multiple processing elements that actually perform the

alignment; and iii) a mechanism to gather the results from all processing elements.

Query sequence 0

Query sequence 1

Query sequence 2

Query sequence 3

Query sequence 4

Query sequence 5

...

Reference sequence

Work queue

Main Memory:

Work control:

Sequence

alignment

...

Sequence

alignment

Sequence

alignment

Processing:

Results queue

Results gather:

Figure 3.5: Processing architecture for multiple sequences alignment.

When going from a single core system to a multi-core system, several issues related with

resource sharing and concurrency must be considered. This is, mainly, due to the fact that multi-

core architectures require data sharing throughout all the cores. The most popular solution is the

introduction of a shared memory architecture, which leads to synchronization requirements, in

order to maintain memory consistency and data coherence, and to core access control to change

common resources. To correctly address the problem of concurrency in shared memories, the

complete memory architecture must be analyzed. This includes shared bus topologies, shared

memory architectures and synchronization mechanisms.

3.2.1 Shared memory organization

The main property of a shared memory system is that any processor can directly reference

any memory location. Therefore, in order to assure synchronization of the shared data, atomic

operations must be made possible. The system must allow only one processor to change data at

the time, which leads to atomic reads and writes. Although, in some cases the required atomicity

is implicitly eliminated.

30


As an example, in a rather simplified SMP system [34], with a dual-core configuration, the

system may be composed by two cores directly connected to a dual-port data memory and to a

hardware mutex core, among other peripherals, through a shared bus. The dual-port data memory

allows for direct and simultaneous access from both cores, eliminating the need for an arbiter to

manage memory access, also eliminating the need one core to wait while the other accesses the

memory.

By extending this trivial dual-port data memory configuration, some systems utilize multi-port

shared memories [35]. These multi-port memories provide a number of ports equal to the number

cores, having the restriction that there can not be a simultaneous read and write operation to the

same address. In [35], a quad-port memory is implemented on a FPGA of the Spartan3E family,

using the existing dual-port RAM blocks by adding a doubled clock and extra logic. To implement a

multi-port shared memory topology, a considerable amount of logic is required. With the increase

of the number of instantiated cores, the application of this technique requires immense cautions,

as there is no synchronization whatsoever and no atomic capabilities.

Another widely used shared memory topology in MIMD systems [3] is a Distributed Shared

Memory (DSM) system, where a common memory space is shared by all cores, as described

above. However, in this cases the physical memory is distributed using a complex hierarchical

organization. For example, in [36] the proposed system features a local private memory space

for each core, and a global shared memory addressable by all cores. The system features a

host processor that manages a number of PE cores. Private memory access is managed by a

dedicated arbiter that distinguishes the cores as owner (local core), guest (remote core) or host.

The priority policy is always granted first to the host, with highest priority, then to the owner and

finally to the guest. The global memory access is managed by an access FIFO queue, for the PE

cores, followed by another arbiter that gives access to the first PE in the access queue or to the

host (giving higher priority to the host).

This system provides L1 caches for the private memories and a shared L2 cache for the global

memory, to speed-up the data requests. However, the complexity of the system requires immense

access contention and management.

In order to implement a data transfer between private and global memories and/or peripherals,

a processor core is required to load the data from the source component, save it in the local

register file and then store it to the destination component (load/store architecture). This process

is extremely inefficient, specially due to the fact that the data is not altered by the processor.

DMA controllers are usually used to overcome this issue, by directly transferring the data between

memories and peripherals, while the processor keeps its normal execution (whenever possible).

The DMA operation is started when the processor sends the initial address of the transfer. Then

it sends the number of bytes to be transfered followed by other commands, depending on the

controller’s architecture. The actual transfers begin by sending a start signal from the processor

31


to the controller. All the described values are stored in control registers of the DMA controller.

The controller operates with an auxiliary counter, to increment the initial address and transfer the

required amount of bytes [37].

Most DMA controllers support three main operation modes: burst, cycle stealing and transpar-

ent modes. Burst mode is characterized by stalling the processor while the transfer is performed.

This mode is useful for loading programs and files into memory. Cycle stealing mode interleaves

the bus access with the processor, taking more time to perform the transfer but making the pro-

cessor idle for less time than in burst mode. Finally, in transparent mode the DMA controller

only performs the transfer when the processor is not accessing the system bus, granting a higher

priority to the processor execution [37].

Most DMA controller cores have a number of channels to allow simultaneous transfers to and

from different system components, usually called gather-scatter operations. An example DMA

controller that uses such transfers is described in [38].

3.2.2 Synchronization mechanisms

Due to the inherent concurrency in the resource access, multi-core and multi-threaded ar-

chitectures require synchronization mechanisms to provide atomicity and data coherency. The

synchronization is usually performed at the software level, where a wide variety of mechanisms

with some hardware support, such as mutexes and semaphores, provide mutually exclusive ac-

cess to shared resources. However, it is possible to obtain a similar synchronization with pure

hardware solutions, coupled with multiple master shared bus protocols.

One possible solution, proposed in [39], is the use of a hardware mutex core, connected

through a shared bus and shared by all processors. This block is register based (with a config-

urable number of registers), where each internal register acts as a traditional software mutex (two

states, locked and unlocked). When one core tries to read from an unlocked mutex (register), a

‘1’ is immediately returned and the value changes to ‘0’, i.e., the mutex becomes locked. After

that, all other cores that read the mutex receive a ‘0’, until the first writes ‘1’ to the mutex, i.e.,

unlocks the mutex (see Fig. 3.6). All the write and read operations are atomic, assured by bus

arbitration logic, which allows only one core to access a single mutex at the time. This provides

the user with the possibility of creating atomic micro-routines, by simply analyzing the contents

of the mutex registers, without any need of special atomic instructions, which would otherwise

require additional logic inside the core itself, therefore degrading the performance.

3.2.3 Shared bus topologies

In order to interconnect the cores and the shared memory, several shared bus topologies

that use different Master/Slave configurations have been proposed. In [34], the system is inter-

connected through an On-chip Peripheral Bus (OPB) [40], which is shared by all the cores and

32


MUTEX REGISTER︷︸︸︷

CORE ID STATE

n 1 0

read from mutex register:

if ’1’:

mutex was unlocked

acquired lock (wrote ’0’ to the register)

if ’0’:

mutex was locked

write to mutex register:

if id == owner’s id:

lock released (wrote ’1’ to the register)

if id != owner’s id

nothing happens

Figure 3.6: Mutex register structure and pseudocode for lock and unlock operations. A coreidentification is concatenated with the state bit in the mutex register, for lock owner identification.

connects to the hardware mutex core and the shared memory, as well as to other peripherals.

The OPB is a fully synchronous shared bus that does not connect directly to the processor cores.

Instead, the connection is made to a local bus that interconnects all cores, through a bridge unit

in a separate core [40]. The bus is fully synchronous and provides a single cycle transfer of data

between a Master and a Slave, as long as there is no more than one Master connecting to the

same Slave. Also, a considerable amount of different protocols for arbitration and data transfer are

available. OPB supports 8-bit to 64-bit Slaves and 32 or 64-bit Masters, as well as dynamic bus

sizing (byte, half-word, full-word and double-word transfers), compatible with MB-LITE transfer

sizes.

Although OPB is highly configurable, the Wishbone Bus [30] is a more convenient solution

when a simpler bus implementation is desired. Also using a Master/Slave-type architecture, the

Master and Slave components can be directly connected through one of four types of intercon-

nection interfaces, i.e., Point-to-point or Data Flow, for one Master and one Slave configuration,

or Shared Bus or Crossbar Switch, for a configuration with multiple Masters and Slaves. In both,

Shared Bus and Crossbar Switch interconnection types, there is always one Master that initiates

a transaction to a target Slave. The Slave then connects to the Master using one or more bus

cycles.

The Shared Bus topology (Fig. 3.7(a)) provides a connection for two or more Masters and one

or more Slaves. The access management to the shared bus by the Master is performed by an

arbiter that implements either a priority-based or a round robin-based protocol. The Shared Bus

requires less logic than other interconnection types. However, only one Master can access the

bus at the time, leading to a performance degradation of the system.

The Crossbar Switch topology (Fig. 3.7(b)) is used when connecting two or more Masters to

two or more Slaves is required. Unlike the Shared Bus, more than one Master can access the

bus at the time, as long as multiple Masters do not try to connect to a single Slave. In this case,

the arbiter manages the access as described above. When a Master accesses the crossbar, a

communication channel is defined and, once established, the Master and the requested Slave

33


transfer data over a private communication link. The main advantage of the Crossbar Switch

interconnection topology is the speed attainable for transferring data, significantly higher than the

Shared Bus.

(a) Shared Bus (b) Crossbar Switch

Figure 3.7: Wishbone bus topologies. (a) Shared Bus topology, where only one Master canaccess the bus at the time. (b) Crossbar Switch interconnection topology, where the Masters cansimultaneously establish connections to the Slaves (dotted lines).

Finally, when high performance is a major requirement, ARM’s AMBA High-performance Bus

(AHB) is one of the most used bus topologies in embedded systems. Within this family, the AHB-

Lite [41], a subset of the AHB protocol, provides a simpler but yet robust bus interconnection.

AHB supports the interconnection of components with multiple clock frequency and bandwidth.

Wide data bus widths are supported, providing pipelined data transfers with sizes ranging from

8 to 1024 bits. For multiple data transactions, a number of different burst modes are provided,

mitigating the bus access latencies.

The protocol, itself, is subdivided into two phases: i) an initial addressing phase, lasting one

clock cycle, were the address and control signals are passed by the Master; and ii) a data transfer

phase, that can be extended with an arbitrary number of wait states inserted by the Slave until

the data transfer is complete. Bus access is managed by an arbiter that grants access to a single

Master at the time.

The simpler AHB-Lite protocol provides interconnection for a single Master and multiple Slaves

(denominated by layer ), hence eliminating the need for a bus arbiter. However, it is possible to

implement a Multi-layer [42] matrix interconnection, by controlling the access to each Slave with a

dedicated arbiter, as shown in the conceptual diagram in Fig. 3.8. This way, all layers can access

the bus at the same time and connect to different slaves. A bottleneck on this bus topology occurs

when simultaneous access to a single slave are requested.

More detailed descriptions of the AHB-Lite and the Multi-layer AHB protocols, along with the

implementation procedures, will be provided in Chapter 6.

34

3.3 Summary

Figure 3.8: AHB Multilayer matrix interconnection representation. All Masters simultaneouslyaccess the bus and arbitration is performed in each Slave. Slaves are addresses are arranged ina memory mapped I/O organization.

3.3 Summary

This Chapter was divided in two sections. The first provided a detailed description of the MB-

LITE [12] processor architecture, which will serve as base architecture in the following chapters.

The last section presented an introduction to multi-processor topologies, including synchroniza-

tion components and bus arbitration architectures, by presenting some example implementations

available in the literature.

35


36

4SIMD instruction set for DNA

sequence alignment

Contents

4.1 SIMD optimization of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 SIMD register architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Proposed SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Proposed SIMD ISA implementation . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

37

4. SIMD instruction set for DNA sequence alignment

This chapter presents the proposed ISA extension targeting the acceleration of a broad range

of DP algorithms, including not only the classic local and global sequence alignment procedures

(such as the several SIMD implementations of the Needleman-Wunsch and Smith-Waterman

algorithms [11, 24, 25]), but also other widely used DP algorithms adopted in several application

domains (e.g.: Hidden Markov Models (HMMs), Viterbi chains, etc.). The ISA definition was

performed by analyzing all the SIMD implementations of the SW algorithm that were described in

Chapter 2, taking into account all the required SIMD operations for the execution of the algorithms.

Other specific instructions were also defined to optimize the implementations, such as control and

memory-access instructions for SIMD operands.

As mentioned above, despite the proposed ISA was specified for a wider range of DP al-

gorithms, due to its higher performance and to its prevalence in most widely established bioin-

formatics applications, Farrar’s SIMD implementation [11] will be herein adopted as the elected

case-study, to demonstrate the advantages of the proposed ISA extension.

4.1 SIMD optimization of the algorithm

By analyzing the SIMD implementations described in Chapter 2, it is clear that the adoption of

vectorial arithmetic instructions is a major requirement to accelerate the algorithm implementation,

comprehending addition, subtraction and compare operations. These instructions should not only

speedup the operations between vectors, but they might also facilitate the several operations be-

tween vectors and scalars, which are particularly useful when subtracting the gap penalties, in the

SW algorithm. Furthermore, from a closer inspection and analysis of other alternative approaches

provided in [11, 24, 25], it can be verified that saturated arithmetic is used, instead of modular (or

wrap-around) arithmetic, provided that the maximum operations of the used instruction sets do not

support unsigned operands. Moreover, an offset is often used to allow the utilization of the vector

elements’ full value range. For instance, for 8-bit signed elements, the values are usually offset by

128, providing the score values from -128 to 127 during the execution. This value is then added

to 128, giving a final score in the interval 0 to 255. Otherwise, the final scores would be limited

to 127. By coupling the saturated arithmetic in SIMD operations with the offset corresponding to

the element’s lowest negative value, the equivalent operation to the maximum with zero (see al-

gorithm by O. Gotoh [16] in Eq. 2.1) becomes implicit, since the values never go bellow the biased

zero. Hence, eliminating at least three maximum operations per matrix cell.

The mentioned maximum operation, mandatory in any implementation of the SW algorithm,

can be provided in a single SIMD instruction, eliminating the need for a micro-routine implement-

ing this operation. Hence, it will easily provide better performance levels in implementations such

as Rognes and Seeberg [25] and Farrar [11]. Since the MB-LITE ISA [12] does not provide a

scalar maximum instruction that can be taken as base for a SIMD version, the scalar version

38

4.1 SIMD optimization of the algorithm

should be implemented first and then extended for SIMD operations.

In addition to the existing memory access operations, specialized instructions for vector Load-

Store operations are also required, in the MB-LITE [12] core. Hence, to simplify the control logic

and by taking advantage of the existing memory organization in parallel 8-bit blocks, the memory

words can be extended to the vector register size (e.g., 128 bits) with minor impact in the memory

access time. This allows the memory access corresponding to vector operands to take the same

number of clock cycles to complete as the scalar operands, provided that they are properly aligned

and sequentially stored in memory. When the elements are not sequential, multiple scalar-type

loads are required to obtain the elements separately and then gather them in a single vector

register. The later operation can be implemented by a specialized move operation, which stores

vector element-sized scalars to a specific vector position. Hence, this ISA extension will provide

a wider flexibility for the application to construct vectors from scalar, or vector element operands,

stored in memory or registers, and rearrange them in a single vector’s elements.

By focusing on the original implementation by M. Farrar [11] (represented in Appendix B.4

and briefly reviewed in Fig. 4.1(a)), the shifting of the F and H vectors can also be efficiently

implemented with a vector element-sized shift left instruction. The reason behind this decision is

the fact that, as it will be explained in the next chapter, the Barrel Shifter will be deactivated in the

considered extension of the MB-LITE [12]. Therefore the only available shift instructions will be

the one-bit shift right operations, discarding all the remaining left shift operations.

On the other hand, from a detailed inspection of the Intel SSE2 assembly implementation [11]

(see Fig. 4.1(b)), it can be observed that the lazy loop condition assertion requires at least 5

instructions. Therefore, a specialized SIMD branch instruction, able to simultaneously assert a

branch condition based on all vector elements, without any additional processing, would signifi-

cantly increase the achieved performance. Furthermore, considering the flexibility point of view,

both logical conjunction and disjunction in the condition assertion can be provided, i.e. allow the

possibility of taking a condition as valid if it is valid across all elements in a vector or if it is valid

in at least one single element. Hence, the available branch conditions in MB-LITE [12] can be

extended for disjunctive condition assertion and a new branch condition, specialized for SIMD,

will be provided to perform the conjunctive assertion of a vector mask (i.e. take all the MSBs from

the vector’s elements and check if they are all zero).

From the above observations, it is clear that a dedicated and optimized ISA not only should

include a suitable set of arithmetic SIMD operations (with a particular emphasis to the addition,

subtraction, compare and maximum operations), but should also incorporate other classes of

instructions, including logic, memory-access and control operations. Furthermore, an appropriate

register structure, targeting specific application domains, will have to be defined to accommodate

both the vectorial and scalar operands. These two aspects will be covered in the following two

sections.

39


Pseudo-code [11] Intel SSE2 ASIP

Inn

er

Lo

op

lv r3, r9, r23vH = vH + vProfile[i][j] paddusb (r8,rcx,1),xmm1

saddvv r3, r8, r3psubusb xmm9,xmm1

vMax = max(vMax, vH) pmaxub xmm1,xmm3 maxvv r19, r19, r3load vE[j] movdqa (rax,rcx,1),xmm2 lv r5, r7, r11vH = max(vH, vE[j]) pmaxub xmm2,xmm1 maxvv r4, r3, r5vH = max(vH, vF) pmaxub xmm0,xmm1 maxvv r4, r4, r6vHStore[j] = vH movdqa xmm1,(r11,rcx,1) sv r4, r7, r10vH = vH - gapOpen psubusb xmm7,xmm1 srsubvs r4, r24, r4vE[j] = vE[j] - gapExtnd psubusb xmm5,xmm2 srsubvs r5, r12, r5vE[j] = max(vE[j], vH) pmaxub xmm1,xmm2 maxvv r5, r5, r4vF = vF - gapExtnd psubusb xmm5,xmm0 srsubvs r3, r12, r6vF = max(vF, vH) pmaxub xmm1,xmm0 maxvv r6, r3, r4

addik r11, r0, 13824store vE[j] movdqa xmm2,(rax,rcx,1)

sv r5, r7, r11vH = vHLoad[j] movdqa (r9,rcx,1),xmm1 lv r8, r7, r22

add $0x10,rcx addik r7, r7, 16cmp r13,rcx xori r18, r7, 4jne 7be <start+0x14a> bnei r18, -72

La

zy

Lo

op

movslq ecx,r10shl $0x4,r10lea (rdi,r10,1),r8

vH = max(vH, vF) pmaxub xmm0,xmm1 maxvv r3, r4, r5vHStore[j] = vH movdqa xmm1,(r11,r10,1) sv r3, r6, r10load vE[j] movdqa (r8),xmm2 lv r4, r6, r11vH = vH - gapOpen psubusb xmm4,xmm1 srsubvs r3, r7, r3vE[j] = max(vE[j], vH) pmaxub xmm2,xmm1 maxvv r4, r4, r3store vE[j] movdqa xmm1,(r8) sv r4, r6, r9vF = vF - gapExtnd psubusb xmm8,xmm0 rsubvs r5, r8, r5

addik r18, r0, 63if (++j ≥ segLen) add $0x1,ecx addik r6, r6, 16

cmp ecx,edx cmp r18, r6, r18jg 87b <start+0x207> bgei r18, 12

vF = vF ≪ 8 pslldq $0x1,xmm0 sllv r5, r5j=0 mov $0x0,ecx addk r6, r0, r0

movslq ecx,r8shl $0x4,r8movdqa (r11,r8,1),xmm1 addik r11, r0, 13936

vH = vHStore[j] movdqa xmm1,xmm2 lv r4, r6, r10vT = vH - gapOpen psubusb xmm4,xmm2 srsubvs r3, r7, r4

movdqa xmm0,xmm11vT = vF - vT psubusb xmm2,xmm11 rsubvv r3, r3, r5

movdqa xmm11,xmm2pcmpeqb xmm6,xmm2pmovmskb xmm2,r8dcmp $0xffff,r8djne 83e <start+0x1ca> bgtiv r3, -68

(a) (b) (c)

Figure 4.1: Farrar’s SIMD implementation of the SW algorithm [11]: (a) Pseudo-code definition;(b) Intel SSE2 assembly; (c) Proposed ISA assembly code. Instructions outlined in bold facebelong to the specialized ISA. Shaded areas outline the blocks of identical operations that requirea different number of instructions to complete in the two implementations. Only the inner and lazyloops are illustrated in this figure.

4.2 SIMD register architecture

By taking a similar approach as Intel’s MMX technology [4], it is possible to define a certain

level of abstraction in the register file, by mapping the new vector registers to the already existing

integer registers. This is performed by extending the width of the MB-LITE scalar register file to

the required vector size and by allowing both vector and scalar operations to access it indepen-

dently. As opposed to the issues arisen from Intel’s mapping of the MMX registers to the FPU

registers, in this case non-SIMD instructions will only operate over the least-significant part of the

register, corresponding to a scalar processor word, being the upper part of the register padded

with zeros. However, all the newly proposed SIMD instructions will operate over the entire register

(see Fig. 4.2). With this design option, the critical path of the processor is confined within the

data-path corresponding to the non-SIMD operations, thus making it independent of the extended

SIMD register size. Moreover, state switching is not required as in the MMX technology.

As already mentioned in the previous section the data memory organization will be also mod-

ified. To accommodate entire SIMD registers in a single address position, the memory words are

also extended to the vector size. However, specially provided logic will be included to assure the

40

4.3 Proposed SIMD instructions

memory alignment in MB-LITE. With such approach, it will be possible to maintain the current

alignment procedures for scalar operands and extend them to support vector size memory ac-

cess. Hence, scalar operands will keep the 8, 16 or 32-bit alignment, while the SIMD operands

are aligned at their size (e.g., 128-bit alignment). The vector elements are numbered starting from

0 at the least-significant element of the vector.

Although not limited at this respect, the proposed instruction set and the corresponding data-

path will provide, by default, support for the same register and vector-element sizes as Intel SSE2

(used by Farrar [11]), i.e., 128-bit registers, with 16 8-bit elements. Hence, the vector elements of

each register can take any size, starting from 8 bits to the limit imposed by the register size. The

vectors, themselves, can also take any size, from 32 bits to at least 128 bits. However, it should be

made clear that, in theory, the vectors can take any size, being the limit imposed by the hardware

and applications’ restrictions. To obtain a fair comparison with Farrar’s [11] SSE2 implementation,

only 128-bit registers with 8-bit elements will be considered henceforth. The exception will arise

in some of the figures that will be presented in this document, where 32-bit registers with 8-bit

elements will be considered, for simplification.

non-SIMD operand︷︸︸︷

v.e. v.e. · · · v.e. v.e. v.e. v.e. v.e. v.e. v.e.

︸︷︷︸

SIMD operand

Figure 4.2: Division of each processor register multiple SIMD vector elements (v.e.). Non-SIMDoperands coexist in the same register, occupying the least significant word.


The specialized proposed ISA defines 58 specialized SIMD instructions for arithmetic, logic,

memory access and control operations, summarized in Table 4.1, and fully documented in Ap-

pendix A. These instructions are subdivided into 3 classes: vector-vector, operating over the

corresponding pairs of elements in each SIMD register; vector-scalar, operating between one

SIMD register and a non-SIMD operand of another register; and inner-vector, operating between

adjacent pairs on vector elements in a single SIMD register. Instructions that only take one vector

operand are included in the vector-vector category.

In order to facilitate the definition of each instruction, architectural references to MB-LITE in-

cluded and some of the conducted procedures to include those instructions are presented.

4.3.1 Modular and saturated arithmetic

To define the desired set of SIMD arithmetic instructions, the existent and corresponding scalar

instructions in the original MB-LITE’s ISA can be used as base. However, some simplifications

41


Table 4.1: Proposed SIMD specialized instruction set for biological sequence alignment. In theSIMD branch instructions, the delay and immediate options are implicit.

ARITHMETIC LOGIC MEMORY CONTROL

Vector-Vector Vector-Scalar Inner-Vector Vector-Vector

ADDVV ADDVS ADDV SLLV LV BEQV

SADDVV SADDVS MTV LVI BNEV

RSUBVV RSUBVS RSUBV SV BGEV

SRSUBVV SRSUBVS SVI BGTV

CMPVV CMPVS CMPV BLEV

CMPUVV CMPUVS CMPUV BLTV

MAXVV MAXVS MAXV BMEV

MAXUVV MAXUVS MAXUV

were assumed in the implementation of the three defined SIMD categories of arithmetic opera-

tions. As a consequence, the carry logic is left out, i.e., there is no carry-keep or carry-in function-

ality for SIMD arithmetic. This decision was mainly due to the fact that required logic to track the

carries across the different pipeline stages would be extremely complex. The carry-in bits are only

forced to ’1’ to differentiate the addition and subtraction operations, as it was initially implemented

for the scalar operations. Also left out were the immediate operands, since there is not enough

space in the instruction word to provide constant vectors with sizes greater than 32 bits.

As mentioned, the existent scalar arithmetic logic can be used. In particular, the same reversed

subtraction used in the original MicroBlaze ISA [28] can be adopted, i.e., the subtractions take

the format rD = rB − rA. Hence, in vector-scalar operations, the scalar operand is always in

the source register A. Also, in inner-vector operations, the additions/subtractions follow the rule

rD[i] = rA[2 × i + 1] ± rA[2 × i], for the lower half of SIMD register D, where rD[i] represents

register D’s vector element i.

Arithmetic instructions are required to operate in modular or saturated modes. However, this

feature is only required for the addition and subtraction on SIMD instructions. Moreover, since the

main purpose of the corresponding inner-vector operations is to sum or subtract all the elements

in a vector, in order to simplify the saturation detection logic, only the vector-vector and the vector-

scalar types feature this option.

Since MB-LITE [12] does not feature a maximum instruction, to further facilitate the implemen-

tation of maximum SIMD versions, a scalar maximum instruction is first implemented, to serve as

base. In order to implement the scalar maximum instruction, the existing compare operation logic

can be used as base, as will be described in the next chapter. As a consequence, when extending

the maximum logic to allow SIMD operations, the compare logic will also be extended. Hence,

providing the corresponding SIMD versions of the compare instruction.

4.3.2 Logic operations

Bitwise SIMD operations, such as OR, AND or XOR, can use the existing scalar instructions

without any change. Such an extension is simply achieved by making these instructions to operate

over the entire vector register. Such option is possible since there is no bit interdependency in

42


such operations, as opposing to what occurs in an addition, for instance. Moreover, the fact that

the higher part of the register is padded with zeros in scalar operands, the scalar operations do

not affect the rest of the register.

However, in order to implement the vector element-wide shift, it is necessary to provide extra

logic to perform the vector elements’ left shift and fill the lower element of the vector with zeros.

The same control logic for the existing one-bit right shifts can be used, with minor modifications.

The data transfer (move) operation, which allows the construction of a SIMD vector from scalar

values, may be implemented with the aid of multiplexers. With an index provided through a scalar

parameter in the instruction encoding, the lower element-sized part can be multiplexed into the in-

dexed position of the vector (illustrated in Fig. 4.3). Any other vector elements remain unchanged

throughout the entire process. Therefore, in this data transfer instruction the destination and base

vector registers are both given by register D, forcing the 3 registers to be obtained from the reg-

ister file (as in a store instruction, where register D contains the value to be stored in memory),

taking advantage of the three-port register file. Nonetheless, it still allows a simple instruction en-

Figure 4.3: Illustration of the move operation. The multiplexer is selected with a a value providedin the instruction encoding.

coding. Furthermore, the control logic of the store operations can be reused. Note that, although

the implementation of this operation is also possible with the previous defined element-size shift,

coupled with a OR operation with a scalar value, the move operation allows a faster and more

flexible option.

4.3.3 SIMD branch

The SIMD branch condition assertion requires the replication of the existing logic for scalar

branches. In particular, the logic corresponding to the conditions already featured by MB-LITE [12]

(i.e., if-equal (BEQ), if-not-equal (BNE), if-greater-than (BGT), if-greater-or-equal (BGE), if-less-than

(BLT) and if-less-or-equal (BLE)), should be replicated through all the vector elements and the

resulting assertion bits are OR’ed to obtain disjunctive conditions. In order to implement a con-

junctive condition assertion, a new specialized condition must be provided. Therefore, the new

43


branch condition (named mask branch (BMEV)) is implemented by taking the MSB of each vector

element (hence constructing a signal mask of the vector), negate and then AND the resulting bits.

Fig. 4.4 illustrates the disjunctive and conjunctive assertions in a conceptual level. The branch

Figure 4.4: Conceptual diagram of the branch disjunctive (left) and conjunctive (right) conditionassertions.On the left, blocks Z and N represent the logic to assert if the input value is zero ornegative, respectively. On the right, a mask is constructed with the MSB from each element. Notethat the mask’s bits are negated before the AND gate.

targets are provided through scalar operands, either in registers or immediate values, with no ad-

ditional logic required for the SIMD versions. Also, as for the existing scalar branches, the SIMD

branches also support delay slots.

The control logic for the SIMD branches requires minor modifications to include the new mask

condition and the identification of SIMD operation. Note that, the delayed versions of the branch

are independent from the condition assertion, therefore no modifications are required, even for

the new condition.

Since the mask condition is specific for the assertion of multiple elements at the time, it be-

comes obsolete in scalar branches. Therefore, it is only available for SIMD operations.

4.3.4 Memory access

According to the defined register structure, non-SIMD load and store instructions will only op-

erate over the processor word-size. As a consequence, the extended SIMD instruction set also

provides the corresponding SIMD versions (LV and SV) for vectorized memory accesses. The

memory address is computed with scalar operands (register or immediate) and only the destina-

tion (LV) or the origin (SV) are SIMD operands. Since this extension implies some modifications in

the memory organization, the alignment procedures must provide support for vector-sized word

storage. Hence, vector-sized memory words are aligned to the vector size. For example, with

128-bit registers, the vectors in memory are aligned to the address 10h and the address’s 4

Least-Significant Bits (LSBs) are used to choose the corresponding vector, 32, 16 or 8-bit words

in the memory position (see Fig. 4.5).

44

4.4 Proposed SIMD ISA implementation

Figure 4.5: Byte select scheme with the memory address’s lower 4 bits, for 128-bit memory words.

Regarding the implemented control logic, it is only necessary to distinguish a new transfer

type for vector-sized memory transfers, named vector. Coupled with the existing transfer types

(i.e. byte, halfword and word), the control logic is provided for the memory alignment procedures.

4.4 Proposed SIMD ISA implementation

With the proposed SIMD ISA that was defined in this chapter, a pseudo-assembly implementa-

tion based on Farrar’s striped algorithm [11] is shown in Section B.1, of Appendix B. Furthermore,

a preliminary compilation result is also shown in Fig. 4.1(c), to compare with Farrar’s [11] algorithm

implementation based on Intel’s SSE2 ISA.

It is observable that an immediate reduction on the number of instructions is promptly achieved

with the proposed ISA, with more visible advantages in the lazy-loop. The major contribution to

this reduction arises from the new set of vectorized control instructions, that significantly reduce

the control overhead. Furthermore, besides the clear reduction in the number of instructions, it

is important to note that the most important advantage of the proposed ASIP arises from the fact

that it adopts a strict RISC paradigm based on a shallow pipeline structure, contrasting with Intel’s

Complex Instruction Set Computer (CISC) model, with deeper pipeline structures, mitigating the

impact of pipeline flushes. Moreover, the higher latencies of the Intel’s CISC instructions, requiring

multiple-cycle executions, are eliminated and give room to the proposed single-cycle instruction

set. As a consequence, the observed difference in the number of executed instructions, together

with the RISC single-IPC ratio, will significantly improve the processing efficiency of the algorithm,

as it will be demonstrated in Chapter 7.

4.5 Summary

The beginning of the chapter presents a requirement analysis of the current state-of-the-art

SIMD implementations of the SW algorithm, with a more detailed focus on the striped implemen-

tation by M. Farrar [11]. Then, the definition of an extended optimized ISA to efficiently execute

45


this class of algorithms was presented. In particular, it was described the proposed SIMD regis-

ter structure, by establishing a parallelism with the MMX technology and with the MB-LITE [12]

soft-core architecture. The definition of the new SIMD instruction set, cross-referencing the Mi-

croBlaze ISA [28] and MB-LITE’s available resources was also presented. Finally, the last section

presents an implementation of the striped SW algorithm with the newly defined ISA, together with

a performance prediction against the Intel SSE2 implementation, by M. Farrar [11].

46

5Proposed architecture for the

defined SIMD extension

Contents

5.1 Register and memory support for SIMD vectors . . . . . . . . . . . . . . . . . 48

5.2 Modification of the execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Adaptation of the decode unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 GCC back-end extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

47

5. Proposed architecture for the defined SIMD extension

In this chapter, the MB-LITE [12] soft-core is extended adopt the ISA proposed in the previous

chapter. The extension was performed in different phases. Initially, the register file structure and

the datapath of the processor were extended for the new SIMD operations, including arithmetic,

logical, branch and memory access operations. The next phase was the changing of the decode

unit to support the new instructions and generate the corresponding control signals. The final

phase was the inclusion of the new instructions in the adopted compiler structure, for validation

and performance testing throughout the extension process.

5.1 Register and memory support for SIMD vectors

Despite being fully parameterizable, the configuration of the designed SIMD module that was

adopted by default uses 128-bit registers, each one with 16 8-bit SIMD elements. The operand

size can range from 2 bits (useful for DNA processing) to half the register size. By increasing the

number of operands, the amount of generated logic also increases, with consequences on the

processor hardware resource usage, although not necessarily increasing the critical path since

the generated logic is almost completely parallel. Similarly, several parameterizations can also

be used for the scalar operations, since the architecture is able to process values ranging from

32-bit data words to any other word size. However, it shall be taken into account that increasing

the scalar word size may decrease the maximum clock frequency.

As a consequence of the introduced extension of the data word size, changes must be also

applied to the memory organization. Therefore, the memory was also extended to support vector-

sized words (e.g: 128 bits by default). In that case, the LSBs of the memory address are used

to generate the byte select signal, in the MEM stage, upon performing the alignment procedures.

The byte select signal is used to indicate which bytes of a word were requested in a transfer. No

changes were required, other than the extension of the data and byte signal signals.

5.2 Modification of the execution unit

As consequence of the described register and memory word expansions, coupled with the

decision of maintaining the scalar operands with 32 bits, the output data from the scalar part of the

ALU must be padded with zeros, before being multiplexed with results from the vector processing

units. Furthermore, since several scalar operations may generate a carry-out bit, this extra bit is

concatenated with the result throughout the selection process. After the desired result is selected,

the carry bit can then be removed and the corresponding position forced to 0, whenever this is

required by the operation in progress.

The following subsections present the developed architectures to implement the proposed new

instructions.

48


5.2.1 Maximum instruction and decision logic

There are several methods to obtain the maximum value between two operands, such as

comparators and equality detectors both combined with extra logic. However, a simpler approach

can be used to calculate the maximum. When subtracting two operands, the maximum between

then is obtained by analysing the signal of the result. This is easily implemented with the help of

an adder and a multiplexer as shown in Fig. 5.1.

i f s ign (B − A) = 1 :MAX = A / / B < A

else :MAX = B / / B >= A

end i f

Figure 5.1: Pseudocode and hardware logic for the maximum operation. The adder uses a carrybit and the complemented A operand to perform a subtraction.

However, it was observed that the existing compare instruction already implements part of this

procedure, by using the signal of a subtraction result to determine which of the operands is greater

and then by asserting the MSB of the output to show the true relation between them. Therefore,

the introduced maximum instruction is based on the compare instruction, since the same logic

can be used to implement these two instructions, and only requiring an extra multiplexer, selected

by the MSB of the result, to choose the maximum between the two operands. Since the compare

instruction can handle both signed or unsigned operands, the maximum also adopts the same

policy. This allowed a simplification of the implementation, since no changes were required prior

to the decision logic. However, one issue must be taken into account before the final selection

of the greater value, since the operand selection logic must complement operand A before send-

ing it to the adder. So, before the output of the multiplexer, operand A must be complemented

again to match to its original value. Although simple to implement, all the additional logic would

greatly increase the critical path of the execution unit when extending the logic to support SIMD

operations. To overcome this issue, it was decided to move the decision logic to the next pipeline

stage and to embed part of its circuitry in the pipeline forwarding path, as described in Fig. 5.2.

To accomplish this, both operands and the ctrl ex.operation control signal are propagated to

the next stage.

49


Figure 5.2: The maximum decision logic is postponed to the next pipeline stage and to the pipelineforwarding lines.

5.2.2 SIMD ALU module

To support the proposed extension of the ISA, the execution unit had to be modified, by ex-

tending its original ALU to include a new SIMD module. This SIMD module comprises all the

introduced arithmetic and logical operations, as well as the move-to-vector operation.

The addition and subtraction operations require one adder per SIMD vector element, together

with some extra multiplexing logic (illustrated in Fig. 5.3). Since different types of SIMD operations

are supported (vector-vector, vector-scalar and inner-vector ), the required elements have to be

selected from the corresponding vectors and only then the execution unit performs all the arith-

metic operations in parallel. The results are then chosen based on appropriate control signals,

due to the fact that the SIMD operations are performed in parallel with the original scalar unit.

Hence, multiplexing is required to select the results from the correct arithmetic unit.

The shift instruction is implemented by simply shifting the entire vector to the left by the vector

element size (8 bits by default) and the lower element is filled with zeros. Fig. 5.4 illustrates the

operation.

For the SIMD move instruction, the destination vector is first fetched from the register file, just

as in the store instruction. The value to be moved is given by the least-significant element-size

part of operand B. The position where the value is to be moved is provided by the LSBs from

operand A. The resulting vector is the concatenation of the remaining vector elements (originally

at operand D) with the moved value in the correct position (as seen in Fig. 5.5).

Such as the corresponding scalar version, the SIMD maximum instruction also requires ap-

propriate compare logic, which is used to implement auxiliary compare instructions. Hence, the

scalar compare logic was replicated for each SIMD instruction type. This was done by down-

sizing the existing scalar compare logic for the vector element size, and by selecting the correct

50


Figure 5.3: Block diagram of the ALU SIMD module, illustrating the logic required to implement(a) the vector-vector and vector-scalar operations and (b) the inner-vector operations.

Figure 5.4: Representation of a vector element sized shift left operation.

Figure 5.5: Block diagram of the move-to-vector operation, illustrating the multiplexer intercon-nection. The lower bits from rA are used to select one of the multiplexers, placing the scalar fromrB in the corresponding position.

elements to be analyzed. The element selection follows the same pattern, for the different SIMD

instruction types, as the arithmetic operand selection in Fig. 5.3.

As in the scalar compare instruction, when the values are signed it is not required to do any

change of the result, since the result’s MSB is already the true relation between both operands.

On the other and, if the values are unsigned the MSB is changed accordingly, just as in the scalar

instruction. For instance, assuming 8-bit values, the result of the subtraction 85h − 5h is 80h (or

−128 in decimal). If the input values are taken as signed, through the MSB of the result, it can be

determined that 5h (or 5) is greater than 85h (or −123). However, if the input values are unsigned,

51


the same resulting value is obtained, but now, it is 85h (or 133) that is the greater value. Hence, for

unsigned input values, additional logic is required to take the input operands MSB into account,

when asserting the true relation between them.

The second step of the SIMD maximum instructions’ implementation is to apply the same

replication to the maximum decision logic. Hence, based on the MSB of the result obtained from

the compare logic, the same procedure is applied for each resulting vector element. In the vector-

vector and vector-scalar instructions, the result is selected between the corresponding pairs of

elements in both vectors, or between each element in vector B and the element-sized scalar A,

respectively. For the inner-vector instruction the selection is performed between the adjacent

pairs of elements in vector A and the results placed at the lower part of the result, with upper

part filled with zeros. The decision logic for the SIMD maximum is added to the scalar maximum

decision logic. A control signal, to choose between a scalar or a SIMD operation type, is included

in the maximum control signals, propagated to the next pipeline stage and to the pipeline path.

5.2.3 Saturation logic

In order to correctly implement the saturation detection logic for the vector-vector and vector-

scalar addition and subtraction instructions, the arithmetic results must be obtained from the cor-

responding adders. Hence, the inclusion of the required logic would further increase the critical

path of the EX stage. To overcome this issue, the saturation logic is moved to the MEM stage and

to the corresponding forwarding lines, adopting the same solution of the maximum instruction.

Since the ALU result, both operands and the SIMD type are already propagated for the maximum

decision logic, it is only necessary to include a saturation enable signal. To implement the detec-

tion logic, the signal from both operands and the computed result is taken into account. An 8-bit

result is required to saturate at the lowest and highest values, i.e. at −128 (80h, 10000000b) and

at 127 (7Fh, 01111111b), respectively. In other words, the saturation is implemented by preventing

a wrap-around situation (7Fh → 80h, and vice-versa), which never occurs when the operands

have opposite signals. It only occurs when the operand signals match and differ from the result’s

signal.

Table 5.1: Truth table for saturation detection based on the MSBs from the operands and theresult. The two occurrences that activate the saturation are highlighted.

Sa Sb Sr Saturate

0 0 0 0

0 0 1 1

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 1

1 1 1 0

Table 5.1 represents the saturation detection procedure based on the MSBs of the vector

element operands and the corresponding result. The detection condition is given by Eq. 5.1. The

52


saturation value takes the same signal as operand B and fills the following bits with the inverted

signal, as defined in Eq. 5.2.

Saturatei = (SA[i] xnor SB[i]) and (SB[i] xor SR[i]) (5.1)

R[i] = SB[i] SB[i] SB[i] SB[i] · · · SB[i] (5.2)

In both equations, i represents the number of the vector element. Hence, SA[i] represents the

MSB of element i in vector operand A. Finally, the saturation logic implementation is illustrated in

Fig. 5.6.

Figure 5.6: Schematic of the saturation detection logic, paired with an enable signal, providing theselection between the ALU result and the saturated value.

5.2.4 SIMD branch condition assertion

The SIMD conditional branches were implemented by replicating the condition evaluation logic

and by modifying it to separately evaluate all the elements in the vector. In order to implement

the different condition assertions in parallel, every partial condition is evaluated at the same time,

i.e., signal and equals-zero evaluation. Depending on the branch condition, one or both flags are

used to assert he condition, for each vector element.

As defined in the last chapter, both disjunctive and conjunctive conditions are used in SIMD

branches. Therefore, the existing branch conditions are also implemented with disjunctive asser-

tions, i.e., the condition must be valid for at least one vector element. These are implemented

by taking the condition validity bits from each vector element and use one OR gate to obtain the

condition.

The new branch condition (mask), that was defined in the last chapter to allow conjunctive

assertion, takes the MSB of every vector element and verifies if they are all equal to zero. This is

performed by inverting the bits and using an AND gate to obtain the condition validity.

Although there is only one conjunctive branch instruction, from the software development point

of view, it is possible to implement all the other SIMD disjunctive conditions in a corresponding

conjunctive assertion. This can be performed by preceding the SIMD mask branch with a SIMD

compare instruction. Hence, explicitly implementing any branch condition with a conjunctive as-

sertion.

53


The branch target address is calculated within the ALU for scalar operands, provided in the

instruction. This is performed in parallel with the condition assertions.

5.2.5 Vector-sized memory access

Non-SIMD load /store instructions provide memory transfers with byte, halfword or word sizes

(i.e. 8, 16 and 32 bits). Since the SIMD vector can be greater than the processor scalar word

size, a new vector transfer size option is implemented. This vector size is transparently considered

by the load /store instructions, so that the instructions can work independently of the predefined

vector size. The only required modifications are the inclusion of a new VECTOR option in the transfer

size, and the already mentioned memory alignment procedure extension to support vector-sized

alignments. No further changes are required to implement the SIMD load /store instructions, since

the remaining processor structure is already adapted to support SIMD operations with vectors.

5.3 Adaptation of the decode unit

After the described modifications to the processor data-path in order to support vectorized op-

erations, the final step is to provide the correct control signals based on the instruction decoding.

Since the encoding of most MB-LITE type A arithmetic instructions (ADD, RSUB and CMP) have an

unused (or partially used) function field, the same opcode that was used for the original scalar

instructions was assigned to the new SIMD arithmetic instructions.

Considering the scalar CMP instruction as an example, it can be observed that its opcode is

same as the RSUBK instruction’s opcode. This is due to the fact that, for a compare operation, a

subtraction is initially performed. Hence, the same opcode is used for both instructions to activate

a subtraction in the ALU. The difference between both instructions is least-significant encoding

bit which, in the CMP instruction, is used to activate the compare logic to complete the operation

(see Fig. 5.7). Furthermore, since the CMP instruction can take unsigned values, the second least-

significant encoding bit is used to indicate if the operands are to assumed signed or unsigned

values.

Figure 5.7: The compare and subtraction instructions use the same opcode to enable a subtrac-tion in the ALU. The LSB of the instruction activates the compare logic at the ALU output, for thecompare instruction. This greatly simplifies the decoding and control logic.

By following the same principle, the subsequent encoding bit is reserved to encode the maxi-

54

5.3 Adaptation of the decode unit

mum instruction (MAX), which uses the same compare logic and it can also take unsigned operands

by using the same encoding bits as the CMP instruction. The decision logic is activated by the third

encoding bit.

Fig. 5.8 depicts the encoding field reservation policy, for arithmetic instructions. Bits 3 and

FUNCTION FIELD︷︸︸︷

UNUSED S SIMD M U C

11-6 5 4 3 2 1 0

Figure 5.8: Encoding field reservation for SIMD arithmetic instructions. S - Saturation, SIMD -SIMD type, M - Maximum, U - Unsigned, C - Compare.

4 encode the SIMD operation type in the addition, subtraction, compare and maximum instruc-

tions, as demonstrated in Table 5.2. The scalar type represents the previously existing scalar

instructions. In this mode, all other encoding bits are ignored, with the exception to the lower

two, corresponding the CMP instruction. Finally, encoding bit 5 activates the saturation logic for the

vector-vector and vector-scalar ADD/RSUB instructions. Otherwise it is ignored.

Table 5.2: SIMD type encoding for function field bits 3 and 4.SIMD type Code

scalar 00

vector-vector 01

vector-scalar 10

inner-vector 11

With the above options, it was possible to re-use most of the decoding structures already

implemented in the processor, except for a few control signals that had to be generated from

the added function bits. These choices also provide added flexibility for future extensions. For

instance, adding saturation or forcing unsigned values to all the addition/subtraction SIMD opera-

tions.

The encoding of the vector-sized shift instruction (SLLV) takes the existing shift right logi-

cal instruction (SRL) as base, i.e., uses the same opcode with a slightly different encoding field.

Therefore, the lower two bits are changed from 01b to 10b. Although only one shift instruction was

implemented, there is enough space in the encoding structure for future implementations of other

shift instructions.

The scalar branch instructions use the destination register identification field to encode the

different condition types. After analysing these 5 bits, it can be observed that the lower 3 bits

encode the 6 branch conditions (with two unused codes) and the fifth bit encodes the delay option.

Hence, the fourth bit can be used to distinct the SIMD branch instructions from the scalar branch

instructions. As defined before, the existing conditions were ported to a disjunctive assertion

topology and the mask branch instruction (BMEV) takes an unused 3-bit function code, 111b.

In contrast with the previous definitions, unused opcodes were assigned to the SIMD

55


load /store instructions (LV/SV). The assigned opcodes are consecutive (the two lower bits of the

opcode are set to 11b) to those corresponding to the existing scalar load /store instructions, allow-

ing for an easier decoding of the instruction. In a similar way, the move-to-vector instruction (MTV)

was also assigned an unused opcode (010100b). Due to the absence of a similar instruction in the

original ISA, new decoding logic had to be added to the processor decode unit, to generate the

appropriate signals.

For a more detailed information, Appendix A presents a list of all the implemented instruc-

tions in the proposed ISA extension, as well as the corresponding instruction encoding, a brief

description and the pseudocode of the operation.

The final ASIP architecture, with the above described modifications, is presented in Fig. 5.9.

Figure 5.9: Final ASIP architecture description. The new modules are presented in blue. The newinstructions’ decoding logic is included in the Instruction Decode/Interrupt Handle block, only thecontrol signals were added. The SIMD addition, subtraction, compare, shift and move instructionsare included in the SIMD ALU block. Notice the replication of the saturation detection and themaximum decision logic to the forwarding lines.

5.4 GCC back-end extension

Writing programs directly in machine language (i.e. byte code) is not practicable and would be

a serious limitation to the proposed processor’s usability. A compiler is thus a fundamental tool,

because it not only eases the task of writing programs, but also allows for some code optimizations

and reduces the interdependency between the program and the processor’s implementation, ide-

ally making low-level changes transparent to the programmer. In the scope of the HELIX project,

and parallel to the work described in this document, a compiler is currently being developed to

provide full support to the developed ASIP. In this section, the current compiler development

stage is presented.

A compiler is usually organized as a pipeline with three main modules: i) Front-End; ii) Middle-

56

5.4 GCC back-end extension

Figure 5.10: Compiler structure.

End; and iii Back-End, as illustrated in Figure 5.10. The front-end is in charge of reading the

programmer’s code and creating the Abstract Syntax Tree (AST), a data structure representing

language constructs which are the input for the next stage. Fig 5.11 illustrates an example AST.

The middle-end, usually highly coupled to the front-end, is an intermediate step able to identify

structures to undergo optimization, taking into account the intrinsics of the compiler and other

factors. Finally, the back-end, is responsible for machine code generation using known processor

properties, such as available registers and instructions encoding format.

Figure 5.11: AST corresponding to: print((3 + variable)× 8.2)

In the context of this work, the compiler could be simply implemented as an assembler, al-

lowing for programs to be written in the target processor’s Assembly Language (ASM). In fact,

57


given the need to perform extensive tests on the implementation for each introduced feature dur-

ing the development process and obtain results for performance analysis, this was the adopted

approach. To this end, every new instruction implemented in the ASIP has a corresponding op-

code and mnemonic. Since the ASIP implementation was based on a subset of the MicroBlaze

ISA [28], it was only necessary a replication of the GCC MicroBlaze back-end structure, to allow

the new instructions to be included.

Even though, at the present development stage, it is possible to use the assembler. The

other stages of the compiler are currently being addressed. The front-end will allow for a greater

abstraction and expressiveness when writing programs, since it allow programmers to use high-

level programming languages. The middle-end requires more attention, since it optimizes and

performs various validation checks on the code using the AST, allowing a leverage on the specific

aspects of the underlying ISA.

As mentioned above, the compiler back-end currently provides full support for the proposed

ISA. The new instructions can be used through inline assembly mnemonics embedded in the C

source code, providing a tool to accelerate algorithms with the specialized SIMD instructions.

The final goal is to support the compiler’s full pipeline. The front-end will support the well-

known C programming language and the back-end supports the proposed ISA. The main body

of work in this development process is anticipated in two areas: i) how to expose the processor’s

improved functionality to programmers (e.g. SIMD operations), and ii) what optimizations to pro-

vide. To that end, the interface is going to be explicitly exposed, by creating processor intrinsics,

which are often used to implement vectorization on programming languages that do not explic-

itly support it. This is usually performed by implementing functions that are directly mapped to

SIMD instructions, eliminating the need to use inline assembly mnemonics. On the other hand,

advances to the compiler’s capabilities will be performed, both in the front-end (with syntax cues

for special behavior) and in the middle-end (with optimizations based on code analysis).

5.5 Summary

This Chapter described the modifications done to the MB-LITE original architecture to support

the new ISA. The first section presents the modifications to the register file and memory access

structures. The second and third sections describe the modifications to the execution and decode

units, respectively. Finally it was presented the resulting ASIP architecture. The last section

provided a brief and general description of the compiler structure, as well as the procedures that

allowed the extension of the GCC MicroBlaze back-end structure in order to include the new ISA.

58

6Multi-core platform architecture

Contents

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 AMBA 3 AHB-Lite protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Shared bus arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.5 DMA controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.6 Hardware mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.7 System implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

59

6. Multi-core platform architecture

6.1 Overview

6.1.1 Multi-core processing structure

Besides the fine-grained SIMD parallelization model described in the previous chapters, a

platform that also exploits a coarse parallelization model was developed, by implementing several

identical processors in SMP structure. Hence, to allow the parallel implementation of different

algorithms, the developed multi-core architecture (illustrated in Fig. 6.1) is composed of: a shared

memory element, to store the data sets; multiple processing cores; and a mutex circuit, to handle

core synchronization. All elements are interconnected by a high-bandwidth communication in-

terface. To reduce the amount of data that is transferred between the master and the processing

cores, and with the envisaged shared memory model (proposed in [43]), the proof-of-concept that

is presented here does not make use of the master core to manage the work queue and gather

the results. All processing cores initiate the algorithms’ execution independently and at the same

time, by first loading the instruction and data memories with the compiled implementation and

data sets. Furthermore, this proof-of-concept architecture does not include any cache level in the

cores, thus alleviating the need for any explicit coherence mechanism.

Hence, to perform the required algorithm, each processing core is composed of the specialized

SIMD ASIP presented in Chapter 5, an instruction memory, a local (scratchpad) memory, a DMA

controller and a network interface (see Fig. 6.2).

Figure 6.1: General overview of the multi-core architecture.

Figure 6.2: Schematic of each individualprocessing core.

6.1.2 Data communication

The proposed architecture supports several different types of interconnection mechanisms

(e.g., shared bus, ring/mesh, network on chip, etc.) depending on the implemented interface

structure. For the considered implementation, it was adopted a shared bus, together with a spe-

cially designed arbiter to manage the bus access contention. The main reason for this option

arises from the need to adopt a simple communication infrastructure, requiring as little hardware

60

6.2 Memory hierarchy

resources as possible, in order to allow the implementation of as many processing cores as pos-

sible. Furthermore, it was already demonstrated the viability of this solution [43] for this specific

application domain.

The adopted bus protocol is AMBA 3 AHB-Lite [41] compatible, with multi-layer support, re-

quiring a minimum of two clock cycles to transmit data through the bus: the first to request the

access to the bus and the second to transmit the data. Naturally, whenever the bus is unavailable

(busy) when the access request is first made, additional cycles are required. To minimize the

data transfer time, the bus arbiter also supports a burst mode, where a single bus request is used

to transmit multiple data packets. In this case, a minimum of n + 1 clock cycles are required for

transmitting n data packets.

Furthermore, the design of the processing cores took into account the appropriate techniques

to reduce the contention in the system bus. For such purpose, the proposed architecture uses

a local (scratchpad) memory to store temporary and runtime data. Moreover, a DMA controller

was also provided to handle most accesses to the main memory (e.g., prefetching data to the

scratchpad memory). This reduces the number of bus requests and allows hiding most of the

communication time while the computation is taking place. Although pooling mechanisms can be

used, the DMA controller flags an interruption line of the processor whenever a given copy request

is finished. This allows the implementation of a suitable interrupt routine that handles all the data

prefetching.

To simplify the programming task (and compiler development), the processor was connected to

a memory mapped I/O organization. Thus, access to the DMA registers (for configuring the DMA

transfers and checking their status) and to the scratchpad and main memories are performed by

using the usual load/store instructions.

6.2 Memory hierarchy

Two levels of memory are used in the proposed architecture. The first level is composed

by a private scratchpad memory, exclusively owned by each ASIP, with direct access and with

single-cycle latency. The second level is a shared main memory, concurrently accessed by all

cores through an arbitered shared bus and with multiple-cycle latency. The scratchpad memory

is used for runtime storage of program and temporary data, while program inputs and outputs are

stored in the shared memory. To allow an I/O memory mapped access, specific address spaces

are assigned to the private and shared memories. Since the memory map is the same for all

processing cores, convenient abstraction mechanisms are provided for memory access, making

it independent of the core in question (see Fig. 6.3).

At this point, it is important to recall the processing model that is adopted by the considered

class of algorithms. In particular, considering that the same algorithm is executed in the several

61


Central Processing Units (CPUs), by applying dual-level parallelism to process the large dataset,

corresponding to the significant amount of query sequences that have to be aligned with the

reference sequence. Hence, the loaded program can be the same for all processing cores, as

any intermediate or runtime specific data is stored in each core’s private memory.

Figure 6.3: Diagram representing a memory mapped interconnection for multi-core system.

Accordingly, both the ASIP core and the DMA controller have direct access to the private

scratchpad memory. To simplify the architecture, the scratchpad is implemented as a dual-port

memory module, allowing a simultaneous access from the ASIP core and the DMA. Any con-

flicts are managed by the programmer, by making sure that the same memory locations are not

accessed at the same time by both components. This is easily achieved with the pooling and

interrupt mechanisms allowed by the DMA controller.

6.3 AMBA 3 AHB-Lite protocol

The implemented interconnection structure is based on ARM’s AMBA 3 AHB-Lite [41] pro-

tocol. This interface was implemented by modifying the original MB-LITE’s interface (based on

Wishbone standard),in order to translate the ASIP I/O to AHB-Lite compatible signals. The bus

interconnection logic, consists of a Decoder, attached to the Master interface, that monitors the

transfer address, selects the corresponding Slave and then drives a Multiplexer block to route

the Slave’s output. All slaves are therefore organized in a I/O mapped organization. Fig. 6.4 il-

lustrates the signal interfaces for both the AHB-Lite Master and Slaves. The Decoder compares

the address provided by the Master with a predefined memory map and activates the HSELx sig-

nal of the corresponding Slave. The signal is then latched and used as the select signal in the

corresponding Multiplexer, to route the Slave’s response (see Fig. 6.5).

The communication characteristics and capabilities of the implemented bus were properly

adapted to the actual needs of the ASIP. As such, it only features single and undefined length

incrementing burst transfer types. Leaving out the fixed sized and wrapping burst transfer types,

62

6.3 AMBA 3 AHB-Lite protocol

Figure 6.4: Signal interfaces for the AHB-Lite Master and Slave components.

as they are not used in the target applications. Each transfer consists of two phases: address

phase and data phase. The address phase lasts only one single clock cycle, where the Master

drives the address bus, which is sampled by the Slave. The data phase can take any number of

cycles, depending on the time the Slave requires to complete the operation. The data phase is

extended through explicit insertion of wait states by the Slave, by setting the HREADY signal value

to LOW. However, the targeted Slave can not request the address phase to be extended. An

address phase can only be extended through the extension of the previous (and simultaneously

occurring) data phase.

Figure 6.5: AHB-Lite bus interconnection scheme between one Master and multiple Slaves.

The implemented ASIP only supports single transfers and does not generate any control sig-

nals except for the HADDR, HWRITE and HRDATA signals. Therefore, the wrapper interface generates

all the unmentioned signals, with the required default and fixed values, in order to comply with the

protocol. The HREADY signal is used to stall the core, through the ASIP dmemi.enai signal, when

the transfer is waited. In order to use the memory modules provided in the MB-LITE package [31],

a Slave interface is also implemented to sample the control signals and implement the wait state

generation logic. On the other hand, only the HSELx, HADDR, HWRITE, HWDATA and HRDATA signals

are used, hence all the other input signals are ignored. HRDATA outputs the read data and the

63


HREADYOUT signal is used to introduce wait states.

As described above, the AHB-Lite protocol only supports one Master. However, by introducing

the Multi-layer AHB interconnection scheme, a multi-master system can be implemented. As-

suming that the above bus architecture is one layer, i.e., one Master connected to any number of

Slaves, the multi-layer AHB consists of a number of layers being connected through an intercon-

nection matrix. This matrix enables parallel access paths between multiple Masters and Slaves,

being the later shared between all layers. To achieve this scheme, no changes are required to

the AHB-Lite layers and components. It is only necessary to include a Multiplexer block for each

Slave, controlled by a local arbiter, similar to those used by the Masters. Thus, each of those

arbiters become the point arbitration at each Slave and it is only necessary when more than one

Master requests access to the same Slave at the time.

In order to keep the atomicity in Slave access, the local arbiter drives the HREADY signal to

introduce wait states to any Masters that are not granted access in the current transfer. The

arbitration scheme can follow either a round-robin policy, changing upon every transfer or every

burst, or it can be a fixed priority scheme, by providing different priorities to each layer. The

chosen arbitration scheme is further discussed in the next section.

6.4 Shared bus arbitration

Since there is not any AHB-Lite Master that requires higher priority over any other, a no star-

vation guarantee is required from the arbiter, i.e. all Masters wait at most a fixed number of time

slots (equal to #Masters − 1) to access the bus. Therefore, the chosen arbitration protocol was

a round-robin token passing scheme, where at every time slot a new token is generated and a

different Master receives access to the shared resource.

Table 6.1: Truth table representing the priority function implemented by the arbiter’s priority blocks.EN IN[0] IN[1] IN[2] IN[3] OUT[0] OUT[1] OUT[2] OUT[3]

0 X X X X 0 0 0 0

1 1 X X X 1 0 0 0

1 0 1 X X 0 1 0 0

1 0 0 1 X 0 0 1 0

1 0 0 0 1 0 0 0 1

The arbiter, with a design based on [44], is composed by a ring counter and a number of

instantiations of a priority function equal to the number of Masters. The priority block implements

the function described in Table 6.1 (priority encoder) and given by Eq. 6.1.

out[0] = EN ∧ in[0]

out[1] = EN ∧ in[0] ∧ in[1]

out[i] = EN ∧ in[0] ∧ · · · ∧ in[i− 1] ∧ in[i]

(6.1)

As it can be observed, the priority block gives highest priority to the Master connected to in[0],

then to in[1] and so on. Priority blocks are sequentially activated by the ring counter’s token, one

64

6.5 DMA controller

at the time. Therefore, by rotating the connection of the Masters in each priority block, the round-

robin scheme is implemented (see Fig. 6.6). For instance, in a system with n Masters, Master 1

is connected to in[1] in block 0, to in[0] in block 1, to in[n] in block 3 and so on.

Figure 6.6: Arbiter block diagram featuring the Priority Block (PB) interconnection logic. Each PBimplements the same function (see Table 6.1), the req[i] signals rotate through each PB’s inputsignals.

The arbiter also features an ACK signal, to allow the Masters to signal the end of the time slot

and activate the ring counter. The time slot size depends on the bus topology: for instance in

the above described multi-layer AHB, the Master can signal the arbiter at the end of each single

transfer or at the end of a burst transfer.

6.5 DMA controller

Currently, a considerable amount of IP and open source DMA controllers are available, and

not only do they have extensive operation protocols, but also complex interfaces with multiple

channel implementations. However, for the purpose of this structure, only a simple controller

is required. Since the function of the DMA is to perform the transfers between the processing

core’s scratchpad memory and the shared memory, it is simply implemented by two bidirectional

channels, controlled by a Finite-State Machine (FSM), that accept requests from the ASIP.

The transfer request is performed in two consecutive phases, where the ASIP writes two words

in the DMA control registers (I/O mapped). In the first phase, the data word contains the number

of memory positions to transfer (configurable, 4 bits by default), a bit indicating which channel is

the source and the address of the first position, concatenated in a single 32-bit word, as shown in

Fig. 6.7. As can be observed, in the default configuration, the command word only features 27-bit

65


address, which for the current application is enough, since the shared memory is addressed, at

most, with 18 bits. The data word in the second phase contains the address of the first position

for the destination channel, which in this case can occupy the entire word, although they are not

entirely used.

1st PHASE #TRANSFERS CH SOURCE ADDRESS

31 28 27 26 0

2nd PHASE DESTINATION ADDRESS

31 0

Figure 6.7: Default DMA command configuration. The core writes two 32-bit command words toconfigure the DMA. The CH field indicates which channel is the source (0 or 1).

The DMA controller interface is composed by one request input and two memory connec-

tions (the same as the ASIP, i.e., the dmemo and dmemi signals) to the scratchpad and shared

memories. The DMA controller’s interface outputs one state flag and one interrupt flag.

Regarding the internal architecture of the DMA controller, it is composed by two channels,

with an intermediate register to latch the transfered data, and the FSM.Each DMA channel is

implemented by a transfer counter, whose current count is used to generate the memory address

with the first position’s address. During the channel configurations phase, the first address and

the current desired count are stored in the registers. Then, during its regular operations, the count

is compared with the number of desired transfers to verify the end of the operation. To comply with

multiple-cycle bus transfers, the DMA can be stopped (wait-state) through any of the channels by

an enable signal (as in the ASIP, with the dmemi.enai signal) until the bus transfer completes.

An architectural overview of the DMA controller is presented in Fig. 6.8.

(a) DMA architecture (b) DMA channel

Figure 6.8: Block diagram representing the DMA controller (a) and the DMA channels (b) archi-tectures.

By default the FSM (whose flowchart is depicted in Fig. 6.9) is in the idle state (READY), wait-

ing for a request (cmd #1). After the first request, it expects the second command (cmd #2).

Otherwise, FSM resets and reverts to the idle state. The channels are configured based on the

received information, i.e., one channel is set for reading and the other for writing and the counters

are configured for the transfers (INIT state). The transfers are then performed (BUSY state) until

66

6.6 Hardware mutex

their completion (DONE state). At this point, the interrupt is flagged, the counters and registers are

reset, and the DMA controller reverts to the idle state.

Figure 6.9: State diagram of the DMA’s FSM. cmd #1 and cmd #2 represent the two mandatoryconsecutive stores to the DMA.

The DMA controller is regarded as the bus Master in the Multi-layer AHB interconnection.

Therefore, a Master interface, entirely similar to the one implemented for the ASIP but supporting

incrementing burst transfers, was implemented. This, allows the DMA controller to maintain the

connection to the Slave throughout the entire operation, without releasing the bus. The connection

from the DMA to the ASIP is made directly, through an address decoder, as it will be described in

Section 6.7.

6.6 Hardware mutex

To allow an efficient and coordinated cooperation between the several cores, the multi-core

architecture includes a register-based mutex circuit, similar to the one proposed in [39]. Each

register supports two states: locked by core k and unlocked. The operation of the mutex circuit, is

as follows: when one core tries to read from an unlocked mutex, a value of ’1’ is returned and the

mutex immediately locks. After that, all other cores that read from that mutex receive the value

’0’, until the first core unlocks it by writing the value ’1’. All the write and read operations are

atomic, assured by the shared bus logic, which only allows one core to access the mutex circuit at

a time. Each mutex register stores the core identification, provided by the bus arbiter grant signal,

concatenated with a state bit. Therefore, in lock or unlock requests, the upper part of the register

is checked against the grant signal, performing the correct request if both match.

The mutex circuit is implemented as a single peripheral, with its control registers mapped to

an address space range, provided in the shared bus memory map. The number of the registers is

configurable, according to the implementation requirements, and the register size (in bits) is equal

to the number of bus masters plus the lock state bit. A block diagram illustrating the mutex circuit

67


Figure 6.10: Block diagram representing the mutex circuit architecture. Notice the translation ofthe memory-enable and the write-enable signals to lock and unlock requests. Hence, providing acompliant interface with the memory mapped I/O organization.

architecture is shown in Fig. 6.10.

To connect the mutex circuit as a AHB Slave, a simple interface was provided to translate the

protocol signals to I/O compatible signals of this circuit.

6.7 System implementation

Provided all the above described components, the multi-core platform was implemented as

follows. As introduced in Fig 6.2, each processing core is composed by the ASIP architecture

described in Chapter-5, directly connected to an instruction memory and an address decoder. The

adopted memory map of the decoder provides convenient connections to a dual-port scratchpad

memory, a DMA controller and a AHB-Lite Master interface. The DMA controller is connected

both to the free port of the local scratchpad memory and to the shared bus (through a dedicated

AHB-Lite Master interface).

The shared bus adopted a reduced version of the AHB-Lite protocol, with a Multi-layer inter-

connection scheme. Connected to the AHB-compatible Slaves are the main shared memory and

the corresponding mutex circuit. In order to allow the STDOUT stream to be easily used in the

software development phase, the original character device from MB-LITE can still be used. To

that end, it can be either paired with the main shared memory, with the help of a switch in the test-

bench file, or by connecting it to the bus through a Slave interface. The later can reuse the same

bus interface used for the shared memory. A global schematic of the final multi-core platform is

provided in Fig. 6.11 (without the character device).

6.8 Summary

In this Chapter, the implementation of the conceived multi-core platform is presented. The

first section presents an overview of the following sections and an introduction to the overall ar-

68

6.8 Summary

Figure 6.11: Block diagram of the multi-core platform, featuring the processing cores, the AHB-Lite interconnection bus and the mutex circuit.

chitecture. Then each individual component, or structure, is described in a separate section, i.e.

memory architecture, AHB-Lite and bus arbiter implementations, DMA controller and hardware

mutex architectures. The last section wraps up the work’s implementation, by bringing together

the ASIP, with the proposed ISA extension, and the developed multi-core structure.

69


70

7Experimental Evaluation

Contents

7.1 Prototyping framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2 SIMD ASIP evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.3 Multi-core processing structure evaluation . . . . . . . . . . . . . . . . . . . . 77

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

71

7. Experimental Evaluation

7.1 Prototyping framework

To evaluate the proposed multi-core processing framework, a thorough performance analysis

of the complete system was performed. For such purpose, the implemented ASIP performance

was compared with a high-performance GPP (Intel Core i7 950, running at 3.07 GHz, with 6 GB of

RAM), by evaluating the relative speedup of the proposed ISA when compared with the Intel SSE2

implementation of the SW algorithm. A vanilla (sequential) implementation of the SW algorithm

was used as reference in both implementations (see Appendices B.2, B.3 and B.4). Accurate

clock cycle measurements of the required time to execute the biological sequences analysis in

the proposed platform were obtained by using Modelsim SE 10.0b. On the Intel architecture, cycle

accurate measurements were obtained by using the PAPI library, which provides access to the

processor performance counters.

The complete system was then prototyped in a Xilinx Virtex-7 FPGA (XC7VX485T), which is

part of the Xilinx VC707 Evaluation Kit. Furthermore, with future embedded platform integrations

in view, the system was also prototyped in a Xilinx Zynq FPGA (XC7Z020), part of the Xilinx

Zynq-7000 SoC ZC702 Evaluation Kit, and in an Altera Aria II GX FPGA. To synthesize the

design and perform the Post-place&Route procedures both the Xilinx ISE 14.4 and the Altera

Quartus II platforms were used, for the corresponding devices.

To complete the evaluation, an ASIC was synthesized by using a 90nm CMOS technology.

This synthesis was performed by using the Design Vision tool (version E-2010.12-SP4), targeting

the UMC L90 SP standard cell libraries.

The sequence alignment programs that were executed in the ASIP (see Appendices B.2 and

B.3) were compiled with GCC 4.6.2 (with the modified back-end) and using the optimization flags

-O2 (sequential case) and -O (SIMD case), which revealed to be the most favorable parametriza-

tion for each case. For the considered benchmarking, the used DNA dataset is composed of

several reference sequences ranging from 128 to 16384 elements and a set of query sequences

of length 64. The reference sequences correspond to a random selection of sub-sequences of

the Homo Sapiens chromosome Y genomic contig (NT 011875.12), while the query sequences

were generated by randomly combining reads from run ERR004756 of study ERP000053 (human

DNA). The configuration parameters are the same for all implementations, i.e. the substitution

score matrix uses the values 5 (symbol match) and -4 (symbol mismatch) and the gap penalties

are set to 10 (open cost) and 6 (extend cost).

7.2 SIMD ASIP evaluation

In order to demonstrate the scalability of the implemented ASIP for different SIMD vector sizes,

a specific experimental evaluation was setup, where a single query sequence was aligned to all

the reference sequences. The tested configurations had vector sizes of 32, 64, 128 and 256

72


bits, all with 8-bit vector elements. The measured number of clock cycles required to execute the

alignment procedures with these configurations was compared against the number of clock cycles

for the sequential version. A plot of the achieved speedups is presented in Fig. 7.1.

Figure 7.1: Variation of the obtained speedup with the SIMD vector size.

From the presented results, it can be observed that the obtained speedup does not scale lin-

early, this can be explained by the fact that, in the striped implementation of the SW algorithm,

the number of performed lazy loops increases with the number of SIMD elements. As it can be

observed by the values shown in Table 7.1, the number of expected lazy loops increases by an

average factor of 1.47 when the number of vector elements is doubled. Also observable, is a

speedup of about 14x with 32-bit vectors, i.e., 4 8-bit elements. This result may seem incoher-

ent, provided that if the number of cells being processed in parallel increases from 1 to 4, from

the sequential version to the 32-bit SIMD version, then the expected speedup would only be 4x.

However, it should be reckoned that there is already a performance increase with the introduc-

tion of the query profile and maximum instructions, in the SW algorithm. Hence, eliminating the

need to calculate the substitution score matrix’s indexes and reducing the number of branch in-

structions, providing a performance increase of the vanilla implementation, prior to the addition of

SIMD operations.

Table 7.1: Lazy-loop counts, with different vector sizes, for all the reference sequences.VECTOR SIZE

REFERENCE 32 64 128 256128 125 189 251 308256 294 485 669 823512 511 872 1237 1528

1024 880 1561 2296 28662048 2021 3546 5009 61404096 3284 6339 9057 111668192 6759 12971 18600 22957

16384 14520 26521 37885 46542

73


From the obtained results, it was observed that the 128-bit SIMD vector configuration presents

a preliminary speedup of 30.65x against the sequential version. Although the configuration with

256-bit vectors does present a slightly better performance (37.22x speedup), the difference be-

tween both is not significant. This was mainly due to the fact that the vectors doubled in size and

there was only a 17.6% increase in performance. Moreover, it implies a potential cost in hardware

resources. As a consequence, the 128-bit configuration was regarded as the best compromise in

terms of speedup and hardware cost. Furthermore, it was regarded as the best configuration to

ensure a fair comparison with the Intel SSE2 implementation, hence the 128-bit configuration is

herein adopted as base architecture.

7.2.1 ISA evaluation

To better demonstrate the performance improvement of the SIMD ASIP when using the ex-

tended ISA, 10 different query sequences were separately aligned with all the reference se-

quences. Furthermore, to validate the conceived architecture, the ASIP was compared against

the Intel Core i7 processor, which is a state-of-the-art superscalar GPP, capable of multiple in-

struction issue, out-of-order and speculative execution. For this test, both the sequential and the

SIMD versions of the SW algorithm were considered. Table 7.2 presents the average execution

time (in number of clock cycles), to execute the DNA sequence alignment in the ASIP and in the

Intel Core i7.

Table 7.2: Clock cycles(×106) and obtained speedup values.Sequence Intel ASIP

Size Sequential SSE2 Speedup Sequential SIMD Speedup

128 0.125 0.014 8.95 0.651 0.020 32.30256 0.251 0.027 9.27 1.302 0.042 31.26512 0.498 0.046 10.77 2.604 0.083 31.41

1024 0.959 0.085 11.28 5.209 0.165 31.492048 1.924 0.183 10.49 10.416 0.349 29.864096 3.785 0.347 10.91 20.834 0.627 33.208192 7.586 0.684 11.09 41.665 1.386 30.05

16384 15.248 1.360 11.21 83.325 2.757 30.22

Average 10.50 31.23

As it can be observed by analyzing the speedup columns in this table, the Intel Core i7

achieves a maximum speedup of 11.3x (average value of 10.5x), while the ASIP achieves a

maximum speedup of 33.2x (average value of 31.2x).

Another interesting comparison can be made by analyzing the number of clock cycles per cell

update for both the Intel Core i7 and the ASIP. By dividing the total number of clock cycles (c)

by the length of the query and reference sequences (m and n, respectively), the number of clock

cycles per cell update is obtained as c/(m× n). For the SIMD implementations, the Intel Core i7

achieves a minimum of 1.3 cycles per cell update, whereas the ASIP achieves a best case of 2

cycles per cell update. At this point, it is important to note that while it may seem that the ASIP

74


core has a lower performance, it should be noted that the Intel Core i7 uses ILP technologies,

such as macro operation fusion and multiple instruction issue, to achieve up to 6 micro operations

per clock cycle [4], which results, on average, in 2 IPC in the considered benchmarks. However,

such advantage is obtained at the cost of significantly higher hardware and energy resources,

making it not suitable for embedded and portable application domains.

7.2.2 Hardware and timing analysis

To evaluate the performance and resources overhead introduced by the proposed extension to

the ISA, the original MB-LITE processor and the conceived ASIP were implemented on a Virtex-

7 FPGA. To ensure a comparison as fair as possible, the multiplier and the barrel-shifter were

deactivated in the original MB-LITE core.

Table 7.3: Hardware specifications for the MB-LITE and for the ASIP, in a Virtex-7 FPGA.Virtex-7 MB-LITE ASIP

Slice Registers 607200 399 (<1%) 1013 (<1%)Slice LUTs 303600 929 (<1%) 4566 (1%)Slice LUT-FF pairs – 269 (26%) 960 (20%)Block RAM/FIFO 1030 2 (<1%) 6 (<1%)BUFG/BUFGCTRLs 32 1 (3%) 1 (3%)Max. Freq. (MHz) – 226.2 187.0

Table 7.3 presents the FPGA hardware resources and the maximum operating frequencies

for both the MB-LITE and the ASIP cores. From these results it is observed that the number

of LUTs increased approximately 5 times, mostly due to the extra logic that is required for the

parallel arithmetic operations and the additional multiplexing logic. The number of registers and

RAM/FIFO blocks increased almost 3 times, due to the increase in the register size from 32 to

128 bits. Despite this increase, the total resources occupied by the ASIP on the FPGA is still less

than 1%.

As expected, the maximum operating frequency decreased by about 40 MHz (17%), which

demonstrates that the ISA extension had some (but still reduced) impact on the critical path of

the core. It can also be verified by the Post-place&Route report that the clock period is limited by

the added multiplexing logic that is required to implement the SIMD instructions, as well as by the

non-SIMD 32-bit adder.

To conclude, it can be observed that the proposed ASIP provides a speedup factor which is 3

times higher than the speedup obtained with the Intel Core i7, although its operating frequency is

lower than 200 MHz on the adopted FPGA, i.e., it runs only at 1/15 of the clock frequency of the

Intel Core i7. This result demonstrates that the proposed extensions that were introduced in the

ASIP are well tuned for operations frequently used in sequence alignment algorithms.

75


7.2.3 ASIC synthesis

With a possible ASIC implementation in view, preliminary results were obtained by synthe-

sizing the ASIP using a 90nm CMOS technology. Accordingly, a maximum operating frequency

of 769 MHZ was estimated. Then, a frequency scalability study was conducted. The obtained

results in terms of the total circuit area, operating frequency and dynamic power consumption, for

different period constraints, are presented in Table 7.4 and the plotted in Fig. 7.2. These results

were obtained in an interactive process, by starting in the same minimum period obtained in the

Virtex-7 FPGA, and ending in the minimum achievable period constraint for the 90nm technology.

Table 7.4: ASIC synthesis results for a set of different period constraints.Period Maximum Total cell Dynamic

Constraint (ns) Frequency (MHz) area (µ m2) Power (mW )5.0 200 185098 24.204.5 222 184908 27.014.0 250 184315 30.303.5 286 184782 34.673.0 333 185004 40.542.5 400 184531 48.302.0 500 185002 60.491.5 667 194399 82.991.3 769 207156 97.99

Figure 7.2: Frequency, area and power scaling of the synthesized ASIC. The bold line representsthe area, and the dotted line represents the dynamic power consumption, both in relation to theoperating frequency.

As it was expected, the main drawback of operating the circuit with higher frequencies is the

increase of the dynamic power, which goes from 24.2 mW (200 MHz) to a maximum of 98 mW

(769 MHz). This power consumption response respects the requisites imposed by the targeted

application domain, since the core will be embedded in a diagnosis system and will not be continu-

76

7.3 Multi-core processing structure evaluation

ously operating, as opposed to a desktop GPP. In fact, its main goal is to execute computationally

demanding algorithms in reduced amounts of time. Therefore, there is a reduced impact, in terms

of power consumption, in the applicability of the core in embedded systems, by still being able to

provide great performance gains.


The scalability of the conceived multi-core architecture was evaluated by first measuring the

required amount of hardware resources and by conducting timing analyses on the Xilinx Virtex-7,

Xilinx Zynq-7000 and Altera Aria II GX FPGAs, with different numbers of instantiated ASIP cores.

These analyses were followed by a performance and power analysis of the proposed multi-core

processing platform, implemented in several technologies (Xilinx and Altera FPGAs, and 90nm

CMOS technology), and compared with that of off-the-shelf embedded and desktop processors

(namely, an Intel Core i7 950, an Intel Atom and an ARM Cortex-A9). This analysis is started

by evaluating the fitness of the proposed SIMD ISA extensions for biological sequence analysis,

in comparison with the one present on the off-the-shelf processors. Then it was performed an

evaluation of the multi-core implementations regarding performance, energy and Energy-Delay

Product (EDP). To perform this analysis, the benchmark datasets described in the beginning of

the chapter were used.

The implementation of the defined scalable multi-core processing structure adopted several in-

stantiations of the proposed ASIP (as described in Chapter 6), where all processing cores execute

the same SIMD version of the SW algorithm. In this configuration, all the integrated scratchpad

and shared memories were set to a 16-bit address width. In order to evaluate the shared bus con-

tention, which constraints the multi-core scalability, the reference sequence used by the algorithm

was stored in the shared memory, and each processing core concurrently requests access to the

bus (once per iteration) to obtain the corresponding sequence symbol. Any other private variables

were stored in the scratchpad memory of each processing core.

7.3.1 Hardware and timing analysis

Table 7.5: Hardware specifications for the multi-core system prototyping in the Virtex-7 FPGA.Virtex 7 Multi-core (# ASIPs)FPGA 2 4 8 16 32 38

Slice Registers 607200 2338 (<1%) 4687 (<1%) 9322 (1%) 18503 (3%) 37133 (6%) 44016 (7%)Slice LUTs 303600 9424 (3%) 18231 (6%) 36557 (12%) 69723 (22%) 142098 (46%) 171637 (56%)Slice LUT-FF pairs – 2127 (22%) 4125 (22%) 8356 (22%) 16731 (23%) 34.339 (23%) 40827 (23%)Block RAM/FIFO 1030 68 (6%) 120 (11%) 224 (21%) 432 (41%) 848 (82%) 1004 (97%)BUFG/BUFGCTRLs 32 1 (3%) 2 (6%) 2 (6%) 2 (6%) 2 (6%) 2 (6%)Max. Freq. (MHz) – 184.2 179.5 179.5 163.3 162.4 157.3

Table 7.5 presents the hardware resources usage and the maximum operating frequency of

the multi-core processing structure, for different aggregates of processing cores, for the Virtex-7

77


FPGA. Due to the Block-RAM resource requirements of each core, a maximum of 38 processing

cores can be instantiated on the prototyping Virtex-7 FPGA. However, a 64-core configuration

may be instantiated, without exceeding the available amount of slices, if a different memory con-

figuration is used. The maximum operating frequency obtained for the multi-core processing

structure presents a higher decrease for the configurations with a greater number of processing

cores. This results from the amount of generated multiplexing logic in the bus, with the increase

in the number of bus Masters, and the resulting routing inside the FPGA device.

Table 7.6: Hardware specifications for the multi-core system prototyping in the Zynq FPGA.Zynq-7000 Multi-core (# ASIPs)

FPGA 2 4 8Slice Registers 106400 2365 (2%) 4722 (4%) 9409 (8%)Slice LUTs 53200 9383 (17%) 18614 (34%) 37576 (70%)Slice LUT-FF pairs – 2163 (22%) 4313 (22%) 8843 (23%)Block RAM/FIFO 104 32 (22%) 60 (42%) 116 (82%)BUFG/BUFGCTRLs 32 1 (3%) 2 (6%) 2 (6%)Max. Freq. (MHz) – 157.4 155.8 156.4

Tables 7.6 and 7.7 present the hardware resources usage and maximum operating frequency

in the Zynq and Aria II GX FPGAs, respectively. In both devices a maximum of 8 processing

cores can be instantiated, given the reduced amount of resources available, when compared to

the Virtex-7 FPGA. The obtained maximum operating frequencies are almost constant across

the different core configurations and 3x higher in the Zynq FPGA than in the Aria II GX. Also, the

operating frequency in the Zynq FPGA is close to the operating frequency obtained in the Virtex-7

FPGA for similar configurations.

Table 7.7: Hardware specifications for the multi-core system prototyping in the Aria II GX FPGA.Aria II GX Multi-core (# ASIPs)

FPGA 2 4 8Logic utilization 50600 10646 (21%) 21438 (42%) 36908 (73%)Dedicated logic registers 51336 2796 (5%) 5442 (11%) 10734 (21%)ALMs 25300 5757 (23%) 12085 (48%) 20346 (80%)LABs 2530 619 (24%) 1303 (52%) 2135 (84%)Total block memory bits 4561920 483328 (11%) 835584 (18%) 1540096 (34%)Max. Freq. (MHz) – 58.5 62.9 54.7

7.3.2 Performance evaluation

Fig. 7.3 presents the obtained speedup values in what concerns the number of clock cycles

of the proposed multi-core structure. Such speedup values were obtained by using a single

ASIP core configuration as the reference. The observed speedup increases almost linearly for

configurations of up to 16 cores. With additional cores, the contention in the shared bus becomes

a limiting factor, thus reducing the effectiveness of the extra cores and resulting in a minor speedup

increase. This is easily observed by comparing the 32-core and the 64-core configurations, when

executing the SIMD implementation of the SW algorithm.

When considering the non-SIMD sequential implementation as reference, a maximum 910x

clock cycle speedup was perspectivated for the 64-core configuration. It can also be ascertained

78


Figure 7.3: Variation of the obtained clock cycle speedup with the number of ASIP cores in themulti-core processing structure.

that even if more than 32 processing cores are instantiated with the Virtex-7 FPGA, the attainable

speedups would not increase significantly, possibly even starting to decrease with more process-

ing cores, due to bus contention effects. Therefore, in Section 7.3.3 the 38-core configuration is

not taken into account for the presented evaluations.

Figure 7.4: Variation of the obtained processing time speedup with the number of instantiatedASIP cores in the prototyped multi-core processing structure. The processing time speedup takesinto account the decrease of the operating frequency of the multi-core implementations, in thedifferent FPGAs.

Fig. 7.4 presents the achieved processing time speedup by taking into account the maxi-

mum operating frequencies obtained for the several considered multi-core configurations, in the

adopted FPGAs. When considering the non-SIMD sequential implementation as reference, the

obtained results demonstrated that a 745x processing time speedup can be obtained with a 38-

core parallel SIMD implementation of the proposed ASIP, with a maximum operating frequency

of 157.3 MHz, in the Virtex-7 FPGA. Maximum processing time speedups of 210x and 70x were

obtained for an 8-core configuration, in the Zynq and Aria II GX FPGA devices, respectively, with

maximum operating frequencies of 156.4 MHz and 54.7 MHz.

7.3.3 Performance-energy efficiency

Besides the presented evaluation in terms of the resulting speedup values, several efficiency

metrics were also used to study the efficiency balance of the devised multi-core platform. In

79


particular, three different metrics were used to characterize the multi-core ASIP in terms of the

attained raw throughputs, energetic efficiency and Performance-Energy efficiency, in comparison

with the same metrics for three current high-end GPPs. Aside the already adopted Intel Core

i7 processor, the ARM Cortex-A9 and the Intel Atom E665C processors were chosen for this

comparison, not only because they are already integrated with the Xilinx Zynq-7000 and the

Altera Aria II GX FPGAs, respectively, but also because they represent two important processor

families with relevancy for this application domain: server and embedded platforms.

Table 7.8: GPP operating frequencies and power estimation parameters. Power estimate for theGPPs is TDP/#cores.

Maximum TDP /frequency (MHz) Power estimate(W )

Intel Core i7 950 3070 130 / 32.5Intel Atom E665C 1300 3.6 / 3.6ARM Cortex-A9 533 1.9 / 0.95

To determine the achieved raw throughput, it was adopted the Cell Updates per Second

(CUPS) metric traditionally used in this application domain. It is based on the number of pro-

cessed query sequences (q) in each architecture, the length of the query and reference sequences

(m and n, respectively) and the corresponding runtime (t), in seconds. Hence, the CUPS metric

is obtained as (q ×m × n)/t. The attained throughputs are shown in Fig. 7.5. The obtained run-

times for the multi-core platform were measured with the system running at the different operating

frequencies for each FPGA, as presented in Tables 7.5, 7.6 and 7.7. For the GPPs, the operating

frequencies are depicted in Table 7.8.

Figure 7.5: Multi-core system raw throughput in comparison with high-end GPPs, given in CUPS.The ASIC values were extrapolated from the CMOS 90nm synthesis results for the ASIP.

As it can be ascertained from Fig. 7.5, the single-core ASIP implementations in the considered

FPGAs achieve throughputs close to the ARM Cortex-A9 processor. On the other hand, the two

considered Xilinx devices with the 8-core configuration attain throughputs similar to that of the

Atom processor. The implementation of the same 8-core configuration in the CMOS ASIC is

already able to surpass the performance of the Intel Core i7. At this respect, it is observed that

80


the Virtex-7 device represents the most scalable implementation, capable of approaching the

performance of the Core i7 GPP, when instantiating 32 cores.

An energetic efficiency study was also performed with the help of the Power estimation tools

in the Xilinx ISE and Quartus II software frameworks and with the GPP Thermal Dissipation

Power (TDP) values, as depicted in the corresponding data-sheets and presented in Table 7.8.

With the estimated worst-case Power consumption values (see Table 7.9), given from the maxi-

mum Power consumed by the amount of hardware resources used, running at the maximum op-

erating frequencies, efficiency metrics based on energy consumption and on the attained CUPS

metric were calculated. Furthermore, for the selected GPPs the TDP was divided by the number

of available cores in each processor, hence benefiting the GPPs Power consumption since the

normalized values are lower than the Power consumed by the GPPs even in standby mode.

Table 7.9: Power consumption measurements for each FPGA implementation, with different coreconfigurations.

Multi-core power supply (mW )# ASIPs 1 2 4 8 16 32

Virtex-7Dynamic Power 445 516 1002 1772 3044 5554Static Power 210 211 216 224 240 275

Zynq-7000Dynamic Power 209 435 753 1343 - -Static Power 79.5 81.1 83.2 87.6 - -

Aria II GXDynamic Power 47 107 196 350 - -Static Power 322 322 325 330 - -I/O Power 92 17 17 17 - -

The total energy consumption is given by the product of the execution time with the total

supplied power, as shown in Eq. 7.1.

Energy = Runtime ∗ (DynamicPower + StaticPower) (7.1)

Given the obtained energy consumption, a performance efficiency metric based on the attained

cell updates can be obtained in order to study the efficiency of the different configurations. The

adopted Cell Updates per Joule (CUPJ) metric is given by the total number of processed cells

divided by the total energy consumption. Fig. 7.6 represents the average CUPJ values’ evolution,

Figure 7.6: Energy efficiency of the several platforms using the CUPJ metric (higher is better).

81


with different core configurations, for each FPGA. At the first glance, it can be observed that FPGA

and ASIC implementations of the proposed multi-core ASIP clearly surpass the energy efficiency

of all the considered GPPs. It can also be ascertained that, with configurations up to 8 cores,

the energy efficiency of the FPGA implementations increases up to a steady state value, being

the implementations on the Zynq FPGA the most efficient of the three. This can be explained

by the lower dynamic power values required by those configurations, coupled with exponential

growths in the number of processed cells and with reduced shared bus contention, as it was

observed in the speedup analysis. For the Virtex-7 FPGA, it can be observed that the 16-core

configuration represents the peak energetic efficiency, hence corresponding to the best trade-off

between maximum operating frequency, amount of hardware resources, number of processed

cells and shared bus contention.

Figure 7.7: Performance-Energy efficiency of the several platforms using an inverted EDP metric,in CUPJS (higher is better).

The adopted Performance-Energy efficiency metric, given in Cell Updates per Joule-Second

(CUPJS), can be regarded as an inversion and normalization of the commonly used Energy-

Delay Product (EDP) metric. In fact, while the EDP is generally given by the product of the total

energy consumption and the corresponding processing runtime, the adopted CUPJS is obtained

by inverting the EDP and by multiplying it by the total number of processed cells. Fig. 7.7 depicts

the calculated Performance-Energy efficiency for the different platforms. It can be ascertained that

the considered implementations on both Xilinx devices are almost twice as efficient than those

from the Altera Aria II GX FPGA. In comparison to the studied GPPs, it can also be observed that

the ARM processor presents an efficiency close to that of the singe-core ASIP implementation in

the Virtex-7 FPGA and it is only more efficient than the same implementation on the Altera device.

In comparison to the Intel processors, the Xilinx devices’ implementations surpass the Intel Atom

processor by considering at least a 4-core configuration, while the Intel Core i7 is surpassed in

terms of Performance-Energy efficiency by a 16-core configuration implemented on the Virtex-7

FPGA. Furthermore, aside the multiple core extrapolation, a single-core ASIC (with 58 TCUPJS)

is almost 3x as efficient as the Core i7 processor (with 21 TCUPJS).

82

7.4 Summary

7.4 Summary

This Chapter presented an evaluation of the developed multi-core platform. The prototyping

framework is presented, by describing the tools and data sets used to evaluate the developed

work. Following, there is an evaluation of the implemented ASIP, by performing a scalability test

in terms of the SIMD register size and by establishing a speedup comparison with the SSE2

implementation of the SW algorithm, in a Intel Core i7 processor. Also, FPGA Post-place&Route

and ASIC synthesis results are studied. Next, a scalability evaluation of the multi-core platform

is presented, by varying the number of cores in implementations on three different FPGAs. The

obtained speedup results relative to a single ASIP, both running the sequential and SIMD versions

of the SW algorithm, are also presented. Finally, an extensive Performance-Energy efficiency

study of the FPGA and ASIC implementations is performed, and compared with high-end GPPs.

83


84

8Conclusions

Contents

8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

85

8. Conclusions

The currently established biologic sequences alignment algorithms that allow the computation

of the optimal alignment solutions by using DP techniques require large run-times when executed

on current GPPs. Moreover, even with the highest attainable performances, implementations of

those algorithms with ASIC solutions are characterized by their lack of flexibility and the high

production costs. In order to fill the gap between both approaches, a new ASIP architecture,

specifically adapted for this class of algorithms, was proposed. It is able to achieve high pro-

cessing throughputs through an optimized architecture that exploits both fine and coarse-grained

parallelism.

The fine-grained parallelism is exploited by a new synthesizable processor core, featuring an

extended ISA that introduces multiple specialized SIMD vector instructions. With this ASIP, a

significant clock cycle speedup of about 33x was achieved when Farrar’s implementation [11]

(using the new ISA code generated by a GCC compiler back-end extension) was compared with

a sequential version of the SW algorithm. This contrasts with an equivalent implementation using

an Intel Core i7, based on the SSE2 instruction set extensions, where a speedup of only 11x was

achieved. These results demonstrate that the new ISA is especially fit for implementing biological

alignment algorithms. Furthermore, it was demonstrated that the hardware resources overhead

introduced by the new architecture is small, compared to the MB-LITE [12] processor, which was

taken as base to implement the developed ISA. Maximum operating frequencies of 190 MHz, on

a Virtex-7 FPGA device, and 770 MHz, in a 90nm standard cell CMOS technology, were achieved

with the implemented ASIP.

It is also important to recall that the ASIP SIMD register size was configured to a 128-bit width

for the purpose of ensuring a fair comparison with Farrar’s implementation in an Intel processor,

as well as a convenient trade-off, given the algorithm’s performance. Nevertheless, this parame-

ter may be easily modified in order to increase the ASIP’s performance in other implementations.

Hence, the obtained results demonstrate that a simple RISC processor with the proposed set

of SIMD instructions can be very effective to execute bioinformatics algorithms. This is espe-

cially important when considering the integration of multi-core processing structures in low-power

embedded platforms to allow fast, portable and fully autonomous biological sequences analysis

systems [45].

To exploit the coarse-grained parallelism, an extensive multi-core computational structure,

composed by multiple instantiations of the proposed ASIP, was developed and prototyped on

three different FPGA devices. It was demonstrated that an almost linear speed up can be achieved

with up to 16 processing cores, since no relevant contention on the interconnection bus exists.

When the number of instantiated processors was further increased, a gradual (but expected)

sub-linear behavior was observed in the attained speedup. Nevertheless, when considering the

cumulative speedup resulting from using both the SIMD ISA and the multi-core architecture, the

proposed system is capable of achieving significant speedup values, as high as a 745x processing

86

8.1 Future work

time speedup, with 38 cores, and a 910x clock cycle speedup, with 64 cores.

In terms of Performance-Energy efficiency, it was observed that a 16-core configuration imple-

mented on a Virtex-7 FPGA, running at 163 MHz, surpassed an Intel Core i7 950 processor, run-

ning at 3.07 GHz, with an attained Performance-Energy efficiency measurement of 58 TCUPJS

(Tera-Cell Updates per Joule-Second). Therefore, it confirmed that the devised system is clearly

a viable solution for Bioinformatics applications, even surpassing the state-of-the-art implemen-

tations in high-end GPPs. Furthermore, by considering the achieved Energy and Performance-

Energy efficiency metrics, coupled with the attainable operating frequencies and low hardware

resource usage, it was proved that the proposed processing architecture is highly suitable for

future integrations with low-power embedded platforms.

8.1 Future work

The developed SIMD ISA allows the acceleration of other DP algorithms, not only by optimizing

the implementation of the algorithms with the new instruction set, but also by taking advantage

of the flexible architecture and further include other new specialized instructions. For instance,

by revisiting Wozniak’s implementation of the SW algorithm [24] and by solving the issue of the

non-trivial memory access patterns, the imposed overheads can be removed either with the aid of

dedicated instructions, or with specific modifications to the implemented DMA controller. Hence,

the variation in the SIMD vector size, which in the implemented algorithm was limited by the

increase in the lazy loop overhead, could prove to scale linearly, since the data dependencies are

eliminated in the processing of each matrix cell.

As proved by the obtained ASIC results, biochips and other portable diagnosis systems can

take advantage of the conceived ISA. Moreover, with the adoption of the implemented multi-

core structure concept, the paradigm proposed in [43] can be easily implemented. Based on

the results obtained with conducted study, it is possible to take advantage of the Xilinx Zynq

All Programmable System on Chip (SoC) Architecture, by integrating up to 8 processing cores

and connecting them with the device’s ARM processor (working as Master core), through an AXI

bus interconnection. This would provide a fully functional and high performance bioinformatics

platform, with full support of a Linux operating system, together with the broad set of I/O and user

interface mechanisms usually offered by conventional desktop computing systems.

87

8. Conclusions

88

References

[1] K. Shibu, Introduction to Embedded Systems, 1st Edition. Tata McGraw-Hill Education,

June 2009.

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fourth Edition: A Quantitative

Approach. Morgan Kaufmann Publishers Inc., 2006.

[3] M. Flynn, “Some computer organizations and their effectiveness,” Computers, IEEE

Transactions on, vol. C-21, no. 9, pp. 948–960, September 1972.

[4] Intel 64 and IA-32 Architectures Software Developer’s Manual: Combined Volumes, Intel,

March 2013, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-

architectures-software-developer-manual-325462.html.

[5] K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scale, “AltiVec extension to powerpc accel-

erates media processing,” Micro, IEEE, vol. 20, no. 2, pp. 85–95, March/April 2000.

[6] D. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers,

“GenBank,” Nucleic acids research, vol. 41, no. D1, pp. D36–D42, January 2013.

[7] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal

of Molecular Biology, vol. 147, pp. 195–197, 1981.

[8] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similari-

ties in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3,

pp. 443–453, 1970.

[9] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool

(BLAST),” Journal of Molecular Biology, vol. 215, pp. 403–410, 1990.

[10] W. R. Pearson, “Searching protein sequence libraries: comparison of the sensitivity and

selectivity of the Smith-Waterman and FASTA algorithms,” Genomics, vol. 11, no. 3, pp.

635–650, 1991.

[11] M. Farrar, “Striped Smith-Waterman speeds database searches six times over other SIMD

implementations,” Bioinformatics, vol. 23, no. 2, p. 156, 2007.

89

References

[12] T. Kranenburg and R. van Leuken, “MB-LITE: A robust, light-weight soft-core implementation

of the MicroBlaze architecture,” Design, Automation and Test in Europe Conference and

Exhibition (DATE), pp. 997–1000, March 2010.

[13] GCC, the GNU Compiler Collection, GNU Project, February 2013, http://gcc.gnu.org/.

[14] M. Mittal, A. Peleg, and U. Weiser, “MMXTM technology architecture overview,” Intel

Technology Journal Q3 ’97, 1997.

[15] S. Raman, V. Pentkovski, and J. Keshava, “Implementing streaming SIMD extensions on the

Pentium III processor,” Micro, IEEE, vol. 20, no. 4, pp. 47–57, July/August 2000.

[16] O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular

Biology, vol. 162, no. 3, pp. 705–708, 1982.

[17] E. T. Chow, J. C. Peterson, M. S. Waterman, T. Hunkapiller, and B. A. Zimmermann, “A

systolic array processor for biological information signal processing,” in Proceedings of the

5th international conference on Supercomputing. ACM, 1991, pp. 216–223.

[18] C. T. White, R. K. Singh, P. B. Reintjes, J. Lampe, B. W. Erickson, W. D. Dettloff, V. L.

Chi, and S. F. Altschul, “BioScan: A VLSI-based system for biosequence analysis,” in

Proceedings, 1991 IEEE International Conference on Computer Design: VLSI in Computers

and Processors, 1991, ICCD’91. IEEE, 1991, pp. 504–509.

[19] P. Guerdoux-Jamet and D. Lavenier, “SAMBA: hardware accelerator for biological sequence

comparison,” Computer applications in the biosciences: CABIOS, vol. 13, no. 6, pp. 609–615,

1997.

[20] N. Sebastiao, N. Roma, and P. Flores, “Integrated hardware architecture for efficient

computation of the n-best bio-sequence local alignments in embedded platforms,” IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 7, pp. 1262–1275,

July 2012.

[21] Z. Nawaz, M. Nadeem, J. van Someren, and K. Bertels, “A parallel FPGA design of the Smith-

Waterman traceback,” in International Conference on Field-Programmable Technology, FPT

10, December 2010, pp. 454–459.

[22] K. Benkrid, Y. Liu, and A. Benkrid, “Design and implementation of a highly parameterised

FPGA-based skeleton for pairwise biological sequence alignment,” in 15th Annual IEEE

Symposium Field-Programmable Custom Computing Machines. FCCM 2007, April 2007,

pp. 275–278.

90

References

[23] X. Jiang, X. Liu, L. Xu, P. Zhang, and N. Sun, “A reconfigurable accelerator for Smith-

Waterman algorithm,” IEEE Transactions on Circuits and Systems - Part II: Express Briefs,

vol. 54, no. 12, pp. 1077–1081, December 2007.

[24] A. Wozniak, “Using video-oriented instructions to speed up sequence comparison,”

Computer Applications in the Biosciences, vol. 13, no. 2, pp. 145–150, 1997.

[25] T. Rognes and E. Seeberg, “Six-fold speed-up of Smith-Waterman sequence database

searches using parallel processing on common microprocessors,” Bioinformatics, vol. 16,

no. 8, p. 699, 2000.

[26] T. Rognes, “Faster Smith-Waterman database searches with inter-sequence SIMD paralleli-

sation,” Bioinformatics, vol. 12, no. 221, p. 11, 2011.

[27] GRLIB IP Core User’s Manual, Gaisler Research, February 2006.

[28] MicroBlaze Processor Reference Guide, Xilinx, January 2009.

[29] J. G. Tong, I. D. Anderson, and M. A. Khalid, “Soft-core processors for embedded systems,”

in Microelectronics, 2006. ICM’06. International Conference on. IEEE, 2006, pp. 170–173.

[30] Wishbone System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores,

B.3 Revision, OpenCores, September 2002.

[31] OpenCores: MB-LITE project page, March 2013, http://www.opencores.org/project,mblite.

[32] J. Gaisler, Fault-tolerant microprocessors for space applications, March 2013,

http://www.gaisler.com/doc/vhdl2proc.pdf.

[33] F. Sanchez Castano, A. Ramirez, and M. Valero, “Quantitative analysis of sequence

alignment applications on multiprocessor architectures,” in Proceedings of the 6th ACM

Conference on Computing Frontiers. ACM, 2009, pp. 61–70.

[34] P. Huerta, J. Castillo, C. Pedraza, J. Cano, and J. I. Martinez, “Symmetric multiprocessor

systems on FPGA,” in International Conference on Reconfigurable Computing and FPGAs,

2009. ReConFig’09. IEEE, 2009, pp. 279–283.

[35] P. Manoilov and P. Krivoshieva, “Shared memory design for multicore systems,” in

International Scientific Conference Computer Science, 2008, pp. 302–307.

[36] D. Wu, X. Zou, K. Dai, C. Deng, and S. Lin, “Mem-

ory system design and implementation for a multiprocessor,” in

2nd International Conference on Computer Engineering and Technology (ICCET), vol. 6.

IEEE, 2010, pp. V6–313.

91

References

[37] G. Arroz, J. Monteiro, and A. Oliveira, Arquitectura de Computadores: dos Sistemas Digitais

aos Microprocessadores, 1a Edicao. IST Press, 2007.

[38] PrimeCell R© µDMA Controller (PL230), ARM, Ltd., March 2007, http://infocenter.arm.com.

[39] P. Huerta, J. Castillo, J. I. Martinez, and C. Pedraza, “Exploring

FPGA capabilities for building symmetric multiprocessor systems,” in

3rd Southern Conference on Programmable Logic, 2007. SPL’07. IEEE, 2007, pp.

113–118.

[40] On-Chip Peripheral Bus – Architecture Specifications, Version 2.1, IBM, April 2001.

[41] AMBA R© 3 AHB-Lite Protocol, v1.0, ARM, Ltd., June 2006, http://infocenter.arm.com.

[42] Multi-layer AHB Overview, ARM, Ltd., May 2004, http://infocenter.arm.com.

[43] N. Roma and P. Magalhaes, “System-level prototyping framework for heterogeneous multi-

core architecture applied to biological sequence analysis,” in IEEE International Symposium

on Rapid System Prototyping (RSP’2012), Tampere - Finland, Oct. 2012, pp. 156–162.

[44] E. S. Shin, V. J. Mooney III, and G. F. Riley, “Round-robin arbiter design and generation,” in

Proceedings of the 15th international symposium on System Synthesis. ACM, 2002, pp.

243–248.

[45] J. Germano, V. C. Martins, F. A. Cardoso, T. M. Almeida, L. Sousa, P. P. Freitas, and M. S.

Piedade, “A portable and autonomous magnetic detection platform for biosensing,” Sensors,

vol. 9, no. 6, pp. 4119–4137, 2009.

92

ASIMD Instruction Set

Contents

A.1 Scalar maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.2 Vector addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.3 Vector reverse subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.4 Vector compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.5 Vector maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.6 Vector element shift left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.7 Move-to-vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.8 Load vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.9 Store vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.10 Vector Disjunctive Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.11 Vector Disjunctive Branch w/ Immediate . . . . . . . . . . . . . . . . . . . . . . 99

A.12 Vector Conjunctive Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.13 Vector Conjunctive Branch w/ Immediate . . . . . . . . . . . . . . . . . . . . . 100

93

A. SIMD Instruction Set

This appendix lists the instructions of the proposed ISA. Each section is named after the

instruction, or set of instructions, defined in it. Also, the sections are divided in three subsec-

tions: i) the mnemonics and the instruction code; ii) a brief description of the operation; and iii) a

pseudocode illustrating the operation.

A.1 Scalar maximum

Maximum between two scalar values

Assembly and instruction

max rD,rA,rB (signed)maxu rD,rD,rB (unsigned)

0 0 0 1 0 1 rD rA rB 0 0 0 0 0 0 0 0 1 U 1

31 25 20 15 10 0

DescriptionThe operand in register rA is subtracted from the operand in register rB. The MSB of the result is used to set rD with the contentsof rA or rB, accordingly. If the U bit is clear, rA and rB are considered signed values.

Pseudocode

result ← (rB) - (rA)if result(MSB) == 1 then:

(rD) ← (rA)else:

(rD) ← (rB)end if

A.2 Vector addition

Addition operation for SIMD vector operands


addvv rD,rA,rB (vector-vector)saddvv rD,rA,rB (vector-vector w/ saturation)

addvs rD,rA,rB (vector-scalar)saddvs rD,rA,rB (vector-scalar w/ saturation)

addv rD,rA,rB (inner-vector)

0 0 0 0 0 0 rD rA rB 0 0 0 0 0 S V 0 0 0

31 25 20 15 10 4 3 0

DescriptionAccording to the V field, the vector in register rB is added to a vector (vector-vector type) or to a scalar (vector-scalar type) inregister rA, or the adjacent pairs of vector elements in register rA are added between them and the results placed in the lower halfof register rD(inner-vector type). If the S bit is set to 1, the vector-vector and vector-scalar addition resulting vector elements aresaturated at the lower or higher signed value. For instance, 8-bit vector elements saturate at −128 and 127.

Pseudocode

i ← vector element indexif V == "01" then:

(rD)[i] ← (rB)[i] + (rA)[i] /*vector-vector*/else if V == "10" then:

(rD)[i] ← (rB)[i] + (rA)[0] /*vector-scalar*/else if V == "11" then:

(rD)[i] ← (rA)[i*2+1] + (rA)[i*2] /*inner-vector*/end if

94

A.3 Vector reverse subtraction

A.3 Vector reverse subtraction

Reverse subtraction operation for SIMD vector operands


rsubvv rD,rA,rB (vector-vector)srsubvv rD,rA,rB (vector-vector w/ saturation)

rsubvs rD,rA,rB (vector-scalar)srsubvs rD,rA,rB (vector-scalar w/ saturation)

rsubv rD,rA,rB (inner-vector)

0 0 0 0 0 1 rD rA rB 0 0 0 0 0 S V 0 0 0

31 25 20 15 10 4 3 0

DescriptionAccording to the V field, the vector (vector-vector type) or scalar (vector-scalar type) operands in register rA are subtracted fromthe vector in register rB, or the adjacent pairs of vector elements in register rA are subtracted between them and the results placedin the lower half of register rD(inner-vector type). If the S bit is set to 1, the vector-vector and vector-scalar subtraction resultingvector elements are saturated at the lower or higher signed value. For instance, 8-bit vector elements saturate at −128 and 127.

Pseudocode


(rD)[i] ← (rB)[i] - (rA)[i] /*vector-vector*/else if V == "10" then:

(rD)[i] ← (rB)[i] - (rA)[0] /*vector-scalar*/else if V == "11" then:

(rD)[i] ← (rA)[i*2+1] - (rA)[i*2] /*inner-vector*/end if

A.4 Vector compare

Compare operation for SIMD vector operands


cmpvv rD,rA,rB (vector-vector)cmpuvv rD,rA,rB (vector-vector unsigned)

cmpvs rD,rA,rB (vector-scalar)cmpuvs rD,rA,rB (vector-scalar unsigned)

cmpv rD,rA,rB (inner vector)cmpuv rD,rA,rB (inner vector unsigned)

0 0 0 1 0 1 rD rA rB 0 0 0 0 0 0 V 0 U 1

31 25 20 15 10 4 3 0

DescriptionAccording to the V field, the vector (vector-vector type) or scalar (vector-scalar type) operands in register rA are subtracted fromthe vector in register rB, or the adjacent pairs of vector elements in register rA are subtracted between them and the results placedin the lower half of register rD(inner-vector type). The MSB of each vector element in rD is adjusted to shown true relation betweenthe corresponding elements in rA and rB. If the U bit is set, the elements of rA and rB are considered unsigned values. If the U bitis clear, the elements are considered signed values.

Pseudocode

95



(rD)[i] (rB)[i] - (rA)[i] /*vector-vector*/(rD)[i](MSB) ← (rA)[i] > (rB)[i]

else if V == "10" then:(rD)[i] ← (rB)[i] - (rA)[0] /*vector-scalar*/(rD)[i](MSB) ← (rA)[0] > (rB)[i]

else if V == "11" then:(rD)[i] ← (rA)[i*2+1] - (rA)[i*2] /*inner-vector*/(rD)[i](MSB) ← (rA)[i*2+1] > (rA)[i*2]

end if

A.5 Vector maximum

Maximum operation between SIMD vector operands


maxvv rD,rA,rB (vector-vector)maxuvv rD,rA,rB (vector-vector unsigned)

maxvs rD,rA,rB (vector-scalar)maxuvs rD,rA,rB (vector-scalar unsigned)

maxv rD,rA,rB (inner vector)maxuv rD,rA,rB (inner vector unsigned)

0 0 0 1 0 1 rD rA rB 0 0 0 0 0 0 V 1 U 1

31 25 20 15 10 4 3 0

DescriptionAccording to the V field, the vector (vector-vector type) or scalar (vector-scalar type) operands in register rA are subtracted fromthe vector in register rB, or the adjacent pairs of vector elements in register rA are subtracted between them and the results placedin the lower half of register rD(inner-vector type). The MSB of the result is used to set rD with the contents of rA or rB, accordingly.If the U bit is clear, rA and rB are considered signed values.

Pseudocode


result[i] ← (rB)[i] - (rA)[i] /*vector-vector*/if result(MSB) == 1 then:

(rD)[i] ← (rA)[i]else:

(rD)[i] ← (rB)[i]end if

else if V == "10" then:result[i] ← (rB)[i] - (rA)[0] /*vector-scalar*/if result(MSB) == 1 then:

(rD)[i] ← (rA)[0]else:

(rD)[i] ← (rB)[i]end if

else if V == "11" then:result[i] ← (rA)[i*2+1] - (rA)[i*2] /*inner-vector*/if result(MSB) == 1 then:

(rD)[i] ← (rA)[i*2]else:

(rD)[i] ← (rA)[i*2+1]end if

end if

A.6 Vector element shift left

SIMD vector elements shift left


96

A.7 Move-to-vector

1 0 0 1 0 0 rD rA 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0

31 25 20 15 0

sllv rD,rA

DescriptionShifts logically the contents of the SIMD vector in rA, one element to the left, and places the result in rD. Zeros are shifted in theshift chain and placed in the least significant element position of rD.

Pseudocode

N: number of vector elements(rD)[0] ← "000..0"(rD)[N-1:1] ← (rA)[N-2:0]

A.7 Move-to-vector

Move a scalar value to a specified SIMD vector position


mtv rD,rA,rB

0 1 0 1 0 0 rD rA rB 0 0 0 0 0 0 0 0 0 0 0

31 25 20 15 10 0

DescriptionMoves the scalar value in rB, of width equal to the SIMD vector element width, and places it in the element position of SIMD vectorrD, specified by rD. All the other vector elements in rD are maintained.

Pseudocode

Pos ← (rA)(rD)[Pos] ← (rB)

A.8 Load vector

Load a SIMD vector from memory


lv rD,rA,rBlvi rD,rA,Imm

1 1 0 0 1 1 rD rA rB 0 0 0 0 0 0 0 0 0 0 0

31 25 20 15 10 0

DescriptionLoads a SIMD vector from the vector aligned memory location that results from adding the contents of register rA with the contentsof register rB, or with the immediate value Imm. The data is placed in register rD.

Pseudocode

Register: Addr ← (rA) + (rB)Immediate: Addr ← (rA) + Imm(rD) ← Mem(Addr)

97


1 1 1 0 1 1 rD rA Immediate

31 25 20 15 0

A.9 Store vector

Store a SIMD vector to memory


sv rD,rA,rBsvi rD,rA,Imm

1 1 0 1 1 1 rD rA rB 0 0 0 0 0 0 0 0 0 0 0

31 25 20 15 10 0

1 1 1 1 1 1 rD rA Immediate

31 25 20 15 0

DescriptionStores a SIMD vector in register rD to the vector aligned memory location that results from adding the contents of register rA withthe contents of register rB, or with the immediate value Imm.

Pseudocode

Register: Addr ← (rA) + (rB)Immediate: Addr ← (rA) + ImmMem(Addr) ← (rD)

A.10 Vector Disjunctive Branch

SIMD branch (disjunctive assertion)


beqv rA,rB bnev rA,rBbeqdv rA,rB (w/ delay) bnedv rA,rB (w/ delay)

bgev rA,rB bgtv rA,rBbgedv rA,rB (w/ delay) bgtdv rA,rB (w/ delay)

bltv rA,rB blev rA,rBbltdv rA,rB (w/ delay) bledv rA,rB (w/ delay)

DescriptionBranch if at least one element of the SIMD vector in rA respects the branch condition, to the instruction located in the offset valueof rB. The target of the branch will be the instruction at address PC + rB.

The branch mnemonics with a d (e.g.: beqdv) will set the D bit, shown in the instruction code. The D bit determines whether thereis a branch delay slot or not. If the D bit is set, it means that there is a delay slot and the instruction following the branch (that is, inthe branch delay slot) is allowed to complete execution before executing the target instruction. If the D bit is not set, it means thatthere is no delay slot, so the instruction to be executed after the branch is the target instruction.

Pseudocode

if any v.e. has a valid condition thenPC ← PC + (rB)

elsePC ← PC + 4

end ifif D = 1 then

allow following instruction to complete executionend if

98

A.11 Vector Disjunctive Branch w/ Immediate

1 0 0 1 1 1 D 1 C N D rA rB 0 0 0 0 0 0 0 0 0 0 0

31 25 20 15 10 0

CND mnemonic Condition

000 beq if equal to 0

001 bne if not equal to 0

010 blt less than 0

011 ble less or equal to 0

100 bgt greater than 0

101 bge greater of equal to 0

A.11 Vector Disjunctive Branch w/ Immediate

SIMD branch (disjunctive assertion), with immediate


beqiv rA,Imm bneiv rA,Immbeqidv rA,Imm (w/ delay) bneidv rA,Imm (w/ delay)

bgeiv rA,Imm bgtiv rA,Immbgeidv rA,Imm (w/ delay) bgtidv rA,Imm (w/ delay)

bltiv rA,Imm bleiv rA,Immbltidv rA,Imm (w/ delay) bleidv rA,Imm (w/ delay)

1 0 1 1 1 1 D 1 0 0 0 rA Immediate

31 25 20 15 0

DescriptionBranch if at least one element of the SIMD vector in rA is equal to 0, to the instruction located in the offset value of Imm. The targetof the branch will be the instruction at address PC + Imm.

The branch mnemonics with a d (e.g.: beqidv) will set the D bit, shown in the instruction code. The D bit determines whether thereis a branch delay slot or not. If the D bit is set, it means that there is a delay slot and the instruction following the branch (that is, inthe branch delay slot) is allowed to complete execution before executing the target instruction. If the D bit is not set, it means thatthere is no delay slot, so the instruction to be executed after the branch is the target instruction.

CND mnemonic Condition

000 beqi if equal to 0

001 bnei if not equal to 0

010 blti less than 0

011 blei less or equal to 0

100 bgti greater than 0

101 bgei greater of equal to 0

Pseudocode

if any v.e. has a valid condition thenPC ← PC + (Imm)

elsePC ← PC + 4

end ifif D = 1 then


A.12 Vector Conjunctive Branch

SIMD branch (conjunctive assertion)


99


1 0 0 1 1 1 D 1 1 1 1 rA rB 0 0 0 0 0 0 0 0 0 0 0

31 25 20 15 10 0

bmev rA,rBbmedv rA,rB (w/ delay)

DescriptionBranch if all the SIMD vector element’s MSBs in rA are equal to 0, to the instruction located in the offset value of rB. The target ofthe branch will be the instruction at address PC + rB.

Pseudocode

if all(rA[i](MSB) = 0) thenPC ← PC + (rB)

elsePC ← PC + 4

end ifif D = 1 then


A.13 Vector Conjunctive Branch w/ Immediate

SIMD branch if vector mask is equal , with immediate


bmeiv rA,Immbmeidv rA,Imm (w/ delay)

1 0 1 1 1 1 D 1 1 1 1 rA Immediate

31 25 20 15 0

DescriptionBranch if all the SIMD vector element’s MSBs in rA are equal to 0, to the instruction located in the offset value of Imm. The target ofthe branch will be the instruction at address PC + Imm.

Pseudocode

if all(rA[i](MSB) = 0) thenPC ← PC + (Imm)

elsePC ← PC + 4

end ifif D = 1 then


100

BCode

Contents

B.1 Striped SW algorithm implementation pseudocode (new SIMD ISA) . . . . . . 102

B.2 Affine gap local alignment function . . . . . . . . . . . . . . . . . . . . . . . . . 104

B.3 Striped SW algorithm implementation in C (new SIMD ISA) . . . . . . . . . . . 105

B.4 Striped SW algorithm implementation (Intel SSE2) . . . . . . . . . . . . . . . . 107

101

B. Code

B.1 Striped SW algorithm implementation pseudocode (new

SIMD ISA)

/ / g loba lsQ LEND LENN SIMDS LEN = (Q LEN + N SIMD−1)/N SIMDLAST H = (S LEN − 1)∗N SIMD

s imd st r ipped search ( ) :/ / d e c l a r a t i o n s

bias = 0x80/ / vec to r r e g i s t e r s :

vF , vH , vMax , vT , E, r0/ / vec to r ar rays :

vHLoad , vHStore , vE/ / sequence charac te rs :

d/ / po i n te r s :∗pHLoad = vHLoad∗pHStore = vHStore∗pE = vE∗p

/ / i n i t i a l i z e ar rays and vec to r elements to b ias :addvs vF , bias , r0addvs vH , bias , r0addvs vMax , bias , r0

for j = 0 : : N SIMD : : S LEN∗N SIMDdo

sv vMax , j , pHLoadsv vMax , j , pHStoresv vMax , j , pE

end for

/ /−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−/// /−−−−−−−−−−−−−−−−− MAIN LOOP −−−−−−−−−−−−−−−−−//

for i = 0 : : 1 : : D LENdo

d = D[ i ]

addvs vF , bias , r0l v vT , LAST H , pHStores l l v vH , vTor vH , vH , b ias

p = pHLoadpHLoad = pHStorepHStore = p

/ /−−−−−−−−−−−−−−−− INNER LOOP −−−−−−−−−−−−−−−−//for j = 0 : : N SIMD : : S LEN∗N SIMDdo

l v vT , j , q u e r y P r o f i l e [ d ]saddvv vH , vT , vHmaxvv vMax , vMax , vHl v E, j , pEmaxvv vH , vH , Emaxvv vH , vH , vFsv vH , j , pHStoresrsubvs vH , GAP OPEN COST, vHsrsubvs E, GAP EXTEND COST, Emaxvv E, E, vHsv E, j , pEsrsubvs vF , GAP EXTEND COST, vFmaxvv vF , vF , vHl v vH , j , pHLoad

end for

/ /−−−−−−−−−−−−−−−− LAZY−F LOOP −−−−−−−−−−−−−−−−//j = 0s l l v vF , vFor vF , vF , b iasl v vH , j , pHStore

srsubvs vT , GAP OPEN COST, vH

102

B.1 Striped SW algorithm implementation pseudocode (new SIMD ISA)

cmpvv vT , vF , vTbmev vT , END−F

LAZY−F :maxvv vH , vH , vFsv vH , j , pHStoresrsubvs vH , GAP OPEN COST, vHl v E, j , pEmaxvv E, E, vHsv E, j , pEsrsubvs vF , GAP EXTEND COST, vF

j += N SIMDi f j >= S LEN∗N SIMD :

j = 0s l l v vF , vFor vF , vF , b ias

end i f

l v vH , j , pHStore

srsubvs vT , GAP OPEN COST, vHrsubvv vT , vT , vFbgtv vT , LAZY−F

END−F :

end for

/ /−−−−−−−−−−−−−−− END MAIN LOOP −−−−−−−−−−−−−−−/// /−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−//

maxv vMax , vMaxmaxv vMax , vMaxmaxv vMax , vMaxmaxv vMax , vMax

add vMax , vMax , b ias

return vMax

103

B. Code

B.2 Affine gap local alignment function

# de f ine ALPHABET SIZE 4 /∗DNA has 4 symbols : A, C, T ,G∗ /

extern i n t 8 t D [ ] ; /∗ Reference sequence , leng th : D LENGTH ∗ /extern i n t 8 t Q [ ] ; /∗ Query sequence , leng th : Q LENGTH ∗ /

extern const i n t 8 t costs [ ALPHABET SIZE ] [ ALPHABET SIZE ] ; /∗ S u b s t i t u t i o n score mat r i x ∗ /

i n t 3 2 t p a i r w i s e l o c a l a l i g n s c o r e ( ) {

i n t i , j ;i n t 3 2 t matr ix max ;i n t 3 2 t hd , fs , fu , es , e l ;

i n t 3 2 t h values [Q LENGTH+1] , e va lues [Q LENGTH+1] , f v a l u e s [Q LENGTH+ 1 ] ;

i n t 3 2 t h p r e v i t ;

/∗ there w i l l be th ree d i f f e r e n t matr ices (H, E, F ) ∗ //∗ f o r each mat r i x on ly one column i s necessary , p lus a temporary storage f o r a s i n g l e value ∗ /

/∗ j u s t to make sure a l l mat r i x elements are 0 ! ! ∗ /for ( i =0; i <(Q LENGTH+1) ; i ++) {

h values [ i ] = 0 ;e values [ i ] = 0 ;f v a l u e s [ i ] = 0 ;

}

/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ /

matrix max = 0;

for ( j =0; j<D LENGTH; j ++) {h p r e v i t = 0 ;for ( i =1; i<=Q LENGTH; i ++) {

fu = h values [ i −1] − GAP OPEN COST;f s = f v a l u e s [ i −1] − GAP EXTEND COST;es = h values [ i ] − GAP OPEN COST;e l = e values [ i ] − GAP EXTEND COST;

/∗ es and f s are now set to the r e s u l t o f the maximum c a l c u l a t i o n value f o r the c e l l ∗ /i f ( e l > es ) {

es = e l ;}i f ( fu > f s ) {

f s = fu ;}

hd = h p r e v i t + costs [ ( i n t )Q[ i −1 ] ] [ ( i n t )D[ j ] ] ;

i f ( es > hd ) {hd = es ;

}i f ( f s > hd ) {

hd = f s ;}i f (0 > hd ) {

hd = 0;}

/∗ s to re scores from the cu r ren t c e l l f o r the computat ion o f the next c e l l ’ s score ( bel low ) ∗ /h p r e v i t = h values [ i ] ;

h va lues [ i ] = hd ;e values [ i ] = es ;f v a l u e s [ i ] = f s ;

i f ( hd > matrix max ) {matrix max = hd ;

}

}}

/∗ r e t u r n op t ima l l o c a l a l ignment score ∗ /return ( matr ix max ) ;

}

104

B.3 Striped SW algorithm implementation in C (new SIMD ISA)

B.3 Striped SW algorithm implementation in C (new SIMD ISA)

# de f ine ALPHABET SIZE 4# def ine SIMD ELEMENTS 16 /∗ d e f a u l t ∗ /

/∗ vec to r type emulat ion ∗ /typedef union {

i n t 8 t b [SIMD ELEMENTS ] ;i n t 3 2 t w[SIMD ELEMENTS / 2 ] ;void∗ v ;

} vec ;

/∗ runt ime va r i a b l e s ∗ /extern vec q u e r y P r o f i l e [ ALPHABET SIZE ] [ S LENGTH ] ;ex tern vec vHLoad [S LENGTH ] ;ex tern vec vHStore [S LENGTH ] ;ex tern vec vE [S LENGTH ] ;

ex tern const i n t 8 t D[D LENGTH ] ; /∗ Reference sequence ∗ /extern const i n t 8 t Q[Q LENGTH ] ; /∗ Query sequence ∗ /

extern const i n t 8 t costs [ ALPHABET SIZE ] [ ALPHABET SIZE ] ;

i n t 8 t s imd s t r i ped search ( ) {

i n t i , j , count = 0 ;i n t r0 = 0 ;i n t bias = 0x00000080 ;

r e g i s t e r i n t vF , vH , vMax , vT , E ;

i n t 8 t d ;

void∗ addr ;vec ∗pHLoad = vHLoad ;vec ∗pHStore = vHStore ;vec ∗pE = vE ;vec ∗p ;

/∗ I n i t i a l i z e a l l vec to r to the b ias ∗ /asm ( ” addvs %0,%1,%2” : ” = r ” ( vF ) : ” r ” ( b ias ) , ” r ” ( r0 ) ) ;asm ( ” addvs %0,%1,%2” : ” = r ” (vH ) : ” r ” ( b ias ) , ” r ” ( r0 ) ) ;asm ( ” addvs %0,%1,%2” : ” = r ” ( vMax ) : ” r ” ( b ias ) , ” r ” ( r0 ) ) ;

for ( j = 0 ; j < S LENGTH∗SIMD ELEMENTS; j +=SIMD ELEMENTS) {asm ( ” sv %0,%1,%2” : : ” r ” ( vMax ) , ” r ” ( j ) , ” r ” ( pHLoad ) ) ;asm ( ” sv %0,%1,%2” : : ” r ” ( vMax ) , ” r ” ( j ) , ” r ” ( pHStore ) ) ;asm ( ” sv %0,%1,%2” : : ” r ” ( vMax ) , ” r ” ( j ) , ” r ” (pE ) ) ;

}

/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ /

for ( i = 0 ; i < D LENGTH; i ++) {

d = D[ i ] ;

/∗ rese t vF to the b ias ∗ /asm ( ” addvs %0,%1,%2” : ” = r ” ( vF ) : ” r ” ( b ias ) , ” r ” ( r0 ) ) ;

/∗ load and s h i f t the segment from the prev ious i t e r a t i o n to a l i g n theh o r i z o n t a l dependencies ∗ /

asm ( ” l v %0,%1,%2” : ” = r ” ( vT ) : ” r ” ( ( S LENGTH − 1)∗SIMD ELEMENTS) , ” r ” ( pHStore ) ) ;

asm ( ” s l l v %0,%1” : ” = r ” (vH ) : ” r ” ( vT ) ) ;asm ( ” or %0,%1,%2” : ” = r ” (vH ) : ” r ” ( vH ) , ” r ” ( b ias ) ) ;

p = pHLoad ;pHLoad = pHStore ;pHStore = p ;

for ( j = 0 ; j < S LENGTH∗SIMD ELEMENTS; j +=SIMD ELEMENTS) {

/∗ load query p r o f i l e vec to r and add i t to vH ∗ /addr = &( q u e r y P r o f i l e [ d ] ) ;

asm ( ” l v %0,%1,%2” : ” = r ” ( vT ) : ” r ” ( addr ) , ” r ” ( j ) ) ;asm ( ” saddvv %0,%1,%2” : ” = r ” (vH ) : ” r ” ( vH ) , ” r ” ( vT ) ) ;

/∗ cu r ren t maximum score vec to r ∗ /asm ( ” maxvv %0,%1,%2” : ” = r ” ( vMax ) : ” r ” ( vMax ) , ” r ” ( vH ) ) ;

/∗ v e r i f y dependencies ∗ /

105

B. Code

asm ( ” l v %0,%1,%2” : ” = r ” (E ) : ” r ” ( j ) , ” r ” (pE ) ) ;asm ( ” maxvv %0,%1,%2” : ” = r ” (vH ) : ” r ” (vH ) , ” r ” (E ) ) ;asm ( ” maxvv %0,%1,%2” : ” = r ” (vH ) : ” r ” (vH ) , ” r ” ( vF ) ) ;

asm ( ” sv %0,%1,%2” : : ” r ” ( vH ) , ” r ” ( j ) , ” r ” ( pHStore ) ) ;

/∗ open and extend gaps from the cu r ren t vec to r ’ s c e l l s ∗ /asm ( ” srsubvs %0,%1,%2” : ” = r ” (vH ) : ” r ” (GAP OPEN COST) , ” r ” ( vH ) ) ;asm ( ” srsubvs %0,%1,%2” : ” = r ” (E ) : ” r ” (GAP EXTEND COST) , ” r ” (E ) ) ;

/∗ check f o r gap i n s e r t i o n ∗ /asm ( ” maxvv %0,%1,%2” : ” = r ” (E ) : ” r ” (E) , ” r ” ( vH ) ) ;asm ( ” sv %0,%1,%2” : : ” r ” (E) , ” r ” ( j ) , ” r ” (pE ) ) ;

/∗ check f o r gap extens ion ∗ /asm ( ” srsubvs %0,%1,%2” : ” = r ” ( vF ) : ” r ” (GAP EXTEND COST) , ” r ” ( vF ) ) ;asm ( ” maxvv %0,%1,%2” : ” = r ” ( vF ) : ” r ” ( vF ) , ” r ” ( vH ) ) ;

/∗ load next vec to r ∗ /asm ( ” l v %0,%1,%2” : ” = r ” (vH ) : ” r ” ( j ) , ” r ” ( pHLoad ) ) ;

}

asm ( ” s l l v %0,%1” : ” = r ” ( vF ) : ” r ” ( vF ) ) ;asm ( ” or %0,%1,%2” : ” = r ” ( vF ) : ” r ” ( vF ) , ” r ” ( b ias ) ) ;

j = 0 ;asm ( ” l v %0,%1,%2” : ” = r ” (vH ) : ” r ” ( j ) , ” r ” ( pHStore ) ) ;

asm ( ” srsubvs %0,%1,%2” : ” = r ” ( vT ) : ” r ” (GAP OPEN COST) , ” r ” ( vH ) ) ;asm ( ” cmpvv %0,%1,%2” : ” = r ” ( vT ) : ” r ” ( vF ) , ” r ” ( vT ) ) ;

/∗∗∗ Lazy−F Loop ∗∗∗ /asm ( ” bmeiv %0, 0x0064 ” : : ” r ” ( vT ) ) ;

/∗ check v e r t i c a l dependencies i n f l u e n ce i n the score ∗ /asm ( ” maxvv %0,%1,%2” : ” = r ” (vH ) : ” r ” (vH ) , ” r ” ( vF ) ) ;

/∗ s to re re−computed values ∗ /asm ( ” sv %0,%1,%2” : : ” r ” ( vH ) , ” r ” ( j ) , ” r ” ( pHStore ) ) ;

/∗ check i f the new scores i n f l u e n ce the E values ∗ /asm ( ” srsubvs %0,%1,%2” : ” = r ” (vH ) : ” r ” (GAP OPEN COST) , ” r ” ( vH ) ) ;

asm ( ” l v %0,%1,%2” : ” = r ” (E ) : ” r ” ( j ) , ” r ” (pE ) ) ;asm ( ” maxvv %0,%1,%2” : ” = r ” (E ) : ” r ” (E) , ” r ” ( vH ) ) ;asm ( ” sv %0,%1,%2” : : ” r ” (E) , ” r ” ( j ) , ” r ” (pE ) ) ;

asm ( ” srsubvs %0,%1,%2” : ” = r ” ( vF ) : ” r ” (GAP EXTEND COST) , ” r ” ( vF ) ) ;

/∗ s h i f t the vec to r f o r lazy−loop r e s t a r t ∗ /j +=SIMD ELEMENTS;i f ( j >= S LENGTH∗SIMD ELEMENTS) {

j = 0 ;asm ( ” s l l v %0,%1” : ” = r ” ( vF ) : ” r ” ( vF ) ) ;asm ( ” or %0,%1,%2” : ” = r ” ( vF ) : ” r ” ( vF ) , ” r ” ( b ias ) ) ;

}

asm ( ” l v %0,%1,%2” : ” = r ” (vH ) : ” r ” ( j ) , ” r ” ( pHStore ) ) ;

asm ( ” srsubvs %0,%1,%2” : ” = r ” ( vT ) : ” r ” (GAP OPEN COST) , ” r ” ( vH ) ) ;asm ( ” rsubvv %0,%1,%2” : ” = r ” ( vT ) : ” r ” ( vT ) , ” r ” ( vF ) ) ;

asm ( ” b g t i v %0, 0 x f f b4 ” : : ” r ” ( vT ) ) ;/∗∗∗ Lazy−F Loop ∗∗∗ /

}

/∗ ob ta in the maximum score w i t h i n the vec to r ∗ /asm ( ”maxv %0,%1” : ” = r ” ( vMax ) : ” r ” ( vMax ) ) ;asm ( ”maxv %0,%1” : ” = r ” ( vMax ) : ” r ” ( vMax ) ) ;asm ( ”maxv %0,%1” : ” = r ” ( vMax ) : ” r ” ( vMax ) ) ;asm ( ”maxv %0,%1” : ” = r ” ( vMax ) : ” r ” ( vMax ) ) ;

/∗ add the bias to the score to e l i m i n a te the o f f s e t ∗ /asm ( ” add %0,%1,%2” : ” = r ” ( vMax ) : ” r ” ( vMax ) , ” r ” ( b ias ) ) ;

return ( vMax ) ; }

106

B.4 Striped SW algorithm implementation (Intel SSE2)


i n tsmith waterman sse2 byte ( const unsigned char ∗ query sequence ,

unsigned char ∗ q u e r y p r o f i l e b y t e ,const i n t query length ,const unsigned char ∗ db sequence ,const i n t db length ,unsigned char bias ,unsigned char gap open ,unsigned char gap extend ,s t r u c t f s t r u c t ∗ f s t r )

{i n t i , j , k ;i n t score ;

i n t dup ;i n t cmp ;i n t i t e r = ( query leng th + 15) / 16;

m128i ∗p ;m128i ∗workspace = ( m128i ∗) f s t r−>workspace ;

m128i E, F , H;

m128i v maxscore ;m128i v b ias ;m128i v gapopen ;m128i v gapextend ;

m128i v temp ;m128i v zero ;

m128i ∗pHLoad , ∗pHStore ;m128i ∗pE ;

m128i ∗pScore ;

/∗ Load the bias to a l l elements o f a constant ∗ /dup = ( ( short ) b ias << 8) | bias ;v b ias = mm inser t ep i16 ( v b ias , dup , 0 ) ;v b ias = mm shuf f le lo ep i16 ( v b ias , 0 ) ;v b ias = mm shuf f le ep i32 ( v b ias , 0 ) ;

/∗ Load gap opening pena l ty to a l l elements o f a constant ∗ /dup = ( ( short ) gap open << 8) | gap open ;v gapopen = mm inser t ep i16 ( v gapopen , dup , 0 ) ;v gapopen = mm shuf f le lo ep i16 ( v gapopen , 0 ) ;v gapopen = mm shuf f le ep i32 ( v gapopen , 0 ) ;

/∗ Load gap extens ion pena l ty to a l l elements o f a constant ∗ /dup = ( ( short ) gap extend << 8) | gap extend ;v gapextend = mm inser t ep i16 ( v gapextend , dup , 0 ) ;v gapextend = mm shuf f le lo ep i16 ( v gapextend , 0 ) ;v gapextend = mm shuf f le ep i32 ( v gapextend , 0 ) ;

/∗ i n i t i a l i z e the max score ∗ /v maxscore = mm xor si128 ( v maxscore , v maxscore ) ;

/∗ create a constant o f a l l zeros f o r comparison ∗ /v zero = mm xor si128 ( v zero , v zero ) ;

/∗ Zero out the storage vec to r ∗ /k = i t e r ∗ 2;

p = workspace ;for ( i = 0 ; i < k ; i ++){

mm store si128 ( p++ , v maxscore ) ;}

pE = workspace ;pHStore = pE + i t e r ;pHLoad = pHStore + i t e r ;

for ( i = 0 ; i < db length ; ++ i ){

/∗ f e t ch f i r s t data asap . ∗ /pScore = ( m128i ∗) q u e r y p r o f i l e b y t e + db sequence [ i ] ∗ i t e r ;

/∗ zero out F value . ∗ /F = mm xor si128 (F , F ) ;

107

B. Code

/∗ load the next h value ∗ /H = mm load si128 ( pHStore + i t e r − 1 ) ;H = mm s l l i s i 1 2 8 (H, 1 ) ;

p = pHLoad ;pHLoad = pHStore ;pHStore = p ;

for ( j = 0 ; j < i t e r ; j ++){

/∗ load values E. ∗ /E = mm load si128 (pE + j ) ;

/∗ add score to H ∗ /H = mm adds epu8 (H, ∗pScore ++) ;H = mm subs epu8 (H, v b ias ) ;

/∗ Update h ighes t score encountered t h i s f a r ∗ /v maxscore = mm max epu8 ( v maxscore , H ) ;

/∗ get max from H, E and F ∗ /H = mm max epu8 (H, E ) ;H = mm max epu8 (H, F ) ;

/∗ save H values ∗ /mm store si128 ( pHStore + j , H ) ;

/∗ sub t rac t the gap open pena l ty from H ∗ /H = mm subs epu8 (H, v gapopen ) ;

/∗ update E value ∗ /E = mm subs epu8 (E, v gapextend ) ;E = mm max epu8 (E, H ) ;

/∗ update F value ∗ /F = mm subs epu8 (F , v gapextend ) ;F = mm max epu8 (F , H ) ;

/∗ save E values ∗ /mm store si128 (pE + j , E ) ;

/∗ load the next h value ∗ /H = mm load si128 ( pHLoad + j ) ;

}

/∗ rese t po in te r s to the s t a r t o f the saved data ∗ /j = 0 ;H = mm load si128 ( pHStore + j ) ;

/∗ the computed F value i s f o r the given column . s ince ∗ //∗ we are a t the end , we need to s h i f t the F value over ∗ //∗ to the next column . ∗ /F = mm s l l i s i 1 2 8 (F , 1 ) ;v temp = mm subs epu8 (H, v gapopen ) ;v temp = mm subs epu8 (F , v temp ) ;v temp = mm cmpeq epi8 ( v temp , v zero ) ;cmp = mm movemask epi8 ( v temp ) ;

while (cmp != 0 x f f f f ){

E = mm load si128 (pE + j ) ;

H = mm max epu8 (H, F ) ;

/∗ save H values ∗ /mm store si128 ( pHStore + j , H ) ;

/∗ update E i n case the new H value would change i t ∗ /H = mm subs epu8 (H, v gapopen ) ;E = mm max epu8 (E, H ) ;

mm store si128 (pE + j , E ) ;

/∗ update F value ∗ /F = mm subs epu8 (F , v gapextend ) ;

j ++;i f ( j >= i t e r ){

j = 0 ;F = mm s l l i s i 1 2 8 (F , 1 ) ;

}H = mm load si128 ( pHStore + j ) ;

108


v temp = mm subs epu8 (H, v gapopen ) ;v temp = mm subs epu8 (F , v temp ) ;v temp = mm cmpeq epi8 ( v temp , v zero ) ;cmp = mm movemask epi8 ( v temp ) ;

}}

/∗ f i n d l a r g e s t score i n the v maxscore vec to r ∗ /v temp = mm sr l i s i 128 ( v maxscore , 8 ) ;v maxscore = mm max epu8 ( v maxscore , v temp ) ;v temp = mm sr l i s i 128 ( v maxscore , 4 ) ;v maxscore = mm max epu8 ( v maxscore , v temp ) ;v temp = mm sr l i s i 128 ( v maxscore , 2 ) ;v maxscore = mm max epu8 ( v maxscore , v temp ) ;v temp = mm sr l i s i 128 ( v maxscore , 1 ) ;v maxscore = mm max epu8 ( v maxscore , v temp ) ;

/∗ s to re i n temporary v a r i a b l e ∗ /score = mm extract ep i16 ( v maxscore , 0 ) ;score = score & 0 x00 f f ;

/∗ check i f we might have overf lowed ∗ /i f ( score + bias >= 255){

score = 255;}

/∗ r e t u r n l a r g e s t score ∗ /return score ;

}

109

B. Code

110

Documents

Multi-Core SIMD ASIP for DNA Sequence Alignmentpremio-vidigal.inesc.pt/pdf/NunoNevesMSc.pdf · Multi-Core SIMD ASIP for DNA Sequence Alignment Nuno Filipe Simoes Santos Moraes Neves˜