LiLeveraging DSP BiBasic Oti i tiO...

L i DSP B i O ti i tiLeveraging DSP: Basic Optimization

FAE Summit 2015Embedded ProcessingEmbedded Processing

Agenda• C6000 VLIW Architecture• Software Pipeline• Software Pipeline OptimizationSoftware Pipeline Optimization

– Estimate performancesU i CCS t ti i d– Using CCS to optimize code

– Software pipeline issues

• Hands‐on Lab: Optimize FIR filter

C6000 VLIW A hit tC6000 VLIW Architecture

TI DSP: Basic Optimization

C6000 DSP CoreA hi

MemoryArchitecture

A0 B0 • VLIW (Very Large Instruction .D1 .D2 Word) architecture:

– Two (almost independent) sides A and B

.S1 .S2

sides, A and B– 8 functional units: M, L, S, D – Up to 8 instructions sustained

.M1 .M2

MACs dispatch rate

..L1 L2

A31.L1 .L2

Controller/DecoderController/Decoder4

C6000 Cross‐Path

Register File A

Register File B

......A B

A31 B31

Some C6000 Family Members

TMS320C6424TMS320C6424 TMS320C6748 TMS320C6678

Partial List of .M Instructions

Partial List of .D Instructions

Partial List of .L Instructions

Partial List of .S Instructions

S ft Pi liSoftware Pipeline

TI DSP: Basic Optimization

Non‐Pipelined vs. Pipelined CPU

CPU TypeClock Cycles

1 2 3 4 5 6 7 8 9CPU Type

F2 D2 E2 F3 D3 E3F1 D1 E1Non‐Pipelined

1 2 3 4 5 6 7 8 9

F1 D1 E1Pipelined F1 D1 E1F2 D2 E2

Pipelined

Stage Pipeline Function

F3 D3 E3

Pipeline fullF

Fetch• Generate program fetch address• Read opcode

• Route opcode to functional unitsD d i t ti

Now look at the C66x pipeline.Decode • Decode instructions

EExecute Execute instructions

Program Fetch PhasesPhase Description

PG G f h ddPG Generate fetch address

PS Send address to memory

PW Wait for data readyPW Wait for data ready

PR Read opcode

C66xCore Functional

UnitsPR

Memory PG

Pipeline Phases: ReviewProgram Fetch

ExecuteDecode

PG PS PW PR D EPG PS PW PR D EPG PS PW PR D E

PG PS PW PR D EPG PS PW PR D E

Single‐cycle performance is not affected by adding three program fetch phases.program fetch phases.

That is, there is still an execute every cycle.

How about decode? Is it only one cycle?

Decode PhasesDecode Phase Description

DP Intelligently routes instruction toDP Intelligently routes instruction to functional unit (dispatch)

DC Instruction decoded at functional unit (d d )(decode)

C66xCorePR Functional

UnitsDP UnitsDPDC

Memory PG

Pipeline Phases

Program Fetch ExecuteDecode

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

Pipeline Full

How many cycles does it take to execute an instruction?

Instruction DelaysAll C66x instructions require only one cycle to

t b t lt d l dexecute, but some results are delayed.

Description Instruction Example Delay

Single Cycle All instructions except 0

Integer multiplication and new floating point

MPY, FMPYSP 1

L fl i i MPYSP 2Legacy floating point multiplication

MPYSP 2

Load LDW 4Branch B 5

S ft Pi li O ti i tiSoftware Pipeline Optimization

• Estimating performance• Using CCS to optimize code• Using CCS to optimize code• Software pipeline issues

Software Pipeline Example

Void example(float *in, float*out, int N, float V){{

sum = 1.0 ;for (i=0; i<N; i++){{

x = *in++ * V ;sum = sum + x ;*out++ = sum ;

How many cycles wouldHow many cycles wouldit take to perform the loop five times?

Non Pipeline Code Flowo pe e Code o

Implementation of the loop in the following code:Void example(float *in, float*out, int N, float V){{

sum = 1.0 ;for (i=0; i<N; i++){{

Software Pipeline Code FlowImplementation of the loop in the following code:Void example(float *in, float*out, int N, float V)

p ( , , , ){

sum = 1.0 ;for (i=0; i<N; i++){

The compiler kno s all the dela s and isThe compiler knows all the delays and is smart enough to build the correct software pipeline.pipeline.

Software Pipeline Support• The compiler is smart enough to schedule instructions

efficiently.efficiently.• Software pipeline is the major speed‐up mechanism for

VLIW architectureVLIW architecture.• Software pipeline requires deterministic execution:

Not if branch and call– Not if, branch, and call– No interruptsNo dependencies– No dependencies

Software Pipeline Example:So t a e pe e a p e:Interrupt

Implementation of the loop in the following code:p p gVoid example(float *in, float*out, int N, float V){

sum = 1.0 ;f (i 0 i<N i++)for (i=0; i<N; i++){

x = *in++ * V ;sum = sum + x ;

*out++ = sum ;}

Software Pipeline Example: .D1 .D2 .M1 .L1LD1

2 LD So t a e pe e a p e:SPLOOP

LDLDLD LD MPY

Implementation of the loop in the following code:

LD MPYLD MPY

MPY ADDST MPY ADD

LDInterrupt

Void example(float *in, float*out, int N, float V){

sum = 1.0 ;for (i=0; i<N; i++)

ST MPY ADDST MPY ADDST ADD for (i 0; i<N; i++)

{x = *in++ * V ;sum = sum + x ;

121314

STServing The Interrupt

LD*out++ = sum ;

151617

LDLDLD

181920

LD MPYLD MPYLD MPY ADDLD ST MPY ADD21

Code Development• Code Generation Tools can build executables from different code types:

– Generic C or C++ code

– C with intrinsic– Linear Assembly

– Assembly (DETAI)• Optimization is performed:Optimization is performed:

– In the front end

– Using the intrinsicR ll ti d ft i li h i ti i d li bl– Resource allocation and software pipeline search in optimized linear assembly

• To understand the quality of the optimization of a loop, compare the theoretical iteration interval (II: The actual number of cycles between two results of the loop) to the result of the

bl / ti iassembler/optimizer.

– Was the software pipeline successful (if not, why)?– Is the usage balanced between the two sides (if not, can it be improved)?

– What are the bottlenecks and how to mitigate them?

• To keep the assembly file, set the –k option

Keep Generated Assembly File

Build Options: Optimization and Debug

‐S and ‐MW Setting

And if You Don’t Find the GUI?

.D1 .D2 .M1 .L1LD1 Dependencies

pWhat if out = in + 1?

In that case the code cannot start loading the567

In that case, the code cannot start loading the next input before the previous output is ready.

Unless the compiler knows otherwise the891011

Unless the compiler knows otherwise, the compiler assumes dependencies.

Implementation of the loop in the following code:Void example(float *in, float*out, int N, float V){

sum = 1 0 ;

11121314

MPYADD sum = 1.0 ;

for (i=0; i<N; i++){

x = *in++ * V ;

151617

sum = sum + x ;*out++ = sum ;

181920 MPY

Dependencies The compiler knows that there is no dependencies in the following cases:in the following cases:• It can understand it from the code (e.g., the calling function is in the same file as the routine).

• The code uses the restrict keyword. y• A compiler switch tells the compiler that there is no overlay between vector pointers ( mt)no overlay between vector pointers (‐mt)

IF and Conditional Execution• All assembly instructions are conditional instructions • In conditional instruction the functional unit executes the

instruction but the result is written to the output register ONLY if the condition is true Th diti h ld b k ONLY th l b f th• The condition should be known ONLY the cycle before the result is written to the output register

• Condition execution can replace if statements as follows:Condition execution can replace if statements as follows:

if (x < 1000.0) sum = sum + x --> [x <1000.0] sum=sum+x

• The compiler is smart enough to convert “simple” if statements into conditional execution

• The result of x < 1000.0 should known just one cycle before the last step of execution

Function CallsFunction Calls

Void example(float *in, float*out, int N, float V){

sum = 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;sum = sum + f(x) ; *out++ = sum ;

• Function call prevents the compiler from generating the software pipeline• Function call prevents the compiler from generating the software pipeline.

• Inline, the function removes this limitation.

• The compiler does not inline function (unless it is told to). It is up to the user.

Software Pipeline ExampleSoftware Pipeline xample

void copyFunction(int *p1, int *p2, int N){

int i ;for (i=0; i<N;i++){

*p2++ = *p1++ ;}return ;

Software PipelineE l R i d

;*----------------------------------------------------------------------------*;* SOFTWARE PIPELINE INFORMATION;* Example: Reminder;;* Loop found in file : ../utility.c;* Loop source line : 12;* Loop opening brace source line : 13;* Loop closing brace source line : 15;* Known Minimum Trip Count : 1 ;* Known Max Trip Count Factor : 1;* Loop Carried Dependency Bound(^) : 6;* Unpartitioned Resource Bound : 1;* Partitioned Resource Bound(*) : 2;* Resource Partition:* A id B id;* A-side B-side;* .L units 0 0 ;* .S units 0 0 ;* .D units 0 2* ;* .M units 0 0 ;* .X cross paths 0 0; .X cross paths 0 0 ;* .T address paths 0 2* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops (.LSD) 0 0 (.L or .S or .D unit);* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 0 1 ;*;* Searching for software pipeline schedule at;* Searching for software pipeline schedule at ...;* ii = 6 Schedule found with 2 iterations in parallel;* Done;*;* Loop will be splooped; p p p

Restrict Qualifiers• Loop iterations cannot be overlapped unless input and output are

independent (do not reference the same memory locations).• Most users write their loops so that loads and stores do not overlap.• Compiler does not know this unless the compiler sees all callers or user

tells compiler.• Use restrict qualifiers to notify compiler.• Restrict tells the compiler that any location addressed by the following

pointer WILL NOT be accessed by any other vector.

void copyFunction(int *restrict p1, int *p2, int N){int i ;int i ;for (i=0; i<N;i++){*p2++ = *p1++ ;*p2++ = *p1++ ;}return ;}}

;*----------------------------------------------------------------------------*;* SOFTWARE PIPELINE INFORMATION;*;* Loop found in file : ../utility.c;* Loop source line : 12;* Loop opening brace source line : 13;* Loop closing brace source line : 15;* Known Minimum Trip Count : 1 ;* Known Max Trip Count Factor : 1;* Known Max Trip Count Factor : 1;* Loop Carried Dependency Bound(^) : 0;* Unpartitioned Resource Bound : 1;* Partitioned Resource Bound(*) : 1;* Resource Partition:;* A-side B-side;* .L units 0 0 ;* .S units 0 0 ;* .D units 1* 1* ;* .M units 0 0 ;* .X cross paths 0 1* ;* .T address paths 1* 1* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops ( LSD) 0 1 ( L; Addition ops (.LSD) 0 1 (.L or .S or .D unit);* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 1* 1* ;*;* Searching for software pipeline schedule at ...g p p;* ii = 1 Schedule found with 7 iterations in parallel;* Done;*;* Loop will be splooped

Agenda• C6000 VLIW Architecture• Software Pipeline• Software Pipeline OptimizationSoftware Pipeline Optimization

– Estimate performancesU i CCS t ti i d– Using CCS to optimize code

– Software pipeline issues

• Hands‐on Lab: Optimize FIR filter

DSP Lab Instructions

LiLeveraging DSP BiBasic Oti i tiO...

Documents

BiBasic - Taylor County Supervision & Corrections … SiiSupervision ... y1,200 ‐additional beds with transfer of San Saba and ... substance abuse treatment beds at the state jail

Introduction to DSP - msvincognito.nl · Introduction to DSP . Source: . Free textbook on DSP . 1 Version 2

2018 Defense Standardization Program (DSP) Workshop DSP Track

GLOSSARY - Interactyxfirestonebpu.interactyx.com/.../story_content/external_files/FSBP_Glossary.pdf · - Special Steep Asphalt: A roofing asphalt conforming to the requirements of

Lentiviral Vectors - researchcompliance.uc.eduresearchcompliance.uc.edu/training/lentiviral-vectors/story_content/external_files/LV... · the university of cincinnati makes no representations

To DSP or Not to DSP?

DSP Series Datasheet - TDK · DSP Series Datasheets DSP Series Datasheets DSP Series DSP/277A2 Series Pages 1 – 2 Pages 3 – 4

eel.iust.ac.ir - /Behnam/DSP/Oppenheim-dsp/jmram/pds09/oppenheim.pdf · eel.iust.ac.ir - /Behnam/DSP/Oppenheim-dsp/ Created Date: 11/16/2005 5:23:12 PM

Introduction to DSP Builder, DSP Builder Handbook, Volume 1 · 2020-06-08 · DSP Builder Handbook November 2013 Altera Corporation Volume 1: Introduction to DSP Builder The DSP Builder

SEED & DSP. SEED LTD. DSP Tools SEED DSP Solutions SEED & DSP… Agenda

Multicore DSP Architecture and ProgrammingTDDD56/slides/14-DSP-OlaDahl.pdf · DSP - ePUMA Properties of DSP algorithms Most DSP algorithms share some common traits. Predictable addressing

DSP, Z transform 1ce.sharif.edu/.../1/ce763-1/resources/root/Lectures/Lec04-Ztransform.pdf · DSP, Z transform 21. DSP, Z transform 22. DSP, Z transform 23. DSP, Z transform 24. DSP,

DSP DSP DSP DSP DSP DSP DSP DSP DSPece491/homework/h6.pdf · 2020-02-27 · DSP DSP DSP DSP DSP DSP DSP DSP DSP Fundamentals of Signal ... but you may bring in one sheet of 8.5" x

DSP DesignDSP Design Introduction and DSP Basics · PDF fileDSP DesignDSP Design Introduction and DSP Basics ... – Lars Wanhammar, DSP Integrated Circuits, Academic ... • eminar

Date: May 16, 2016 KX-NS Series IP PBX · DSP 1-1 DSP 1-2 DSP 2-1 DSP 2-2 DSP IP Setting p a @ Use the fo.wlng address DSP Card -1 IP Address MAC Address DSP Card -2 IP Address MAC

DSP-8681 - Advantechadvdownload.advantech.com/productfile/PIS/DSP-8681/Product... · The DSP-8681 integrates four Texas Instruments ... The 32 DSP cores on the DSP-8681 make it ideal

SABOU MIHAELA ANAMARIA SM AANEI T. MONICA DSP … · aanei t. monica dsp bistrita nasaud aania i. mihaela claudia dsp iasi ababei c.m. roxana dsp botosani ababei v. gabriel dsp iasi

DSP/BIOS By Degrees: Using DSP/BIOS Features in an ...iem.at/~majdak/vsps/Technical References/Dsp/spra591_TMS320_DSP-BIOS...DSP/BIOS by Degrees: Using DSP/BIOS Features in an Existing

DSP-VPFR,VPR...DSP-VPR LOAD L1L2L3 DSP MCCB DSP-VPFR LOAD L1L2L3 DSP-VPFR,VPR 전압 검출식 역상·결상 보호계전기 특 징 기능 및 용도 주문방법 (Order) 정격

Procesor DSP (ang. Digital Signal Processor) oznacza ... · Procesory zmiennoprzecinkowe DSP firmy Texas Instruments Procesory DSP TEXAS INSTRUMENST 1. C6-Integra DSP+ARM Processor