38
Leveraging DSP: Basic Optimization for C6000 Digital Signal Processors

TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Leveraging DSP: Basic Optimization forC6000 Digital Signal Processors

Page 2: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Agenda• C6000 VLIW Architecture• Hardware Pipeline• Software Pipeline OptimizationSoftware Pipeline Optimization

– Estimating performanceU i CCS t ti i d– Using CCS to optimize code

– Software pipeline issues

Page 3: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

C6000 VLIW A hit tC6000 VLIW Architecture

Leveraging DSP: Basic Optimization for C6000 Digital Signal Processors

Page 4: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

C6000 DSP CoreA hi

MemoryArchitecture

A0 B0 • VLIW (Very Large Instruction .D1 .D2 Word) architecture:

– Two (almost independent) sides A and B

.S1 .S2

MAC

sides, A and B– 8 functional units: M, L, S, D – Up to 8 instructions sustained 

.M1 .M2

MACs dispatch rate 

..L1 L2

..

A31.L1 .L2

B31

Controller/DecoderController/Decoder

Page 5: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

C6000 Cross‐Path

A0

Register File A

B0

Register File B

A1

A2

B1

B2

A3

A4

B3

B4

......A B

A31 B31

.D1

.S1

.D1

.S1

.M1

.L1

.M1

.L1

Page 6: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

C6000 Processors

TMS320C6424 TMS320C6748 TMS320C6678

Page 7: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Partial List of .M Instructions

Page 8: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Partial List of .D Instructions

Page 9: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Partial List of .L Instructions

Page 10: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Partial List of .S Instructions

Page 11: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

H d Pi liHardware Pipeline

Leveraging DSP: Basic Optimization for C6000 Digital Signal Processors

Page 12: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Non‐Pipelined vs. Pipelined CPU

CPU TypeClock Cycles

1 2 3 4 5 6 7 8 9CPU Type

F2 D2 E2 F3 D3 E3F1 D1 E1Non‐Pipelined

1     2     3      4     5     6      7     8     9

p

F1 D1 E1Pipelined F1 D1 E1F2 D2 E2

F D E

Pipelined

Stage Pipeline Function

F3 D3 E3

Pipeline fullF

Fetch• Generate program fetch address• Read opcode

DD d

• Route opcode to functional unitsD d i t ti

Now look at the C66x pipeline.Decode • Decode instructions

EExecute Execute instructions

Page 13: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Program Fetch PhasesPhase Description

PG G f h ddPG Generate fetch address

PS Send address to memory

PW Wait for data readyPW Wait for data ready

PR Read opcode

C66xCore Functional

UnitsPR

Units

PWPS

Memory PG

PW

Page 14: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Pipeline Phases: ReviewProgram Fetch

ExecuteDecode

PG PS PW PR D EPG PS PW PR D EPG PS PW PR D E

PG PS PW PR D EPG PS PW PR D E

PG PS PW PR D EPG PS PW PR D E

Single‐cycle performance is not affected by adding three program fetch phases.program fetch phases.

That is, there is still an execute every cycle.

How about decode? Is it only one cycle?

Page 15: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Decode PhasesDecode Phase Description

DP Intelligently routes instruction toDP Intelligently routes instruction to functional unit (dispatch)

DC Instruction decoded at functional unit (d d )(decode)

C66xCore

PR FunctionalUnitsDP UnitsDPDC

PWPS

Memory PG

PW

Page 16: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Pipeline Phases

Program Fetch ExecuteDecode

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

Pipeline Full

How many cycles does it take to execute an instruction?

Page 17: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Instruction DelaysMost C66x instructions require only one cycle to 

t B t i t ti lt d l dexecute. But some instruction results are delayed.

Description Instruction Example Delay (cycles)

Single Cycle All instructions except 0

Integer multiplication and new floating point 

MPY, FMPYSP 1

L fl i i MPYSP 2Legacy floating point multiplication

MPYSP 2

Load  LDW  4Branch B 5

Page 18: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

S ft Pi li O ti i tiSoftware Pipeline Optimization

• Estimating performance• Using CCS to optimize code• Using CCS to optimize code• Software pipeline issues

Page 19: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline Example

Void example(float *in, float*out, int N, float V){{

sum = 1.0 ;for (i=0; i<N; i++){{

x = *in++ * V ;sum = sum + x ;*out++ = sum ;

}}

How many cycles would it take to perform the loop five times?   y y p p

Page 20: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Non‐Pipeline Code Flow

Implementation of the loop in the following code:

Void example(float *in, float*out, int N, float V){

sum = 1.0 ;sum 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;sum = sum + x ;

*out++ = sum ;}

}}

Page 21: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline Code FlFlow

Implementation of the loop in the following code:

Void example(float *in, float*out, int N, float V){

sum = 1.0 ;sum 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;sum = sum + x ;

*out++ = sum ;}

}

The compiler kno s all the dela s and is

}

The compiler knows all the delays and is smart enough to build the correct software pipeline.pipeline.

Page 22: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline S t• The compiler is smart enough to

SupportThe compiler is smart enough to schedule instructions efficiently.

• Software pipeline is the major speed‐ADDup mechanism for VLIW architecture.

• Software pipeline requires deterministic execution:deterministic execution:―No if, branch, and/or call―No interruptsp―No dependencies

Page 23: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

.D1 .D2 .M1 .L1LD1

2 LD

Software Pipeline E l I t t3

45

LDLD LD MPY

Example: Interrupt

678

LD MPYLD MPY ADDLD ST MPY ADD

LDLDInterrupt

Implementation of the loop in the following code:

ISR

91011

LD ST MPY ADDLD ST MPY ADDLD ST MPY ADD

MPY

p p g

Void example(float *in, float*out, int N, float V){

1 0121314

LD ST MPY ADDLD ST MPY ADDLD ST MPY ADD

sum = 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;151617

LD ST MPY ADDLD ST MPY ADDLD ST MPY ADD

;sum = sum + x ;

*out++ = sum ;}

}181920

LD ST MPY ADDLD ST MPY ADDLD  ST MPY ADD

}Return

from ISR

LD ST MPY ADD21

Page 24: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline E l SPLOOP

.D1 .D2 .M1 .L1LD1

2 LD Example: SPLOOP345

LDLD LD MPY

I678

LD MPYMPY ADD

ST MPY ADD

LDInterrupt

Implementation of the loop in the following code:91011

ST MPY ADDST MPY ADDST ADD

p p g

Void example(float *in, float*out, int N, float V){

1 0121314

STServing The Interrupt

LD

sum = 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;151617

LDLDLD

;sum = sum + x ;

*out++ = sum ;}

}181920

LD MPYLD MPYLD  MPY ADD

}

LD ST MPY ADD21

Page 25: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Code Development• Code Generation Tools can build executables from different code types:

– Generic C or C++ code

– C with intrinsic– Linear Assembly  

– Assembly (DETAI)• Optimization is performed:Optimization is performed:

– In the front end

– Using the intrinsicR ll ti d ft i li h i ti i d li bl– Resource allocation and software pipeline search in optimized linear assembly

• To understand the quality of the optimization of a loop, compare the theoretical iteration interval (II: The actual number of cycles between two results of the loop) to the result of the 

bl / ti iassembler/optimizer.

– Was the software pipeline successful (if not, why)?– Is the usage balanced between the two sides (if not, can it be improved)?

– What are the bottlenecks and how to mitigate them?

• To keep the assembly file, set the –k option

Page 26: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Keep Generated Assembly File

Page 27: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Build Options: Optimization and Debug

Page 28: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

‐S and ‐MW Settings

Page 29: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Set Additional Flags

Page 30: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

.D1 .D2 .M1 .L1LD1 Dependencies

234

pWhat if out = in + 1?

In that case the code cannot start loading the567

MPY

ADD

In that case, the code cannot start loading the next input before the previous output is ready.

Unless the compiler knows otherwise the891011

STLD

Unless the compiler knows otherwise, the compiler assumes dependencies.

Implementation of the loop in the following code:Void example(float *in, float*out, int N, float V){

sum = 1 0 ;

11121314

MPYADD sum = 1.0 ;

for (i=0; i<N; i++){

x = *in++ * V ;

151617

STLD 

sum = sum + x ;*out++ = sum ;

}}

181920 MPY

}21

Page 31: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

No Dependencies The compiler concludes that there are no dependencies in the following cases:dependencies in the following cases:• The compiler determines it from the code (e.g., the calling function is in the same file as thethe calling function is in the same file as the routine).

• The code uses the restrict keyword• The code uses the restrict keyword. • A compiler switch tells the compiler that there is 

l b t t i t ( t)no overlay between vector pointers (‐mt).

Page 32: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

IF and Conditional Execution• All assembly instructions are conditional instructions.

• In conditional instruction, the functional unit executes the instruction and writes the result to the output register ONLY if the condition is true .

• The true condition should be known one cycle – and ONLY one cycle – before the result is written to the output register.C diti l ti l if t t t f ll• Conditional execution can replace if statements as follows:

if (x < 1000.0) sum = sum + x --> [x <1000.0] sum=sum+x

• The compiler is smart enough to convert “simple” if statements into conditional execution

( ) [ ]

into conditional execution .• The result of x < 1000.0 should be known just one cycle before 

the last step of executionthe last step of execution.

Page 33: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Function CallsVoid example(float *in, float*out, int N, float V){

Function Calls

sum = 1.0 ;for (i=0; i<N; i++){

x = *in++ * V ;sum = sum + f(x) ; *out++ = sum ;

}}

• A function call prevents the compiler from generating the• A function call prevents the compiler from generating the software pipeline.

• Inline the function removes this limitation• Inline, the function removes this limitation.• The compiler does not inline function (unless it is told to do so). 

It is up to the user.It is up to the user.

Page 34: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline ExampleSoftware Pipeline xample

void copyFunction(int *p1, int *p2, int N){

int i ;for (i=0; i<N;i++){

*p2++ = *p1++ ;}return ;

}

Page 35: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software PipelineE l R i d

;*--------------------------------------------------*;* SOFTWARE PIPELINE INFORMATION;*;* Loop found in file : ../utility.c* L li 12 Example: Reminder;* Loop source line : 12

;* Loop opening brace source line : 13;* Loop closing brace source line : 15;* Known Minimum Trip Count : 1 ;* Known Max Trip Count Factor : 1;* Loop Carried Dependency Bound(^) : 6;* Unpartitioned Resource Bound : 1;* Partitioned Resource Bound(*) : 2;* Resource Partition:;* A-side B-side; A side B side;* .L units 0 0 ;* .S units 0 0 ;* .D units 0 2* ;* .M units 0 0 ;* X cross paths 0 0;* .X cross paths 0 0 ;* .T address paths 0 2* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops (.LSD) 0 0 (.L or .S or .D unit);* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 0 1 ;*;* Searching for software pipeline schedule at ...; g p p;* ii = 6 Schedule found with 2 iterations in parallel;* Done;*;* Loop will be splooped

Page 36: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Restrict Qualifiers• Loop iterations cannot be overlapped unless input and output are 

independent (do not reference the same memory locations).• Most users write their loops so that loads and stores do not overlap.p p• Compiler does not know this unless the compiler sees all callers or 

user tells compiler.• Use restrict qualifiers to notify compilerUse restrict qualifiers to notify compiler.• Restrict tells the compiler that any location addressed by the 

following pointer WILL NOT be accessed by any other vector.

void copyFunction(int *restrict p1, int *p2, int N){int i ;int i ;for (i=0; i<N;i++){*p2++ = *p1++ ;*p2++ = *p1++ ;}return ;}}

Page 37: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

Software Pipeline;*--------------------------------------------------*;* SOFTWARE PIPELINE INFORMATION;*;* Loop found in file : ../utility.c* L li 12 Example: Reminder;* Loop source line : 12

;* Loop opening brace source line : 13;* Loop closing brace source line : 15;* Known Minimum Trip Count : 1 ;* Known Max Trip Count Factor : 1;* Loop Carried Dependency Bound(^) : 6;* Unpartitioned Resource Bound : 1;* Partitioned Resource Bound(*) : 2;* Resource Partition:;* A-side B-side; A side B side;* .L units 0 0 ;* .S units 0 0 ;* .D units 0 2* ;* .M units 0 0 ;* X cross paths 0 0;* .X cross paths 0 0 ;* .T address paths 0 2* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops (.LSD) 0 0 (.L or .S or .D unit);* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 0 1 ;*;* Searching for software pipeline schedule at ...; g p p;* ii = 6 Schedule found with 2 iterations in parallel;* Done;*;* Loop will be splooped

Page 38: TI Training - Leveraging DSP: Basic Optimization for Signal ...• Optimization Techniques for the TI C6000 Compiler • TMS320C6000 DSP Optimization Workshop • For questions regarding

For More Information

• Optimization Techniques for the TI C6000 Compiler

• TMS320C6000 DSP Optimization Workshop

• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.