32
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax

A first attempt at learning about optimizing the TigerSHARC code

  • Upload
    sanam

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

A first attempt at learning about optimizing the TigerSHARC code. TigerSHARC assembly syntax. What we NOW KNOW!. Can we return from an assembly language routine without crashing the processor? Return a parameter from assembly language routine (Is it same for ints and floats?) - PowerPoint PPT Presentation

Citation preview

Page 1: A first attempt at learning about optimizing the TigerSHARC code

A first attempt at learning about optimizing the TigerSHARC code

TigerSHARC assembly syntax

Page 2: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

2 / 28

What we NOW KNOW!

Can we return from an assembly language routine without crashing the processor?

Return a parameter from assembly language routine (Is it same for ints and floats?)

Pass parameters into assembly language (Is it same for ints and floats?)

Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from

memoryAll this stuff is demonstrated by coding

HalfWaveRectifyASM( )

Page 3: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

3 / 28

Not bad for a first effortFaster than compiler in debug mode

Need to learn fromthe compiler onhow to speed code

Page 4: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

4 / 28

How does compiler do it? Look at source code and use mixed mode to show

Warning – out of order instructions displayed

Page 5: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

5 / 28

Many new instructions. Many parallel instruction. Ones inside loop are key

How important is coding if conditional jump (NP or not) is predicted or not?

BIG 25%

523 435

Page 6: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

6 / 28

Many new instructions. Many parallel instruction. Ones inside loop are key

How important is not using J registers when reading from memory

XR1 rather than J1

Now need

Condition XALT rather than JLT

XCOMP rather than COMP

JMP (NP) 523 435

XR1 not J1 435 491

Page 7: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

7 / 28

Many new instructions. Many parallel instruction. Ones inside loop are key

How important is not using J registers as a destination when reading from memory, and using pointers (*pt++) rather than array ( pt[count])

XR1 rather than J1

Now need Condition XALT rather than JLT

XCOMP rather than COMP

JMP (NP) 523 435

XR1 not J1 435 491and ++ operator 491 435

Page 8: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

8 / 28

Redoing our code to this point.Note new instructions using XR2 and R2

Try a little thing. R2 = 0 is a constant – move outside loopFound we had already set R2 = 0 outside loop

Difference, about half the time – expect improve by 12 cycles

Got 491 476 = 15 – timing only accurate to around 10 cycles

Page 9: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

9 / 28

The IF THEN JUMPS in the loop are killing us. Rewrite C++ code into optimized form

Reduce loop size from 6 if > 0 and 7 if < 0 to 4 any way.

Loop size 24 – expect improvement of 48 cyclesWe go from 476 to 250 cycles

That’s 225 cycles or roughly9 cycles saved each time around the loop

The jumps were causing us 9 cyclesby disrupting the TigerSHARC pipeline

Need to get rid of this jumpand counter increment.

Blackfin has hardware loopsDoes the TigerSHARC – Duh!!

Page 10: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

10 / 28

Many new instructions. Many parallel instruction. Ones inside loop are key

Hardware loop instructions

LC0 = loop counter 0 – may only be a few hardware loops possibleSHARC ADSP-21061 – allows 6, Blackfin ADSP-BF5XX – allows 2, so need to still understand software loops

IF LC0E If hardware loop expired, IF NLC0E, if not expired – MM!!

JMP (NP) 523 435

XR1 not J1 491and ++ operator 435

Remove inner jumpsfrom loop 250

Page 11: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

11 / 28

With hardware loops – 166 cycles!Are we cooking or what!

Fine tuning – can we saveN cycles (1 each time round loop)by merging instructions

Page 12: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

12 / 28

Merge those two instructions and use our fancy SIGN-BIT trick for float code

We are beatingthe optimized compiler on the float code by afactor of 2

We need 1 cycleto beat thecompiler on theoptimized int code

Find in for Assignment 1

I did 138 cycles

Page 13: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

13 / 28

My code passes the tests in 138 cyclesExtra 11 cycles from outside the loop (not worth the time and effort if the loop was larger, or there were more points to process)

Does turning off the Cache make any difference to our code

Find out in assignment 1

Page 14: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

14 / 28

What is the theoretical maximum speed? This is something I always work out BEFORE optimizing.

I have a target to meet – normally finish all processing before next sample comes in.

If my code (in theory) can’t meet that target, I need to find a different approach, not spend days optimizing useless code.

In theory – if I have written the code with no hidden stalls – 1 cycle per instruction 6 instructions outside the loop 4 instruction inside the loop – N * 4 cycles Very short loop – read that getting out of very short loop

stalls the pipeline – lets add 5 cycles for that 6 + 24 * 4 + 5 = 107 in theory, 138 in practice Difference 21 – close enough to being 24, or 1 stall per cycle Can use the pipeline viewer to find out where the problem is

occurring. In a long loop, done 4096 times, might be worth it.

Page 15: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

15 / 28

Trying to understand what we have done

Most TigerSHARC instructions can be made conditional.

WHY? Because doing a NOP instruction (if condition not met) is much less disruptive to the instruction pipeline than doing a JUMP (lose of 9 cycles if jump taken – probably more because of code format)

Page 16: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

16 / 28

Why mostly conditional instructions?

TigerSHARC has a very deep pipeline, so that conditional jumps cause a potential large disruption of the pipeline

Better to use non-jump instructions which don’t disrupt pipeline, even if instruction is not executed (acts as nop)

If (N < 1) return_value = NULL;

else return_value = NULL;

Page 17: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

17 / 28

Why mostly conditional instructions?

If (N < 1) return_value = NULL;

else return_value = value;

COMP(N, 1);;IF NJLT, JUMP _ELSE;; J5 = NULL;; JUMP _END_IF;;_ELSE: J5 = value;;

If (N < 1) return_value = NULL;

else return_value = value;

COMP(N, 1);; IF NJLT; DO, J5 = NULL;; IF JLT; DO, J5 = value;;

Concept is there – we need tocheck on whether syntax is

correct

Page 18: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

18 / 28

Trying to understand what we have done

Use J registers for address operations, but store values from memory in XR1 and YR1

WHY? Instructions like this [J1] = XR1;; has the potential to be put in parallel with more operations

Page 19: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

19 / 28

Hardware – zero overhead loop.About 4 * N cycles better (N is times round the loop)

LC0 = N;; Load counter 0 with value N

Start_of_loop_LABEL: Loop code here ;;

IF NLC0E, JUMP Start_of_loop_LABEL;;

NLC0E – Not LC0 expired – essentially Compare LC0 with 2If less than 2, continue (don’t jump)If 2 or more, then decrement LC0 and jumpAll sorts of stall issues if not properly aligned –TigerSHARC

manual 8-23CAN’T USE WHEN THERE IS A FUNCTION CALL IN THE LOOP?

WHY NOT? – WHAT HAPPENS – NEED TO EXPLORE MORE.Using a software loop when there is a function is okay since calling a function is slow anyway – don’t need efficiency

Page 20: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

20 / 28

Hardware – zero overhead loop.BIG WARNINGLC0 = N;; Load counter 0 with value N

LC0 uses UNSIGNED ARITHMETIC – MAKE SURE N is not negative, as a negative number has the same bit pattern as a VERY large unsigned number, and the processor will go around the loop for a week

We did a check for N <= 0 before entering the hardware loop as another part of our code – so we lucked in – otherise could have big problems.

This issue is so important (and time wasting in the laboratories) that will be deducting marks in quizzes and exams

Page 21: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

21 / 28

What’s this XR1, YR1 and R1 stuff

TigerSHARC is designed to do many things at once

So you need appropriate syntax to control it

Page 22: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

22 / 28

What’s this XR1, YR1 and R1 stuff

XYR1 = R2 + R3;;does 2 addsXR1 = XR2 + XR3andYR1 = YR2 + YR3;

You can add the X values and not the Y values with this syntax

XR1 = R2 + R3;;And NOT withXR1 = XR2 + XR3;;

Ugly – but they (ADI) will not change the syntax (DAMY)

Page 23: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

23 / 28

What’s this XR1, YR1 and R1 stuff

XYR1 = [J0 += 0x1];;

Does a 32-bit fetch and puts the same value into XR1 and YR1. Same as doing

XR1 = [J0 += 0];; ANDYR1 = [J0 += 1];; at the same

time

XYR1 = L[J0 +0x2];;

Does a dual 64 bit fetch and is the same as doing

XR1 = [J0 += 1];; ANDYR1 = [J0 += 1];; at the same

time

Page 24: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

24 / 28

What’s this XR1, YR1 and R1 stuff

XYR1 = [J0 += 0x1];; meansXR1 = [J0 += 0];; ANDYR1 = [J0 += 1];;

XYR1 = L[J0 +0x2];; meansXR1 = [J0 += 1];; ANDYR1 = [J0 += 1];; at the same time

XR1:0 = L[J0 +0x2];; meansXR0 = [J0 += 1];; ANDXR1 = [J0 += 1];;

XYR1:0 = L[J0 +0x2];; meansXR0 = [J0 += 0];; ANDYR0 = [J0 += 1];; ANDXR1 = [J0 += 0];; YR1 = [J0 += 1];;

Page 25: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

25 / 28

What’s this XR1, YR1 and R1 stuff

XYR1:0 = L[J0 +0x2];; meansXR0 = [J0 += 0];; ANDYR0 = [J0 += 1];; ANDXR1 = [J0 += 0];; YR1 = [J0 += 1];;

XR3:0 = Q[J0 +0x4];; meansXR0 = [J0 += 1];; ANDXR1 = [J0 += 1];; ANDXR2 = [J0 += 1];; ANDXR3 = [J0 += 1];;

XYR3:0 = Q[J0 +0x4];; meansXR0 = [J0 += 0];; ANDYR0 = [J0 += 1];; ANDXR1 = [J0 += 0];; ANDYR1 = [J0 += 1];; ANDXR2 = [J0 +=0];; ANDYR2 = [J0 += 1];; ANDXR3 = [J0 += 0];; ANDYR3 = [J0 += 1];;

Page 26: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

26 / 28

Float release generated by C++ compiler– identify new instructions

I see 1 new instruction

Page 27: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

27 / 28

Difference between integer and math operations

XYR1 = R2 + R3;;does 2 INTEGER addsXR1 = XR2 + XR3andYR1 = YR2 + YR3;

SYNTAX XR1 = R2 + R3;;And NOT withXR1 = XR2 + XR3;;

Use F syntax to make it a float operation

XYFR1 = R2 + R3;;does 2 FLOATING addsXFR1 = R2 + R3andYFR1 = R2 + R3;

Page 28: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

28 / 28

Exercise 1 – needed for Lab. 1

FIR filter operation -- data and filter-coefficients are both integer arrays – Write in C++

New_value from Audio A/D, output sent to Audio D/A

1

0

1 1

[ 1] [ ];

[0] ;

[ ]* _ [ ];N

j

for j to N

data N j data N j

data newvalue

output data j filter coeffs j

Page 29: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

29 / 28

Exercise – needed for Lab. 1

FIR filter operation -- data and filter-coefficients are both integer arrays -- ASM

1

0

Re (& );

1 1

[ 1] [ ];

[0] ;

[ ]* _ [ ];

( );

N

j

adAudioSource newvalue

for j to N

data N j data N j

data newvalue

output data j filter coeffs j

WriteAudioSource output

Page 30: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

30 / 28

Insert C++ code – for Lab. 1

Page 31: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

31 / 28

Insert assembler code version (Lab. 2)

Page 32: A first attempt at learning about optimizing the TigerSHARC code

04/20/23 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

32 / 28

What we NOW KNOW EVERYTHING FOR THE FINAL (REALLY -- ALMOST)!

Can we return from an assembly language routine without crashing the processor?

Return a parameter from assembly language routine (Is it same for ints and floats?)

Pass parameters into assembly language (Is it same for ints and floats?)

Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from

memoryAll this stuff was demonstrated by coding

HalfWaveRectifyASM( ) --