27
Software Performance Software Performance Tuning Project Tuning Project Monkey’s Audio Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb

Software Performance Tuning Project Monkey’s Audio

Embed Size (px)

DESCRIPTION

Software Performance Tuning Project Monkey’s Audio. Prepared by: Meni Orenbach Roman Kaplan. Advisors: Liat Atsmon Kobi Gottlieb. MAC – Ape File Encoder. Monkey’s Audio – a lossless audio codec Can Compress at different levels Can be decompressed back to a Wav file - PowerPoint PPT Presentation

Citation preview

Page 1: Software Performance  Tuning Project Monkey’s Audio

Software Performance Software Performance Tuning ProjectTuning Project

Monkey’s AudioMonkey’s Audio

Prepared by:Meni OrenbachRoman Kaplan

Advisors:Liat AtsmonKobi Gottlieb

Page 2: Software Performance  Tuning Project Monkey’s Audio

• Monkey’s Audio – a lossless audio codec

• Can Compress at different levels

• Can be decompressed back to a Wav file

• Used to save memory while maintaining all the original data

• Playable

MAC – Ape File EncoderMAC – Ape File Encoder

Page 3: Software Performance  Tuning Project Monkey’s Audio

PlatformPlatform And Benchmark UsedAnd Benchmark Used

• Platform: Intel Pentium Core i7 3GB of RAM and with a

Windows Vista operating System.

• Benchmark:- 238MB song.- Original Encoding Duration: 98.9 Sec

Page 4: Software Performance  Tuning Project Monkey’s Audio

Algorithm DescriptionAlgorithm Description

• The input file is read frame by frame

• Every frame contains a constant number of channels

• Channels encoded with dependency between them

• Every frame is encoded and immediately written

Page 5: Software Performance  Tuning Project Monkey’s Audio

The Encoding ProcessThe Encoding ProcessMultiThread

Here!

Frame 1

Frame 2

Frame n

Channel 1

Channel 2

Channel 8

Input File

Frame iChannel 1

Channel 2

Channel 8

Channel 1

Channel 2

Channel 8

Channel 1

Channel 2

Channel 8

·

·

·

·

·

·

···

···

···

···

Encoded Frame 1

Encoded Frame 2

Encoded Frame n

Encoded Frame i

·

·

·

·

·

·

Output File

MultiThread

Here!

MultiThread

Here!

Page 6: Software Performance  Tuning Project Monkey’s Audio

Function Data flowFunction Data flow

Encoding everyFrame

Encoding the error for

every channel

Most timeConsumingfunctions

Encode with a Predictor

Encoding everyFrame

Page 7: Software Performance  Tuning Project Monkey’s Audio

Optimization MethodOptimization Method

• Dealing with the most time consuming functions

• Two approaches were taken:– Multi-threading– SIMD

Page 8: Software Performance  Tuning Project Monkey’s Audio

Optimization Method 1: ThreadsOptimization Method 1: Threads• Monkey’s Audio was managed by a single thread

• Threads should maintain 1:1 bit compatibility

• Changing the flow of the program is required

Page 9: Software Performance  Tuning Project Monkey’s Audio

Changing The Program FlowChanging The Program Flow

Originally:• Each frame is encoded and written immediately

After The Change:• Each frame is encoded and written to a buffer• The buffer is filled through the encode process• Write the buffer once all previous frames have been

encoded and written

Page 10: Software Performance  Tuning Project Monkey’s Audio

Our ImplementationOur Implementation

We use the following threads:• Main thread

Transfers frame data to the encode thread

• Write thread Writes the encoded buffers to the output file

• Encode threads Encodes the frame it is given

Note: we use N+2 threads, when N is the number of threads available.

Page 11: Software Performance  Tuning Project Monkey’s Audio

Data Structures UsedData Structures Used

ThreadParam – a linked list of objects that contains the encoded data

EncodeParam – an object containing data needed to encode a frame

WriteParam – an object containing data needed to write to the output

FramePredictor - global array that signal dependency between frames

Page 12: Software Performance  Tuning Project Monkey’s Audio

Threads SchemaThreads Schema

Buffer

Buffer Counter

Frame index

Thread ID

Encode Done(T\F)

Mutexes

Next

Buffer

Buffer Counter

Frame index

Thread ID

Encode Done(T\F)

Mutexes

Next

Buffer

Buffer Counter

Frame index

Thread ID

Encode Done(T\F)

Mutexes

Next

HeadTail

Encode Thread 1

Encode Thread 2

Write Thread

Main Thread

Page 13: Software Performance  Tuning Project Monkey’s Audio

Dependencies Between FramesDependencies Between Frames

Once a frame finished encoding, there may bea left over of data, which is dealt with in 2 ways:

1. Writing the left over data after the encoded frame2. Re encode the left over data with the next frame

We always write the left over data after theencoded frame

Page 14: Software Performance  Tuning Project Monkey’s Audio

Dealing With DependenciesDealing With DependenciesBetween FramesBetween Frames

• Using the write thread to start a new encode thread

• Remove the ‘wrongly encoded’ frame from the list

• Keep encoding the rest normally

• Keep writing to the output file in the right order!

Page 15: Software Performance  Tuning Project Monkey’s Audio

The ProblemThe Problem• There is also a data leftover between frames

• This dependency is unpredictable

• It is impossible to maintain 1:1 bit compatibility

• We ‘guess’ the best value so we don’t lose data!

Page 16: Software Performance  Tuning Project Monkey’s Audio

Results: Vtune Thread ProfilerResults: Vtune Thread Profiler

Page 17: Software Performance  Tuning Project Monkey’s Audio

Results: Vtune Thread CheckerResults: Vtune Thread Checker

Page 18: Software Performance  Tuning Project Monkey’s Audio

MultiThreading ConclusionMultiThreading Conclusion

• Total speedup from using MT: x3.15!

MT SpeedUp

0

0.5

1

1.5

2

2.5

3

3.5

4

Original 8Cores nodep

8Cores withdep

4Cores

Original

8Cores no dep

8Cores with dep

4Cores

Page 19: Software Performance  Tuning Project Monkey’s Audio

Explaining The SpeedupExplaining The Speedup

•When considering Amdahl’s law we have 2 serial parts (reading the first frames and encoding the last frame) that takes about 8% of our benchmark so we get:

•In addition while implementing our solution, in order to deal the dependencies we added ~20% instruction, thus we expect:

exp

1 13.8

0.93(1 ) 1 0.93

4.8

ectedtP

PN

exp 43.15

1.2ected

MT not optimal

t

t

Page 20: Software Performance  Tuning Project Monkey’s Audio

Optimization Method 2: SIMDOptimization Method 2: SIMD

• Original Code is written using MMX technology

• Operations with only 16bit Integer arrays

• Two main functions we used SSE on: – Adapt()

– CalculateDotProduct()

Note: These functions written entirely in ASM

Page 21: Software Performance  Tuning Project Monkey’s Audio

AdaptAdapt()() - - ImprovementsImprovements

• Add and Sub instructions on arrays of 16 bit Integers (supported in MMX)

• Each iteration goes over 32 sequential array elements

• The input and output arrays were aligned to prevent ‘Split loads’

Page 22: Software Performance  Tuning Project Monkey’s Audio

AdaptAdapt() () – Main Loop– Main Loop

Old code movq mm0, [eax]

paddw mm0, [ecx]

movq [eax], mm0

movq mm1, [eax + 8]

... movq mm3, [eax + 24]

paddw mm3, [ecx + 24]

movq [eax + 24], mm3

New code (aligned)movdqa xmm0, [eax]

movdqa xmm2, [ecx]

paddw xmm0, xmm2

movdqa [eax], xmm0

movdqa xmm1, [eax + 16]

movdqa xmm3, [ecx + 16]

paddw xmm1, xmm3

movdqa [eax + 16], xmm1

Note: There is equivalent loop with SUB operations

MMX register is 8 byteSSE register is 16 byte

16 Vs. 12 instructions per

iteration

Page 23: Software Performance  Tuning Project Monkey’s Audio

SIMD -SIMD - CalculateDotProduct CalculateDotProduct()()

• Multiply-Add of an 16bit Integers array.

• Each iteration goes over 32 array elements.

• Speedup will be calculated for both functions together.

Page 24: Software Performance  Tuning Project Monkey’s Audio

CalculateDotProductCalculateDotProduct()()Old code

movq mm0, [eax]

pmaddwd mm0, [ecx]

paddd mm7, mm0

movq mm1, [eax + 8]

... movq mm3, [eax + 24]

pmaddwd mm3, [ecx + 24]

paddd mm7, mm3

New code (aligned) movdqa xmm0, [eax]

movdqa xmm4, [ecx]

pmaddwd xmm0, xmm4

paddd xmm7, xmm0

movdqa xmm1, [eax + 16]

movdqa xmm4, [ecx + 16]

pmaddwd xmm1, xmm4

paddd xmm7, xmm1

Multiply-Add

• Each iteration is Multiply-Adding 32 array elements

16 Vs. 12 instructions per iteration

Page 25: Software Performance  Tuning Project Monkey’s Audio

SIMD Speedup AchievedSIMD Speedup Achieved

• Adapt() local speedup: x1.72

Overall speedup: x1.2

• CalculateDotProduct() local speedup: x1.62

Overall speedup: x1.2

• Total speedup using SIMD: x1.4!

Page 26: Software Performance  Tuning Project Monkey’s Audio

Intel Tuning AssistantIntel Tuning Assistant

No Micro-Architectural problems found in the

optimized code.

Page 27: Software Performance  Tuning Project Monkey’s Audio

Final ResultsFinal Results

A total speedup of x4.017 was achieved by using only MT and SIMD

Overall SpeedUp

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Original MT SIMD ALL

Original

MT

SIMD

ALL