25
1 / 25 FOSDEM 2018 | Michael Meeks Michael Meeks General Manager at Collabora Productivity [email protected] Skype - mmeeks, G+ - [email protected] Calc: The challenges of scalable arithmetic How threading can be challenging “Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and you will find rest for your souls...” - Jeremiah 6:16 www.collaboraoffice.com

Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

1 / 25 FOSDEM 2018 | Michael Meeks

Michael MeeksGeneral Manager at Collabora Productivity

[email protected] - mmeeks,

G+ - [email protected]

Calc: The challenges ofscalable arithmetic

How threading can be challenging

“Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and

you will find rest for your souls...” - Jeremiah 6:16

www.collaboraoffice.com

Page 2: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

2 2 / 25 FOSDEM 2018 | Michael Meeks

Calc threading - Overview

● LibreOffice 6.0 Calc

● Existing structure & parallelism

● Why thread ?

● The initial solution & problems● mis-factored code● dependency issues

● The group calculation piece

● Profiling & optimizing

● Future work & expansion …

Disclaimer & Thanks:Almost all of this

work was done by Tor Lillqvist & Dennis Francis – who can’t be here today.

Some great code reading & improvement.

Disclaimer & Thanks:Almost all of this

work was done by Tor Lillqvist & Dennis Francis – who can’t be here today.

Some great code reading & improvement.

Page 3: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

3 3 / 25 FOSDEM 2018 | Michael Meeks

LibreOffice 6.0 Calc ...

● A 30+ year old code-base

● Primary Data structures hugely improved recently● Still some scope for improvement:

FormulaGroup vs. FormulaCell, per-cell dependency records etc.

● Calculation Engine in need of love● Some insights into how it works● Some problems wrt. threading.

Page 4: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

4 / 25 FOSDEM 2018 | Michael Meeks

Core structures since 4.3(mdds::multi_type_vector)

ScDocument

ScTable

svl::SharedString block

double block

EditTextObject block

ScFormulaCell block

ScColumn

Broadcasters

Text widthsScript types

Cell values

Cell notes

This bit:This bit:

Page 5: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

5 / 25 FOSDEM 2018 | Michael Meeks

FormulaCellGroups

ScFormulaCell

ScTokenArray

ScFormulaCell

ScFormulaCell

ScFormulaCell

ScFormulaCell

ScFormulaCell

ScFormulaCell

ScFormulaCellGroup

… Tokens

… RPN

Sample Token types (StackVar) ● svSingleRef → A1● svDoubleRef → A1:C3● svExternalSingleRef etc.● svDouble → 42.0● svString → “hello world”● svByte → ocDiv, ocMacro ...

Sample Token types (StackVar) ● svSingleRef → A1● svDoubleRef → A1:C3● svExternalSingleRef etc.● svDouble → 42.0● svString → “hello world”● svByte → ocDiv, ocMacro ...

Page 6: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

6 / 25 FOSDEM 2018 | Michael Meeks

Normal Formula interpreting

double ScFormulaCell::GetValue(){ MaybeInterpret(); return GetRawValue();}

void ScFormulaCell::Interpret(){ … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */);

->Interpret()

StackVar ScInterpreter::Interpret(){ … execute reverse-polish stack … … execute functions … … get cell values from references …

Recursion++

Page 7: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

7 / 25 FOSDEM 2018 | Michael Meeks

InterpretFormulaGroup

ScTokenArray

ScFormulaCellGroup

… Tokens

… RPN

1

2

2

1

7

6

9

6

5

2

3

4

getValuesCollected to Matrix

Interpret:OpenCLSoftware

Examine for safe casesExamine for safe cases

Even non-threaded software case: fasterShares function input collection work.Aggregated / linearized doubles / strings in the matrix

Page 8: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

Why Thread ?

Page 9: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

9 / 25 FOSDEM 2018 | Michael Meeks

CPUs get wider not faster

● Sometimes CPUs get slower …

● Process clocks stymied at 3-4 GHz● IPC improvements ~stalled

● Real IPC wins:● Laptops minimum 4 threads→

– Mid-range 8 threads.→

● PC / Workstation– 8 16 threads: the new normal.→

● Affordable too ...

● Many thanks to AMD for sponsoring this work.

Page 10: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

10 / 25 FOSDEM 2018 | Michael Meeks

2017 Crash reporting stats

● Frustratingly ‘cores’ not threads.

2017

-01-

01

2017

-02-

01

2017

-03-

01

2017

-04-

01

2017

-05-

01

2017

-06-

01

2017

-07-

01

2017

-08-

01

2017

-09-

01

2017

-10-

01

2017

-11-0

1

2017

-12-

01

2018

-01-

010.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Crash report % by CPU core count over time.

48

36

32

24

16

12

10

8

6

4

2

1

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Reports from large core count machines.

48

40

36

32

24

16

12

10

Page 11: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

Initial Solution ...

Page 12: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

12 / 25 FOSDEM 2018 | Michael Meeks

Thread InterpretFormulaGroup

● Attempt re-use of existing formula core● Try to avoid special / sub-setting code-paths

for existing formula-group conversion: a more generic solution.

● Concept:● Pre-calculate dependent cells to control

recursion outside of threads.● Protect invariants with assertions● Black-list problematic functions ...● Parallelise using existing interpreter.

Page 13: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

13 / 25 FOSDEM 2018 | Michael Meeks

Parallelize existing interpreter

double ScFormulaCell::GetValue(){ MaybeInterpret(); return GetRawValue();}

void ScFormulaCell::Interpret(){ … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */);

->Interpret()StackVar ScInterpreter::Interpret(){ … execute reverse-polish stack … … execute functions … … get cell values from references …

Pre-fetch all dependent values – and lock-that down:void ScFormulaCell::MaybeInterpret() ...assert(!pDocument->mbThreadedGroupCalcInProgress);

Pre-calculated →No recursion

Page 14: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

14 / 25 FOSDEM 2018 | Michael Meeks

ScInterpreter: calcs formulae

ScDocumentScTable

ScFormulaCell block

Broadcasters

ScBroadcastAreaSlotMachine

ScColumn

DependenciesDependenciesScInterpreter

ScTokenArray

ScFormulaCellGroup

… Tokens

… RPN

Mutates: INDEX, OFFSET etc.

CloudWeb fn’s

MacrosExt’ns

Mutates!

VlookupCache

Number format, Link mgmt etc.

Page 15: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

15 / 25 FOSDEM 2018 | Michael Meeks

ScInterpreter: some fixes

● Basic iteration - broken:● class FormulaTokenArray

– sal_uInt16 nIndex; // Current step index

– FormulaToken* FirstRPN() { nIndex = 0;

return NextRPN(); }

● Now has an external iterator

– a man-week+ to un-wind this, and debug the last pieces that relied on this.

● Added mutation guards:● ScMutationGuard aGuard(this, ScMutationGuardFlags::CORE);

– In all likely-looking places: where core state is changed.

Page 16: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

16 / 25 FOSDEM 2018 | Michael Meeks

Disabling nasties:

● Dependency graph manipulation● During calculation:

– Indirect, Offset, Match, Cell, ocTableOp

● Other stuff● Macros – disabled for now.

– Could detect ‘pure’ ie. non-mutating functions

– Also parallelize the basic/ interpreter (?)

● Info grab-bag of bits.→

● ocExternal UNO extensions:→

– currently in: but can do ~un-controlled mutation (?)

Page 17: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

17 / 25 FOSDEM 2018 | Michael Meeks

More nasties ...

● Several global variables● No-where obvious to hang them

● Now some thread_local variables

– Calculation stack

– Current-document being calculated

– Matrix positions – nC,nR

● Somewhat horrific: fix obsolete Mac toolchain.

● ScInterpreterContext● Added – passed through all functions.

– Impacts eg. ‘GetValue’ though ...

Page 18: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

18 / 25 FOSDEM 2018 | Michael Meeks

single1 2 4 8 160.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

re-calculating 100k formulae on 1m doubles

Meeks/LinuxRyzen/Win10

Thread count

Se

con

ds

to c

alc

ula

te

How did that look: initially ...

● Faster● Getting some nice

speedups – ignoring the hyper-threaded-ness:

● 8.5s 2.5 with 4 →threads 3.4x→

● 4.7 0.86 - ~5.5x →with 8 threads

Page 19: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

19 / 25 FOSDEM 2018 | Michael Meeks

Up to this point:

● Plain Old calculation – single threaded (POC)

● Group calculation

A) Single Threaded Software Group calc (STSG)

B) OpenCL: GPU parallelism after conversion

C) New threaded calculation (NTC)

● Then: C) slower than A) in some cases …– Collecting data from sheets, branching, type handling, etc. again

and again for each formulacell …● Expensive – threading doesn’t help.

– A) collects once – and has some SSE2 goodness …

● So add a ‘threaded A)’ - simple & better …→

● Weighting decision: POC vs. ... based on complexity.

Page 20: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

20 / 25 FOSDEM 2018 | Michael Meeks

Improving performance ...

● Why don’t we get a 8x for 8 threads ?● Terrible profiling tools on Windows.● Linux – used ‘perf’ looking for threading

issues:– sudo perf record --call-graph dwarf \ --switch-events -c 1 # etc.

● Looking for false-sharing

– And other horrors.

Page 21: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

21 / 25 FOSDEM 2018 | Michael Meeks

Horror: rampant heap thrash

● RPN calculation – stack based:● Tons of stack operations: pushing values etc.● Do memory allocation & frees.

– Using the ancient / internal allocator – never intended for heavy parallel use.

→ drop the custom allocator hugely faster→

→ Re-use tokens where possible too.

● std::stack deque lists …→ →● Horrible: std::vector instead far better.→

● Re-using ScInterpreterContext ...

Page 22: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

22 / 25 FOSDEM 2018 | Michael Meeks

Other issues ...

● Where ‘GetDouble’ meets SfxItemSet ...● fixed SvNumberFormatter thread safety.

Page 23: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

23 / 25 FOSDEM 2018 | Michael Meeks

Threading & optimizing story:

Row 1 Row 2 Row 3 Row 40

2

4

6

8

10

12

Column 1

Column 2

Column 3

Baseline from recent master

Group Interpreter work by Tor

Thread Software Interpreter

Avoid TokenArray thrash

halve the number of threads if HT is active

disable custom allocator

use a cache for FormulaDoubleToken allocation

make token cache thread local

up the token cache size to 16

halve threadcount if HT active for group interpreter too

Use C++ threads0

200

400

600

800

1000

1200

1400

1600

1800

Benchmarking some of our sample sheets ...

Note this perf. Regression from threading for some workloads came from avoiding the SoftwareGroupInterpreter

Page 24: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

24 / 25 FOSDEM 2018 | Michael Meeks

Future work

● Stop the Crash-testing from asserting ...● Implicit intersection: killing us (again)

– Move RPN to have precise ranges

● Extend threaded unit tests further …

● Move more global variables to ScInterpreterContext

● Make FormulaCell a 1x item group● Make POC calcalation a forced-single-threaded calc

– Always thread SoftwareGroup Intepreter

● De-bong the format-typeuse● =J20 – should not change format type if J20

changes format.– A sheet-creation-time optimization …

– Intersects with ‘units’ work too.

Page 25: Calc: The challenges of scalable arithmetic How threading can be ... · single1 2 4 8 16 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

25 25 / 25 FOSDEM 2018 | Michael Meeks

Conclusions

● Calculation can be threaded

● Significant speedups are possible

● Profiling & optimizing works● “it is slow” == “not enough invested yet”

– All problems are just economics

● Many thanks to AMD for their support.

Oh, that my words were recorded, that they were written on a scroll, that they were inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer lives, and that in the end he will stand upon the earth. And though this body has been destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not another. How my heart yearns within me. - Job 19: 23-27