44
1

Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

1

Page 2: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Inversion with Devito: Trading off memoryand compute with PyRevolve

Navjot Kukreja1 (With contributions from a lot of people!)

STMI WorkshopResearch Centre for Gas InnovationUniversidade de Sao PauloOctober 3, 2019

1Department of Earth Science and Engineering, Imperial College London, UK

1

Page 3: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Introduction

Checkpointing

Compression

1

Page 4: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Inverse Problems

The Forward Problem

(Estimated)Parameters

MathematicalModel/Phys-ical Theory

(Predicted)Observations

Observations∇ Mathematical

Model/Phys-ical Theory

(Predicted)Parameters

The Inverse Problem

2

Page 5: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Applications

3

Page 6: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

The Forward Problem

(Estimated)Parameters

MathematicalModel/Phys-ical Theory

(Predicted)Observations

Devito1

1Navjot Kukreja et al. “Devito: Automated fast finite difference computation”. In: 2016 Sixth International Workshop on Domain-SpecificLanguages and High-Level Frameworks for High Performance Computing (WOLFHPC). IEEE. 2016, pp. 11–19.

4

Page 7: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Data flow

Data flow for forward problem:

F (0) F (1) F (2) F (3) F (4) F (5) · · · F (n)

5

Page 8: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Raising the abstraction with Devito

m d2u(x,t)

dt2 −∇2u(x , t) = qs

u(.,0) = 0du(x,t)

dt |t=0 = 0

pde = m * u.dt2 - u.laplace

stencil = Eq(u.forward, solve(pde, u.forward)[0])

fwd_op = Operator([stencil], ...)

void finite_difference_solver(...) {

//...impenetrable "performance optimised" code

}

6

Page 9: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Figure 1: Illustration of the forward problem - simulating the received signal for a given structure

7

Page 10: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Inverse Problems

The Forward Problem

(Estimated)Parameters

MathematicalModel/Phys-ical Theory

(Predicted)Observations

Observations∇ Mathematical

Model/Phys-ical Theory

(Predicted)Parameters

The Inverse Problem

8

Page 11: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Full Waveform Inversion

Figure 2: Illustration of full waveform inversion - initial guess

9

Page 12: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Full Waveform Inversion

Figure 3: Illustration of full waveform inversion - in progress

10

Page 13: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Full Waveform Inversion

Figure 4: Illustration of full waveform inversion - convergence

11

Page 14: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Data flow

Data flow for gradient calculation:

F (0) F (1) F (2) F (3) F (4) F (5) · · · F (n)

R(n)· · ·R(5)R(4)R(3)R(2)R(1)R(0)

12

Page 15: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Adjoint mode - store all timesteps

Figure 5: Progression of the adjoint computation with wall-clock time on the x-axis and simulationtime on the y-axis. Each vertical cross-section represents the status at that time. The dots representcheckpoints stored in memory - here a checkpoint is stored at each time step. Image Source:2

2Qiqi Wang, Parviz Moin, and Gianluca Iaccarino. “Minimal repetition dynamic checkpointing algorithm for unsteady adjoint calculation”. In:SIAM Journal on Scientific Computing 31.4 (2009), pp. 2549–2567.

13

Page 16: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Dealing with high memory requirements

• For a typical 3D problem 3 the grid has 287 × 881 × 881 points in singleprecision, meaning each timestep is 900 MB.

• Such a problem would be run for 2500 timesteps, meaning the totalmemory required for a naive adjoint run would be 2.3 TB.

• One strategy to fit this into existing computer architectures would bedomain decomposition

• Might end up wasting computational power in order to use more memory• Communication overhead might start dominating soon, especially in a cloud

environment

• Another technique, that is domain-specific, is saving the wavefields onlyon the boundaries of the domain and recomputing from there

• Only works for a very small number of cases, i.e. where the equation istime-reversible

• Compression

• Checkpointing3e.g. Overthrust model

14

Page 17: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Introduction

Checkpointing

Compression

15

Page 18: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Adjoint mode - checkpointed

5

10

15

20

25

Tim

e s

tep in

dex

Wall clock time evolution

Figure 6: Progression of the adjoint computation with wall-clock time on the x-axis and thesimulation time on the y-axis. In this case the number of checkpoints is less than the timesteps,hence there is some recomputation involved.

16

Page 19: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Checkpointing - Revolve

For the problem where:

1. Number of steps is known in advance

2. Only one level of memory available

3. Checkpoint sizes are uniform

4. Saving/retrieving a checkpoint takes no time

5. Computational cost of the steps is uniform

6. Cost of restarting operators is zero

the optimal algorithm, Revolve, was given by4. Given a certainnumber of steps and a given amount of memory, Revolve providesthe start-stop-restart schedule that minimises the amount ofrecomputation.

4Andreas Griewank and Andrea Walther. “Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode ofcomputational differentiation”. In: ACM Transactions on Mathematical Software (TOMS) 26.1 (2000), pp. 19–45.

17

Page 20: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Checkpointing - Separation of concerns

Given the different kinds of checkpointing algorithms that apply todifferent kinds of problems and in different scenarios, it makes senseto have a library/tool manage checkpointing for Separation ofConcerns.

PyRevolve5

5Navjot Kukreja et al. “High-level python abstractions for optimal checkpointing in inversion problems”. In: arXiv preprint arXiv:1802.02474(2018).

18

Page 21: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

PyRevolve

Storage

Checkpointer

Forward Operator

Adjoint OperatorScheduler

19

Page 22: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Modelling the performance of Revolve

For a simple forward-adjoint computation with no recomputation, timeto solution would be:

TN(N) = 2 · C · N (1)

where C is the time taken to compute a single timestep of either theforward or the adjoint mode (assume they’re the same for now) and Nis the number of timesteps. Revolve introduces overheads:

• OR the overhead due to the recomputation involved

• OS the overhead due to repeatedly storing/loading checkpoints

Time to solution under Revolve is hence:

TR(N,M) = 2 · C · N + OR(N,M) + OS(N,M) (2)

20

Page 23: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Modelling the performance of Revolve

The storage overhead IS ZERO! (according to6)

Figure 7: Time spent in different actions during a Revolve-based forward-adjoint computation

6Griewank and Walther, “Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computationaldifferentiation”.

21

Page 24: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Keeping the storage overhead in mind

Figure 8: Updated Revolve timings after a modification to remove (some) redundant copies

22

Page 25: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Modelling the performance of Revolve

We can already use this simple performance model to answer somequestions: If we have slow but infinite memory, how manycheckpoints should we store?

104

105

106

107

Memory(MB)

45000

50000

55000

60000

65000

70000

75000

80000Ti

me

to so

lutio

n (s

)

Time taken for Revolve under non-negligible storage

Timesteps: 2526, Size of checkpoint (MB): 1782.065656, Time for compute step (s): 1.11, Bandwidth (MB/s): 200, Platform: Hypothetical

Figure 9: Time to solution when using different amounts of memory in Revolve, but when dealingwith slow memory

23

Page 26: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Introduction

Checkpointing

Compression

24

Page 27: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Memory-compute tradeoffs

• Revolve/Checkpointing is a way to trade off memory and compute. So iscompression.

• Compression has been used to compress the entire forward trajectory inpast work.

• With domain-specific compression algorithms (we live inside a DSL sodo not want to make too many assumptions about our problem)

• But which one is a better choice?

• Checkpointing + Compression can allow you to store more checkpoints,hence do less recomputation. i.e. can we combine the two?

• Is any of this even worth the trouble?

25

Page 28: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Trying Compression

Is F (compression ratio) constant? I tried ZFP7 to find out.

0

500

1000

1500

2000

2500

Simulation time (n)

100

101

102

103

Com

pres

sion

ratio

(x)

Variation of achievable compression ratio as the simulation progresses

Figure 10: Compression ratios achieved when I tried to compress every timestep of a seismic problem setup.

7Peter Lindstrom. “Fixed-rate compressed floating-point arrays”. In: IEEE transactions on visualization and computer graphics 20.12(2014), pp. 2674–2683.

26

Page 29: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Reference wavefield

0 5 10 15 20X (km)

0

2

4Dept

h (k

m)

0.01000.00750.00500.0025

0.00000.00250.00500.00750.0100

Pres

sure

Figure 11: Cross-section of the wavefield used as a reference sample for compression anddecompression. This field was formed after a Ricker wavelet source was placed at the surface ofthe model and the wave propagated for 2500 timesteps. This is a vertical (x-z) cross-section of a3D field, taken at the y source location

27

Page 30: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Lossless Compression

Compressor Chunk size(bytes) Shuffle Mode Setting Compression time(ms) Decompression time(ms) Compression RatioBloscLZ 1048576 SHUFFLE 6 4249.44 1288.86 1.188

LZ4 2965280 SHUFFLE 4 1371.26 920.98 1.199LZ4HC 2097152 SHUFFLE 8 31245.16 926.69 1.265

ZLib 524288 SHUFFLE 7 30218.81 2470.04 1.291ZStd 524288 SHUFFLE 9 117238.76 1477.34 1.312

Table 1: Some results from trying out all possible compressors and settings in blosc. We selectedthe best compression ratio seen for each compressor. ”Setting” here is the choice between speedand compression, where 0 is fastest and 9 is highest compression.

28

Page 31: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Tolerance settings within ZFP

1014

1012

1010

108

106

104

102

100

Tolerance

100

101

102

103

Com

pres

sion

Fact

or

Compression ratios for varying tolerance

Figure 12: Compression ratios achieved on compressing the wavefield. We define compressionratio as the ratio between the size of the uncompressed data and the compressed data. Thedashed line represents no compression. The highlighted point corresponds to the setting used forthe other results here unless otherwise specified. 29

Page 32: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Errors introduced during compression-decompression

0 5 10 15 20X (km)

0

2

4

6Dep

th (k

m)

0.000200.000150.000100.00005

0.000000.000050.000100.000150.00020

Abso

lute

erro

r

Figure 13: Cross-section of the field that shows errors introduced during compression anddecompression using the fixed-tolerance mode. It is interesting to note that the errors are more orless evenly distributed across the domain with only slight variations corresponding to the waveamplitude. A small block-like structure characteristic of ZFP can be seen.

30

Page 33: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Checkpoint error impact

Figure 14: L2 norm of gradient error for different settings of lossy compression

31

Page 34: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Compression-Revolve performance model

The performance model can now be extended to includecompression. With compression, the storage overhead goes up:

OSC(N,M) = W(N,M · F ) ·(

2 · SF · B

+ tc

)+ N ·

(2 · SF · B

+ td

)(3)

where F is the compression ratio (i.e. the ratio between theuncompressed and compressed checkpoint), and tc and td arecompression and decompression times, respectively. At the sametime, the recomputation overhead decreases because F times morecheckpoints are now available.

32

Page 35: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Understanding the model

104

105

106

Memory (MB)

1

2

3

4

5

6

Spee

dup

(x)

2250

748.

92

5489

6.32

Speedup for varying peak memoryOrig0.250.832.08.0

Timesteps: 2526, Size of checkpoint (MB): 891.032828, Time for compute step (s): 1.11, Bandwidth (MB/s): 8139.2, Compression Factor: 41, Compression Time (s): 0.36, Decompression time (s): 1.6, Theoretical decompression time (s): 0.396, Platform: Skylake

Figure 15: The speedups predicted by the performance model for varying memory. The baseline(1.0) is the performance of a Revolve-only implementation under the same conditions. The differentcurves represent kernels with differing compute times (represented here as a factor of the sum ofcompression and decompression times). 33

Page 36: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Future work

• Compression

• Optimal scheduling under compression• Also include SZ in the study• Study acceptable error tolerances (application dependent)• Extend to multi-level checkpointing

• Complex data dependencies (e.g. higher order in time, subsampling)

• Cost of restarting operators

34

Page 37: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Thank you

Thank you 8

Questions?

8This work was funded by the Intel Parallel Computing Centre at Imperial College London and EPSRC EP/R029423/1

35

Page 38: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Problem setup

• Grid size: 881 × 881 × 287

• SEG Overthrust model

• Ricker source placed in the x-y centre of the domain, just below thesurface

• 2500 timesteps

36

Page 39: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Scheduling strategy

One of the assumptions of Revolve is that all checkpoints are thesame size. This is clearly not true under (lossy) compression. Hencethis breaks the optimality of Revolve. One possible suggestion toimprove this schedule is:

• Use a(n) (underestimating) heuristic for compression ratio (F) for acheckpoint at a given timestep

• Report this pessimistic M to Revolve, one checkpoint at a time, to getthe location of the next checkpoint.

• When storing checkpoint, track how much memory left after actualcompression to calculate a new M for Revolve.

37

Page 40: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Understanding the model

103

102

101

100

101

102

Compute time (s)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Spee

dup

(x)

Speedup for varying compute time per timestepPeak memory: 4000Peak memory: 8000Peak memory: 16000Peak memory: 24000

Timesteps: 2526, Size of checkpoint (MB): 891.032828, Peak memory (MB): 4000, Bandwidth (MB/s): 8139.2, Compression Factor: 41, Compression Time (s): 0.36, Decompression time (s): 1.6, Platform: Skylake

Figure 16: The speedups predicted by the performance model for varying compute cost. The baseline (1.0) is the performance of aRevolve-only implementation under the same conditions. The benefits of compression drop rapidly if the computational cost of the kernelthat generated the data is much lower than the cost of compressing the data. For increasing computational costs, the benefits are bounded.

38

Page 41: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Understanding the model

28 29

210 211 212 213 214

Timesteps

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Spee

dup

(x)

Speedup for varying number of timesteps

Size of checkpoint (MB): 891.032828, Time for compute step (s): 1.11, Bandwidth (MB/s): 8139.2, Peak memory (MB): 4000, Compression Factor: 41, Compression Time (s): 0.36, Decompression time (s): 1.6, Platform: Skylake

Figure 17: The speedups predicted by the performance model for varying number of timesteps tobe reversed. The baseline (1.0) is the performance of a Revolve-only implementation under thesame conditions. It can be seen that compression becomes more beneficial as the number oftimesteps is increased.

39

Page 42: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Checkpointing - Multistage

For the problem where:

1. Number of steps is known in advance

2. Only one level of memory available

3. Checkpoint sizes are uniform

4. Saving/retrieving a checkpoint (to first level memory) takes no time(zero-cost checkpointing)

5. Computational cost of the steps is uniform

6. Cost of restarting operators is zero

the optimal algorithm was given by9.

9Guillaume Aupy et al. “Optimal multistage algorithm for adjoint computation”. In: SIAM Journal on Scientific Computing 38.3 (2016),pp. C232–C255.

40

Page 43: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Checkpointing - Online Multistage

For the problem where:

1. Number of steps is known in advance

2. Only one level of memory available

3. Checkpoint sizes are uniform

4. Saving/retrieving a checkpoint (to first level memory) takes no time(zero-cost checkpointing)

5. Computational cost of the steps is uniform

6. Cost of restarting operators is zero

the optimal algorithm was given by10 and11.

10Michel Schanen et al. “Asynchronous two-level checkpointing scheme for large-scale adjoints in the spectral-element solver Nek5000”. In:Procedia Computer Science 80 (2016), pp. 1147–1158.11Guillaume Aupy and Julien Herrmann. “Periodicity in optimal hierarchical checkpointing schemes for adjoint computations”. In:Optimization Methods and Software 32.3 (2017), pp. 594–624.

41

Page 44: Inversion with Devito: Trading off memory and …...Inversion with Devito: Trading off memory and compute with PyRevolve Navjot Kukreja1 (With contributions from a lot of people!)

Understanding the model

• The first vertical line at 55GB marks the spot where the compressedwavefield can completely fit in memory and Revolve is unnecessary ifusing compression.

• The second vertical line at 2.2 TB marks the spot where the entireuncompressed wavefield can fit in memory and neither Revolve norcompression is necessary.

• The region to the right is where these optimisations are not necessaryor relevant.

• The middle region has been the subject of past studies usingcompression in adjoint problems.

• The region to the left is the novelty here.

42