of 17 /17
EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Embed Size (px)

Text of EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos...

Micro-Architectural Techniques to Alleviate Memory-Related Stalls for Transactional and Emerging Workloads

EKIVOLOSSelf Contained, Accurate Precomputation Prefetching

Islam AttaXin TongAndreas MoshovosViji SrinivasanIoana Baldini

14.4ZB44ZB20132020

EMC2 DigitalUniverse Study2 Graphic Credit: www.editeddaily.comPrefetching is the traditional remedy3

LOGUnconventional Data SourcesUnstructured &Semi-StructuredSparse MatricesGraphsXML Graphic Credit: www.editeddaily.comMemory-BoundHardware PrefetchersHistory ofAccessesPredictFuture AccessesCurrentStateHistory-based predictions may not be sufficient!Non-RepetitiveIrregularAccesses3Shared CacheMemoryPrecomputation Slice(P-Slice)LLCTarget LoadTimePrefetchLoadDelinquent Load: a problematic load which accounts for a significant amount of memory stalls.HitContext 14Precomputation PrefetchersProgramSlicePrecomputeFuture AccessesCurrentStateMainThreadContext 0PrecomputationPrediction--based Prefetching4Yet Another Precomputation Prefetcher?ManuallyAt Compile TimeTraces from BinaryPast Work constructed P-slicesRe-design binary-based implementations to prioritize accuracy5Burdensome TaskRequires Source CodeDense P-slicesInaccurate P-slicesAccurate

Fast

P-slicesare ought to be5Conventional Binary-based methods Over-Simplify P-slicesCorrectness: Do not modify the state of the main-thread.Fast: Aggressively optimize a p-slice.Ignore Control FlowIgnore Memory DependenciesMonitor & CorrectMechanismsPotential InaccuracyVariable Run-ahead distanceIterationsTimeMain ThreadAbort & Restart6InaccurateLightP-sliceApplications with intense code divergence or memory dependencies foilMonitor & Correct mechanisms6Paradigm shift Accuracy-First P-sliceMemory Dependencies: p-slice uses a local store buffer.Control Flow: Merge multiple traces, instead of the single dominant trace.All data dependencies can be maintained.No-monitoringAccurateMaybe slightly slower but can still run-aheadAccurately replicate main threads execution path.Eventually higher Run-ahead distanceIterationsTimeMain ThreadInaccurateLightP-sliceMore AccurateDenserP-sliceEKIVOLOS Slow and Steady Wins The Race77Web GraphsCircuit SimulationDNA AnalysisSocial NetworksGraph PartitioningClusteringFluid DynamicsSparse Matrices8Example of Hard-to-Predict AccessesSpVM Sparse-Vector Sparse-Matrix MultiplicationExample of Hard-to-Predict AccessesSpVM Sparse-Vector Sparse-Matrix MultiplicationV[]RV[]V_valV_idxM_valM_idxM_begin9x=LinearFragmented LinearM[][]OuterInnerScan over V_idx[]Find corresponding RowScan over M_idx[]Find corresponding RV[]RV Accesses: History does not entail FutureRandom!9Binary-based P-Slice ConstructionPre-Compute RV AddressesCPUExecuteCollectInstructionTrace0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]IdentifyDominantLoop

Apply BackwardSlicing0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]1234TimeIdentify Delinquent Load010SpVM P-Slice: Backward Slicing0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]Delinquent Load0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]

0x9588 ldr r5, [sl]0x958c add r3, #1

0x9566 ldr r4, [r0, #12]0x9568 add r1, #4

0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r5

0x9582 ldr r4, [r0, #16]

0x9588 ldr r5, [sl]0x958c add r3, #1

0x9566 ldr r4, [r0, #12]0x9568 add r1, #4

0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x9568 add r1, #4

0x956e ldr r4, [r9,r1]

0x9576 ldr r5, [r2,r4,lsl#2]Eliminate Control FlowRetain OnlyRegisterDependenciesEliminate Stores11Inner-most Dominant LoopV[]RV[]V_valV_idxM_valM_idxM_beginM[][]Fails to Pre-Compute RV Addresses for Multiple Rows0x9568 add r1, #40x956e ldr r4, [r9,r1]0x9576 ldr r5, [r2,r4,lsl#2]Dominant-Path P-slice11EKIVOLOSLocal Store BufferMemory DependenciesKeep Control FlowMerge Multiple TracesMaintains All Data DependenciesAccurately Replicates Main Threads Execution Path0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]OuterInnerPrefetchCoreL1L2LSB

Simple Algorithm, Much Better Accuracy12Evaluation MethodologySystem SetupESESC Simulator, ARM ISAMain core: Out-of-Order, 3GHzPrefetch core: In-Order, 3GHzArea & Energy: MCPAT 1.2

EvaluatedWorkloadsRepresentComputational BiologyData MiningFloating Point DifferentialGraph SearchHash Table joinsImage ProcessingOptimizationSchedulingSimulationSortingSparse Matrix MultiplicationSupport Vector Machines1313Key ResultsEkivolos (Control Flow Only)Dominant-Path Precomputation PrefetcherEkivolos (Control Flow and Memory Dependencies)BetterSpeedupLLC MissesEnergy10%ControlFlowMemoryDependencies70%267% (0-12X)SMS Spatial Address CorrelationAMPM Pattern MatchingPC/AC Address Correlation with PC-LocalizationEkivolos+ASP Adding Simple Stream Prefetcher14Better14Limitations of EkivolosCurrently Requires Offline ProfilingEffectiveness depends on Profiling InputTargets only Delinquent Loads15

15Future Work DirectionsEnhancementsP-Core ArchitectureBenchmarksOnline ProfilingIn-Memory or In-Cache ProcessingSuites suitable for Architectural StudiesBig DataDiverse Memory Access Patterns1616

What We Learned

P-slices Need Not be Aggressively OptimizedSimple AlgorithmControl Flow & Memory DependenciesEmerging Algorithms not Studied BeforePrefetch-cores can be Simplified17