Author
joella-george
View
213
Download
1
Embed Size (px)
Micro-Architectural Techniques to Alleviate Memory-Related Stalls for Transactional and Emerging Workloads
EKIVOLOSSelf Contained, Accurate Precomputation Prefetching
Islam AttaXin TongAndreas MoshovosViji SrinivasanIoana Baldini
14.4ZB44ZB20132020
EMC2 DigitalUniverse Study2 Graphic Credit: www.editeddaily.comPrefetching is the traditional remedy3
LOGUnconventional Data SourcesUnstructured &Semi-StructuredSparse MatricesGraphsXML Graphic Credit: www.editeddaily.comMemory-BoundHardware PrefetchersHistory ofAccessesPredictFuture AccessesCurrentStateHistory-based predictions may not be sufficient!Non-RepetitiveIrregularAccesses3Shared CacheMemoryPrecomputation Slice(P-Slice)LLCTarget LoadTimePrefetchLoadDelinquent Load: a problematic load which accounts for a significant amount of memory stalls.HitContext 14Precomputation PrefetchersProgramSlicePrecomputeFuture AccessesCurrentStateMainThreadContext 0PrecomputationPrediction--based Prefetching4Yet Another Precomputation Prefetcher?ManuallyAt Compile TimeTraces from BinaryPast Work constructed P-slicesRe-design binary-based implementations to prioritize accuracy5Burdensome TaskRequires Source CodeDense P-slicesInaccurate P-slicesAccurate
Fast
P-slicesare ought to be5Conventional Binary-based methods Over-Simplify P-slicesCorrectness: Do not modify the state of the main-thread.Fast: Aggressively optimize a p-slice.Ignore Control FlowIgnore Memory DependenciesMonitor & CorrectMechanismsPotential InaccuracyVariable Run-ahead distanceIterationsTimeMain ThreadAbort & Restart6InaccurateLightP-sliceApplications with intense code divergence or memory dependencies foilMonitor & Correct mechanisms6Paradigm shift Accuracy-First P-sliceMemory Dependencies: p-slice uses a local store buffer.Control Flow: Merge multiple traces, instead of the single dominant trace.All data dependencies can be maintained.No-monitoringAccurateMaybe slightly slower but can still run-aheadAccurately replicate main threads execution path.Eventually higher Run-ahead distanceIterationsTimeMain ThreadInaccurateLightP-sliceMore AccurateDenserP-sliceEKIVOLOS Slow and Steady Wins The Race77Web GraphsCircuit SimulationDNA AnalysisSocial NetworksGraph PartitioningClusteringFluid DynamicsSparse Matrices8Example of Hard-to-Predict AccessesSpVM Sparse-Vector Sparse-Matrix MultiplicationExample of Hard-to-Predict AccessesSpVM Sparse-Vector Sparse-Matrix MultiplicationV[]RV[]V_valV_idxM_valM_idxM_begin9x=LinearFragmented LinearM[][]OuterInnerScan over V_idx[]Find corresponding RowScan over M_idx[]Find corresponding RV[]RV Accesses: History does not entail FutureRandom!9Binary-based P-Slice ConstructionPre-Compute RV AddressesCPUExecuteCollectInstructionTrace0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]IdentifyDominantLoop
Apply BackwardSlicing0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]1234TimeIdentify Delinquent Load010SpVM P-Slice: Backward Slicing0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]Delinquent Load0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r5
0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]
0x9576 ldr r5, [r2,r4,lsl#2]Eliminate Control FlowRetain OnlyRegisterDependenciesEliminate Stores11Inner-most Dominant LoopV[]RV[]V_valV_idxM_valM_idxM_beginM[][]Fails to Pre-Compute RV Addresses for Multiple Rows0x9568 add r1, #40x956e ldr r4, [r9,r1]0x9576 ldr r5, [r2,r4,lsl#2]Dominant-Path P-slice11EKIVOLOSLocal Store BufferMemory DependenciesKeep Control FlowMerge Multiple TracesMaintains All Data DependenciesAccurately Replicates Main Threads Execution Path0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]OuterInnerPrefetchCoreL1L2LSB
Simple Algorithm, Much Better Accuracy12Evaluation MethodologySystem SetupESESC Simulator, ARM ISAMain core: Out-of-Order, 3GHzPrefetch core: In-Order, 3GHzArea & Energy: MCPAT 1.2
EvaluatedWorkloadsRepresentComputational BiologyData MiningFloating Point DifferentialGraph SearchHash Table joinsImage ProcessingOptimizationSchedulingSimulationSortingSparse Matrix MultiplicationSupport Vector Machines1313Key ResultsEkivolos (Control Flow Only)Dominant-Path Precomputation PrefetcherEkivolos (Control Flow and Memory Dependencies)BetterSpeedupLLC MissesEnergy10%ControlFlowMemoryDependencies70%267% (0-12X)SMS Spatial Address CorrelationAMPM Pattern MatchingPC/AC Address Correlation with PC-LocalizationEkivolos+ASP Adding Simple Stream Prefetcher14Better14Limitations of EkivolosCurrently Requires Offline ProfilingEffectiveness depends on Profiling InputTargets only Delinquent Loads15
15Future Work DirectionsEnhancementsP-Core ArchitectureBenchmarksOnline ProfilingIn-Memory or In-Cache ProcessingSuites suitable for Architectural StudiesBig DataDiverse Memory Access Patterns1616
What We Learned
P-slices Need Not be Aggressively OptimizedSimple AlgorithmControl Flow & Memory DependenciesEmerging Algorithms not Studied BeforePrefetch-cores can be Simplified17