5th International Conference , HiPEAC 2010

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Memory-aware application mapping on coarse-grained reconfigurable arrays. Yongjoo Kim, Jongeun Lee * , Aviral Shrivastava ** , Jonghee Yoon and Yunheung Paek. Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea. - PowerPoint PPT Presentation

Text of 5th International Conference , HiPEAC 2010

1

5th International Conference , HiPEAC 2010Memory-aware application mapping on coarse-grained reconfigurable arraysYongjoo Kim, Jongeun Lee*, Aviral Shrivastava**, Jonghee Yoon and Yunheung Paek**Compiler and Microarchitecture Lab,Center for Embedded Systems,Arizona State University, Tempe, AZ, USA. * Embedded Systems Research Lab,ECE, Ulsan Natl Institute of Science & Tech, Ulsan, Korea Software Optimization And Restructuring,Department of Electrical Engineering,Seoul National University, Seoul, South Korea2010-01-25Coarse-Grained Reconfigurable Array (CGRA)SO&R and CML Research Group2High computation throughput Low power consumption and scalabilityHigh flexibility with fast configuration

CategoryProcessorMIPSmWMIPS/mWEmbeddedXscale125016000.78DSPTI TM320C64559.573.32.9DSP(VLIW)TI TM320C614T4.7110.677* CGRA shows 10~100MIPS/mWCoarse-Grained Reconfigurable Array (CGRA)SO&R and CML Research Group3Array of PEMesh like networkOperate on the result of their neighbor PEExecute computation intensive kernel

Application mapping in CGRASO&R and CML Research Group4Mapping DFG on PE array mapping spaceShould satisfy several conditionsShould map nodes on the PE which have a right functionalityData transfer between nodes should be guaranteedResource consumption should be minimized for performance

CGRA execution & data mapping5tc : computation time, td : data transfer timePEConfiguration MemoryMain MemoryBk1buf2Bk2buf2Bk3buf2Bk4buf2DMABk1buf1Bk2buf1Bk3buf1Bk4buf1Local memoryDouble bufferingTotal runtime = max(tc, td)The performance bottleneck : Data transferSO&R and CML Research Group6Many multimedia kernels show bigger td than tcAverage ratio of tc : just 22%Most applications are memory-bound applications.< The ratio between tc and td >100% = tc + tdComputation Mapping & Data MappingSO&R and CML Research Group7Duplicate array increase data transfer timeLocal memory012LD S[i]LD S[i+1]+S[i]S[i+1]01Contributions of this workSO&R and CML Research Group8First approach to consider computation mapping and data mapping- balance tc and td - minimize duplicate arrays (maximize data reuse)- balance bank utilization

Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling)Application mapping flowSO&R and CML Research Group9DFGPerformanceBottleneckAnalysisData ReuseAnalysisMemory-awareModulo SchedulingDCRDRGMappingPreprocessing 1 : Performance bottleneck analysisSO&R and CML Research Group10Determines whether it is computation or data transfer that limits the overall performanceCalculate DCR(data-transfer-to-computation time ratio)DCR = td / tc

DCR > 1 : the loop is memory-boundPreprocessing 2 : Data reuse analysisSO&R and CML Research Group11Find the amount of potential data reuseCreates a DRG(Data Reuse Graph)nodes correspond to memory operations and edge weights approximate the amount of reuseThe edge weight is estimated to be TS - rdTS : the tile sizerd : the reuse distance in iterationsS[i]S[i+1]D[i]R[i]S[i+5]D[i+10]R2[i]< DRG>11Application mapping flowSO&R and CML Research Group12DFGPerformanceBottleneckAnalysisData ReuseAnalysisMemory-awareModulo SchedulingDCRDRGMappingDCR & DRG are used for cost calculationMapping with data reuse opportunity costSO&R and CML Research Group13PE0PE1PE2PE3012340135274A[i],A[i+1]B[i]Local MemoryPEPEPEPEBank1Bank20135279A[i]B[i]A[i+1]48B[i+1]PE Array4050606050xxxx000+20+20xxxx4050604030xxxx66Memory-unaware costData reuse opportunity costNew total cost(memory unaware cost + DROC)BBC(Bank Balancing Cost)SO&R and CML Research Group14To prevent allocating all data to just one bankBBC(b) = A(b) : the base balancing cost(a design parameter)A(b) : the number of arrays already mapped onto bank bPE0PE1PE2PE301234+10+0A[i],A[i+1]03256A[i]A[i+1]47B[i]10325641CandCand : 10Local MemoryPEPEPEPEBank1Bank2PE ArrayApplication mapping flowSO&R and CML Research Group15DFGPerformanceBottleneckAnalysisData ReuseAnalysisMemory-awareModulo SchedulingDCRDRGMappingPartial ShutdownExplorationPartial Shutdown ExplorationSO&R and CML Research Group16For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. Computation resources are surplus.

Partial Shutdown Explorationon PE rows and the memory banksfind the best configuration that gives the minimum EDP(Energy-Delay Product)

Example of Partial shutdown explorationTcTdRER*E4r-2m18028828810.4630122r-2m27028828810.0128827/7r8/3-/65/--/40/12/0rD[], R[]S[]< 4 row - 2 bank >-/0r/20/1/-4/-/--/5/-3/8/67/-/-S[]D[], R[]< 2 row - 2 bank >012435678LD S[i]LD S[i+1]LD D[i]ST R[i]17Experimental SetupSO&R and CML Research Group18A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarksTarget architecture4x4 heterogeneous CGRA(4 memory accessable PE)4 memory bank, each connected to each rowConnected to its four neighbors and four diagonal onesCompared with other mapping flowIdeal : memory unaware + single bank memory architectureMU : memory unaware mapping(*EMS) + multi bank memory architectureMA : memory aware mapping + multi bank memory architectureMA + PSE : MA + partial shutdown exploration* Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08Runtime comparisonSO&R and CML Research Group19Compared with MUThe MA reduces the runtime by 30%Energy consumption comparisonSO&R and CML Research Group20MA + PSE shows 47% energy consumption reduction.ConclusionSO&R and CML Research Group21The CGRA provide very high power efficiency while be software programmable.While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance.We proposed an effective heuristic that considers memory architecture.It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.

SO&R and CML Research Group22Thank you for your attention!