Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Claude TadonkiLaboratoire de l’Accélérateur Linéaire/IN2P3/CNRS

University of OrsayOrsay / France

[email protected]

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.



Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

The Algebraic Path ProblemThe Algebraic Path Problem


The Warshall-Floyd AlgorithmThe Warshall-Floyd Algorithm




Shift-toroïdal Reindexation ( Kung-Lo-Lewis, 1987)Shift-toroïdal Reindexation ( Kung-Lo-Lewis, 1987)




The CELL Broadband EngineThe CELL Broadband Engine




Ring Pipelined Algorithm for the APP ( algorithm )Ring Pipelined Algorithm for the APP ( algorithm )




Ring Pipelined Algorithm for the APP ( algorithm )Ring Pipelined Algorithm for the APP ( algorithm )



Can run with any number of processors p <= N ( natural LPGS )

Interesting properties of our algorithm

Generic tiling applies ( LSGP by blocking )

Each processor only requires a buffer of size bN ( Block of size b )

Fully pipelined process with local synchronization only

Perfect computation-communication overlap


Ring Pipelined Algorithm for the APP ( implementation on the CELL BE )Ring Pipelined Algorithm for the APP ( implementation on the CELL BE )



PPE-DMA is issued only by the first and the last processor

Inner SPEs communicate and synchronize locally

Computation-communication overlap occurs for all communications

Can run on more SPEs or CELL Blades by natural extension


PerformancesPerformances




Conclusion and PerspectivesConclusion and Perspectives



Our ring SPMD algorithm suits for the CELL BE with a good scalabilityOur ring SPMD algorithm suits for the CELL BE with a good scalability

Communication and synchronization yield less than 5% overheadCommunication and synchronization yield less than 5% overhead

Absolute performance can be improved by optimizing the APP kernelAbsolute performance can be improved by optimizing the APP kernel

Close to 80% of the peak performance expectedClose to 80% of the peak performance expected

Our scheduling can be applied to similar problemsOur scheduling can be applied to similar problems


END & QUESTIONSEND & QUESTIONS



Documents

Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay