11
Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay Orsay / France [email protected] 1st Workshop on Applications for Multi and Many Core Architectures 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

  • Upload
    garnet

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay Orsay / France [email protected]. 1st Workshop on Applications for Multi and Many Core Architectures - PowerPoint PPT Presentation

Citation preview

Page 1: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Claude TadonkiLaboratoire de l’Accélérateur Linéaire/IN2P3/CNRS

University of OrsayOrsay / France

[email protected]

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 2: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

The Algebraic Path ProblemThe Algebraic Path Problem

Page 3: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

The Warshall-Floyd AlgorithmThe Warshall-Floyd Algorithm

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 4: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

Shift-toroïdal Reindexation ( Kung-Lo-Lewis, 1987)Shift-toroïdal Reindexation ( Kung-Lo-Lewis, 1987)

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 5: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

The CELL Broadband EngineThe CELL Broadband Engine

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 6: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

Ring Pipelined Algorithm for the APP ( algorithm )Ring Pipelined Algorithm for the APP ( algorithm )

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 7: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

Ring Pipelined Algorithm for the APP ( algorithm )Ring Pipelined Algorithm for the APP ( algorithm )

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Can run with any number of processors p <= N ( natural LPGS )

Interesting properties of our algorithm

Generic tiling applies ( LSGP by blocking )

Each processor only requires a buffer of size bN ( Block of size b )

Fully pipelined process with local synchronization only

Perfect computation-communication overlap

Page 8: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

Ring Pipelined Algorithm for the APP ( implementation on the CELL BE )Ring Pipelined Algorithm for the APP ( implementation on the CELL BE )

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

PPE-DMA is issued only by the first and the last processor

Inner SPEs communicate and synchronize locally

Computation-communication overlap occurs for all communications

Can run on more SPEs or CELL Blades by natural extension

Page 9: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

PerformancesPerformances

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Page 10: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

Conclusion and PerspectivesConclusion and Perspectives

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.

Our ring SPMD algorithm suits for the CELL BE with a good scalabilityOur ring SPMD algorithm suits for the CELL BE with a good scalability

Communication and synchronization yield less than 5% overheadCommunication and synchronization yield less than 5% overhead

Absolute performance can be improved by optimizing the APP kernelAbsolute performance can be improved by optimizing the APP kernel

Close to 80% of the peak performance expectedClose to 80% of the peak performance expected

Our scheduling can be applied to similar problemsOur scheduling can be applied to similar problems

Page 11: Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS University of Orsay

Ring pipelined algorithm for the algebraic path problem on the CELL Broadband Engine C. TADONKI

END & QUESTIONSEND & QUESTIONS

1st Workshop on Applications for Multi and Many Core Architectures22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2010)

October, 27 – 30 2010, Petrópolis, Rio de Janeiro, Brazil.