Christos Kozyrakis and Kunle Olukotun - Stanford Universitykozyraki/publications/... · 1. Finding...

Christos Kozyrakis and Kunle Olukotun

http://ppl.stanford.edu

Hot Chips 21 – Stanford – August 2009

Applications

Ron Fedkiw, Vladlen Koltun, Sebastian Thrun

Programming & software systems

Alex Aiken, Pat Hanrahan, John Ousterhout,

Mendel Rosenblum

Architecture

Bill Dally, John Hennessy, Mark Horowitz, Christos Kozyrakis, Kunle Olukotun (director)

Goal: the parallel computing platform for 2015 Parallel application development practical for the masses

Joe the programmer…

Parallel applications without parallel programming

PPL is a collaboration of Leading Stanford researchers across multiple domains

Applications, languages, software systems, architecture

Leading companies in computer systems and software

Sun, AMD, Nvidia, IBM, Intel, NEC, HP

PPL is open Any company can join; all results in the public domain

1. Finding independent tasks

2. Mapping tasks to execution units

3. Implementing synchronization

Races, livelocks, deadlocks, …

4. Composing parallel tasks

5. Recovering from HW & SW errors

6. Optimizing locality and communication

7. Predictable performance & scalability

8. … and all the sequential programming issues

Even with new tools, can Joe handle these issues?

Guiding observations

Must hide low-level issues from programmer

No single discipline can solve all problems

Top-down research driven by applications

Core techniques

Domain specific languages (DSLs)

Simple & portable programs

Heterogeneous hardware

Energy and area efficient computing

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Programmable Hierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Physics Scripting Probabilistic Machine Learning Rendering

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

Infrastructure

Scalable Coherence

Virtual Worlds

Personal Robotics

Data Informatics

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

Infrastructure

Leverage domain expertise at Stanford

CS research groups, national centers for scientific computing

Media-X

NIH NCBC

DOE ASC

EnvironmentalScience

Geophysics

Web, MiningStreaming DB

GraphicsGames

MobileHCI

Existing Stanford research center

Seismic modeling

Existing Stanford CS research groups

AI/MLRobotics

Next-generation web platform

Millions of players in vast landscapes

Immersive collaboration

Social gaming

Computing challenges

Client-side game engine

Graphics rendering

Server-side world simulation

Object scripting, geometric queries, AI, physics computation

Dynamic content, huge datasets

More at http://vw.stanford.edu/

Scalable Coherence

Virtual Worlds

Personal Robotics

Data Informatics

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

Infrastructure

High-level languages targeted at specific domains E.g.: SQL, Matlab, OpenGL, Ruby/Rails, …

Usually declarative and simpler than GP languages

DSLs higher productivity for developers High-level data types & ops (e.g. relations, triangles, …)

Express high-level intent w/o implementation artifacts

DSLs scalable parallelism for the system

Declarative description of parallelism & locality patterns

Can be ported or scaled to available machine

Allows for domain specific optimization

Automatically adjust structures, mapping, and scheduling

Goal: simplify code of mesh-based PDE solvers

Write once, run on any type of parallel machine

From multi-cores and GPUs to clusters

Language features

Built-in mesh data types

Vertex, edge, face, cell

Collections of mesh elements

cell.faces(), face.edgesCCW()

Mesh-based data storage

Fields, sparse matrices

Parallelizable iterations

Map, reduce, forall statements

val position = vertexProperty[double3](“pos”)

val A = new SparseMatrix[Vertex,Vertex]

for (c <- mesh.cells) {

val center = average position of c.vertices

for (f <- c.faces) {

val face_dx = average position of f.vertices – center

for (e <- f.edges With c CounterClockwise) {

val v0 = e.tail

val v1 = e.head

val v0_dx = position(v0) – center

val face_normal = v0_dx cross v1_dx

// calculate flux for face …

A(v0,v1) += …

A(v1,v0) -= …

val position = vertexProperty[double3](“pos”)

val A = new SparseMatrix[Vertex,Vertex]

for (c <- mesh.cells) {

val center = average position of c.vertices

for (f <- c.faces) {

val face_dx = average position of f.vertices – center

for (e <- f.edges With c CounterClockwise) {

val v0 = e.tail

val v1 = e.head

val face_normal = v0_dx cross v1_dx

// calculate flux for face …

A(v0,v1) += …

A(v1,v0) -= …

High-level data types & operations

Explicit parallelism using map/reduce/forall

Implicit parallelism with help from DSL & HW

No low-level code to manage parallelism

Liszt compiler & runtime manage parallel execution Data layout & access, domain decomposition, communication, …

Domain specific optimizations

Select mesh layout (grid, tetrahedral, unstructured, custom, …)

Select decomposition that improves locality of access

Optimize communication strategy across iterations

Optimizations are possible because

Mesh semantics are visible to compiler & runtime

Iterative programs with data accesses based on mesh topology

Mesh topology is known to runtime

Scalable Coherence

Virtual Worlds

Personal Robotics

Data Informatics

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

Infrastructure

Provide a shared framework for DSL development

Features

Common parallel language that retains DSL semantics

Mechanism to express domain specific optimizations

Static compilation + dynamic management environment

For regular and unpredictable patterns respectively

Synthesize HW features into high-level solutions

E.g. from HW messaging to fast runtime for fine-grain tasks

Exploit heterogeneous hardware to improve efficiency

Required features

Support for functional programming (FP)

Declarative programming style for portable parallelism

High-order functions allow parallel control structures

Support for object-oriented programming (OOP)

Familiar model for complex programs

Allows mutable data-structures and domain-specific attributes

Managed execution environment

For runtime optimizations & automated memory management

Our approach: embed DSLs in the Scala language

Supports both FP and OOP features

Supports embedding of higher-level abstractions

Compiles to Java bytecode

Calls Matrix

DSL methods

Delite applies

generic & domain

transformations to

generate mapping

DSL defers OP

execution to

Delite

1 2 4 8 16 32 64 128

Execution Cores

Gaussian Discriminant Analysis

Original + Domain Optimizations + Data Parallelism

Low speedup due to loop dependencies

1 2 4 8 16 32 64 128

Execution Cores

Domain info used to refactor dependencies

1 2 4 8 16 32 64 128

Execution Cores

Exploiting data parallelism within tasks

Scalable Coherence

Virtual Worlds

Personal Robotics

Data Informatics

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

Infrastructure

Heterogeneous HW for energy & area efficiency

ILP, threads, data-parallel engines, custom engines

Q: what is the right balance of features in a chip?

Q: what tools can generate best chip for an app/domain?

Study: HW options for H.264 encoding

4 cores + ILP + SIMD + custom

Performance

Energy Savings

Revisit architectural support for parallelism Which are the basic HW primitives needed?

Challenges: semantics, implementation, scalability, virtualization, interactions, granularity (fine-grain & bulk), …

HW primitives

Coherence & consistency, atomicity & isolation, memory partitioning, data and control messaging, event monitoring

Runtime synthesizes primitives into SW solutions

Streaming system: mem partitioning + bulk data messaging

TLS: isolation + fine-grain control communication

Transactional memory: atomicity + isolation + consistency

Security: mem partitioning + isolation

Fault tolerance: isolation + checkpoint + bulk data messaging

Parallel tasks with a few thousand instructions Critical to exploit in large-scale chips

Tradeoff: load balance vs overheads vs locality

Software-only scheduling

Per-thread task queues + task stealing

Flexible algorithms but high stealing overheads

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128

SW-only HW-only ADM SW-only HW-only ADM

cg gtfold

Overhead

Running

Overheads and scheduling get worse at larger scales

Hardware-only scheduling

HW tasks queues + HW stealing protocol

Minimal overheads (bypass coherence protocol)

But fixes scheduling algorithm

Optimal approach varies across applications

Impractical to support all options in HW

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128

cg gtfold

Overhead

Running

Wrong scheduling algorithm makes HW slower than SW

Simple HW feature: asynchronous direct messages

Register-to-register, received with user-level interrupt

Fast messaging for SW schedulers with flexible algorithms

E.g., gtfold scheduler tracks domain-specific dependencies

Also useful for fast barriers, reductions, IPC, …

Better performance, simpler HW, more flexibility & uses

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128

cg gtfold

Overhead

Running

Scalable scheduling with simple HW

Goal: make parallel computing practical for the masses

Technical approach

Domain specific languages (DSLs)

Simple & portable programs

Heterogeneous hardware

Energy and area efficient computing

Working on the SW & HW techniques that bridge them

More info at: http://ppl.stanford.edu

Christos Kozyrakis and Kunle Olukotun - Stanford Universitykozyraki/publications/... · 1. Finding...

Documents

3 Levels of Interaction Basic User Tasks Intermediate User Tasks Advanced User Tasks

1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis

Pinning - Android Security Symposium 2017 - John Kozyrakis · 2019. 8. 26. · John Kozyrakis Android Security Symposium –Vienna –9 March 2017 Not as simple as it sounds Pinning

The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University

Tasks and Task Management : --Time-critical tasks --Contending tasks --Communicating tasks

Kunle Olukotun Stanford Universityorap.irisa.fr/ArchivesForums/Forum25/F25Presentations/... · 2016. 10. 15. · Applications Ron Fedkiw, Vladlen Koltun, Sebastian Thrun Programming

RAKSHA A Flexible Information Flow Architecture for Software Security Michael Dalton Hari Kannan Christos Kozyrakis Computer Systems Laboratory Stanford

Tasks, Tasks, Tasks. What are they anyway?

12/11/2015 Periodic tasks Aperiodic tasks · 12/11/2015 1 Periodic tasks + Aperiodic tasks 2 Handling aperiodic tasks Aperiodic tasks are typically activated by the arrival of external

RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,

1 Warrior Core Tasks. 2 Move (7-8 Tasks) Shoot (16-17 Tasks) Fight (15 Tasks) Communicate (4-5 Tasks) Warrior Core Tasks (IET & Sustainment) Joint Urban

CS252 Graduate Computer Architecture Lecture 20 Vector Processing => Multimedia David E. Culler Many slides due to Christoforos E. Kozyrakis

Christos Kozyrakis and Kunle Olukotun - Stanford Universitycsl.stanford.edu › ~christos › publications › 2009.ppl.hotchips.slides.pdf1. Finding independent tasks 2. Mapping tasks

3 levels Basic User Tasks Intermediate User Tasks Advanced User Tasks

POMI 2020 Programmable Open Mobile Internet Dan Boneh, Andrea Goldsmith, Ramsesh Johari, Paul Kim, Scott Klemmer, Christos Kozyrakis, Monica Lam, Phil

Machine Learning for Computer Systemsiacoma.cs.uiuc.edu/mcat/ml.pdfRecommender Systems Delimitrou and Kozyrakis ASPLOS 2013,2014 13 Performance Power configuration: processor type

Towards Pervasive Parallelism - Hot Chips · Pervasive Parallelism Lab Christos Kozyrakis and Kunle Olukotun Hot Chips 21 – Stanford – August 2009 . The PPL Team ... Simple &

1 Lecture 18: Introduction to Multiprocessors Prepared and presented by: Kurt Keutzer with thanks for materials from Kunle Olukotun, Stanford; David Patterson,

MARS: Adaptive Remote Execution Scheduler for Multithreaded Mobile Devices Asaf Cidon, Tomer M. London, Sachin Katti, Christos Kozyrakis, Mendel Rosenblum