Etudes de cas Multimédia - Grenoble INP · applications. In 3D operations, the TM32 core performs tri-angle setup and controls the hardware 3D engine; the MIPS core performs transformation

Architecture des SoC

Etudes de casMultimédia

S. Mancini

http://www-tima.imag.fr

http://ensimag.grenoble-inp.fr

Plan

6 Contexte4 Présentationq Standards vidéoq 3D

q Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU

1- Architecture des SoC- Etudes de cas- Multimédia

Applications

P InternetP Educatif, jeux, etc . . .P Télévision interactive

P MPEG4P Téléachat

P Télécommunications

P TéléconférenceP Visiophonie

P Imagerie médicale

2/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Présentation

S. Mancini



Particularités

P Les traitements bas niveau sont réguliers et nécessitentde grande puissances de calcul:

P Compression/décompresion

P Composition

P FiltrageP 3D (textures, etc . . . )

P La synchronisation des contenus est complexe etirrégulière.

íííí

Aussi les systèmes multimédia doivent être constitués:

P d’opérateurs spécifiques, . . .P et de microprocesseurs

3/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Présentation

S. Mancini



Plan

6 Contexteq Présentation4 Standards vidéoq 3D



Format vidéo

Format vidéo non compressé pour la télévision numériquehaute définition.

P SD - CCIR BT-656. 625 ou 525 lignes faisant juqu’à 720 pixels.. Format 4:2:2 (Cb, Y, Cr, Y, . . . ). Débit : 27 MOctets/s

P HD - 1080p. 1920*1080 en 24 et 30 Hz. Format 4:2:0. Débit : 52 MPixel/s = 78 MO/s

5/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Standards vidéo

S. Mancini



MPEG2

Format vidéo compressé

MPEG2 définit plusieurs résolutions et débits :

Level Max. sampling Pixels/ Max.

dimensions fps sec bitrate

Low 352 x 240 x 30 3.05 M 4 Mb/s

Main 720 x 480 x 30 10.40 M 15 Mb/s

High 1440 1440 x 1152 x 30 47.00 M 60 Mb/s

High 1920 x 1080 x 30 62.70 M 80 Mb/s

6/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Standards vidéo

S. Mancini



Plan

6 Contexteq Présentationq Standards vidéo4 3D



Principe de la 3D Z-buffer

Z

Projection

Lumière

Scène

Ecran Observateur

8/82- Architecture des SoC- Etudes de casMultimédia- Contexte- 3D

S. Mancini



Pipeline 3D

Application

de textureZ−buffer

Image

finale

Projection

pixelsElagage

Transformation

du modèleEclairage Transformation

perspective

Base de

donnée


S. Mancini



Performances des accélérateurs 3DLa performance des accélérateurs 3D dépend essentiellement du débit auxmémoires de texture et du nombre de pixels affichés par seconde.

Gamme Horloge MHz Débit mémoire Pixels/s

Geforce 3 (2001) 175 2,4 GB/s 800 M

Geforce 6 (2004) 500 16 GB/s 4 G

Geforce 7 (2005) 600 51 GB/s 10 G

Geforce 8 (2007) 612/1500 51.2 GB/s 15 G

Geforce 9 (2008) 700/1500 51 GB/s 15 G ??

GTX 660 (2012) 980 144 GB/s 800 G ??

Console (2001) 300 10 GB/s 4 G

Console (2006) 3200 25.6GB/s + 20GB/s 16 G

Professionnel (2001) 500 32*11,2 GB/s 32*6,1 G


S. Mancini



Plan

q Contexte6 Set top Box PNX8500

4 Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit

q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU


Environnement

12/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Environnement

S. Mancini



Classe d’application

Le PNX8500 est dédié aux applications multimédia.

Il permet de gérer des flux vidéo en réalisant des opérationscomme la conversion, la décompression et la composition deflux.

La présence de CPUs facilite la mise en oeuvred’environnements interactifs.

L’accélération 2D autorise des systèmes de fenêtrages.

Une accélération de la 3D permet l’utilisation de scènes desynthèse.


S. Mancini



Architecture générale

arbiter uses a programmable, cyclical list-based arbiter that

assigns unused list slots to the CPUs. This guarantees

required latency to all real-time agents while optimizing CPU

performance.

COPROCESSORS/ACCELLERATORS

The pnx8500 includes on-chip coprocessors for 2D and 3D

graphics acceleration, MPEG decoding, image scaling and fil-

tering, and display channel composition. All coprocessors read

input and write results to memory.

2D drawing engine__The pnx8500 2D drawing accelerator

block performs high-speed graphics operations including fast

area fills, three-operand bitblt, monochrome data expansion,

and line drawing. Monochrome data can be color-expanded to

any supported pixel format. Anti-aliased lines are supported

through a 16-level alpha-blend bitblt. A full 256-level alpha

bitblt supports blending of source and destination images. The

2D engine can also be used as a generic DMA engine to trans-

fer data between memory locations on a byte-aligned basis.

3D rendering engine__An on-chip 3D drawing engine accel-

erates texture-mapped pixel rendering. It implements

Microsoft-compatible DirectX 6.1 3D functions, such as bi-

linear and tri-linear texture filtering.

Rendering speeds up to 60 Mpix/sec for bilinear- and trilin-

ear-textured, mip-mapped, z-buffered, blended triangles deliv-

er excellent 3D graphics quality for ASTB or television

applications. In 3D operations, the TM32 core performs tri-

angle setup and controls the hardware 3D engine; the MIPS

core performs transformation and lighting.

High-level MPEG-2 decoder__A high-level MPEG-2 decoder

block performs MPEG-2 program stream decoding below slice

level. The TM32 core controls the decode process and handles

MPEG-2 processing above slice level. BS-D, DVB, and all 18

ATSC formats are supported; three SD streams or one HD

stream can be decoded simultaneously. The MPEG-2 decoder

is capable of full-resolution decoding or decoding to a 2X

horizontally compressed image format, saving memory space

and bandwidth.

Memory-based scaler (MBS)__The primary function of the

MBS block is acceleration of scaling operations. It reads input

from and writes results to memory. Compared to an 'on-the-

fly' approach, this method de-couples the scaling operation

from the display, significantly reducing on-demand memory

bandwidth requirements. Since the MBS block uses memory

as a buffer, it can process many images in the same time

required to display a single frame.

In addition to basic scaling capabilities, the MBS performs

many commonly required image manipulations prior to dis-

play including linear and non-linear aspect ratio conversion

(including panorama mode), anti-flicker filtering, de-interlac-

ing, and pixel format conversion functions.

NEXPERIA PNX8500 CONCEPTUAL ARCHITECTURE

1


S. Mancini



Plan


q Environnement4 Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit



MIPS

Le MIPS 32 bits est dédié à la gestion globale du système.

Parties opératives:

P ALUP Multiplieur-Accumulateur et Diviseur entier

Cache I. :16 KBD.:32 KB

Pipeline de 6 étages à 150 MHz,

Il adapté à l’utilisation d’un OS: il contient un module de gestionde la mémoire (MMU).

16/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Les processeurs

S. Mancini



Trimédia- parties opératives

Il est dédié aux traitement de la vidéo et du son.C’est un VLIW 32 bits à 5 opérations par instruction, à 200MHz. Il dispose de 27 opérateurs.

Chapter 1: Overview of the TriMedia System

© 2000 TriMedia Technologies 10/17/2000

Book 1—Getting Started

11

1

TriMedia CPU is assigned to specific issue slots in a TriMedia VLIW instruction. Table 1

gives the functional unit assignments.

Because of the number of available functional units and their assignment, some opera-

tions may have to wait for one or more cycles before they are executed. This means that

in some cases not all issue slots are used. Good use of issue slots is one of the purposes

for software code optimization.

What Makes TriMedia’s VLIW Architecture Better

The beginning instruction set architecture (the processor programming model) must be

distinguished from implementation (the physical chip and its characteristics). VLIW

microprocessors and superscalar implementations of traditional instruction sets share

some characteristics such as multiple execution units and the ability to execute multiple

operations simultaneously.

The techniques used to achieve high performance are different because the parallelism is

explicit in VLIW instructions, but must be discovered by hardware at run time by super-

scalar processors. VLIW implementations are simpler for very high performance. Just as

RISC architectures permit simpler, cheaper high-performance implementations than do

ClSC architectures, VLIW architectures are simpler and cheaper than RlSC architectures

because of hardware simplifications. However, they require more compiler support.

Table 1

Functional Unit Assignments

Functional Unit Quantity Latency

A

/Delay

B

A. Clock cycles between the issuance of an operation and availability of its resultsB. Clock cycles between the execution of a branch instruction and the branching taking place

Recovery

C

C. Minimum number of clock cycles between successive operations

Slot Assignment

1 2 3 4 5

Constant 5 1 1

� � � � �

Integer ALU 5 1 1

� � � � �

Load/Store 2 3 1

� �

DSP ALU 2 2 1

� �

DSP MUL 2 3 1

� �

Shifter 2 1 1

� �

Branch 3 3 1

� � �

Int/Float MUL 2 3 1

� �

Float ALU 2 3 1

� �

Float Compare 1 1 1

�

Float sqrt/div 1 17 16

�


S. Mancini



Trimédia-Architecture mémoire

Le trimédia accède a deux busdistincts :

q DRAMq MMIO (sur bus PI)

Cache I. :32 KBD.:16 KB

Associatif par groupes, à 8 voies de 64octets par bloc.La zone MMIO est non-cachable.

,*+�+-� ��

�� #��

��'"��D��&��#9��""��(�� "��"�� %��

�� '�

%�> ��4��3��'

��'��"��&'��,*+�+-� ��"��:��#3��

��"��F�G��%%� ��%��

�$��%.,��# �� # ��'��

# ��3'� �'�� 3

�� "��

�� #&�� "� $��"�

��"�� 7#&��E��9#&��

��"�� 7#&��E��9#&��"��

�� $��

��"� � �� '�� "��%��

��%�� #&��%�

&�� %��!

%�� %��"�� N� ��

�%�� #&��%��"��"��

% ��" �� ' ��%��"�� "��

%�� %��"��E��

��

��

##8�D'��7

##8��/��

"#��D� 7?"�DA8

"#��D� 7?"�D@�

8��/��

��----�-----

��

��

��

��

"#��D.?�#DA8

"#��D.?�#D@�

.?�#��/��

"#��D.?�#D&@8#8"

��

��#��%.>��(��4�4��3�4'�1Les instructions sont compressées en mémoire externe.


S. Mancini



Trimédia-Mécanismes systèmes

Le TM32 implémente un mécanisme de sémaphoresmatériels.Ils sont accessibles en tant que registres, depuis le coeur ou lebus PI.

Procédure d’accès:

write ID to SEM (use 32 bit store, with ID in 12 LSB)retrieve SEM (use 32 bit load, it returns 0x00000nnn)if (SEM = ID) {

performs a short critical section actionwrite 0 to SEM

}else"try again later, or loop back to write"


S. Mancini



Plan


q Environnementq Les processeurs4 Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit



Les bus

PI Bus DVP Memory busBus Trois EtatsBus système32 bits à 50 MHz ≈ 200 MB/s

Bus Point à PointBus mémoire64 bits à 143 MHz ≈ 1 GB/s

Release 3.1 VHDL PI-Bus ToolKit 2

3. How to use the PI-Bus ToolKit

The purpose of the PI-Bus ToolKit is to allow the accelerated implementation of PI-Bussystems. The ToolKit provides much of the core of a potential system and facilitatescombining the pieces for a given application. Figure 1 below shows the signals of the PI-Bus.The University of Sussex has built several PI-Bus systems that are discussed in detail later.

MasterInterface(s)

SlaveInterface(s)

Arbitration Decode

Bus Control

RESETN RESETN

CLK CLK

READ READ

OPC[3:0] OPC[3:0]

ACK[2:0] ACK[2:0]

A[31:2]A[31:2]

D[31:0] D[31:0]

SELy

RESETN CLK

LOCK

REQx

GNTx

Figure 1 - PI-Bus Signals Between Models

The interface provided to the host circuits attached to the slave and master devices arediscussed in the chapters associated with the bus agents.

A model is provided for the master, slave and BCU (Bus Control Unit). The BCU model is acomplex example that can be simplified for a small PI-Bus system. To provide compatibilitywith widely different platforms we have used structural VHDL to link PI-Bus modules.

Models are also provided for behavioural and synthesisable circuits to attach to the hostinterfaces of the master and slave models. The Centre for VLSI and Computer Graphics hasdeveloped hardware and software tools to test and confirm compliance with standards such asthe PI-Bus, details can be found on page 34.

The ToolKit includes two distinct model libraries for the slave and master agents. In somesystems it is practical to use one model as the behavioural and synthesisable view, however ithas become clear that this is not always the case. The behavioural model has been modified tocreate models more suited to synthesis, as follows:

• Removed AFTER clauses for systems such as Synopsis and ALTERA

• Component Libraries all set to WORK for systems that require this (ALTERA)

1

3

In addition to providing a high bandwidth data path, the PLB offers designers flexibility through thefollowing features:

• Support for both multiple masters and slaves• Four priority levels for master requests allowing PLB implementations with various arbitration

schemes• Deadlock avoidance through slave forced PLB rearbitration• Master driven atomic operations through a bus arbitration locking mechanism• Byte-enable capability, supporting unaligned transfers• A sequential burst protocol allowing byte, half-word, word and double-word burst transfers• Support for 16-, 32- and 64-byte line data transfers• Read word address capability, allowing slaves to return line data either sequentially or target

word first• DMA support for buffered, fly-by, peripheral-to-memory, memory-to-peripheral, and memory-to-

memory transfers• Guarded or unguarded memory transfers allow slaves to individually enable or disable

prefetching of instructions or data• Slave error reporting• Architecture extendable to 256-bit data buses• Fully synchronous

The PLB specification describes a system architecture along with a detailed description of the signals andtransactions. PLB-based custom logic systems require the use of a PLB macro to interconnect thevarious master and slave macros.

Figure 2 illustrates the connection of multiple masters and slaves through the PLB macro. Each PLBmaster is attached to the PLB macro via separate address, read data and write data buses and a pluralityof transfer qualifier signals. PLB slaves are attached to the PLB macro via shared, but decoupled,address, read data and write data buses along with transfer control and status signals for each data bus.

The PLB architecture supports up to 16 master devices. Specific PLB macro implementations, however,may support fewer masters. The PLB architecture also supports any number of slave devices. The

PLB Masters PLB Macro PLB Slaves

Status&

Control

Read Data

Shared Bus

Central Arbiter

Status & Control

Read Data

Control

Address& TransferQualifiers

Write Data

Arbitration

Address& TransferQualifiers

Control

Write Data

Figure 2: Example of PLB Interconnection.

1

21/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D

S. Mancini



Architecture des Bus

I/O

Video Out

PCI/XIO

MIPS Bridge

Memory Controller

External SDRAM

DMA DMA

MIPS TM32

Fast C−Bridge TM32 C−Bridge

MIPS C−Bridge

C−Bridge

Controller

Interrupt

Interrupt

Controller

IOVideo IN

Video

2D

PI bus PI bus

MM

I B

us

PI bus


S. Mancini



Architecture Mémoire

Les transferts de données entre les différents opérateurs sefont à travers la mémoire partagéee principale.

The pnx8500 supports input of two CCIR656 video streams

and up to four transport streams. After input, streams are rout-

ed to either the video input processor (VIP) blocks or MPEG

system processor (MSP) blocks, depending on stream type.

Dual video input processors__Two VIP blocks input digital

video (primarily CCIR656) at pixel rates up to 40 MHz in a

variety of packed YUV422 memory video formats. To reduce

the amount of memory required to store video data, the VIP

blocks can compress video lines horizontally using a 6-tap

polyphase scaler in normal or transposed polyphase modes.

After input processing, video data is stored in memory. VIPs

also support a raw data-streaming input mode for message

passing and handling future 8-to-10-bit data formats.

Transport stream input and routing__Transport stream

input is supported from several sources including two trans-

port stream input interfaces (TSIN), a 1394 receiver channel,

an internal DMA agent dedicated to software-generated trans-

port streams, and a feedback channel from each of the MPEG

System Processor (MSP) blocks containing PID filtered,

optionally de-scrambled data. Each TSIN supports parallel

and serial formats and input of up to four serial transport

streams. Serial streams are converted to parallel within the

TSIN block before further routing.

After input, each transport stream is routed to one of several

destinations including an MSP block, a 1394 transmitter

channel, or a transport stream output (TSOUT) block. Parallel

streams are converted to serial as needed by the TSOUT block.

MPEG system processors__Two independent MSP blocks

parse DVB and DSS transport streams, perform PID and sec-

tion filtering, timestamping, descrambling, and demultiplex-

ing, then write results to memory. Each MSP block handles

one transport stream, thus up to two transport streams can be

processed simultaneously.

Prior to input, transport streams may pass through external

POD or CI conditional access modules. Descrambler entitle-

Advanced image composition processor (AICP)__Two

AICP units perform color-space conversion and final image

composition and refresh of the primary and secondary display

channels. Compositing is performed in either RGB or YUV

color space, depending on the AICP output mode. Compositing

functionality is highly programmable and includes chroma,

alpha or color-key based alpha-blending and pixel selection

from the previous or current layer or the background color.

Two physical and logically independent AICP units support all

display modes. The primary AICP composites four layers to

produce the primary output channel, intended for display on

a TV or monitor. The secondary AICP composites two layers

to create the secondary channel, intended for output to a

VCR or other recording device or for viewing on another TV.

Each layer has its own pixel clock and frame timing.

All layers in both units are identical, so there are no restrictions

on how each layer can be utilized. Thus the primary display

can have four layers of video, graphics, or a mix of both. To

support scenarios other than watch/record, one or both AICPs

can be set to divide their layers over two pixel/timing synchro-

nous multiplexed outputs. This allows refresh of up to four

screens with limited compositing capabilities. The output of

each AICP unit is sent to a digital video output port.

MEMORY-BASED ARCHITECTURE

The pnx8500 operates as a memory-based system; memory

serves as the buffer to decouple input and output data streams.

As pnx8500 decodes input, it stores results in memory. When

multiple data structures exist in memory for a given input

stream, timestamping is used to determine when to display

specific structures. Use of memory as a staging buffer simpli-

fies synchronization between various on-chip blocks.

I/O UNITS

On-chip I/O units receive and/or route input from or trans-

mit output to off-chip devices. Some perform additional

stream processing and formatting.

1

Ú Ú Ú Ú Ú Ú Ú ÚÚ Ú Ú Ú Ú Ú Ú ÚÚ Grande souplesse Ø Ø Ø Ø Ø Ø

Ø Ø Ø Ø Ø ØØ Les débits sur le bus

sont doublés


S. Mancini



Architecture “Système”

La présence d’une mémoire partagée nécessite l’utilisation dematériel spécifique aux mécanismes systèmes tels que lessémaphores.

P Le bus PI implémente la fonction “lock”P Le TM32 implémente des sémaphoresP Les deux processeurs ont des zones non-cachables

Les deux processeurs peuvent communiquer en construisantdes communications par “tube”, réalisées par les OS.


S. Mancini



Partionnement de la 3D

60 Millions de Pixels par seconde

Image

finale

Elagageperspective

TransformationEclairagedu modèle

Transformation

de texture

ApplicationProjection

pixels

donnée

Base de

TM32

FusionZ−buffer

MIPS

ASIC


S. Mancini



Vidéo

P Décodage des “Transport Stream” par 3 MSP (MPEGSystem Processor)+ un MSP est un processeur RISC 16 bits

P Les traitements vidéo sont réalisés par des ASICs

q Décodage MPEG2q Homotétieq Composition d’image


S. Mancini



Plan


q Environnementq Les processeursq Bus, mémoire et 3D4 Architecture logicielleq Exemple d’applicationq Le circuit



Systèmes d’exploitations

Les deux processeurs peuvent supporter un systèmed’exploitation.

P Le MIPS permet d’utiliser des systèmes d’exploitation àprocessus “protégés”.+ Un MMU permet de disjoindre les espaces mémoire desprocessus.Inconvénients : les changements de contexte peuvent êtrelents.

P L’OS du TM32 doit être rapide pour effectuer lasynchronisation des flux.Pour réduire le temps changement de contexte, une petitepartie des 128 registres est sauvegardée.

28/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle

S. Mancini



Streaming Software

Le Trimedia Streaming Software Architecture (TSSA) permetde simplifier la gestion de la synchronisation des fonctions detraitement.

L’utilisateur décrit les fonctions de traitement et leursconnexions.

TSSA gère les déplacements, créations et destructions desdonnées.

.

.La gestion de la mémoire n’est pas laissée à l’OS pouréviter la fragmentation de l’espace mémoire.


S. Mancini



Données standardisées gérées par TSSA

Chapter 7: TSSA Essentials

14

Book 3—Software Architecture, Part B

© 2000 TriMedia Technologies 10/17/2000

Common Data Structures

Given countless ways to describe data, TSA has set some rules on formats of data used

among TSA components, so that the data is understandable to all. The standard TSA data

structures are declared in a header file called tmAvFormats.h. (Refer to Chapter 4,

tmAv-

Formats.h: Multimedia Format Definitions

for documentation of these structures). This

chapter gives you an introduction to these commonly used data structures.

The TSA data structures are used in communication among components, and between

an application and its components. Between an application and its components,

configu-

ration structures

set up components. Among components,

data packets

transfer data.

Data Packets

Data travels in packets between components. Components that produce data packets are

referred to as “senders” and components that consume data are called “receivers.” The

packets are always constructed according to the TSA data packet structure, with a pointer

to the packet header as its first field. The header contains descriptive information of the

type of data in the packet. The packet has a flexible, yet consistent structure. Specific

hooks are provided where application needs can be satisfied, provided that you adhere to

a few TSA rules.

Figure 2

Example showing the hierarchical composition of a TSA packet

tmAvPacket_t

*Header

allocatedBuffers

buffersInUse

buffers[ ]

bufSize

dataSize

data

tmAvBufferDescriptor_t

18432

18432

* Data Buffer

id

flags

userSender

userReceiver

userPointer

time

format

tmAvHeader_t

0

0

***

*

ticks

hiTicks

tmTimeStamp_t

0

0

tmAudioFormat_t

size

hash

referenceCount

dataClass

dataType

dataSubtype

description

sampleRate

sizeof(tmAudioFormat)

0

0

avdcAudio

atfLinearPCM

apfFiveDotOne16

0

44100.0


S. Mancini



Exemple TSSA simple

Chapter 8: Developing Applications Using a Streaming Model

30 Book 3—Software Architecture, Part B © 2000 TriMedia Technologies 10/17/2000

Sample Application

The best way to understand how the streaming model works is to study a sample applica-

tion using components in the TSSA OL layer. This chapter contains pseudo code from a

basic TSSA application that uses three generic asynchronous components: a digitizer, a

processor, and a renderer. After the basic application has been explained in its entirety,

more advanced issues of TSSA applications will be presented starting on page 41.

Figure 8 Overall flow of the sample application showing a command/response queue between application and component and bidirectional data queues between components

Following is the pseudo code of the sample application. Each part of the code will

explained on the following pages. Although error checking is critical in any TSSA appli-

cation, it is omitted from this example for clarity.

void tmosMain(){ Int digInst; Int procInst; Int rendInst; ptmolDigitizerInstanceSetup_t digSetup; ptmolProcessorInstanceSetup_t procSetup; ptmolRendererInstanceSetup_t rendSetup; ptmolDigitizerCapabilities_t digCap; ptmolProcessorCapabilities_t procCap; ptmolRendererCapabillities_t rendCap; ptsaInOutDescriptor_t iodesc1; ptsaInOutDescriptor_t iodesc2; ptsaInOutDescriptorSetup_t iosetup;

tmAvFormat_t lformat = { sizeof(tmAvFormat_t), /* size */ 0, /* hash */ 0, /* referenceCount */ avdcGeneric, /* dataClass */

Full buffers

Empty buffers

Full buffers

Empty buffers

Application

FilterDigitizer Renderer

Response

Command

Response

Command

Response

Command

1

Chapter 8: Developing Applications Using a Streaming Model

© 2000 TriMedia Technologies 10/17/2000 Book 3—Software Architecture, Part B 39

8

The following is an example of the scheduling of a number of tasks in a TSSA audio

application. The tasks involved are the root task, the idle task, the processor component

task, and their interaction with two interrupt service routines (an audio digitizer and an

audio renderer).

Figure 9 Processor control flow in a TSSA system. The animation shows the dynamic interaction between components, packets, and the application.

Following are descriptions of the task switches illustrated in Figure 9:

Task Switch t0 Audio input interrupt triggers the audio digitizer. Digitizer pro-cures an empty packet, fills it, and sends it to the filter.

Task Switch t1 The filter task has been waiting for a full packet. Since a pSOS call was made in the digitizer’s ISR, the pSOS scheduler is invoked on return from the ISR. Control is transferred to the waiting filter task.

Task Switch t2 The filter finishes processing. An input buffer has been consumed and an output buffer is produced. Control is transferred to the next waiting task, which is the application (root task).

Task Switch t3 The application completes its processing and control is transferred to some other task.

Task Switch t4 The audio renderer is triggered by the audio out interrupt service request. The audio renderer consumes a full buffer and produces an empty buffer.

Digitizer(Interrupt)

Processor(Task)

Renderer(Interrupt)

Application (Task)

Idle/Other (Task)

Progress FunctionEmpty Packet Full Packet

t3 t4 t0 t1 t2 t3 t4 t0

5 ms = 256 audio samples

TIMELINE.MOV

1


S. Mancini



Description de l’application

// Initialisation de chaque module...// Initialisation des connexionsiodesc1 = fct ( taille paquet, etc ....)

// Connexion des modulesdigSetup ->outputDescriptors[0] = iodesc1;procSetup ->inputDescriptors[0] = iodesc1;...

// Lancement de l’applicationtmolDigitizerStart( digInst );tmolProcessStart ( procInst );tmolRendererStart ( rendInst );...


S. Mancini



Plan


q Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielle4 Exemple d’applicationq Le circuit



Objectifs

Une image 3D animée incrustée dans un flux vidéo.

34/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Exemple d’application

S. Mancini



Organisation de l’application

Préparation Rendu

CompositionEntrée

Image

MIPS, TM32, 3DMIPS

SortieVidéo out

Modèle 3D

Utilisateur

Video In AICP Video Out

Vidéo IN


S. Mancini



Organisation du contrôle

Streaming Software

(RPC+ mémoire partagée)

Communication inter−processeur

OS du MIPS

I/O

MIPS

TM32

3D Video IN AICP Video Out

Matériel Logiciel


S. Mancini



Déroulement de l’application

��

��

��

��

��

��

��

��

Fincomposition

Sortie

Vidéo In

compositionDébut

sortieFin

Composition

3DDébut

RéceptionsortieFin

3D

3DFin

Vidéo In A

Vidéo In B

3D

Composition A

Composition B

(précédent)

TempsBuffer plein Buffer vide

Arrivée image vidéo suivante − stockée dans Vidéo In B

Buffers

image

Matériel

Logiciel


S. Mancini



Plan


q Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’application4 Le circuit



Le circuit

P 35 Millions de transistorsP Techno 0.18 µm

P 82 Domaines d’horloges

q TM32 : 200 MHzq PR3940 : 150 MHzq SDRAM : 143 MHz

P 243 Mémoires : 750 KbitsP Les périphériques sont regroupés:

M-PI, T-PIP Les blocs sont connectés par

aboutement

chiplet timing, clock matching, and I/O tim-

ing analysis.

To achieve timing closure, we made engi-

neering change orders to the netlist after routing.

Following each manipulation step, formal verifi-

cation ensured that the modified netlist was func-

tionally equivalent to the one after test insertion.

We aligned all clock domains having syn-

chronous chiplet crossings. For example, if the

memory interface clock in one chiplet was syn-

chronously connected to the same clock in

another chiplet, we phase-aligned these clocks

and analyzed the signal paths to meet timing

constraints. We achieved clock alignment by

tweaking the clock insertion delays, using align-

ers in the clock module. Similarly, we made the

clock trees as structurally identical as possible.

As part of the physical design process, we met

design completion and manufacturability goals

by implementing techniques such as design rule

checks, antenna fixes, track filling, and doubling

of vias wherever possible. Figure 4 shows the lay-

out plot for the Viper design’s initial version.

Table 3 summarizes the major design

parameters.

WE HAVE LEARNED much from the Viper design

experience and trust it will guide us in the

future, particularly since the next-generation

SOC designs are significantly more complex,

calling for still higher levels of integration. Some

of our current activities, in addition to regular

chip-development tasks, are investigating more

efficient on-chip bus architectures and better

design-reuse methodologies. �

AcknowledgmentsWe thank the Viper management and design

teams for their hard work, particularly chief

architects Gert Slavenburg and Lane Albanese,

without whose foresight and leadership the pro-

ject never would have been successful.

References1. S. Rathnam and G. Slavenburg, “An Architectural

Overview of the Programmable Multimedia

Processor, TM-1,’’ Proc. 41st IEEE Computer

Society Int’l Conf. (COMPCON 96), IEEE CS

Press, Los Alamitos, Calif., 1996, pp. 319-326.

2. D. Paret and C. Fenger, The I2C Bus, John Wiley

& Sons, New York, 1997.

Santanu Dutta is a designengineering manager atPhilips Semiconductors inSunnyvale, California. Hisresearch interests includedesign of high-performance

Application-Specific SOC Multiprocessors

30 IEEE Design & Test of Computers

CAB MPEG

MBS+

VIP1+

VIP2

ICP1 + ICP2 + MMI

Conditionalaccess

(MSP1 + MSP2)T-PI

M-PI

TM32

1394

MSP3

PR3940

Figure 4. Layout of Viper (PNX8500).

Table 3. Design statistics.

Parameter Value

Process technology TSMC 0.18 µm, six metal layers

Transistors About 35 million

Instances 1.2 million instances, or 8 million gates

Memories 243 instances, 750-Kbit memory

CPUs 2 (TriMedia TM32 and MIPS PR3940)

Peripherals 50

Clock domains 82

Clock speed TM32: 200 MHz; PR3940: 150 MHz;

SDRAM: 143 MHz

Power 4.5 W

Supply voltage 1.8-V core and 3.3-V I/O

Package BGA456

1

39/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Le circuit

S. Mancini



Plan

q Contexteq Set top Box PNX85006 Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU


Set top box IBM STB03xxx

41/82- Architecture des SoC- Etudes de casMultimédia- Set top box IBM STB03xxx

S. Mancini



Set top box IBM STB04xxx


S. Mancini



Architecture mémoire

23

As more functions are integrated in system-on-a-chip (SOC)designs, memory subsystem performance becomes ever morecritical to achieving the requirements of the design. While thenumber of cores increases, the number of memory controllersusually remains constant for cost and complexity reasons.Adding memory controllers can increase the size and complex-ity of the memory subsystem design and can increase costs byincreasing the chip area and the number of I/Os. By under-standing the system demands on the memory subsystem anddetermining ways to enhance its efficiency, the additional com-plexity and cost can be avoided.

This article describes a set-top-box (STB) integrated controllerdesign and the process of modeling, analyzing, and improvingmemory performance for this SOC during the design and simu-lation phases of the project. While the techniques have beendeveloped in the STB design environment, they can be readilyadapted for memory subsystem analysis for a wide range ofSOC applications.

Set-Top-Box OverviewA set-top box (STB) is a consumer device used to decode dig-ital video and audio signals from a digital cable or satellite serv-ice provider. The STB034xx integrated controller is an exampleof an SOC for an STB application. The STB034xx SOC is aPowerPC™ 405 processor-based design and includes the fol-lowing subsystems: 405 processor with I-cache and D-cache,digital audio/video decoders, memory interface, direct memoryaccess (DMA) controller, and peripherals. An overview of thearchitecture is shown in Figure 1.

The focus of the performance analysis for this design was onthe SDRAM memory interface. The MPEG-2 video decoder, 405processor, and DMA controller can each place significantdemands on memory bandwidth. For example, decoding videowith on-screen display (OSD) planes active can require over 90MB/s. Typically, the continuous bandwidth requirements of theDMA controller would not be as demanding, but it can still havebursts of high-data-rate activity. Requirements for 405 band-width are application-dependent. Usually, the audio decoderand digital video broadcast (DVB) transport demultiplexer willrequire only about five MB/s.

The STB034xx memory subsystem includes a crossbar switchand two SDRAM controllers. A crossbar switch core providesnonblocking access to system memory. This architecture isflexible because cores connected to either of the processorlocal buses (PLBs) may access either of the two SDRAM inter-faces, thus enabling the even distribution of the bandwidth loadbetween the two interfaces.

The SOC audio and video decode functions have to meet areal-time deadline, thereby determining when the needed datamust reach the video/audio decoder. Provisions were made inthe SDRAM controllers to handle the overall bandwidth of thedesign and to meet these real-time requirements.

Although the memory controllers have a theoretical bandwidththat is well above the aggregate requirements, queuing andlatency variations during actual operation can reduce the effi-ciency of the memory subsystem performance.

MicroNews Fourth Quarter 2001

Memory Performance Analysis in a System-on-a-Chip DesignEverett G. Vail III, Steven B. Herndon, Dennis E. Franklin, Eric M. Foster, and Jeffrey S. Graves

Vid

eo 1

PowerPC405

16KI-cache

8KD-cache

Processor Local Bus (PLB)

Processor Local Bus (PLB)Peripherals

CrossbarSwitch SDRAM 0

Controller

SDRAM 1Controller

SRAM16-bit

Interface

DMAController

MPEG2 / DVBTransport

MPEG2 VideoDecoder

Transport

Vid

eo 2

Aud

io

1-MBFlash

2 / 8-MBSDRAM

2 / 8-MBSDRAM

Satellite or CableInput Stream

NTSC / PALDigital Encoder

TV /VCR

MPEG2 AudioDecoder

DM

A

405-D405-I

External Graphicsand Video (EGV) Port

Figure 1. Set-top-box system-on-a-chip architecture.1


S. Mancini



Plan

q Contexteq Set top Box PNX8500q Set top box IBM STB03xxx6 Plateforme multimédia “portable”q Playstation 2 & 3q GPU


Nomadik vidéo portable

©ST-Ericsson,2009-Allrightsreserved.ST-EricssonandtheST-EricssonlogoaretrademarksoftheST-EricssongroupofcompaniesorusedunderalicensefromSTMicroelectronicsNVorTelefonaktiebolagetLMEricsson.Allothernamesarethepropertyoftheirrespectiveowners.FormoreinformationonST-Ericsson,visitwww.stericsson.com

order code: FlSTN88150109

LET’SCREATEIT

STn8815 block diagram

DDR SDRAMcontroller

NAND/NOR Flashcontroller

Multichannel DMAcontroller

SecuredRAM/ROM

Timers

Watchdog

RTCInterruptcontroller PLLS

Systemcontroller Security toolbox

JTAG/trace

SD/MMC/MS

Host port interface USB-OTG high-speedCamera interfacesTV outputColor LCD controllerDiplays interfaces

I2C

HSI

MSP

UART

SSP

FIrDA

1-Wire/HDQ interface

GPIO

Rotary encoder I/T keypad interfaces

eSRAMbuffer

Power management

Icache

Dcache

L2 cacheARM926EJ

Smart videoaccelerator

Smart audioaccelerator

Smart imagingaccelerator

Smart graphicsaccelerator

Nomadikarchitectureisbasedonthedistributedprocessingofaudio,videoandimaging,advancedsecurityfeatures,pervasivelow-powertechniquesforincreasedautonomy,andoptimizedmemoryarchitectureforthebestcost/performanceratio.TheSTn8815combinesanARM9coreupto332MHzwithlevel-twocachetoaudio,video,imagingandgraphicsaccelerators,allowingbothlow-powermultimediaperformanceandpowerfulgeneral-purposesoftwareprocessingandOSsupport.

TheSTn8815smartimagingaccelerator(SIA)deliversimpressivemultimediaqualitywithoutsacrificingbatterylife.Itoperatesasareal-time,programmableimage-reconstructionengineatupto80-Mpixel/s.Thiscapabilityenablescamera-phonesystems,basedon5-Mpixelsensors,toexecutenoisereduction,autofocusandexposurecontrol,andotherfundamentalalgorithms,thereforeeliminatingtheneedforanexternalimagingcoprocessor,andreducingthesystemBOM(bill-of-materials).TheSIA,coupledwiththesmartvideoaccelerator(SVA),whichiscapableof30-Mpixel/sJPEG-imageencoding,allows

impressivemulti-shotcameraperformance,aswellaslow-powervideoencoding.TheSTn8815integratestwoSMIA(standardmobileimagingarchitecture),CCP2(compactcameraport2)camerainterfaces,andsupports10-bitrawBayerRGBdataformats.

TheNomadikplatformgivescustomerstheeasetodifferentiateproducts,anopennesstoindustrystandards(suchasOMAandMIPI),andourexpertiseinopenOScomplexplatformintegration.Nomadikisreinforcedbyafullsystemofferingwithmultipleconnectivity,camera,energymanagement,TVout,andcompaniondevices.

STn8815-baseddevelopmentkits(NDK-15,NHK-15)offeracomplete,flexibledesignenvironment,includingarichsetofperipheralssuchascameras,audiocodecs,wiredandwirelessconnectivity,LCDdisplays,amongothers.AcompletesetofdevelopmenttoolsisavailablefromST-Ericssonortoolspartners(ARMLtd,Lauterbach)tosupportafullrangeofapplicationdevelopment,fromfirmwarecustomizationuptoOS-levelapplications.

stN8815 PrODuCt taBLe

Part number Package type Package size availability

STn8815a09 Standalone 9 x 9 x 1.0 mm Production

STn8815a12 Standalone 12 x 12 x 1.2 mm Production

STn8815p14 MAP PoP (Package-on-Package) 14 x 14 x 1.0 mm Engineering samples

1

45/82- Architecture des SoC- Etudes de casMultimédia- Plateforme multimédia “portable”

S. Mancini



Plan

q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3

4 PS2 : contexteq EE: architecture matérielleq EE: architecture logicielleq EE: le circuitq PS3 & Cell

q GPU


Architecture générale

47/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS2 : contexte

S. Mancini



Performances

Système tri-processeurs à 250 MHz

q 1 processeur 128 bitsq 2 processeurs vectoriels

í 6.2 Giga Flops

Architecture mémoire mixte partagée/distribuée

q Débit à la mémoire principale = 500 MB/s

q Débit interne = 2 GB/s

Performances 3D

q 75 Millions de polygones par secondeq 2,4 GPixels/s

48/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS2 : contexte

S. Mancini



Plan


q PS2 : contexte4 EE: architecture matérielleq EE: architecture logicielleq EE: le circuitq PS3 & Cell

q GPU


Vue d’ensemble

FPUCPU

128 bits

DMATraitement

d’imageInterfaceRDRAM

I/O

RDRAM

250 MHz 250 MHz 250 MHz

VU0 VU1 GIF

Tous les modules et bus sont à 125 MHz

64b

32b

128b

50/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle

S. Mancini



Processeur principal

DTLB 32*128b Register fileITLB

CO

P2

VU

0 P

IPE

Bra

nch

Pip

e

LD

/ST

Pip

e

CO

P F

PU

Pip

e

Inte

ger

Pip

e 0

Inte

ger

Pip

e 1

I−Cache

16 KB

D−Cache

8 KB 16 KB

SPRAMR

esp

on

seB

uff

er

WBB

UCAB

Bus Interface Unit

64 b

its

64 b

its

Bus système


S. Mancini



Architecture de VU1 (et VU0)

Sauf VU0

FM

AC

w

FM

AC

z

FM

AC

y

FM

AC

x

FD

IV

IAL

U

EF

U

FD

IV

FM

AC

LS

U

Upper

Instruction

Lower

Instruction

VIF

GIF

Bus Système

2*128b

32b32b

128b64b

128b

128b128b

Floating

Register

32*128b

Integer

16*16b

register

16b

64b Instruction

16 KB (4 KB)

Memory Memory

16 KB (4 KB)

Data

64b

2*16b

EF

U


S. Mancini



Architecture mémoire

128b

128

32b

128b

128b

64b

Buff0

Buff Buff Buff1 2 3

RDRAM

Registres RegistresRegistres

GIFSPRAM

4 KBRAM

4 KBRAM

16 KBRAM

DI I

16 KBRAM

D

16 KBI

Cache

8 KB

16 KB

DCache

VIF0 VIF1

DMA

2*128

128b

CPU VU0 VU1128b


S. Mancini



Partionnement de la 3D

Transformation

perspectiveEclairage

Transformation

du modèleElagage

Projection

pixels

Application

de textureZ−buffer

Image

finale

Graphic Synthesizer

Base de

donnée

CPU+VU0 VU1


S. Mancini



Plan


q PS2 : contexteq EE: architecture matérielle4 EE: architecture logicielleq EE: le circuitq PS3 & Cell

q GPU


Modèle de programmation

VPU0 peut fonctionner soit comme un coprocesseur du CPUsoit de façon autonome.VPU1 fonctionne exclusivement de façon autonome.

Le VIF de chaque VU reçoit des paquets de données à stockersoit dans la mémoire programme soit dans la mémoire donnéeassociée, selon un code d’identification.Ce code peut indiquer l’adresse d’un programme à lancer. tV

++

Les programmes destinés aux VPU0 et VPU1 sont écritsen assembleur.

56/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture logicielle

S. Mancini



Flot de données

VPU0

CPU

SPR

RDRAM VU1

GIF

DMA

CPU

VPU0

SPR

RDRAM

VU1 GIF

DMA

DMA

GIF

CPU

VPU0

RDRAM

VU1

DMA

DMA

DMAIPU

DMA

DMA

LD/ST

DMA

57/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture logicielle

S. Mancini



Plan


q PS2 : contexteq EE: architecture matérielleq EE: architecture logicielle4 EE: le circuitq PS3 & Cell

q GPU


Le circuit

• 1999 IEEE International Solid-State Circuits Conference 0-7803-5129-0/99 © IEEE 1999

Figure 15.1.4: Example of data flow.

Figure 15.1.3: IPU block diagram.

Figure 15.1.2: Vector operations units block diagram.

Figure 15.1.1: Chip block diagram.

Figure 15.1.5: Die micrograph.

VU1 MEMORY

FMAC

FDIV

VU1

VUO MEMORY

VUO

GIFIPU

MEM

ORY I/F

ALU / SHIFTER / LSUDMAC, etc.

SUPERSCALAR RISC CORE

I CACHED

CACHE SPR

I/O INTERFACE

FMAC

FMAC

FMAC

FMAC

FMAC

FMAC

FMAC

FMAC

FMAC

FDIV

MEM

ORY I/FM

EMORY I/F

FDIV

FDIV

DCACHE SPRI CACHE

1q 10,5 Millions de transistorsq 17x14.1 mm2 en 0.18 µm

q 15 W

q 250 MHz(CPU, VU0, VU1)

q 125 MHz

59/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: le circuit

S. Mancini



Plan


q PS2 : contexteq EE: architecture matérielleq EE: architecture logicielleq EE: le circuit4 PS3 & Cell

q GPU


Architecture de la PS3

Cell3.2 GHz

RSX®XDRAM256 MB

I/O Bridge

HD/HDSD

AV out

20GB/s

15GB/s

25.6GB/s

2.5GB/s

2.5GB/s

BD/DVD/CD ROM Drive

54GB USB 2.0 x 6

Gbit Ether/WiFi Removable 　 StorageMemoryStick,SD,CF

BT Controller

GDDR3256 MB

22.4GB/s

61/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell

S. Mancini



Cell: schéma bloc


S. Mancini



Cell: EIB

two-way double-precision floating-pointoperation is partially pipelined, so its instruc-tions issue at a lower rate (two double-preci-sion flops every seven SPU clock cycles).When using single-precision floating-pointfused multiply-add instructions (which countas two operations), the eight SPUs can per-form a total of 64 operations per cycle.

Communication architectureTo take advantage of all the computation

power available on the Cell processor, workmust be distributed and coordinated acrossthe PPE and the SPEs. The processor’s spe-cialized communication mechanisms allowefficient data collection and distribution aswell as coordination of concurrent activitiesacross the computation elements. Because theSPU can act directly only on programs anddata in its own local store, each SPE has aDMA controller that performs high-band-width data transfer between local store andmain memory. These DMA engines also allowdirect transfers between the local stores of twoSPUs for pipeline or producer-consumer-styleparallel applications.

At the other end of the spectrum, the SPUcan use either signals or mailboxes to performsimple low-latency signaling to the PPE orother SPEs. Supporting more-complex syn-

chronization mechanisms is a set of atomicoperations available to the SPU, which oper-ate in a similar manner as the PowerPC archi-tecture’s lwarx/stwcx atomic instructions. Infact, the SPU’s atomic operations interoper-ate with PPE atomic instructions to buildlocks and other synchronization mechanismsthat work across the SPEs and the PPE. Final-ly, the Cell allows memory-mapped access tonearly all SPE resources, including the entirelocal store. This provides a convenient andconsistent mechanism for special communi-cations needs not met by the other techniques.

The rich set of communications mecha-nisms in the Cell architecture enables pro-grammers to efficiently implement widelyused programming models for parallel anddistributed applications. These modelsinclude the function-offload, device-exten-sion, computational-acceleration, streaming,shared-memory-multiprocessor, and asym-metric-thread-runtime models.8

Element interconnect busFigure 2 shows the EIB, the heart of the Cell

processor’s communication architecture,which enables communication among thePPE, the SPEs, main system memory, andexternal I/O. The EIB has separate commu-nication paths for commands (requests to

13MAY–JUNE 2006

PPE SPE1 SPE3 SPE5 SPE7

SPE0 SPE2 SPE4 SPE6 BIFIOIF0MIC

IOIF1

Data bus arbiter

BIFIOIF

Broadband interfaceI/O interface

Data network

Figure 2. Element interconnect bus (EIB).1


S. Mancini



Cell: performance DMA

DMA command requests. The DMACthen queues this bus request to the businterface unit (BIU).

6. The BIU selects the request from itsqueue and issues the command to theEIB. The EIB orders the command withother outstanding requests and thenbroadcasts the command to all bus ele-ments. For transfers involving mainmemory, the MIC acknowledges thecommand to the EIB which then informsthe BIU that the command was accept-ed and data transfer can begin.

7. The BIU in the MFC performs the readsto local store required for the data trans-fer. The EIB transfers the data for thisrequest between the BIU and the MIC.The MIC transfers the data to or fromthe off-chip memory.

8. The unrolling process produces a sequenceof bus requests for the DMA command

that pipeline through the communicationnetwork. The DMA command remainsin the MFC SPU command queue untilall its bus requests have completed. How-ever, the DMAC can continue to processother DMA commands. When all busrequests for a command have completed,the DMAC signals command completionto the SPU and removes the commandfrom the queue.

In the absence of congestion, a thread run-ning on the SPU can issue a DMA request inas little as 10 clock cycles—the time needed towrite to the five SPU channels that describethe source and destination addresses, theDMA size, the DMA tag, and the DMA com-mand. At that point, the DMAC can processthe DMA request without SPU intervention.

The overall latency of generating the DMAcommand, initially selecting the command,

17MAY–JUNE 2006

SPU

3.2 GHz 1.6 GHz

MFC

DMAC

LS

BIU

EIB

(1)

MMU(3)

(6)

MIC

(6)

Memory(off chip)

(7)

(7)

(7)

(7)

1.6

GH

z

Cha

nnel

inte

rfac

e

MM

IO

51.2 Gbyte/s

25.6Gbyte/s

204.

8 G

byte

/s25.6

Gbyte/sin/out

25.6 Gbyte/sin/out

DMASPU

Queuesproxy

(2) (5)

(4)

TLB

BIUDMAC

EIBLS

MFC

MICMMIOMMUSPUTLB

Bus interface unitDirect memory access controllerElement interconnect busLocal storeMemory flow controller

Memory interface controllerMemory-mapped I/OMemory management unitSynergistic processor unitTranslation lookaside buffer

Figure 3. Basic flow of a DMA transfer.

1

between puts and gets to local store becauselocal-store access latency was remarkablylow—only 8 ns (about 24 processor clockcycles). Main memory gets required less than100 ns to fetch information, which is remark-ably fast.

Figure 4b presents the same results in termsof bandwidth achieved by each DMA opera-tion. As we expected, the largest transfersachieved the highest bandwidth, which wemeasured as 22.5 Gbytes/s for gets and puts tolocal store and puts to main memory and 15Gbytes/s for gets from main memory.

Next, we considered the impact of non-blocking DMA operations. In the Cell proces-sor, each SPE can have up to 16 outstandingDMAs, for a total of 128 across the chip, allow-ing unprecedented levels of parallelism in on-chip communication. Applications that relyheavily on random scatter or gather accesses tomain memory can take advantage of these com-munication features seamlessly. Our bench-marks use a batched communication model, inwhich the SPU issues a fixed number (the batchsize) of DMAs before blocking for notificationof request completion. By using a very largebatch size (16,384 in our experiments), weeffectively converted the benchmark to use anonblocking communication model.

Figure 5 shows the results of these experi-ments, including aggregate latency and band-width for the set of DMAs in a batch, by batchsize and data transfer size. The results show aform of performance continuity betweenblocking—the most constrained case—andnonblocking operations, with differentdegrees of freedom expressed by the increasingbatch size. In accessing main memory andlocal storage, nonblocking puts achieved theasymptotic bandwidth of 25.6 Gbytes/s,determined by the EIB capacity at the end-points, with 2-Kbyte DMAs (Figure 5b and5f ). Accessing local store, nonblocking putsachieved the optimal value with even smallerpackets (Figure 5f ).

Gets are also very efficient when accessinglocal memories, and the main memory laten-cy penalty slightly affects them, as Figure 5cshows. Overall, even a limited amount ofbatching is very effective for intermediateDMA sizes, between 256 bytes and 4 Kbytes,with a factor of two or even three of bandwidthincrease compared with the blocking case (for

example, 256-byte DMAs in Figure 5h).

Collective DMA performanceParallelization of scientific applications gen-

erates far more sophisticated collective

19MAY–JUNE 2006

Table 1. DMA latency components for a clock

frequency of 3.2 GHz.

Latency component Cycles NanosecondsDMA issue 10 3.125DMA to EIB 30 9.375List element fetch 10 3.125Coherence protocol 100 31.25Data transfer for inter-SPE put 140 43.75Total 290 90.61

0

200

400

600

800

1,000

1,200

4 16 64 256 1,024 4,096 16,384

Late

ncy

(nan

osec

onds

)

0

5

10

15

20

25

Ban

dwid

th (

Gby

tes/

seco

nd)

(a)

(b)

DMA message size (bytes)

4 16 64 256 1,024 4,096 16,384


Blocking get, memoryBlocking get, SPEBlocking put, memoryBlocking put, SPE


Figure 4. Latency (a) and bandwidth (b) as a function of DMA message sizefor blocking gets and puts in the absence of contention.

1

between puts and gets to local store becauselocal-store access latency was remarkablylow—only 8 ns (about 24 processor clockcycles). Main memory gets required less than100 ns to fetch information, which is remark-ably fast.

Figure 4b presents the same results in termsof bandwidth achieved by each DMA opera-tion. As we expected, the largest transfersachieved the highest bandwidth, which wemeasured as 22.5 Gbytes/s for gets and puts tolocal store and puts to main memory and 15Gbytes/s for gets from main memory.

Next, we considered the impact of non-blocking DMA operations. In the Cell proces-sor, each SPE can have up to 16 outstandingDMAs, for a total of 128 across the chip, allow-ing unprecedented levels of parallelism in on-chip communication. Applications that relyheavily on random scatter or gather accesses tomain memory can take advantage of these com-munication features seamlessly. Our bench-marks use a batched communication model, inwhich the SPU issues a fixed number (the batchsize) of DMAs before blocking for notificationof request completion. By using a very largebatch size (16,384 in our experiments), weeffectively converted the benchmark to use anonblocking communication model.

Figure 5 shows the results of these experi-ments, including aggregate latency and band-width for the set of DMAs in a batch, by batchsize and data transfer size. The results show aform of performance continuity betweenblocking—the most constrained case—andnonblocking operations, with differentdegrees of freedom expressed by the increasingbatch size. In accessing main memory andlocal storage, nonblocking puts achieved theasymptotic bandwidth of 25.6 Gbytes/s,determined by the EIB capacity at the end-points, with 2-Kbyte DMAs (Figure 5b and5f ). Accessing local store, nonblocking putsachieved the optimal value with even smallerpackets (Figure 5f ).

Gets are also very efficient when accessinglocal memories, and the main memory laten-cy penalty slightly affects them, as Figure 5cshows. Overall, even a limited amount ofbatching is very effective for intermediateDMA sizes, between 256 bytes and 4 Kbytes,with a factor of two or even three of bandwidthincrease compared with the blocking case (for

example, 256-byte DMAs in Figure 5h).

Collective DMA performanceParallelization of scientific applications gen-

erates far more sophisticated collective

19MAY–JUNE 2006

Table 1. DMA latency components for a clock

frequency of 3.2 GHz.

Latency component Cycles NanosecondsDMA issue 10 3.125DMA to EIB 30 9.375List element fetch 10 3.125Coherence protocol 100 31.25Data transfer for inter-SPE put 140 43.75Total 290 90.61

0

200

400

600

800

1,000

1,200

4 16 64 256 1,024 4,096 16,384

Late

ncy

(nan

osec

onds

) 0

5

10

15

20

25

Ban

dwid

th (

Gby

tes/

seco

nd)

(a)

(b)


4 16 64 256 1,024 4,096 16,384




Figure 4. Latency (a) and bandwidth (b) as a function of DMA message sizefor blocking gets and puts in the absence of contention. 1


S. Mancini



Cell: le circuit


S. Mancini



Plan

q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 36 GPU


Objectifs des (GP)-GPU

P Unifier les processeurs qui interviennent dans le pipelinegraphique

P Permettre leur programmation pour un usage général

Leur architecture reste conditionnée par le pipeline graphique:

P Parallélisme massif pour le calcul des pixelP Hiérarchie mémoire adaptéeP Intégration d’opérations systématiques (interpolation et

filtrage de texture, pixelisation des triangles, etc . . . )

67/82- Architecture des SoC- Etudes de casMultimédia- GPU

S. Mancini



GeForce 8, 9 et GT200 (Tesla): Architecture globale


S. Mancini



GeForce Tesla: Architecture TPC

work distribution is based on the pixellocation.

Streaming processor arrayThe SPA executes graphics shader thread

programs and GPU computing programsand provides thread control and manage-ment. Each TPC in the SPA roughlycorresponds to a quad-pixel unit in previousarchitectures.1 The number of TPCs deter-mines a GPU’s programmable processingperformance and scales from one TPC in asmall GPU to eight or more TPCs in high-performance GPUs.

Texture/processor clusterAs Figure 2 shows, each TPC contains a

geometry controller, an SM controller(SMC), two streaming multiprocessors(SMs), and a texture unit. Figure 3 expandseach SM to show its eight SP cores. Tobalance the expected ratio of math opera-

tions to texture operations, one texture unitserves two SMs. This architectural ratio canvary as needed.

Geometry controllerThe geometry controller maps the logical

graphics vertex pipeline into recirculationon the physical SMs by directing allprimitive and vertex attribute and topologyflow in the TPC. It manages dedicated on-chip input and output vertex attributestorage and forwards contents as required.

DX10 has two stages dealing with vertexand primitive processing: the vertex shaderand the geometry shader. The vertex shaderprocesses one vertex’s attributes indepen-dently of other vertices. Typical operationsare position space transforms and color andtexture coordinate generation. The geome-try shader follows the vertex shader anddeals with a whole primitive and its vertices.Typical operations are edge extrusion for

Figure 2. Texture/processor cluster (TPC).

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

42 IEEE MICRO

Authorized licensed use limited to: IMAG - Universite Joseph Fourrier. Downloaded on December 11, 2008 at 05:12 from IEEE Xplore. Restrictions apply.

169/82- Architecture des SoC- Etudes de cas

Multimédia- GPUS. Mancini



GeForce Tesla: Architecture SP


S. Mancini



GeForce Tesla: Circuit

market segments. NVIDIA’s Scalable LinkInterconnect (SLI) enables multiple GPUsto act together as one, providing furtherscalability.

CUDA C/C++ applications executing onTesla computing platforms, Quadro work-stations, and GeForce GPUs deliver com-pelling computing performance on a rangeof large problems, including more than1003 speedups on molecular modeling,more than 200 Gflops on n-body problems,and real-time 3D magnetic-resonance im-aging.12–14 For graphics, the GeForce 8800GPU delivers high performance and imagequality for the most demanding games.15

Figure 9 shows the GeForce 8800 Ultraphysical die layout implementing the Teslaarchitecture shown in Figure 1. Implemen-tation specifics include

N 681 million transistors, 470 mm2;N TSMC 90-nm CMOS;N 128 SP cores in 16 SMs;N 12,288 processor threads;N 1.5-GHz processor clock rate;N peak 576 Gflops in processors;N 768-Mbyte GDDR3 DRAM;

N 384-pin DRAM interface;N 1.08-GHz DRAM clock;N 104-Gbyte/s peak bandwidth; andN typical power of 150 W at 1.3 V.

The Tesla architecture is the firstubiquitous supercomputing platform.

NVIDIA has shipped more than 50 millionTesla-based systems. This wide availability,coupled with C programmability and theCUDA software development environment,enables broad deployment of demandingparallel-computing and graphics applications.

With future increases in transistor density,the architecture will readily scale processorparallelism, memory partitions, and overallperformance. Increased number of multipro-cessors and memory partitions will supportlarger data sets and richer graphics andcomputing, without a change to the pro-gramming model.

We continue to investigate improved sched-uling and load-balancing algorithms for theunified processor. Other areas of improvementare enhanced scalability for derivative products,reduced synchronization and communicationoverhead for compute programs, new graphicsfeatures, increased realized memory band-width, and improved power efficiency. MICRO

AcknowledgmentsWe thank the entire NVIDIA GPU deve-

lopment team for their extraordinary effortin bringing Tesla-based GPUs to market.

................................................................................................

References1. J. Montrym and H. Moreton, ‘‘The GeForce

6800,’’ IEEE Micro, vol. 25, no. 2, Mar./

Apr. 2005, pp. 41-51.

2. CUDA Technology, NVIDIA, 2007, http://

www.nvidia.com/CUDA.

3. CUDA Programming Guide 1.1, NVIDIA,

2007; http://developer.download.nvidia.

com/compute/cuda/1_1/NVIDIA_CUDA_

Programming_Guide_1.1.pdf.

4. J. Nickolls, I. Buck, K. Skadron, and M.

Garland, ‘‘Scalable Parallel Programming

with CUDA,’’ ACM Queue, vol. 6, no. 2,

Mar./Apr. 2008, pp. 40-53.

5. DX Specification, Microsoft; http://msdn.

microsoft.com/directx.

Figure 9. GeForce 8800 Ultra die layout.

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

54 IEEE MICRO

Authorized licensed use limited to: IMAG - Universite Joseph Fourrier. Downloaded on December 11, 2008 at 05:12 from IEEE Xplore. Restrictions apply.

1


S. Mancini



Fermi: Architecture globale


S. Mancini



Fermi: Architecture SM


S. Mancini



Fermi: Circuit


S. Mancini



ATI HD 5xxx


S. Mancini



ATI HD 5xxx


S. Mancini



ATI HD 5xxx


S. Mancini



ATI HD 4xxx vs 5xxx


S. Mancini



Architecture des SoCEtudes de cas- Multimédia

S. Mancini

Plan Détaillé

6 Contexte

q Présentation

P ApplicationsP Particularités

q Standards vidéo

P Format vidéoP MPEG2

q 3D

P Principe de la 3D Z-bufferP Pipeline 3DP Performances des accélérateurs 3D

6 Set top Box PNX8500

q Environnement

P EnvironnementP Classe d’applicationP Architecture générale

q Les processeurs

P MIPS

P Trimédia- parties opérativesP Trimédia-Architecture mémoireP Trimédia-Mécanismes systèmes

q Bus, mémoire et 3DP Les busP Architecture des BusP Architecture MémoireP Architecture “Système”P Partionnement de la 3DP Vidéo

q Architecture logicielleP Systèmes d’exploitationsP Streaming SoftwareP Données standardisées gérées par TSSAP Exemple TSSA simpleP Description de l’application

q Exemple d’applicationP ObjectifsP Organisation de l’applicationP Organisation du contrôleP Déroulement de l’application


q Le circuit

P Le circuit

6 Set top box IBM STB03xxx

P Set top box IBM STB03xxxP Set top box IBM STB04xxxP Architecture mémoire

6 Plateforme multimédia “portable”

P Nomadik vidéo portable

6 Playstation 2 & 3

q PS2 : contexte

P Architecture généraleP Performances

q EE: architecture matérielle

P Vue d’ensembleP Processeur principalP Architecture de VU1 (et VU0)P Architecture mémoireP Partionnement de la 3D

q EE: architecture logicielle

P Modèle de programmationP Flot de données

q EE: le circuit

P Le circuit

q PS3 & Cell

P Architecture de la PS3P Cell: schéma blocP Cell: EIBP Cell: performance DMAP Cell: le circuit

6 GPU

P Objectifs des (GP)-GPUP GeForce 8, 9 et GT200 (Tesla): Architecture

globaleP GeForce Tesla: Architecture TPCP GeForce Tesla: Architecture SPP GeForce Tesla: CircuitP Fermi: Architecture globaleP Fermi: Architecture SMP Fermi: CircuitP ATI HD 5xxxP ATI HD 5xxxP ATI HD 5xxxP ATI HD 4xxx vs 5xxx

81

Documents

Etudes de cas Multimédia - Grenoble INP · applications. In 3D operations, the TM32 core performs tri-angle setup and controls the hardware 3D engine; the MIPS core performs transformation