Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Architecture des SoC
Etudes de casMultimédia
S. Mancini
Plan
6 Contexte4 Présentationq Standards vidéoq 3D
q Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
1- Architecture des SoC- Etudes de cas- Multimédia
Applications
P InternetP Educatif, jeux, etc . . .P Télévision interactive
P MPEG4P Téléachat
P Télécommunications
P TéléconférenceP Visiophonie
P Imagerie médicale
2/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Présentation
S. Mancini
Particularités
P Les traitements bas niveau sont réguliers et nécessitentde grande puissances de calcul:
P Compression/décompresion
P Composition
P FiltrageP 3D (textures, etc . . . )
P La synchronisation des contenus est complexe etirrégulière.
íííí
Aussi les systèmes multimédia doivent être constitués:
P d’opérateurs spécifiques, . . .P et de microprocesseurs
3/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Présentation
S. Mancini
Plan
6 Contexteq Présentation4 Standards vidéoq 3D
q Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
4- Architecture des SoC- Etudes de cas- Multimédia
Format vidéo
Format vidéo non compressé pour la télévision numériquehaute définition.
P SD - CCIR BT-656. 625 ou 525 lignes faisant juqu’à 720 pixels.. Format 4:2:2 (Cb, Y, Cr, Y, . . . ). Débit : 27 MOctets/s
P HD - 1080p. 1920*1080 en 24 et 30 Hz. Format 4:2:0. Débit : 52 MPixel/s = 78 MO/s
5/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Standards vidéo
S. Mancini
MPEG2
Format vidéo compressé
MPEG2 définit plusieurs résolutions et débits :
Level Max. sampling Pixels/ Max.
dimensions fps sec bitrate
Low 352 x 240 x 30 3.05 M 4 Mb/s
Main 720 x 480 x 30 10.40 M 15 Mb/s
High 1440 1440 x 1152 x 30 47.00 M 60 Mb/s
High 1920 x 1080 x 30 62.70 M 80 Mb/s
6/82- Architecture des SoC- Etudes de casMultimédia- Contexte- Standards vidéo
S. Mancini
Plan
6 Contexteq Présentationq Standards vidéo4 3D
q Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
7- Architecture des SoC- Etudes de cas- Multimédia
Principe de la 3D Z-buffer
Z
Projection
Lumière
Scène
Ecran Observateur
8/82- Architecture des SoC- Etudes de casMultimédia- Contexte- 3D
S. Mancini
Pipeline 3D
Application
de textureZ−buffer
Image
finale
Projection
pixelsElagage
Transformation
du modèleEclairage Transformation
perspective
Base de
donnée
9/82- Architecture des SoC- Etudes de casMultimédia- Contexte- 3D
S. Mancini
Performances des accélérateurs 3DLa performance des accélérateurs 3D dépend essentiellement du débit auxmémoires de texture et du nombre de pixels affichés par seconde.
Gamme Horloge MHz Débit mémoire Pixels/s
Geforce 3 (2001) 175 2,4 GB/s 800 M
Geforce 6 (2004) 500 16 GB/s 4 G
Geforce 7 (2005) 600 51 GB/s 10 G
Geforce 8 (2007) 612/1500 51.2 GB/s 15 G
Geforce 9 (2008) 700/1500 51 GB/s 15 G ??
GTX 660 (2012) 980 144 GB/s 800 G ??
Console (2001) 300 10 GB/s 4 G
Console (2006) 3200 25.6GB/s + 20GB/s 16 G
Professionnel (2001) 500 32*11,2 GB/s 32*6,1 G
10/82- Architecture des SoC- Etudes de casMultimédia- Contexte- 3D
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
4 Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
11- Architecture des SoC- Etudes de cas- Multimédia
Environnement
12/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Environnement
S. Mancini
Classe d’application
Le PNX8500 est dédié aux applications multimédia.
Il permet de gérer des flux vidéo en réalisant des opérationscomme la conversion, la décompression et la composition deflux.
La présence de CPUs facilite la mise en oeuvred’environnements interactifs.
L’accélération 2D autorise des systèmes de fenêtrages.
Une accélération de la 3D permet l’utilisation de scènes desynthèse.
13/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Environnement
S. Mancini
Architecture générale
arbiter uses a programmable, cyclical list-based arbiter that
assigns unused list slots to the CPUs. This guarantees
required latency to all real-time agents while optimizing CPU
performance.
COPROCESSORS/ACCELLERATORS
The pnx8500 includes on-chip coprocessors for 2D and 3D
graphics acceleration, MPEG decoding, image scaling and fil-
tering, and display channel composition. All coprocessors read
input and write results to memory.
2D drawing engine__The pnx8500 2D drawing accelerator
block performs high-speed graphics operations including fast
area fills, three-operand bitblt, monochrome data expansion,
and line drawing. Monochrome data can be color-expanded to
any supported pixel format. Anti-aliased lines are supported
through a 16-level alpha-blend bitblt. A full 256-level alpha
bitblt supports blending of source and destination images. The
2D engine can also be used as a generic DMA engine to trans-
fer data between memory locations on a byte-aligned basis.
3D rendering engine__An on-chip 3D drawing engine accel-
erates texture-mapped pixel rendering. It implements
Microsoft-compatible DirectX 6.1 3D functions, such as bi-
linear and tri-linear texture filtering.
Rendering speeds up to 60 Mpix/sec for bilinear- and trilin-
ear-textured, mip-mapped, z-buffered, blended triangles deliv-
er excellent 3D graphics quality for ASTB or television
applications. In 3D operations, the TM32 core performs tri-
angle setup and controls the hardware 3D engine; the MIPS
core performs transformation and lighting.
High-level MPEG-2 decoder__A high-level MPEG-2 decoder
block performs MPEG-2 program stream decoding below slice
level. The TM32 core controls the decode process and handles
MPEG-2 processing above slice level. BS-D, DVB, and all 18
ATSC formats are supported; three SD streams or one HD
stream can be decoded simultaneously. The MPEG-2 decoder
is capable of full-resolution decoding or decoding to a 2X
horizontally compressed image format, saving memory space
and bandwidth.
Memory-based scaler (MBS)__The primary function of the
MBS block is acceleration of scaling operations. It reads input
from and writes results to memory. Compared to an 'on-the-
fly' approach, this method de-couples the scaling operation
from the display, significantly reducing on-demand memory
bandwidth requirements. Since the MBS block uses memory
as a buffer, it can process many images in the same time
required to display a single frame.
In addition to basic scaling capabilities, the MBS performs
many commonly required image manipulations prior to dis-
play including linear and non-linear aspect ratio conversion
(including panorama mode), anti-flicker filtering, de-interlac-
ing, and pixel format conversion functions.
NEXPERIA PNX8500 CONCEPTUAL ARCHITECTURE
1
14/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Environnement
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
q Environnement4 Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
15- Architecture des SoC- Etudes de cas- Multimédia
MIPS
Le MIPS 32 bits est dédié à la gestion globale du système.
Parties opératives:
P ALUP Multiplieur-Accumulateur et Diviseur entier
Cache I. :16 KBD.:32 KB
Pipeline de 6 étages à 150 MHz,
Il adapté à l’utilisation d’un OS: il contient un module de gestionde la mémoire (MMU).
16/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Les processeurs
S. Mancini
Trimédia- parties opératives
Il est dédié aux traitement de la vidéo et du son.C’est un VLIW 32 bits à 5 opérations par instruction, à 200MHz. Il dispose de 27 opérateurs.
Chapter 1: Overview of the TriMedia System
© 2000 TriMedia Technologies 10/17/2000
Book 1—Getting Started
11
1
TriMedia CPU is assigned to specific issue slots in a TriMedia VLIW instruction. Table 1
gives the functional unit assignments.
Because of the number of available functional units and their assignment, some opera-
tions may have to wait for one or more cycles before they are executed. This means that
in some cases not all issue slots are used. Good use of issue slots is one of the purposes
for software code optimization.
What Makes TriMedia’s VLIW Architecture Better
The beginning instruction set architecture (the processor programming model) must be
distinguished from implementation (the physical chip and its characteristics). VLIW
microprocessors and superscalar implementations of traditional instruction sets share
some characteristics such as multiple execution units and the ability to execute multiple
operations simultaneously.
The techniques used to achieve high performance are different because the parallelism is
explicit in VLIW instructions, but must be discovered by hardware at run time by super-
scalar processors. VLIW implementations are simpler for very high performance. Just as
RISC architectures permit simpler, cheaper high-performance implementations than do
ClSC architectures, VLIW architectures are simpler and cheaper than RlSC architectures
because of hardware simplifications. However, they require more compiler support.
Table 1
Functional Unit Assignments
Functional Unit Quantity Latency
A
/Delay
B
A. Clock cycles between the issuance of an operation and availability of its resultsB. Clock cycles between the execution of a branch instruction and the branching taking place
Recovery
C
C. Minimum number of clock cycles between successive operations
Slot Assignment
1 2 3 4 5
Constant 5 1 1
� � � � �
Integer ALU 5 1 1
� � � � �
Load/Store 2 3 1
� �
DSP ALU 2 2 1
� �
DSP MUL 2 3 1
� �
Shifter 2 1 1
� �
Branch 3 3 1
� � �
Int/Float MUL 2 3 1
� �
Float ALU 2 3 1
� �
Float Compare 1 1 1
�
Float sqrt/div 1 17 16
�
117/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Les processeurs
S. Mancini
Trimédia-Architecture mémoire
Le trimédia accède a deux busdistincts :
q DRAMq MMIO (sur bus PI)
Cache I. :32 KBD.:16 KB
Associatif par groupes, à 8 voies de 64octets par bloc.La zone MMIO est non-cachable.
,*+�+-� ������������� ��� ����� ����������� ���� ����
������������� ������������� � ������ �#��
�������������'"�������������D��������&����#9������""����(������ ���"������������������"�� ����%�� �����������
�������������� '�
%�> ��4��3��'
���������'���"������&'�����,*+�+-� ����������"�������������������:�������#3��� �����������������
��"�����������������F�G������������������������%%� ��%�����������
�$���%.,���# ���� ��# ����'����� �
# ����3'� �'����� �������3
���� �����������"��������
��� #&������������ ������ ����"� $��"�
��"��� ������7#&����E����9#&�����������������������
��"��� ������7#&�������E����9#&�������������������"����
���� �����$������
�����"� � ��� ������ '�� � ��� ��������"��%�� �
���%��� �����#&������%�
&��� � �������%��!
%��� %��������"�������������� �N� ����������
�%��� #&���������������%��������"����������"����
% ��" ������� ' ���%��������"����� ��"����
%����� ����������%��������"������E����������������������
�����������
�����������
##8�D'��7
##8������/��
"#��D� 7?"�DA8
"#��D� 7?"�D@�
8�����/��
��----�-----
�����������
���
�����������
�����������
"#��D.?�#DA8
"#��D.?�#D@�
.?�#�����/��
"#��D.?�#D&@8#8"
�������������
���#���%.>����(��4�4��3�4'�1Les instructions sont compressées en mémoire externe.
18/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Les processeurs
S. Mancini
Trimédia-Mécanismes systèmes
Le TM32 implémente un mécanisme de sémaphoresmatériels.Ils sont accessibles en tant que registres, depuis le coeur ou lebus PI.
Procédure d’accès:
write ID to SEM (use 32 bit store, with ID in 12 LSB)retrieve SEM (use 32 bit load, it returns 0x00000nnn)if (SEM = ID) {
performs a short critical section actionwrite 0 to SEM
}else"try again later, or loop back to write"
19/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Les processeurs
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
q Environnementq Les processeurs4 Bus, mémoire et 3Dq Architecture logicielleq Exemple d’applicationq Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
20- Architecture des SoC- Etudes de cas- Multimédia
Les bus
PI Bus DVP Memory busBus Trois EtatsBus système32 bits à 50 MHz ≈ 200 MB/s
Bus Point à PointBus mémoire64 bits à 143 MHz ≈ 1 GB/s
Release 3.1 VHDL PI-Bus ToolKit 2
3. How to use the PI-Bus ToolKit
The purpose of the PI-Bus ToolKit is to allow the accelerated implementation of PI-Bussystems. The ToolKit provides much of the core of a potential system and facilitatescombining the pieces for a given application. Figure 1 below shows the signals of the PI-Bus.The University of Sussex has built several PI-Bus systems that are discussed in detail later.
MasterInterface(s)
SlaveInterface(s)
Arbitration Decode
Bus Control
RESETN RESETN
CLK CLK
READ READ
OPC[3:0] OPC[3:0]
ACK[2:0] ACK[2:0]
A[31:2]A[31:2]
D[31:0] D[31:0]
SELy
RESETN CLK
LOCK
REQx
GNTx
Figure 1 - PI-Bus Signals Between Models
The interface provided to the host circuits attached to the slave and master devices arediscussed in the chapters associated with the bus agents.
A model is provided for the master, slave and BCU (Bus Control Unit). The BCU model is acomplex example that can be simplified for a small PI-Bus system. To provide compatibilitywith widely different platforms we have used structural VHDL to link PI-Bus modules.
Models are also provided for behavioural and synthesisable circuits to attach to the hostinterfaces of the master and slave models. The Centre for VLSI and Computer Graphics hasdeveloped hardware and software tools to test and confirm compliance with standards such asthe PI-Bus, details can be found on page 34.
The ToolKit includes two distinct model libraries for the slave and master agents. In somesystems it is practical to use one model as the behavioural and synthesisable view, however ithas become clear that this is not always the case. The behavioural model has been modified tocreate models more suited to synthesis, as follows:
• Removed AFTER clauses for systems such as Synopsis and ALTERA
• Component Libraries all set to WORK for systems that require this (ALTERA)
1
3
In addition to providing a high bandwidth data path, the PLB offers designers flexibility through thefollowing features:
• Support for both multiple masters and slaves• Four priority levels for master requests allowing PLB implementations with various arbitration
schemes• Deadlock avoidance through slave forced PLB rearbitration• Master driven atomic operations through a bus arbitration locking mechanism• Byte-enable capability, supporting unaligned transfers• A sequential burst protocol allowing byte, half-word, word and double-word burst transfers• Support for 16-, 32- and 64-byte line data transfers• Read word address capability, allowing slaves to return line data either sequentially or target
word first• DMA support for buffered, fly-by, peripheral-to-memory, memory-to-peripheral, and memory-to-
memory transfers• Guarded or unguarded memory transfers allow slaves to individually enable or disable
prefetching of instructions or data• Slave error reporting• Architecture extendable to 256-bit data buses• Fully synchronous
The PLB specification describes a system architecture along with a detailed description of the signals andtransactions. PLB-based custom logic systems require the use of a PLB macro to interconnect thevarious master and slave macros.
Figure 2 illustrates the connection of multiple masters and slaves through the PLB macro. Each PLBmaster is attached to the PLB macro via separate address, read data and write data buses and a pluralityof transfer qualifier signals. PLB slaves are attached to the PLB macro via shared, but decoupled,address, read data and write data buses along with transfer control and status signals for each data bus.
The PLB architecture supports up to 16 master devices. Specific PLB macro implementations, however,may support fewer masters. The PLB architecture also supports any number of slave devices. The
PLB Masters PLB Macro PLB Slaves
Status&
Control
Read Data
Shared Bus
Central Arbiter
Status & Control
Read Data
Control
Address& TransferQualifiers
Write Data
Arbitration
Address& TransferQualifiers
Control
Write Data
Figure 2: Example of PLB Interconnection.
1
21/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Architecture des Bus
I/O
Video Out
PCI/XIO
MIPS Bridge
Memory Controller
External SDRAM
DMA DMA
MIPS TM32
Fast C−Bridge TM32 C−Bridge
MIPS C−Bridge
C−Bridge
Controller
Interrupt
Interrupt
Controller
IOVideo IN
Video
2D
PI bus PI bus
MM
I B
us
PI bus
22/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Architecture Mémoire
Les transferts de données entre les différents opérateurs sefont à travers la mémoire partagéee principale.
The pnx8500 supports input of two CCIR656 video streams
and up to four transport streams. After input, streams are rout-
ed to either the video input processor (VIP) blocks or MPEG
system processor (MSP) blocks, depending on stream type.
Dual video input processors__Two VIP blocks input digital
video (primarily CCIR656) at pixel rates up to 40 MHz in a
variety of packed YUV422 memory video formats. To reduce
the amount of memory required to store video data, the VIP
blocks can compress video lines horizontally using a 6-tap
polyphase scaler in normal or transposed polyphase modes.
After input processing, video data is stored in memory. VIPs
also support a raw data-streaming input mode for message
passing and handling future 8-to-10-bit data formats.
Transport stream input and routing__Transport stream
input is supported from several sources including two trans-
port stream input interfaces (TSIN), a 1394 receiver channel,
an internal DMA agent dedicated to software-generated trans-
port streams, and a feedback channel from each of the MPEG
System Processor (MSP) blocks containing PID filtered,
optionally de-scrambled data. Each TSIN supports parallel
and serial formats and input of up to four serial transport
streams. Serial streams are converted to parallel within the
TSIN block before further routing.
After input, each transport stream is routed to one of several
destinations including an MSP block, a 1394 transmitter
channel, or a transport stream output (TSOUT) block. Parallel
streams are converted to serial as needed by the TSOUT block.
MPEG system processors__Two independent MSP blocks
parse DVB and DSS transport streams, perform PID and sec-
tion filtering, timestamping, descrambling, and demultiplex-
ing, then write results to memory. Each MSP block handles
one transport stream, thus up to two transport streams can be
processed simultaneously.
Prior to input, transport streams may pass through external
POD or CI conditional access modules. Descrambler entitle-
Advanced image composition processor (AICP)__Two
AICP units perform color-space conversion and final image
composition and refresh of the primary and secondary display
channels. Compositing is performed in either RGB or YUV
color space, depending on the AICP output mode. Compositing
functionality is highly programmable and includes chroma,
alpha or color-key based alpha-blending and pixel selection
from the previous or current layer or the background color.
Two physical and logically independent AICP units support all
display modes. The primary AICP composites four layers to
produce the primary output channel, intended for display on
a TV or monitor. The secondary AICP composites two layers
to create the secondary channel, intended for output to a
VCR or other recording device or for viewing on another TV.
Each layer has its own pixel clock and frame timing.
All layers in both units are identical, so there are no restrictions
on how each layer can be utilized. Thus the primary display
can have four layers of video, graphics, or a mix of both. To
support scenarios other than watch/record, one or both AICPs
can be set to divide their layers over two pixel/timing synchro-
nous multiplexed outputs. This allows refresh of up to four
screens with limited compositing capabilities. The output of
each AICP unit is sent to a digital video output port.
MEMORY-BASED ARCHITECTURE
The pnx8500 operates as a memory-based system; memory
serves as the buffer to decouple input and output data streams.
As pnx8500 decodes input, it stores results in memory. When
multiple data structures exist in memory for a given input
stream, timestamping is used to determine when to display
specific structures. Use of memory as a staging buffer simpli-
fies synchronization between various on-chip blocks.
I/O UNITS
On-chip I/O units receive and/or route input from or trans-
mit output to off-chip devices. Some perform additional
stream processing and formatting.
1
Ú Ú Ú Ú Ú Ú Ú ÚÚ Ú Ú Ú Ú Ú Ú ÚÚ Grande souplesse Ø Ø Ø Ø Ø Ø
Ø Ø Ø Ø Ø ØØ Les débits sur le bus
sont doublés
23/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Architecture “Système”
La présence d’une mémoire partagée nécessite l’utilisation dematériel spécifique aux mécanismes systèmes tels que lessémaphores.
P Le bus PI implémente la fonction “lock”P Le TM32 implémente des sémaphoresP Les deux processeurs ont des zones non-cachables
Les deux processeurs peuvent communiquer en construisantdes communications par “tube”, réalisées par les OS.
24/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Partionnement de la 3D
60 Millions de Pixels par seconde
Image
finale
Elagageperspective
TransformationEclairagedu modèle
Transformation
de texture
ApplicationProjection
pixels
donnée
Base de
TM32
FusionZ−buffer
MIPS
ASIC
25/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Vidéo
P Décodage des “Transport Stream” par 3 MSP (MPEGSystem Processor)+ un MSP est un processeur RISC 16 bits
P Les traitements vidéo sont réalisés par des ASICs
q Décodage MPEG2q Homotétieq Composition d’image
26/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Bus, mémoire et 3D
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
q Environnementq Les processeursq Bus, mémoire et 3D4 Architecture logicielleq Exemple d’applicationq Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
27- Architecture des SoC- Etudes de cas- Multimédia
Systèmes d’exploitations
Les deux processeurs peuvent supporter un systèmed’exploitation.
P Le MIPS permet d’utiliser des systèmes d’exploitation àprocessus “protégés”.+ Un MMU permet de disjoindre les espaces mémoire desprocessus.Inconvénients : les changements de contexte peuvent êtrelents.
P L’OS du TM32 doit être rapide pour effectuer lasynchronisation des flux.Pour réduire le temps changement de contexte, une petitepartie des 128 registres est sauvegardée.
28/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle
S. Mancini
Streaming Software
Le Trimedia Streaming Software Architecture (TSSA) permetde simplifier la gestion de la synchronisation des fonctions detraitement.
L’utilisateur décrit les fonctions de traitement et leursconnexions.
TSSA gère les déplacements, créations et destructions desdonnées.
.
.La gestion de la mémoire n’est pas laissée à l’OS pouréviter la fragmentation de l’espace mémoire.
29/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle
S. Mancini
Données standardisées gérées par TSSA
Chapter 7: TSSA Essentials
14
Book 3—Software Architecture, Part B
© 2000 TriMedia Technologies 10/17/2000
Common Data Structures
Given countless ways to describe data, TSA has set some rules on formats of data used
among TSA components, so that the data is understandable to all. The standard TSA data
structures are declared in a header file called tmAvFormats.h. (Refer to Chapter 4,
tmAv-
Formats.h: Multimedia Format Definitions
for documentation of these structures). This
chapter gives you an introduction to these commonly used data structures.
The TSA data structures are used in communication among components, and between
an application and its components. Between an application and its components,
configu-
ration structures
set up components. Among components,
data packets
transfer data.
Data Packets
Data travels in packets between components. Components that produce data packets are
referred to as “senders” and components that consume data are called “receivers.” The
packets are always constructed according to the TSA data packet structure, with a pointer
to the packet header as its first field. The header contains descriptive information of the
type of data in the packet. The packet has a flexible, yet consistent structure. Specific
hooks are provided where application needs can be satisfied, provided that you adhere to
a few TSA rules.
Figure 2
Example showing the hierarchical composition of a TSA packet
tmAvPacket_t
*Header
allocatedBuffers
buffersInUse
buffers[ ]
bufSize
dataSize
data
tmAvBufferDescriptor_t
18432
18432
* Data Buffer
id
flags
userSender
userReceiver
userPointer
time
format
tmAvHeader_t
0
0
***
*
ticks
hiTicks
tmTimeStamp_t
0
0
tmAudioFormat_t
size
hash
referenceCount
dataClass
dataType
dataSubtype
description
sampleRate
sizeof(tmAudioFormat)
0
0
avdcAudio
atfLinearPCM
apfFiveDotOne16
0
44100.0
130/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle
S. Mancini
Exemple TSSA simple
Chapter 8: Developing Applications Using a Streaming Model
30 Book 3—Software Architecture, Part B © 2000 TriMedia Technologies 10/17/2000
Sample Application
The best way to understand how the streaming model works is to study a sample applica-
tion using components in the TSSA OL layer. This chapter contains pseudo code from a
basic TSSA application that uses three generic asynchronous components: a digitizer, a
processor, and a renderer. After the basic application has been explained in its entirety,
more advanced issues of TSSA applications will be presented starting on page 41.
Figure 8 Overall flow of the sample application showing a command/response queue between application and component and bidirectional data queues between components
Following is the pseudo code of the sample application. Each part of the code will
explained on the following pages. Although error checking is critical in any TSSA appli-
cation, it is omitted from this example for clarity.
void tmosMain(){ Int digInst; Int procInst; Int rendInst; ptmolDigitizerInstanceSetup_t digSetup; ptmolProcessorInstanceSetup_t procSetup; ptmolRendererInstanceSetup_t rendSetup; ptmolDigitizerCapabilities_t digCap; ptmolProcessorCapabilities_t procCap; ptmolRendererCapabillities_t rendCap; ptsaInOutDescriptor_t iodesc1; ptsaInOutDescriptor_t iodesc2; ptsaInOutDescriptorSetup_t iosetup;
tmAvFormat_t lformat = { sizeof(tmAvFormat_t), /* size */ 0, /* hash */ 0, /* referenceCount */ avdcGeneric, /* dataClass */
Full buffers
Empty buffers
Full buffers
Empty buffers
Application
FilterDigitizer Renderer
Response
Command
Response
Command
Response
Command
1
Chapter 8: Developing Applications Using a Streaming Model
© 2000 TriMedia Technologies 10/17/2000 Book 3—Software Architecture, Part B 39
8
The following is an example of the scheduling of a number of tasks in a TSSA audio
application. The tasks involved are the root task, the idle task, the processor component
task, and their interaction with two interrupt service routines (an audio digitizer and an
audio renderer).
Figure 9 Processor control flow in a TSSA system. The animation shows the dynamic interaction between components, packets, and the application.
Following are descriptions of the task switches illustrated in Figure 9:
Task Switch t0 Audio input interrupt triggers the audio digitizer. Digitizer pro-cures an empty packet, fills it, and sends it to the filter.
Task Switch t1 The filter task has been waiting for a full packet. Since a pSOS call was made in the digitizer’s ISR, the pSOS scheduler is invoked on return from the ISR. Control is transferred to the waiting filter task.
Task Switch t2 The filter finishes processing. An input buffer has been consumed and an output buffer is produced. Control is transferred to the next waiting task, which is the application (root task).
Task Switch t3 The application completes its processing and control is transferred to some other task.
Task Switch t4 The audio renderer is triggered by the audio out interrupt service request. The audio renderer consumes a full buffer and produces an empty buffer.
Digitizer(Interrupt)
Processor(Task)
Renderer(Interrupt)
Application (Task)
Idle/Other (Task)
Progress FunctionEmpty Packet Full Packet
t3 t4 t0 t1 t2 t3 t4 t0
5 ms = 256 audio samples
TIMELINE.MOV
1
31/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle
S. Mancini
Description de l’application
// Initialisation de chaque module...// Initialisation des connexionsiodesc1 = fct ( taille paquet, etc ....)
// Connexion des modulesdigSetup ->outputDescriptors[0] = iodesc1;procSetup ->inputDescriptors[0] = iodesc1;...
// Lancement de l’applicationtmolDigitizerStart( digInst );tmolProcessStart ( procInst );tmolRendererStart ( rendInst );...
32/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Architecture logicielle
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
q Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielle4 Exemple d’applicationq Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
33- Architecture des SoC- Etudes de cas- Multimédia
Objectifs
Une image 3D animée incrustée dans un flux vidéo.
34/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Exemple d’application
S. Mancini
Organisation de l’application
Préparation Rendu
CompositionEntrée
Image
MIPS, TM32, 3DMIPS
SortieVidéo out
Modèle 3D
Utilisateur
Video In AICP Video Out
Vidéo IN
35/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Exemple d’application
S. Mancini
Organisation du contrôle
Streaming Software
(RPC+ mémoire partagée)
Communication inter−processeur
OS du MIPS
I/O
MIPS
TM32
3D Video IN AICP Video Out
Matériel Logiciel
36/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Exemple d’application
S. Mancini
Déroulement de l’application
������������������
������������������
������������������������
������������������������
���������������������������
���������������������������
��������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������
Fincomposition
Sortie
Vidéo In
compositionDébut
sortieFin
Composition
3DDébut
RéceptionsortieFin
3D
3DFin
Vidéo In A
Vidéo In B
3D
Composition A
Composition B
(précédent)
TempsBuffer plein Buffer vide
Arrivée image vidéo suivante − stockée dans Vidéo In B
Buffers
image
Matériel
Logiciel
37/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Exemple d’application
S. Mancini
Plan
q Contexte6 Set top Box PNX8500
q Environnementq Les processeursq Bus, mémoire et 3Dq Architecture logicielleq Exemple d’application4 Le circuit
q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
38- Architecture des SoC- Etudes de cas- Multimédia
Le circuit
P 35 Millions de transistorsP Techno 0.18 µm
P 82 Domaines d’horloges
q TM32 : 200 MHzq PR3940 : 150 MHzq SDRAM : 143 MHz
P 243 Mémoires : 750 KbitsP Les périphériques sont regroupés:
M-PI, T-PIP Les blocs sont connectés par
aboutement
chiplet timing, clock matching, and I/O tim-
ing analysis.
To achieve timing closure, we made engi-
neering change orders to the netlist after routing.
Following each manipulation step, formal verifi-
cation ensured that the modified netlist was func-
tionally equivalent to the one after test insertion.
We aligned all clock domains having syn-
chronous chiplet crossings. For example, if the
memory interface clock in one chiplet was syn-
chronously connected to the same clock in
another chiplet, we phase-aligned these clocks
and analyzed the signal paths to meet timing
constraints. We achieved clock alignment by
tweaking the clock insertion delays, using align-
ers in the clock module. Similarly, we made the
clock trees as structurally identical as possible.
As part of the physical design process, we met
design completion and manufacturability goals
by implementing techniques such as design rule
checks, antenna fixes, track filling, and doubling
of vias wherever possible. Figure 4 shows the lay-
out plot for the Viper design’s initial version.
Table 3 summarizes the major design
parameters.
WE HAVE LEARNED much from the Viper design
experience and trust it will guide us in the
future, particularly since the next-generation
SOC designs are significantly more complex,
calling for still higher levels of integration. Some
of our current activities, in addition to regular
chip-development tasks, are investigating more
efficient on-chip bus architectures and better
design-reuse methodologies. �
AcknowledgmentsWe thank the Viper management and design
teams for their hard work, particularly chief
architects Gert Slavenburg and Lane Albanese,
without whose foresight and leadership the pro-
ject never would have been successful.
References1. S. Rathnam and G. Slavenburg, “An Architectural
Overview of the Programmable Multimedia
Processor, TM-1,’’ Proc. 41st IEEE Computer
Society Int’l Conf. (COMPCON 96), IEEE CS
Press, Los Alamitos, Calif., 1996, pp. 319-326.
2. D. Paret and C. Fenger, The I2C Bus, John Wiley
& Sons, New York, 1997.
Santanu Dutta is a designengineering manager atPhilips Semiconductors inSunnyvale, California. Hisresearch interests includedesign of high-performance
Application-Specific SOC Multiprocessors
30 IEEE Design & Test of Computers
CAB MPEG
MBS+
VIP1+
VIP2
ICP1 + ICP2 + MMI
Conditionalaccess
(MSP1 + MSP2)T-PI
M-PI
TM32
1394
MSP3
PR3940
Figure 4. Layout of Viper (PNX8500).
Table 3. Design statistics.
Parameter Value
Process technology TSMC 0.18 µm, six metal layers
Transistors About 35 million
Instances 1.2 million instances, or 8 million gates
Memories 243 instances, 750-Kbit memory
CPUs 2 (TriMedia TM32 and MIPS PR3940)
Peripherals 50
Clock domains 82
Clock speed TM32: 200 MHz; PR3940: 150 MHz;
SDRAM: 143 MHz
Power 4.5 W
Supply voltage 1.8-V core and 3.3-V I/O
Package BGA456
1
39/82- Architecture des SoC- Etudes de casMultimédia- Set top Box PNX8500- Le circuit
S. Mancini
Plan
q Contexteq Set top Box PNX85006 Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 3q GPU
40- Architecture des SoC- Etudes de cas- Multimédia
Set top box IBM STB03xxx
41/82- Architecture des SoC- Etudes de casMultimédia- Set top box IBM STB03xxx
S. Mancini
Set top box IBM STB04xxx
42/82- Architecture des SoC- Etudes de casMultimédia- Set top box IBM STB03xxx
S. Mancini
Architecture mémoire
23
As more functions are integrated in system-on-a-chip (SOC)designs, memory subsystem performance becomes ever morecritical to achieving the requirements of the design. While thenumber of cores increases, the number of memory controllersusually remains constant for cost and complexity reasons.Adding memory controllers can increase the size and complex-ity of the memory subsystem design and can increase costs byincreasing the chip area and the number of I/Os. By under-standing the system demands on the memory subsystem anddetermining ways to enhance its efficiency, the additional com-plexity and cost can be avoided.
This article describes a set-top-box (STB) integrated controllerdesign and the process of modeling, analyzing, and improvingmemory performance for this SOC during the design and simu-lation phases of the project. While the techniques have beendeveloped in the STB design environment, they can be readilyadapted for memory subsystem analysis for a wide range ofSOC applications.
Set-Top-Box OverviewA set-top box (STB) is a consumer device used to decode dig-ital video and audio signals from a digital cable or satellite serv-ice provider. The STB034xx integrated controller is an exampleof an SOC for an STB application. The STB034xx SOC is aPowerPC™ 405 processor-based design and includes the fol-lowing subsystems: 405 processor with I-cache and D-cache,digital audio/video decoders, memory interface, direct memoryaccess (DMA) controller, and peripherals. An overview of thearchitecture is shown in Figure 1.
The focus of the performance analysis for this design was onthe SDRAM memory interface. The MPEG-2 video decoder, 405processor, and DMA controller can each place significantdemands on memory bandwidth. For example, decoding videowith on-screen display (OSD) planes active can require over 90MB/s. Typically, the continuous bandwidth requirements of theDMA controller would not be as demanding, but it can still havebursts of high-data-rate activity. Requirements for 405 band-width are application-dependent. Usually, the audio decoderand digital video broadcast (DVB) transport demultiplexer willrequire only about five MB/s.
The STB034xx memory subsystem includes a crossbar switchand two SDRAM controllers. A crossbar switch core providesnonblocking access to system memory. This architecture isflexible because cores connected to either of the processorlocal buses (PLBs) may access either of the two SDRAM inter-faces, thus enabling the even distribution of the bandwidth loadbetween the two interfaces.
The SOC audio and video decode functions have to meet areal-time deadline, thereby determining when the needed datamust reach the video/audio decoder. Provisions were made inthe SDRAM controllers to handle the overall bandwidth of thedesign and to meet these real-time requirements.
Although the memory controllers have a theoretical bandwidththat is well above the aggregate requirements, queuing andlatency variations during actual operation can reduce the effi-ciency of the memory subsystem performance.
MicroNews Fourth Quarter 2001
Memory Performance Analysis in a System-on-a-Chip DesignEverett G. Vail III, Steven B. Herndon, Dennis E. Franklin, Eric M. Foster, and Jeffrey S. Graves
Vid
eo 1
PowerPC405
16KI-cache
8KD-cache
Processor Local Bus (PLB)
Processor Local Bus (PLB)Peripherals
CrossbarSwitch SDRAM 0
Controller
SDRAM 1Controller
SRAM16-bit
Interface
DMAController
MPEG2 / DVBTransport
MPEG2 VideoDecoder
Transport
Vid
eo 2
Aud
io
1-MBFlash
2 / 8-MBSDRAM
2 / 8-MBSDRAM
Satellite or CableInput Stream
NTSC / PALDigital Encoder
TV /VCR
MPEG2 AudioDecoder
DM
A
405-D405-I
External Graphicsand Video (EGV) Port
Figure 1. Set-top-box system-on-a-chip architecture.1
43/82- Architecture des SoC- Etudes de casMultimédia- Set top box IBM STB03xxx
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxx6 Plateforme multimédia “portable”q Playstation 2 & 3q GPU
44- Architecture des SoC- Etudes de cas- Multimédia
Nomadik vidéo portable
©ST-Ericsson,2009-Allrightsreserved.ST-EricssonandtheST-EricssonlogoaretrademarksoftheST-EricssongroupofcompaniesorusedunderalicensefromSTMicroelectronicsNVorTelefonaktiebolagetLMEricsson.Allothernamesarethepropertyoftheirrespectiveowners.FormoreinformationonST-Ericsson,visitwww.stericsson.com
order code: FlSTN88150109
LET’SCREATEIT
STn8815 block diagram
DDR SDRAMcontroller
NAND/NOR Flashcontroller
Multichannel DMAcontroller
SecuredRAM/ROM
Timers
Watchdog
RTCInterruptcontroller PLLS
Systemcontroller Security toolbox
JTAG/trace
SD/MMC/MS
Host port interface USB-OTG high-speedCamera interfacesTV outputColor LCD controllerDiplays interfaces
I2C
HSI
MSP
UART
SSP
FIrDA
1-Wire/HDQ interface
GPIO
Rotary encoder I/T keypad interfaces
eSRAMbuffer
Power management
Icache
Dcache
L2 cacheARM926EJ
Smart videoaccelerator
Smart audioaccelerator
Smart imagingaccelerator
Smart graphicsaccelerator
Nomadikarchitectureisbasedonthedistributedprocessingofaudio,videoandimaging,advancedsecurityfeatures,pervasivelow-powertechniquesforincreasedautonomy,andoptimizedmemoryarchitectureforthebestcost/performanceratio.TheSTn8815combinesanARM9coreupto332MHzwithlevel-twocachetoaudio,video,imagingandgraphicsaccelerators,allowingbothlow-powermultimediaperformanceandpowerfulgeneral-purposesoftwareprocessingandOSsupport.
TheSTn8815smartimagingaccelerator(SIA)deliversimpressivemultimediaqualitywithoutsacrificingbatterylife.Itoperatesasareal-time,programmableimage-reconstructionengineatupto80-Mpixel/s.Thiscapabilityenablescamera-phonesystems,basedon5-Mpixelsensors,toexecutenoisereduction,autofocusandexposurecontrol,andotherfundamentalalgorithms,thereforeeliminatingtheneedforanexternalimagingcoprocessor,andreducingthesystemBOM(bill-of-materials).TheSIA,coupledwiththesmartvideoaccelerator(SVA),whichiscapableof30-Mpixel/sJPEG-imageencoding,allows
impressivemulti-shotcameraperformance,aswellaslow-powervideoencoding.TheSTn8815integratestwoSMIA(standardmobileimagingarchitecture),CCP2(compactcameraport2)camerainterfaces,andsupports10-bitrawBayerRGBdataformats.
TheNomadikplatformgivescustomerstheeasetodifferentiateproducts,anopennesstoindustrystandards(suchasOMAandMIPI),andourexpertiseinopenOScomplexplatformintegration.Nomadikisreinforcedbyafullsystemofferingwithmultipleconnectivity,camera,energymanagement,TVout,andcompaniondevices.
STn8815-baseddevelopmentkits(NDK-15,NHK-15)offeracomplete,flexibledesignenvironment,includingarichsetofperipheralssuchascameras,audiocodecs,wiredandwirelessconnectivity,LCDdisplays,amongothers.AcompletesetofdevelopmenttoolsisavailablefromST-Ericssonortoolspartners(ARMLtd,Lauterbach)tosupportafullrangeofapplicationdevelopment,fromfirmwarecustomizationuptoOS-levelapplications.
stN8815 PrODuCt taBLe
Part number Package type Package size availability
STn8815a09 Standalone 9 x 9 x 1.0 mm Production
STn8815a12 Standalone 12 x 12 x 1.2 mm Production
STn8815p14 MAP PoP (Package-on-Package) 14 x 14 x 1.0 mm Engineering samples
1
45/82- Architecture des SoC- Etudes de casMultimédia- Plateforme multimédia “portable”
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3
4 PS2 : contexteq EE: architecture matérielleq EE: architecture logicielleq EE: le circuitq PS3 & Cell
q GPU
46- Architecture des SoC- Etudes de cas- Multimédia
Architecture générale
47/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS2 : contexte
S. Mancini
Performances
Système tri-processeurs à 250 MHz
q 1 processeur 128 bitsq 2 processeurs vectoriels
í 6.2 Giga Flops
Architecture mémoire mixte partagée/distribuée
q Débit à la mémoire principale = 500 MB/s
q Débit interne = 2 GB/s
Performances 3D
q 75 Millions de polygones par secondeq 2,4 GPixels/s
48/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS2 : contexte
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3
q PS2 : contexte4 EE: architecture matérielleq EE: architecture logicielleq EE: le circuitq PS3 & Cell
q GPU
49- Architecture des SoC- Etudes de cas- Multimédia
Vue d’ensemble
FPUCPU
128 bits
DMATraitement
d’imageInterfaceRDRAM
I/O
RDRAM
250 MHz 250 MHz 250 MHz
VU0 VU1 GIF
Tous les modules et bus sont à 125 MHz
64b
32b
128b
50/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle
S. Mancini
Processeur principal
DTLB 32*128b Register fileITLB
CO
P2
VU
0 P
IPE
Bra
nch
Pip
e
LD
/ST
Pip
e
CO
P F
PU
Pip
e
Inte
ger
Pip
e 0
Inte
ger
Pip
e 1
I−Cache
16 KB
D−Cache
8 KB 16 KB
SPRAMR
esp
on
seB
uff
er
WBB
UCAB
Bus Interface Unit
64 b
its
64 b
its
Bus système
51/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle
S. Mancini
Architecture de VU1 (et VU0)
Sauf VU0
FM
AC
w
FM
AC
z
FM
AC
y
FM
AC
x
FD
IV
IAL
U
EF
U
FD
IV
FM
AC
LS
U
Upper
Instruction
Lower
Instruction
VIF
GIF
Bus Système
2*128b
32b32b
128b64b
128b
128b128b
Floating
Register
32*128b
Integer
16*16b
register
16b
64b Instruction
16 KB (4 KB)
Memory Memory
16 KB (4 KB)
Data
64b
2*16b
EF
U
52/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle
S. Mancini
Architecture mémoire
128b
128
32b
128b
128b
64b
Buff0
Buff Buff Buff1 2 3
RDRAM
Registres RegistresRegistres
GIFSPRAM
4 KBRAM
4 KBRAM
16 KBRAM
DI I
16 KBRAM
D
16 KBI
Cache
8 KB
16 KB
DCache
VIF0 VIF1
DMA
2*128
128b
CPU VU0 VU1128b
53/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle
S. Mancini
Partionnement de la 3D
Transformation
perspectiveEclairage
Transformation
du modèleElagage
Projection
pixels
Application
de textureZ−buffer
Image
finale
Graphic Synthesizer
Base de
donnée
CPU+VU0 VU1
54/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture matérielle
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3
q PS2 : contexteq EE: architecture matérielle4 EE: architecture logicielleq EE: le circuitq PS3 & Cell
q GPU
55- Architecture des SoC- Etudes de cas- Multimédia
Modèle de programmation
VPU0 peut fonctionner soit comme un coprocesseur du CPUsoit de façon autonome.VPU1 fonctionne exclusivement de façon autonome.
Le VIF de chaque VU reçoit des paquets de données à stockersoit dans la mémoire programme soit dans la mémoire donnéeassociée, selon un code d’identification.Ce code peut indiquer l’adresse d’un programme à lancer. tV
++
Les programmes destinés aux VPU0 et VPU1 sont écritsen assembleur.
56/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture logicielle
S. Mancini
Flot de données
VPU0
CPU
SPR
RDRAM VU1
GIF
DMA
CPU
VPU0
SPR
RDRAM
VU1 GIF
DMA
DMA
GIF
CPU
VPU0
RDRAM
VU1
DMA
DMA
DMAIPU
DMA
DMA
LD/ST
DMA
57/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: architecture logicielle
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3
q PS2 : contexteq EE: architecture matérielleq EE: architecture logicielle4 EE: le circuitq PS3 & Cell
q GPU
58- Architecture des SoC- Etudes de cas- Multimédia
Le circuit
• 1999 IEEE International Solid-State Circuits Conference 0-7803-5129-0/99 © IEEE 1999
Figure 15.1.4: Example of data flow.
Figure 15.1.3: IPU block diagram.
Figure 15.1.2: Vector operations units block diagram.
Figure 15.1.1: Chip block diagram.
Figure 15.1.5: Die micrograph.
VU1 MEMORY
FMAC
FDIV
VU1
VUO MEMORY
VUO
GIFIPU
MEM
ORY I/F
ALU / SHIFTER / LSUDMAC, etc.
SUPERSCALAR RISC CORE
I CACHED
CACHE SPR
I/O INTERFACE
FMAC
FMAC
FMAC
FMAC
FMAC
FMAC
FMAC
FMAC
FMAC
FDIV
MEM
ORY I/FM
EMORY I/F
FDIV
FDIV
DCACHE SPRI CACHE
1q 10,5 Millions de transistorsq 17x14.1 mm2 en 0.18 µm
q 15 W
q 250 MHz(CPU, VU0, VU1)
q 125 MHz
59/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- EE: le circuit
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”6 Playstation 2 & 3
q PS2 : contexteq EE: architecture matérielleq EE: architecture logicielleq EE: le circuit4 PS3 & Cell
q GPU
60- Architecture des SoC- Etudes de cas- Multimédia
Architecture de la PS3
Cell3.2 GHz
RSX®XDRAM256 MB
I/O Bridge
HD/HDSD
AV out
20GB/s
15GB/s
25.6GB/s
2.5GB/s
2.5GB/s
BD/DVD/CD ROM Drive
54GB USB 2.0 x 6
Gbit Ether/WiFi Removable StorageMemoryStick,SD,CF
BT Controller
GDDR3256 MB
22.4GB/s
61/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell
S. Mancini
Cell: schéma bloc
62/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell
S. Mancini
Cell: EIB
two-way double-precision floating-pointoperation is partially pipelined, so its instruc-tions issue at a lower rate (two double-preci-sion flops every seven SPU clock cycles).When using single-precision floating-pointfused multiply-add instructions (which countas two operations), the eight SPUs can per-form a total of 64 operations per cycle.
Communication architectureTo take advantage of all the computation
power available on the Cell processor, workmust be distributed and coordinated acrossthe PPE and the SPEs. The processor’s spe-cialized communication mechanisms allowefficient data collection and distribution aswell as coordination of concurrent activitiesacross the computation elements. Because theSPU can act directly only on programs anddata in its own local store, each SPE has aDMA controller that performs high-band-width data transfer between local store andmain memory. These DMA engines also allowdirect transfers between the local stores of twoSPUs for pipeline or producer-consumer-styleparallel applications.
At the other end of the spectrum, the SPUcan use either signals or mailboxes to performsimple low-latency signaling to the PPE orother SPEs. Supporting more-complex syn-
chronization mechanisms is a set of atomicoperations available to the SPU, which oper-ate in a similar manner as the PowerPC archi-tecture’s lwarx/stwcx atomic instructions. Infact, the SPU’s atomic operations interoper-ate with PPE atomic instructions to buildlocks and other synchronization mechanismsthat work across the SPEs and the PPE. Final-ly, the Cell allows memory-mapped access tonearly all SPE resources, including the entirelocal store. This provides a convenient andconsistent mechanism for special communi-cations needs not met by the other techniques.
The rich set of communications mecha-nisms in the Cell architecture enables pro-grammers to efficiently implement widelyused programming models for parallel anddistributed applications. These modelsinclude the function-offload, device-exten-sion, computational-acceleration, streaming,shared-memory-multiprocessor, and asym-metric-thread-runtime models.8
Element interconnect busFigure 2 shows the EIB, the heart of the Cell
processor’s communication architecture,which enables communication among thePPE, the SPEs, main system memory, andexternal I/O. The EIB has separate commu-nication paths for commands (requests to
13MAY–JUNE 2006
PPE SPE1 SPE3 SPE5 SPE7
SPE0 SPE2 SPE4 SPE6 BIFIOIF0MIC
IOIF1
Data bus arbiter
BIFIOIF
Broadband interfaceI/O interface
Data network
Figure 2. Element interconnect bus (EIB).1
63/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell
S. Mancini
Cell: performance DMA
DMA command requests. The DMACthen queues this bus request to the businterface unit (BIU).
6. The BIU selects the request from itsqueue and issues the command to theEIB. The EIB orders the command withother outstanding requests and thenbroadcasts the command to all bus ele-ments. For transfers involving mainmemory, the MIC acknowledges thecommand to the EIB which then informsthe BIU that the command was accept-ed and data transfer can begin.
7. The BIU in the MFC performs the readsto local store required for the data trans-fer. The EIB transfers the data for thisrequest between the BIU and the MIC.The MIC transfers the data to or fromthe off-chip memory.
8. The unrolling process produces a sequenceof bus requests for the DMA command
that pipeline through the communicationnetwork. The DMA command remainsin the MFC SPU command queue untilall its bus requests have completed. How-ever, the DMAC can continue to processother DMA commands. When all busrequests for a command have completed,the DMAC signals command completionto the SPU and removes the commandfrom the queue.
In the absence of congestion, a thread run-ning on the SPU can issue a DMA request inas little as 10 clock cycles—the time needed towrite to the five SPU channels that describethe source and destination addresses, theDMA size, the DMA tag, and the DMA com-mand. At that point, the DMAC can processthe DMA request without SPU intervention.
The overall latency of generating the DMAcommand, initially selecting the command,
17MAY–JUNE 2006
SPU
3.2 GHz 1.6 GHz
MFC
DMAC
LS
BIU
EIB
(1)
MMU(3)
(6)
MIC
(6)
Memory(off chip)
(7)
(7)
(7)
(7)
1.6
GH
z
Cha
nnel
inte
rfac
e
MM
IO
51.2 Gbyte/s
25.6Gbyte/s
204.
8 G
byte
/s25.6
Gbyte/sin/out
25.6 Gbyte/sin/out
DMASPU
Queuesproxy
(2) (5)
(4)
TLB
BIUDMAC
EIBLS
MFC
MICMMIOMMUSPUTLB
Bus interface unitDirect memory access controllerElement interconnect busLocal storeMemory flow controller
Memory interface controllerMemory-mapped I/OMemory management unitSynergistic processor unitTranslation lookaside buffer
Figure 3. Basic flow of a DMA transfer.
1
between puts and gets to local store becauselocal-store access latency was remarkablylow—only 8 ns (about 24 processor clockcycles). Main memory gets required less than100 ns to fetch information, which is remark-ably fast.
Figure 4b presents the same results in termsof bandwidth achieved by each DMA opera-tion. As we expected, the largest transfersachieved the highest bandwidth, which wemeasured as 22.5 Gbytes/s for gets and puts tolocal store and puts to main memory and 15Gbytes/s for gets from main memory.
Next, we considered the impact of non-blocking DMA operations. In the Cell proces-sor, each SPE can have up to 16 outstandingDMAs, for a total of 128 across the chip, allow-ing unprecedented levels of parallelism in on-chip communication. Applications that relyheavily on random scatter or gather accesses tomain memory can take advantage of these com-munication features seamlessly. Our bench-marks use a batched communication model, inwhich the SPU issues a fixed number (the batchsize) of DMAs before blocking for notificationof request completion. By using a very largebatch size (16,384 in our experiments), weeffectively converted the benchmark to use anonblocking communication model.
Figure 5 shows the results of these experi-ments, including aggregate latency and band-width for the set of DMAs in a batch, by batchsize and data transfer size. The results show aform of performance continuity betweenblocking—the most constrained case—andnonblocking operations, with differentdegrees of freedom expressed by the increasingbatch size. In accessing main memory andlocal storage, nonblocking puts achieved theasymptotic bandwidth of 25.6 Gbytes/s,determined by the EIB capacity at the end-points, with 2-Kbyte DMAs (Figure 5b and5f ). Accessing local store, nonblocking putsachieved the optimal value with even smallerpackets (Figure 5f ).
Gets are also very efficient when accessinglocal memories, and the main memory laten-cy penalty slightly affects them, as Figure 5cshows. Overall, even a limited amount ofbatching is very effective for intermediateDMA sizes, between 256 bytes and 4 Kbytes,with a factor of two or even three of bandwidthincrease compared with the blocking case (for
example, 256-byte DMAs in Figure 5h).
Collective DMA performanceParallelization of scientific applications gen-
erates far more sophisticated collective
19MAY–JUNE 2006
Table 1. DMA latency components for a clock
frequency of 3.2 GHz.
Latency component Cycles NanosecondsDMA issue 10 3.125DMA to EIB 30 9.375List element fetch 10 3.125Coherence protocol 100 31.25Data transfer for inter-SPE put 140 43.75Total 290 90.61
0
200
400
600
800
1,000
1,200
4 16 64 256 1,024 4,096 16,384
Late
ncy
(nan
osec
onds
)
0
5
10
15
20
25
Ban
dwid
th (
Gby
tes/
seco
nd)
(a)
(b)
DMA message size (bytes)
4 16 64 256 1,024 4,096 16,384
DMA message size (bytes)
Blocking get, memoryBlocking get, SPEBlocking put, memoryBlocking put, SPE
Blocking get, memoryBlocking get, SPEBlocking put, memoryBlocking put, SPE
Figure 4. Latency (a) and bandwidth (b) as a function of DMA message sizefor blocking gets and puts in the absence of contention.
1
between puts and gets to local store becauselocal-store access latency was remarkablylow—only 8 ns (about 24 processor clockcycles). Main memory gets required less than100 ns to fetch information, which is remark-ably fast.
Figure 4b presents the same results in termsof bandwidth achieved by each DMA opera-tion. As we expected, the largest transfersachieved the highest bandwidth, which wemeasured as 22.5 Gbytes/s for gets and puts tolocal store and puts to main memory and 15Gbytes/s for gets from main memory.
Next, we considered the impact of non-blocking DMA operations. In the Cell proces-sor, each SPE can have up to 16 outstandingDMAs, for a total of 128 across the chip, allow-ing unprecedented levels of parallelism in on-chip communication. Applications that relyheavily on random scatter or gather accesses tomain memory can take advantage of these com-munication features seamlessly. Our bench-marks use a batched communication model, inwhich the SPU issues a fixed number (the batchsize) of DMAs before blocking for notificationof request completion. By using a very largebatch size (16,384 in our experiments), weeffectively converted the benchmark to use anonblocking communication model.
Figure 5 shows the results of these experi-ments, including aggregate latency and band-width for the set of DMAs in a batch, by batchsize and data transfer size. The results show aform of performance continuity betweenblocking—the most constrained case—andnonblocking operations, with differentdegrees of freedom expressed by the increasingbatch size. In accessing main memory andlocal storage, nonblocking puts achieved theasymptotic bandwidth of 25.6 Gbytes/s,determined by the EIB capacity at the end-points, with 2-Kbyte DMAs (Figure 5b and5f ). Accessing local store, nonblocking putsachieved the optimal value with even smallerpackets (Figure 5f ).
Gets are also very efficient when accessinglocal memories, and the main memory laten-cy penalty slightly affects them, as Figure 5cshows. Overall, even a limited amount ofbatching is very effective for intermediateDMA sizes, between 256 bytes and 4 Kbytes,with a factor of two or even three of bandwidthincrease compared with the blocking case (for
example, 256-byte DMAs in Figure 5h).
Collective DMA performanceParallelization of scientific applications gen-
erates far more sophisticated collective
19MAY–JUNE 2006
Table 1. DMA latency components for a clock
frequency of 3.2 GHz.
Latency component Cycles NanosecondsDMA issue 10 3.125DMA to EIB 30 9.375List element fetch 10 3.125Coherence protocol 100 31.25Data transfer for inter-SPE put 140 43.75Total 290 90.61
0
200
400
600
800
1,000
1,200
4 16 64 256 1,024 4,096 16,384
Late
ncy
(nan
osec
onds
) 0
5
10
15
20
25
Ban
dwid
th (
Gby
tes/
seco
nd)
(a)
(b)
DMA message size (bytes)
4 16 64 256 1,024 4,096 16,384
DMA message size (bytes)
Blocking get, memoryBlocking get, SPEBlocking put, memoryBlocking put, SPE
Blocking get, memoryBlocking get, SPEBlocking put, memoryBlocking put, SPE
Figure 4. Latency (a) and bandwidth (b) as a function of DMA message sizefor blocking gets and puts in the absence of contention. 1
64/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell
S. Mancini
Cell: le circuit
65/82- Architecture des SoC- Etudes de casMultimédia- Playstation 2 & 3- PS3 & Cell
S. Mancini
Plan
q Contexteq Set top Box PNX8500q Set top box IBM STB03xxxq Plateforme multimédia “portable”q Playstation 2 & 36 GPU
66- Architecture des SoC- Etudes de cas- Multimédia
Objectifs des (GP)-GPU
P Unifier les processeurs qui interviennent dans le pipelinegraphique
P Permettre leur programmation pour un usage général
Leur architecture reste conditionnée par le pipeline graphique:
P Parallélisme massif pour le calcul des pixelP Hiérarchie mémoire adaptéeP Intégration d’opérations systématiques (interpolation et
filtrage de texture, pixelisation des triangles, etc . . . )
67/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
GeForce 8, 9 et GT200 (Tesla): Architecture globale
68/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
GeForce Tesla: Architecture TPC
work distribution is based on the pixellocation.
Streaming processor arrayThe SPA executes graphics shader thread
programs and GPU computing programsand provides thread control and manage-ment. Each TPC in the SPA roughlycorresponds to a quad-pixel unit in previousarchitectures.1 The number of TPCs deter-mines a GPU’s programmable processingperformance and scales from one TPC in asmall GPU to eight or more TPCs in high-performance GPUs.
Texture/processor clusterAs Figure 2 shows, each TPC contains a
geometry controller, an SM controller(SMC), two streaming multiprocessors(SMs), and a texture unit. Figure 3 expandseach SM to show its eight SP cores. Tobalance the expected ratio of math opera-
tions to texture operations, one texture unitserves two SMs. This architectural ratio canvary as needed.
Geometry controllerThe geometry controller maps the logical
graphics vertex pipeline into recirculationon the physical SMs by directing allprimitive and vertex attribute and topologyflow in the TPC. It manages dedicated on-chip input and output vertex attributestorage and forwards contents as required.
DX10 has two stages dealing with vertexand primitive processing: the vertex shaderand the geometry shader. The vertex shaderprocesses one vertex’s attributes indepen-dently of other vertices. Typical operationsare position space transforms and color andtexture coordinate generation. The geome-try shader follows the vertex shader anddeals with a whole primitive and its vertices.Typical operations are edge extrusion for
Figure 2. Texture/processor cluster (TPC).
.........................................................................................................................................................................................................................
HOT CHIPS 19
.......................................................................
42 IEEE MICRO
Authorized licensed use limited to: IMAG - Universite Joseph Fourrier. Downloaded on December 11, 2008 at 05:12 from IEEE Xplore. Restrictions apply.
169/82- Architecture des SoC- Etudes de cas
Multimédia- GPUS. Mancini
GeForce Tesla: Architecture SP
70/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
GeForce Tesla: Circuit
market segments. NVIDIA’s Scalable LinkInterconnect (SLI) enables multiple GPUsto act together as one, providing furtherscalability.
CUDA C/C++ applications executing onTesla computing platforms, Quadro work-stations, and GeForce GPUs deliver com-pelling computing performance on a rangeof large problems, including more than1003 speedups on molecular modeling,more than 200 Gflops on n-body problems,and real-time 3D magnetic-resonance im-aging.12–14 For graphics, the GeForce 8800GPU delivers high performance and imagequality for the most demanding games.15
Figure 9 shows the GeForce 8800 Ultraphysical die layout implementing the Teslaarchitecture shown in Figure 1. Implemen-tation specifics include
N 681 million transistors, 470 mm2;N TSMC 90-nm CMOS;N 128 SP cores in 16 SMs;N 12,288 processor threads;N 1.5-GHz processor clock rate;N peak 576 Gflops in processors;N 768-Mbyte GDDR3 DRAM;
N 384-pin DRAM interface;N 1.08-GHz DRAM clock;N 104-Gbyte/s peak bandwidth; andN typical power of 150 W at 1.3 V.
The Tesla architecture is the firstubiquitous supercomputing platform.
NVIDIA has shipped more than 50 millionTesla-based systems. This wide availability,coupled with C programmability and theCUDA software development environment,enables broad deployment of demandingparallel-computing and graphics applications.
With future increases in transistor density,the architecture will readily scale processorparallelism, memory partitions, and overallperformance. Increased number of multipro-cessors and memory partitions will supportlarger data sets and richer graphics andcomputing, without a change to the pro-gramming model.
We continue to investigate improved sched-uling and load-balancing algorithms for theunified processor. Other areas of improvementare enhanced scalability for derivative products,reduced synchronization and communicationoverhead for compute programs, new graphicsfeatures, increased realized memory band-width, and improved power efficiency. MICRO
AcknowledgmentsWe thank the entire NVIDIA GPU deve-
lopment team for their extraordinary effortin bringing Tesla-based GPUs to market.
................................................................................................
References1. J. Montrym and H. Moreton, ‘‘The GeForce
6800,’’ IEEE Micro, vol. 25, no. 2, Mar./
Apr. 2005, pp. 41-51.
2. CUDA Technology, NVIDIA, 2007, http://
www.nvidia.com/CUDA.
3. CUDA Programming Guide 1.1, NVIDIA,
2007; http://developer.download.nvidia.
com/compute/cuda/1_1/NVIDIA_CUDA_
Programming_Guide_1.1.pdf.
4. J. Nickolls, I. Buck, K. Skadron, and M.
Garland, ‘‘Scalable Parallel Programming
with CUDA,’’ ACM Queue, vol. 6, no. 2,
Mar./Apr. 2008, pp. 40-53.
5. DX Specification, Microsoft; http://msdn.
microsoft.com/directx.
Figure 9. GeForce 8800 Ultra die layout.
.........................................................................................................................................................................................................................
HOT CHIPS 19
.......................................................................
54 IEEE MICRO
Authorized licensed use limited to: IMAG - Universite Joseph Fourrier. Downloaded on December 11, 2008 at 05:12 from IEEE Xplore. Restrictions apply.
1
71/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
Fermi: Architecture globale
72/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
Fermi: Architecture SM
73/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
Fermi: Circuit
74/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
ATI HD 5xxx
75/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
ATI HD 5xxx
76/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
ATI HD 5xxx
77/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
ATI HD 4xxx vs 5xxx
78/82- Architecture des SoC- Etudes de casMultimédia- GPU
S. Mancini
Architecture des SoCEtudes de cas- Multimédia
S. Mancini
Plan Détaillé
6 Contexte
q Présentation
P ApplicationsP Particularités
q Standards vidéo
P Format vidéoP MPEG2
q 3D
P Principe de la 3D Z-bufferP Pipeline 3DP Performances des accélérateurs 3D
6 Set top Box PNX8500
q Environnement
P EnvironnementP Classe d’applicationP Architecture générale
q Les processeurs
P MIPS
P Trimédia- parties opérativesP Trimédia-Architecture mémoireP Trimédia-Mécanismes systèmes
q Bus, mémoire et 3DP Les busP Architecture des BusP Architecture MémoireP Architecture “Système”P Partionnement de la 3DP Vidéo
q Architecture logicielleP Systèmes d’exploitationsP Streaming SoftwareP Données standardisées gérées par TSSAP Exemple TSSA simpleP Description de l’application
q Exemple d’applicationP ObjectifsP Organisation de l’applicationP Organisation du contrôleP Déroulement de l’application
q Le circuit
P Le circuit
6 Set top box IBM STB03xxx
P Set top box IBM STB03xxxP Set top box IBM STB04xxxP Architecture mémoire
6 Plateforme multimédia “portable”
P Nomadik vidéo portable
6 Playstation 2 & 3
q PS2 : contexte
P Architecture généraleP Performances
q EE: architecture matérielle
P Vue d’ensembleP Processeur principalP Architecture de VU1 (et VU0)P Architecture mémoireP Partionnement de la 3D
q EE: architecture logicielle
P Modèle de programmationP Flot de données
q EE: le circuit
P Le circuit
q PS3 & Cell
P Architecture de la PS3P Cell: schéma blocP Cell: EIBP Cell: performance DMAP Cell: le circuit
6 GPU
P Objectifs des (GP)-GPUP GeForce 8, 9 et GT200 (Tesla): Architecture
globaleP GeForce Tesla: Architecture TPCP GeForce Tesla: Architecture SPP GeForce Tesla: CircuitP Fermi: Architecture globaleP Fermi: Architecture SMP Fermi: CircuitP ATI HD 5xxxP ATI HD 5xxxP ATI HD 5xxxP ATI HD 4xxx vs 5xxx
81