Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 1

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Multi-processor System on Chip Design

성균관대 조준동

© 조준동, 2007년 여름 2

목 차

• 차세대 SoC (System on Chip)의 요구사항

• MPSOC의 필요성

• History of Multiprocessors• MP-SoC Examples and Applications• Homogeneity and Heterogeneity• MP-SoC Design Automation• Network on Chip• SKKU’s Mobile MP-SoC Platform



차세대 SoC (System on Chip)의요구사항

© 조준동, 2007년 여름 4

Processor: AP - MCModem: GSM/GPRS - WCDMA - CDMA2000

Connectivity: Wireless LAN - GPS - Bluetooth

RF/Analog: Rx - Tx - Zero IF - PM

Camera Chipset: CIS - CCD - ISP

Display Driver IC (DDI): STN - TFT - OLED

Smart Card: Smart Card: SIMSIM

Flash Memory: Flash Memory: Code/Data StorageCode/Data Storage

SIP / MCPSIP / MCP

RAM: Mobile DRAM - SRAM - UtRAM

What is System on Chip? What is System on Chip?

SoCSoC

© 조준동, 2007년 여름 5

고성능 및 저전력의 필요성

3D graphics

Moore’s law

ShannonShannon’’s laws law((2.8x / 18m)

2G (IS-95)9.6kbps

3G (CDMA 1xEV)3,100kbps

4G (1GMbps~100Mbps)

20031995 2012

Battery capacityQVGA

D1

HD (720p)

Full HD (1080i)Mobile MultimediaMobile Multimedia

Design Complexity

Productivity Gap: Design complexity vs. Moore’s law Power Gap: Design complexity vs. Battery

© 조준동, 2007년 여름 6

임베디드 프로세서(ARM) 0.5 MOPS/mW

신호처리 프로세서ASIPs, DSPs

3 MOPS/mW

신호처리ASIC

가용성

에너

지효

율(M

OPS

/mW

)

0.1

1

10

100

1000

200 MOPS/mW

10-80 MOPS/mWFPFA

6

Flexibility-Energy Gap

FPFA : Field Programmable Function Array

Sensor network design space

Wireless embedded systems design space

© 조준동, 2007년 여름 7

차세대 SoC의 생산성 증대를 위한 5가지 요구사항

1. High Performance 2. Fast Verification3. Small Form Factor4. Low Power Solutions5. Design-Technology Integration for

Manufacturability

© 조준동, 2007년 여름 8

1. High-Performance: CMP +NoC

Heterogeneous Chip MultiHeterogeneous Chip Multi--processor Architectureprocessor Architecture

μP

IP

Mem

IP

PE

PE

PE

μP

Mem

PE

NoC

0

50

100

150

200

250

300

350

400

2004 2007 2010 2013 2016

#. PEs

Source: ITRS 2005 draft

Technology Evolution

© 조준동, 2007년 여름 9

2. Fast Verification: Embedded System Level

ComplexityComplexity

MooreMoore’’s Laws Law2x / 18m2x / 18m

NielsenNielsen’’s Laws Law2x / 12m2x / 12mEmbedded SWEmbedded SW

2x / 10m2x / 10m

System specification

Architecture design

RTL design

UML / Java / MatLab

SystemC / ADL

Verilog / VHDL

ESL

ctrl1/cmd1/

Req

Addr

Grant

Data

ack1ack0

TLMTLM

© 조준동, 2007년 여름 10

MobileMobileAPAP

32MB32MBNANDNAND

16MB16MBSDRAMSDRAM

~35mm~35mm

~2

5m

m~

25

mm16MB16MB

SDRAMSDRAM

17mm17mm

17mm

17mm

FlashSDRAM

SDRAM

Mobile AP

EMI ReductionEMI Reduction

60% Smaller Area60% Smaller Area

▷▷88--layers of MCPlayers of MCP▷▷ Cost reduction by 15%Cost reduction by 15%

3. Small Form Factor

SiPSiP: Mobile Application Processor + Mobile Memory: Mobile Application Processor + Mobile Memory

© 조준동, 2007년 여름 11

• MTCMOS

• Clock Gating

• Multi-Vdd

• Tr Sizing • VTCMOS• Multi-Vt• SOI

• High-κ Metal Gate

Device Circuit Architecture Runtime•Parallelization•GALS

DAC 2004

4. Low Power Solutions

DVFS

1.2V, 350MHz

1.5V, 500MHz

1.0V200MHz

Multi-Vdd

•DPM/DVS

Active

Active

Standby

Standby

VBP

VBN

VDD

VSS

VTCMOS

• MTCMOS

© 조준동, 2007년 여름 12

Statistical Analysis

CriticalTiming,power

Designer’sIntention

?

?

5. Design-Technology Integration for Manufacturability (DfM)

VariationInformation

NA, NA, ToxTox

Latency, PowerLatency, Power

Fault ProbabilityFault Probability

VddVdd, Temp, Temp

VtVt, , LgLg, L, t, , L, t, tILDtILD

Quantum Physics

Mask / Process Design

Architecture Design

Logic / PhysicalDesign

Algorithm DesignFault-tolerant algorithm

Yield-improving architecture

Statistical STA

© 조준동, 2007년 여름 13

More SoC topics …

• Platform optimization– Power management– BW allocation– Resource sharing– Task distribution– Efficient communications

• Low Power• Verification

•인재 (System Architect) 양성



MP-SoC (Multi-Processor System on Chip)의 필요성

© 조준동, 2007년 여름 15

Definition of MP-SOC?

Usually Heterogeneous Multiprocessor:

CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.

Definition of Multiprocessor by Enslow Jr.

MIMD machines with shared memory•Shared memory•Shared I/O•Distributed OS•HomogeneousExtended definition: All parallel machines (wrong usage)

© 조준동, 2007년 여름 16

Future Microprocessors

© 조준동, 2007년 여름 17

Why MP?

Uniprocessors have hit the ceilingGet performance from better architecture instead of more MHz

© 조준동, 2007년 여름 18

Anatomy of a Cellular Phone

3G Wireless Protocols

© 조준동, 2007년 여름 19

MP-SoC 응용 분야:4G: Multiple standards

Communications.Networking.Multimedia.Security.

Mutiband/multimode를 지원하는 Digital RF

© 조준동, 2007년 여름 20

MP-SoC Platform의 진화 방향(WCDMA+CDMA2000의 예)

© 조준동, 2007년 여름 21

System Architecture for 3G

•4 PEs–static kernel mapping

and scheduling–SIMD+Scalar units•1 ARM GPP controller–scalar algorithms and

protocol controls

© 조준동, 2007년 여름 22

ARM MPCoreTM 아키텍쳐

© 조준동, 2007년 여름 23

재구성 및 Scalable MP-SoC 플랫폼

© 조준동, 2007년 여름 24

Road Map to MP-SoC Trends

• Mask NRE: Over 1M$; • Design NRE: 10M$ to 75M$

– ASICs replaced by programmable ASSP, FPGA’s• Number of embedded processors

– DVD/STB/HDTV, mobile phones: 5 to 8• Image proc, networking, basestation: 8 to 100+• E-S/W complexity

– Set-top box, audio: >1 million lines of codeE-S/W becoming essential part of SoC’s

WhoWho’’s Law?s Law?

© 조준동, 2007년 여름 25

Why is MP-SOC Challenging?

© 조준동, 2007년 여름 26

Software Defined Wireless Multimedia Terminals

•Lower costs–Platform longevity, higher

volume–SW has lower development

costs•Time to market–Future protocols will have

complex implementations–Overlap testing/development

cycles•Adaptability–Standards change over time–Multi-mode operation–Sharing hardware resources

Multistandard Radio

• UMTS• GSM/GPRS/EDGE• WLAN• Bluetooth• UWB

Multistandard M/M• H.264• MP3• AAC• GPS• DVB-H• TPEG

SDR = Reconfigurable Radios

© 조준동, 2007년 여름 27

SDR Configuration• Modulation Format

– QPSK– DQPSK– π/4 DQPSK– {16,64,256,1024} QAM– OFDM– OFDM CDMA

• Digital Down/Up Conversion (DDC)– Channel Center– Decimation/Interpolation rates– Compensation Filters– Matched Filter α = {0.25,0.35,...}

• FEC– Convolutional– Reed-Solomon– Concatenated Coding– Turbo CC/PC– (De-)Interleave

Soft RadioDigital Signal

Processing Engine

• Network Interface Definition

• Channel Access– CDMA– TDMA

• Security• Beam Forming

• DSSS– Rake, track, acquire– Multi User Detect. (MUD– ICU

© 조준동, 2007년 여름 28

Future mobile applications?

• Mobile supercomputing– Speech recognition.– Cryptography.– Augmented reality.– Typical applications (email, etc.).

• Requires 16x 2 GHz Pentium 4 ?

Mudge et al:

Culture and Education? Personal Entertainment ?

© 조준동, 2007년 여름 29

Broa

dcas

ting,

Ubi

quito

us

Health, H

uman, Bio

MP-SoC 응용 분야

D-TV

CIS Mobile

Recorder

Health

HCI Bio

Data Broadcasting

RFID

Automotive & Robotics

Telematics UnmannedDriving Robot

© 조준동, 2007년 여름 30

MPSoC today

• High performance, low power: there is no other way than MPSoC!

• Virtually all processor vendors are on the MPSoC route– TI: OMAP, DaVinci– STM: Nomadik– IBM: Cell– Intel: IXP, CoreDuo– Philips: Nexperia– Atmel: Diopsis– ARM: MPCore– ARC: VRaptor

• Urgent need for MPSoC design tools– Application design and platform capture– Architecture exploration and optimization– Simulation and verification– Application to architecture mapping

© 조준동, 2007년 여름 31

The triangle, Chicken and Egg?

architectures

applications

methodologies

•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.

© 조준동, 2007년 여름 32

Tape Out

VerifyCompose the system

VerifySimulate

VerifySoC Composer

Verify (timing, area)Synthesis + P&R

VerifySimulate (performance)

Should the SoC designer work hard?

Requirements

Mobile SoC에서검증이 왜 중요한지?

왜 우리는 검증이취약하게 되었는지

© 조준동, 2007년 여름 33

Some statements from MPSoC 2006 Symposium

• The ad-hoc approach to SoC design cannot scale with Moore’s Law...The SW development environment as afterthought era of IC design is rapidly drawing to a close“ (K. Keutzer, UCB)

• Power-constrained CPUs are mandatory, but the most exciting features require system-level SW optimization“ (M. Kuulusa, Nokia)

• Multi-core platforms are a reality – but where is the SW support?“ (R. Lauwereins, IMEC)

© 조준동, 2007년 여름 34

Gartner, 2007년 10대 기술 발표

• 향후 3년간 성숙단계에 이를 것으로 예상되는 10대 기술을발표(2006년 제25차 가트너 데이터 센터 연례회의, 2006.11.28~12.1)

• 오픈소스(Open Source), 가상화(Virtualization), 정보 액세스 (Information Access), 유비쿼터스 컴퓨팅(Ubiquitous Computing), 그리드 컴퓨팅(Grid Computing), 컴퓨트 유틸리티(Compute Utilities), 멀티코어 프로세서(Multicore Processors), 웹 2.0(Web 2.0), 네트워크 통합(Network Convergence), 수냉 방식(Water Cooling)

http://www.gartner.com/2_events/conferences/lsc25.jsp

http://www.gartner.com/2_events/conferences/lsc25.jsp



Fundamental to Parallel Machines

© 조준동, 2007년 여름 36

Purposes of multiple processors

• Performance– A job can be executed quickly with multiple

processors

• Fault tolerance– If a processing unit is damaged, total system

can be available： Redundant systems

• Resource sharing– Multiple jobs share memory and/or I/O modules

for cost effective processing：Distributed systems

• Low power– High performance with Low frequency operation

© 조준동, 2007년 여름 37

DSP

Why Multi-Threaded Cores?

Out

NoC

In SRAM

DSPDSPH/W-MTRISC

H/WProc. Element

$GPP

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock)

More parallel processing

(lower-power, higher-perf./mm2)

© 조준동, 2007년 여름 38

Flynn’s Classification

• The number of Instruction Stream：

M(Multiple)/S(Single)

• The number of Data Stream：M/S– SISD

• Uniprocessors（including Super scalar、VLIW）

– MISD： Not existing（Analog Computer）– SIMD– MIMD

© 조준동, 2007년 여름 39

MIMD

Processors

Memory modules (Instructions・Data）

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

© 조준동, 2007년 여름 40

Classification of MIMD machines

• UMA(Uniform Memory Access Model)provides shared memory which can be accessed from all processors with the same manner.

• NUMA(Non-Uniform Memory Access Model)

provides shared memory but not uniformly accessed.

• NORA/NORMA (No Remote Memory Access Model)

provides no shared memory. Communication is done with message passing.

© 조준동, 2007년 여름 41

An example of UMA：Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main Memory

shared bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

© 조준동, 2007년 여름 42

Switch connected UMA

Switch

CPUInterface

Local Memory

Main Memory

．．．．

…．

© 조준동, 2007년 여름 43

NUMA

• Each processor provides a local memory, and accesses other processors’ memory through the network.

• Address translation and cache control often make the hardware structure complicated.

• Scalable：– Programs for UMA can run without

modification. – The performance is improved as the system

size.

Competitive to WS/PC clusters with Software DSM

© 조준동, 2007년 여름 44

Typical structure of NUMA

Node １

Node 2

Node ３

Node ００

１

２

３

ＩｎｔｅｒｃｏｎｎｅｃｔｏｎＮｅｔｗｏｒｋ

Logical address space

© 조준동, 2007년 여름 45

Classification of NUMA

• Simple NUMA：– Remote memory is not cached.– Simple structure but access cost of remote

memory is large.• CC-NUMA：Cache Coherent

– Cache consistency is maintained with hardware.

– The structure tends to be complicated.• COMA:Cache Only Memory Architecture

– No home memory– Complicated control mechanism

© 조준동, 2007년 여름 46

Cray’s T3D: A simple NUMA supercomputer (1993)

• UsingAlpha 21064

© 조준동, 2007년 여름 47

The Earth simulator(2002)

© 조준동, 2007년 여름 48

NORA/NORMA

• No shared memory• Communication is done with

message passing• Simple structure but high pe

ak performance

The fastest processor is always NORA(except The Earth Simulator)

Hard for programming

Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2



MP-SoC Examples and Applications

© 조준동, 2007년 여름 50

Dual-Core (DSP+ARM) Platform

© 조준동, 2007년 여름 51

IBM Power4

– 2 cores– F = 1.4GHz– Single clock over entire

die– Balanced H-tree driving

global grid– Measured clock skew

below 25ps– Power ~85W– 180nm SOI process,

174M transistors

© 조준동, 2007년 여름 52

IBM’s Multiple processors on MCM

- 4 POWER4 chips into single module (MCM)

– The POWER4 chips connected via 4 128-bit buses

– Up to 128MB L3 cache– Bus speed ½ processor

speed– Total throughput ~35

GB/s

© 조준동, 2007년 여름 53

MPSoC “Bus” Alternatives• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O• NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric

© 조준동, 2007년 여름 54

Available Mobile Processors

• The ARM Family– The ARM7 Generation– The StrongARM– The ARM Thumb Option– The ARM Piccolo Option– The ARM9 and ARM10

• The Motorola M-Core• The LSI TinyRisc• The Hitachi SuperH Family• VLIW Processors

– The Motorola-Lucent Star*Core– The Philips TriMedia– The HP/Intel IA-64

© 조준동, 2007년 여름 55

Available MP-Cores

• TI OMAP• Philips’s NexperiaTM DVP• ST Nomadik• Intel® Itanium® Montecito• CELL Processor• CT 3400 Multi-core DSP• Hibrid SoC• Systolic Ring• Virtual platform in SHAPES project

© 조준동, 2007년 여름 56

TI OMAP

• Targets communications, multimedia.

• Dual-processor (DSP, RISC) with shared memory

• Hierarchical Definition of Platform

• Critical Role of Software as well as Hardware

• OCP (Open Core Protocol) based SoCplatform

C55x DSP

OMAP 5910:

ARM9

MMU

Memory ctrl

MPUinterface

SystemDMAcontrol

bridge

I/O

© 조준동, 2007년 여름 57

플랫폼 계층 및 구분

• Level 0: Foundation Platform– Infrastructure & standards : Basic Arch.

• Processor core, Peripheral/Interface IP, Bus: e.g., ARM PrimeXsys

• Level 1: Application specific Integration Platform

• Application Specific SoC: HW & SW• Mobile Platform, Home Platform

• Level 2: System Platform• Terminal Platform• Handset case: RF + Modem + AP + Memory + MMI

© 조준동, 2007년 여름 58

Hierarchy of Platforms in OMAP

Reference Design

Application Platform

SoC Platform

ASIC Library & Tools

Silicon Technology

Application Specific

Broadly Applica

ble

OMAP Products

OMAP Infrastructure

Reuse

System Platform

© 조준동, 2007년 여름 59

Scalable Multi-processors

© 조준동, 2007년 여름 60

TI OMAP 1510 Platform Architecture

Peripheral Bus

TI925

SDRAM Bus(16)

Peripherals: LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, McBSP

Peripheral Bus

C55X

SystemDMA

Peripherals Buses (8/16/32)

MAIL Box

DSP MMU

IMIF

Traffic Controller

EMIFF EMIFS

SRAMLB MMU

HASB MMU

Flash Bus(16)

LB(32)

HASB(32)

GPP:TI925 Core

- 16KB I-Cache - Write Buffer- MMU and D-MMU- Dual TLB

DSP:C55x CoreInternal Memory

- 48KW SARAM - 32KW DARAM- 16KW PDRAM

24KB I-CacheGraphic HW AcceleratorARM Port Interface

IPC- Mail Boxes - API- DSP MMU

System DMATraffic ControllerInternal SRAMBussesPeripherals

© 조준동, 2007년 여름 61

TI OMAP 1510 Platform S/W

TI925 General-Purpose Processor

OS kernel& drivers

TMS320 DSP

MPEG4

OS adapter LINK driverMCU Bridge Kernel

RESOURCE MANAGER

LINK driver Other drivers

DSP/BIOS KernelRM Server

MP3 AMRMEDIA APIs

raw data streams video audio speech

XDAIS AlgorithmsEncapsulated in socket nodes

Node Data Base

© 조준동, 2007년 여름 62

Philips’s NexperiaTM DVP

(source: Th. Claasen, Philips, DAC 2000)

© 조준동, 2007년 여름 63C

ompr

esse

d A

/V In

put B

us

Philips NexperiaTM DVP S/W Reference Architecture

Analog Inputs

Analog Front End

Analog Front End

Digital Inputs

Analog Front End

Optical Drive

Network Protocols

Hard Disk

Digital Front Ends

Com

pres

sed

A/V

Inpu

t Bus

Network Protocols

Players

Broadcast-MPEG2

VCD/SVCD

DVD

CD/SACD

WMT

RN

Broadcast-MPEG4

Recoders

DVD+RW Auth

PVR-SPTS

Lo-Rate SPTS

CD/DVD-MP3

• • •

Unc

ompr

esse

d A

/V In

put B

us

Transcoders

Translaters

TS-SPTS Filter

Loopback / Feedthrough

Digital Outputs

Protocol Stack

Network Protocols

Driver

HDD/Ethernet

Presentation Engine

Audio and Video Processing

© 조준동, 2007년 여름 64

Philips NexperiaTM DVP MP-SoC

• Philips's advanced set-top box anddigital TV SoC (Viper2)

• 0.13 μm• 50 M transistors• 100 clock domains• > 60 IP blocks

© 조준동, 2007년 여름 65

ST Nomadik

• Targets mobile multimedia. A Heterogenousmultiprocessor-of-multiprocessors.

© 조준동, 2007년 여름 66

Power Distribution 인텔 제온 프로세서

© 조준동, 2007년 여름 67

Clock and Power Convergence

Dynamic voltage and frequency scaling (DVS)

© 조준동, 2007년 여름 68

Intel® Itanium® Montecito - Clock system architecture

– Each core split into 3 clock domains on variablepower supply

© 조준동, 2007년 여름 69

Intel® Itanium® Montecito - Power management

– Dynamic voltage-scaling power management system– 4 on-die sensors– On-die microcontroller– Power and temperature measurement– Voltage and frequency modulation– 8μs power/temperature sampling interval– Embedded firmware– Power, temperature, or calibration measurements– Power: closed-loop power control and system

stability check– Temperature: thermal sensor readout (junction

temperature below 90°C monitoring) and power-control communication

– Calibration: power-measurement accuracy check

© 조준동, 2007년 여름 70

The implementation of a first-generation CELL Processor

© 조준동, 2007년 여름 71

The Cell Processor

• Fclock > 4 GHz.• Memory bandwidth: 25.6 GBytes per second.• I/O bandwidth: 76.8 GBytes per second.• Performance:

– 256 GFLOPS (Single precision at 4 GHz).– 256 GOPS (Integer at 4 GHz).– 25 GFLOPS (Double precision at 4 GHz).

• 235 square mm.• 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz

© 조준동, 2007년 여름 72

Cell’s Element Interconnect Bus

• 4 rings (2 ckwise + 2 counter-ckwise)• No token rings, still request/grant arbitrations

© 조준동, 2007년 여름 73

CT 3400 Multi-core DSP

• 8개 32비트DSP 코어

• 6개 32비트 범용프로세서 코어

• 128핀 프로그램가능 I/O 서브시스템으로 구성

• C 프로그램 가능

• H.264 및MPEG4 코드를지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control

© 조준동, 2007년 여름 74

H.264 codec onto CT3400 MDSP

From cradle

© 조준동, 2007년 여름 75


CT3400 DPS Engine

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier

Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdf

© 조준동, 2007년 여름 76

CT3600 Multiprocessor DSP Family Members

• CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수

• 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다

• 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네배로 증가

• 40달러에서 90달러

http://www.cradle.com/downloads/CT3600-PB.pdf

© 조준동, 2007년 여름 77


http://www.cradle.com/downloads/CT3600-PB.pdf

© 조준동, 2007년 여름 78

HiBRID-SoC Architecture

Multi-Core SoC Architecture Dedicated chips

for the Mpeg-4 Simple Profile

Integrate a powerful on-chip communication

structure

Three programmable cores: Each adapted

towards a specific class of algorithmsInstruction Level VLIW (Very long

instruction word)Data Level SIMD (Single instruction

multiple data)Task Level (Simultaneous

multithreading)

Developed at the University of Hannover

© 조준동, 2007년 여름 79

Multi-Core SoC Architecture

• Hi-par DSP• 16-datatath SIMD processor core controlled by VLIW,• Particularly optimized towards high-throughput two

dimensional DSP-style processing• (FFT-intensive applications or filtering)

• Stream Processor (SP)• 32-Bit RISC architecture that is more optimized to-wards

control-dominated task• Bitstream processing or global system control

• Macroblock processor(MP)• Efficient processing of data blocks (Heterogeneous data

path structure consisting of scalar and a vecture unit)• Controlled by dual-issue VLIW, offers flexible subword

parallelism, and contains instruction set extensions for typical processing computation steps

© 조준동, 2007년 여름 80

HiBRID-SoC multi-core architecture

64-bit AMBA AHB system

bus

Connects all cores

SDRAM memory via a

64 Bit SDRAM

interface

Two versatile 32-Bit

host interfaces for

access (e.g., host PC

via PCI and to serial

flash memory)

© 조준동, 2007년 여름 81

HiPAR-DSP

Highly paralled DSP core with a

VLIW-controlled SIMD

architecture

DMA unit serves all cache misses

and performs data prefetch

transfers to the matrix memory

At the targeted clock frequency of

145 MHz, the HiPAR-DSP

achieves a performance of 2.3

GMACs

© 조준동, 2007년 여름 82

Macroblock processorHeterogeneous data path structure consisting

of a scalar and a vector data path

The scalar data path operates on 32-Bit data

words in a 32-entry register file and provides

control instructions (jump,branch, and loop)

The vector data path is equipped with a 64

entry register file of 64 bit width

Special fuction unit(SFU) provide

instruction set extensions for common video

and multimedia core algorithms.

MUL/MAC or ALU, incorporate SIMD-

style subword parallelism by processing

either two 32-Bit, four 16-Bit, or eight 8-Bit

data entities in parallel within a 64-bit

register operand

© 조준동, 2007년 여름 83

HiBRID-SoC Implementations

Chip layout of the HiBRID-SoC.

MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits:

HiBRID-SoC is fabricated in a 0.18 um,

6LM standard-cell technology,

14 million tr’s 3.5W

82 mm2, 145 MHz

mailto:720*576@25Hz,1.5-3

© 조준동, 2007년 여름 84

New Taxonomy/Metric

• Flynn: Triple (d,i,c)d: # of data streamsi: # of instruction

streamsc: # of configuration

states

SISD, SIMD, MIMD,MISD

• RA: (c,g,a)– c: configurability to

various environment– g: size of granularity– a: adaptability to

various components

– SCSG,SCMG,SCLG– MCSG,MCMG,MCLG

© 조준동, 2007년 여름 85

Systolic Ring

• Based on a coarse-grained configurable PE

• Circular datapathsC: # of layers C = 4N: # of Dnodes per layer

N = 2S: # of Rings s = 1

• Control Units (sequencer)Local Dnode unitLocal Ring unitGlobal unit

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

layer 1

layer 2

layer 3

layer 4

Dnode Sequencer

Local RingSequencer

© 조준동, 2007년 여름 86

Motivation For Using Hierarchical Rings

• Relatively simple switching logic reduces the complexity at each node resulting in reduced buffer, area and energy requirements.

• Low latency since packets are forwarded in 1 clock cycle.

• Packets will always arrive in-order at the destination.

• Broadcast and Multicast packets are efficiently implemented.

• Hierarchical rings can be partitioned into independent clock domains.

© 조준동, 2007년 여름 87

Remanence

Fe

Fc

FcNcFeNR PE

..=

• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency

Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations

Interconnection

PE PE PE PE PE

instn

…

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Interconnection

PE PE PE PE PE

instn

…


Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

© 조준동, 2007년 여름 88

Operative Density

NPE: # of PE

A: Core Area (relative unit λ²)

Area can be expressed as a function of NPE

)()(

PE

PEPE NA

NNOD =

Interconnection

PE PE PE PE PE

instn

…


Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Interconnection

PE PE PE PE PE

instn

…


Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

© 조준동, 2007년 여름 89

Remanence formalisation

• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

k = 2k = 4

k = 8

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

layer 1 layer 2

layer 3

layer 4

layer 5layer 6

layer 7

layer 8

k = 1k = 1

k = 2k = 4

k = 8

PEPE NkNR .)( =

k= C/N

© 조준동, 2007년 여름 90

Architectural model Characterization

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Global Bus

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Global Bus

Global Sequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)

Control Units• Local Dnode unit• Local Ring unit• Global unit

•www.qstech.com

© 조준동, 2007년 여름 91

Best OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceWorst interconnect resources and processing power

© 조준동, 2007년 여름 92

Worst OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceBest interconnect resources and processing power

© 조준동, 2007년 여름 93

Comparisons of RA

1. Only 1 cycle to (re)configure the DSP

2. Few cycles to (re)configure coarse grain RA (≤8)

3. Many cycles to (re)configure fine grain RA

NPE Nc RName Type F (MHz)

2304 0.14 16457

24 4 6

24 4 6

128 16 8

ARDOISE

Systolic Ring

DART

MorphoSys

TMS320C62

Fine Grain RA

Coarse Grain RA

Coarse Grain RA

Coarse Grain RA

DSP VLIW 8 8

33

200

130

100

300 1

FcNcFeNR PE

..

=Pascal BENOIT

© 조준동, 2007년 여름 94

Virtual platform in SHAPES project



Homogeneity and Heterogeneity

© 조준동, 2007년 여름 96

MPSoC Architecture Trends

© 조준동, 2007년 여름 97

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

1

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

1~100

MultiFlex Thread-Level

Parallelism

100’s

© 조준동, 2007년 여름 98

Three Levels of Parallelism

© 조준동, 2007년 여름 99

Parallel Heterogeneous Platforms (PHPs)

• Challenges:– Explore the theoretically high performance

Platform Company PEs Het?

Cell IBM/Sony/Toshiba 9 Y

DRP NEC 512 N

Nomadik ST 3+ Y

OMAP2420 TI 4 Y

Nexperia Philips 3+ Y

X-Fi Creative 7 Y

ARM11 MPCore ARM 1-4 N

IXP2800 Intel 17 Y

MXP5800 Intel 54 Y

… … … …

(From Abhijit Davare’s Quals Presentation)

© 조준동, 2007년 여름 100

Homogeneous MP-SOC

• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore

module• 32bit interconnection

(AMBA Bus or STBus)• Porcessor Core modeling :

C++• Hardware interconnection

modeling : SystemC

© 조준동, 2007년 여름 101

NEC MP211: Homogeneous MP core

• Asymmetric mp with very coarse grain multitasking

• 3 ARM9’s utilized as predefined function units

• NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

•Asymmetric mp with very coarse grain multitasking•3 ARM9’s utilized as predefined function units•NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

© 조준동, 2007년 여름 102

MP211 block diagram

© 조준동, 2007년 여름 103

Power consumption of H.264+AAC

H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)DTV: 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가실행되고 있는 영역을 뜻한다.

© 조준동, 2007년 여름 104

Homogeneous MP의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.

▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은

자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하

다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적

으로 설계되었다.

© 조준동, 2007년 여름 105

Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPsdo, a crossbar provides a high bandwidth connection.

Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism,instruction level parallelism을 동시에 처리 가능하도록 설계되었다.

© 조준동, 2007년 여름 106

Heterogeneous MP core

• 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.

© 조준동, 2007년 여름 107

Heterogenous MP Core

▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous

CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은

큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다.

▷ Multi-ISA multicore architecture는 vector/data-level parallellism,

instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 코어

숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우,

인터콘넥트의 전력 소모량은 하나의 코어와 같다.

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을

이용하는 heterogeneous processors는 homogeneous에 비해서 63%

성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의

개선이 있다.

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에

대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를

최대화 한다. [Annavaram, et al]

© 조준동, 2007년 여름 108

NEC’s Asymmetric(or Heterogeneous) Multi processing

• 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른하드웨어 개발기간과 가격을 낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 어플리케이션간 인터페이스를 줄일 수 있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이가능하다.

© 조준동, 2007년 여름 109

Heterogeneous MP-SoC 문제점들

• Processors are bound by wire and memory latencies

• Peak performance on only a small class of applications.

• How well they map to a given design• Diversification of workloads • Increased hardware complexity • Poor resource utilization

© 조준동, 2007년 여름 110

AMP task allocation image

© 조준동, 2007년 여름 111

Bus and Memory Architecture

© 조준동, 2007년 여름 112

Alpha cores scaled to 0.10 um.

EV8 is 80 times bigger but provides only two to three times more single-threaded performance

© 조준동, 2007년 여름 113

Equal-area heterogeneous architectures with multithreaded cores

© 조준동, 2007년 여름 114

Exploring the potential from heterogeneity

© Jun Dong Cho, 2007.7 115


MP-SoC Design Automation

© 조준동, 2007년 여름 116

Optimization and Synthesis

• Computation Synthesis:– Task Allocation–Task Scheduling

• Communication Synthesis:– Interconnection Synthesis–Buffer sizing

© 조준동, 2007년 여름 117

Energy-Aware Task mapping

Minimize Energy Consumption, given a CTG and a heterogenous NoC• Find:

– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned

• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.

• Communication-optimal task mapping– minimal hardware (buffers and wires) required to

meet the timing requirements defined in the specification.

– given a multiprocessor network find a mapping of the application satisfies the timing constraints.

• Genetic algorithm (Chromosome, Generation, Crossover, mutation)

Addressed by Hu et al 2002:

© 조준동, 2007년 여름 118

Interconnection Synthesis– With each new

technology:– Gate delay decreases

~25%– Wire delay increases

~100%

– Cross-chip communication increases

– Clock needs multiple cycles to cover die

Source: SIA NTRS Projection

© 조준동, 2007년 여름 119

Interconnect Delays & Density

Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

© 조준동, 2007년 여름 120

Buffer Sizing

• Architectures have bounded buffer resources.• If more communication buffer resources are

utilized, processors may spend less time waiting to send/receive data.

• Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.

© 조준동, 2007년 여름 121

Multiple Clocks due to Interconnect limitation

© 조준동, 2007년 여름 122

MPSoC HW platform perspective

• Today´s platforms are quite heterogeneous– Reasons: efficiency and legacy IP

• Homogeneous MPSoC would scale welland would simplify programming– Works well for desktop PCs– Too inefficient for embedded apps

• Mixed MPSoC as a compromise?– Globally homogeneous, locally heterogeneous– (re)configurable PEs

www.iss.rwth-aachen.de

© 조준동, 2007년 여름 123

Future MPSoC programming

• Sequential-to-parallel code generationC code (and platform/RTOS model) in,

• parallel C codes out» Step 1: exhibit parallelism at block/task level

to the user for manual mapping» Step 2: automate code partitioning/mapping

• Massive use of compiler technology, e.g.data flow analysis

• Use of „platform refinement“ technology asbackend for machine code generation and

simulation www.iss.rwth-aachen.de

© 조준동, 2007년 여름 124

The von Neumann inheritance

• Sequential programming of sequentialmachines– Pascal, Modula-2, C, C++, Java, ...

• Sequential programming of parallel machines?– VLIW: handled by sophisticated compilers– SIMD: will be accomplished by compilers– Does not scale to heterogeneous MPSoC

with– distributed control paths!

• Parallel programming of parallel machines!– We need to move ... to parallel thinking and

programming...We are standing at the verybeginning...It´s a huge area. (J. Gutknecht,

– ETH Zurich)• What to do in the meantime?

© 2007 R. Leupers

© 조준동, 2007년 여름 125

Block clustering approach


© 조준동, 2007년 여름 126

Block clustering approach –Cn’t


© 조준동, 2007년 여름 127

Block clustering approach –Cn’t


© 조준동, 2007년 여름 128

멀티코어 SoC 설계방법

© Jun Dong Cho, 2007.7 129


Network On Chip

© 조준동, 2007년 여름 130

Technology Evolution

© 조준동, 2007년 여름 131

What are NoC’s?

• According to Wikipedia:

– “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.”

© 조준동, 2007년 여름 132

Network-on-Chip (NoC)

• Communication is achieved by connecting switches together to form a network topology:

• Offers much greater scalability.• parallelism: multiple components can send

data simultaneously• energy efficient: point-to-point connections

require less energy than a bus.• Global synchronization is no longer needed.

© 조준동, 2007년 여름 133

NoC Design Considerations (I)

• There are several popular topologies:– 2D Mesh (most popular).– Torus (rings)– Tree (fat-tree, butterfly fat-tree)

• The on-chip interconnection network will soon be a limiting– factor for performance and energy consumption:– has been reported to account for over 50% of the total

energyrequirement!

• The interconnect should consume the fewest resourcespossible and should be:– area efficient: switches should be as small (simple) as

possible.– energy efficient: related to area efficiency– fast: simple routing algorithms should be used.

© 조준동, 2007년 여름 134

ProcessorMaster

GlobalMemory

Slave

Global I/OSlave

Global I/OSlave

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

NoC exemplified

© 조준동, 2007년 여름 135

NoC: Good news

☺ Only point-to-point one-way wires are used, for all network sizes.

☺ Aggregated bandwidth scales with the network size.

☺ Routing decisions are distributed and the same router is re-instanciated, for all network sizes.

☺ NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)

Sergio Tota and Mario R. Casu

© 조준동, 2007년 여름 136

There’s no free lunch…

Internal network contention causes (often unpredictable) latency.The network has a significant silicon area.Bus-oriented IPs need smart wrappers.Software needs clean synchronization in multiprocessor systems.System designers need reeducation for new concepts.

Sergio Tota and Mario R. Casu

© 조준동, 2007년 여름 137

Facts about NoC’s

• It is a way to decouple computation from communication

• The design is layered (physical, network, application…): Taming complexity is made easier

• Communication between processing elements in NoC takes place by encapsulating data in packets

• The elementary packet piece to which switch and routing operations apply is the flit

© 조준동, 2007년 여름 138

Topologies• Heritage of networks with new constraints

– Need to accommodate interconnects in a 2D layout– Cannot route long wires (clock frequency bound)

a) SPIN, b) CLICHE’c) Torusd) Folded toruse) Octagonf) BFT.

© 조준동, 2007년 여름 139

SPIN (Guerrier et al., DATE ’00/’03)

• Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology.• A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words

© 조준동, 2007년 여름 140

Dally et al., DAC’01• 2D folded torus topology• Wormhole routing and Virtual Channels (VC)

© 조준동, 2007년 여름 141

Kumar et al., ISLVLSI’02

• Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’• 2D Mesh Topology• Message Passing

© 조준동, 2007년 여름 142

Pande et al., TCOMP’05 • Butterfly Fat Tree• Wormhole, Virtual channels• Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)• “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration

© 조준동, 2007년 여름 143

Goossens et al., IEE CDT’03

• Both VCT and WH, GT and BE, IQ and VOQ

• GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k”in a S&F fashion.

• BE uses Matrix Scheduling• GT connections set up by BE

special system packets• Prototype with WH and IQ

– 5 ports– 0.13 um, 0.26 mm2 , 500/166 MHz– Flit size = 3 words, each 32 bits– 80 Gb/s aggregate bandwidth

© Jun Dong Cho, 2007.7 144


SKKU’s Mobile MP-SoC Platform

Koonshik Cho & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

© 조준동, 2007년 여름 145

1. Multiprocessor SoC 설계 Platform

• SW-성능 개선과 표준 변동에 능동적으로 대처

• HW- Modular, Flexible and Scalable Architecture

• Platform based design

2. Multiprocessor SoC Platform test

• DVB-T Receiver

3. Tools

• Seamless CVE (Mentor Graphics)

• ADS(ARM)

SoC (DSP+ARM) Platform

© 조준동, 2007년 여름 146

SoC (DSP+ARM) Platform

© 조준동, 2007년 여름 147

Extended multi-processor platform

© 조준동, 2007년 여름 148

ARM Platform

© 조준동, 2007년 여름 149

AMBA BUS (1)

AMBA BUS는 Multiplexer, Arbiter, Decoder가 있어 여러개의 Master와 Slave를 중재해 주는 역할을 한다.

© 조준동, 2007년 여름 150

AMBA BUS (2)AMBA Bus (Master to slave multiplexer)

• Bus Master는 Address나 Control signal들을 Slave로 내보냄으로 Read 나Write 등의Operation을 할 수 있도록 해 주는 장치이다. 동시 간에 하나의Master만이 전송을 가능하게 한다. 또한 Multiple master가 가능하다

© 조준동, 2007년 여름 151

AMBA BUS (3)

AMBA Bus (Slave to master multiplexer)

• Bus Slave는 주어진 Address-space안에서 Master의 Read와 Write를 가능하게 해주는 장치이다. Slave는 Ready 및 Response signal을통해 동작 상태에 대해 Master에게 알려준다. 또한 Multiple slave가가능하다.

© 조준동, 2007년 여름 152

AMBA BUS (4)

• AHB Arbiter : Bus Arbiter는 한번에 오직 하나의 Master가 선택

되도록 하는 역할을 한다. 고유의 Priority scheme을 가지고 이러

한 Arbitration을 하게 되는데, AHB에는 오직 하나의 Arbiter가 존

재하게 된다.

• AHB Decoder : AHB Decoder의 역할은 Master로 나오는

Address의 상위 비트를 가지고서 적절한 Slave를 선택해 주는 것

이다. AHB에는 역시 하나의 Decoder가 존재한다.

• APB Bridge : APB (Advanced Peripheral Bus)상의 유일한 Bus Master이다. APB Bridge는 ASB의 Slave로서 Decoder에서 APB가 선택이 되었을 때는 APB 상에서 Master의 역할을 하게 된다

APB Bridge는 Slave module로 Local peripheral bus를 대신해

서 Bus handshake와 신호 Retiming을 조정한다.

© 조준동, 2007년 여름 153

AMBA BUS (5)

• Interrupt controller : 최대 32개의 Interrupt source로부터 Interrupt request 신호를 받아서 ARM9 프로세서에 인가되는 nIRQ 또는 nFIQ 신호를 생성한다. 32개의 Interrupt source 중에서 0～3번 Interrupt source가 nFIQ, 4～31번 Interrupt source가 nIRQ를생성한다. 낮은 번호일수록 높은 우선순위를 가진다.

• Timer :Timer 모듈에서는 3개의 Timer 기능을 제공한다. 각 Timer는 16bit counter로서 16, 256, 4096의세 가지 Prescale을 지원하며, 매 주기마다 Counter값을 1씩 감소시키고, Count값이 0이 되면 Interrupt를발생시킨다. ARM9 프로세서가 Timer interrupt clear 레지스터를 통해 Interrupt ack 신호를 줄 때까지Interrupt request를 유지한다.

© 조준동, 2007년 여름 154

Teak DSP Platform

• 전제 플랫폼에서 Co-프로세서인 Teak DSP 플랫폼의 구조

© 조준동, 2007년 여름 155

Configuration of crossbar switch

• Communication interface Architecture (Crossbar 구조)

© 조준동, 2007년 여름 156

재구성 가능한 크로스바 스위치

VHDL 의 generate문을 사용

© 조준동, 2007년 여름 157

재구성 가능한 크로스바 스위치(VHDL code)

entity CI_TOP isgeneric ( number_of_masters , number_of_slaves : integer);

port ( …생략);

end CI_TOP;

CI 모듈의 entity (ci_top.vhd)

COMMUNICATION_INTERFACE : CI_TOPgeneric map( number_of_masters=>4 , number_of_slaves =>6)

port map( …생략);

CI 모듈의 사용(multiplatform.vhd)

© 조준동, 2007년 여름 158

Advantage Disadvantage

Mux

‣비교적 쉽게 구현 가능‣Master, Slave 가적은 경우 효과적

‣Processor 간 병렬처리가 어려움‣시스템이 확장될 경우병목현상을

발생

Crossbar

‣Processor 간 효과적인 병렬처리가 가능‣시스템이 확장되어도같은 Delay를 가짐

‣구현이 어려움‣Size 및 low-power면에서 비교적불리함

Communication Interface Mux vs Crossbar

© 조준동, 2007년 여름 159

Interconnection network

Omega interconnection Octagon interconnection

Mesh interconnection

© 조준동, 2007년 여름 160

장점 단점

Shared bus ‣비교적 쉽게 구현 가능

‣마스터, 슬레이브가 적은

경우 효과적.시스템이 확장되어도 같은 Delay를 가짐

‣프로세서 간 병렬처리 힘듦

‣버스 효율 낮음

‣전력소모 많음 (broadcasting) ‣구현 복잡도 - 낮음

Crossbar ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) - 보통

‣확장에 따른 Size 및 wiring 증가

Omega network

‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수(짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐

Octagon ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수 (가장 짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐 (마스터, 슬레이브의 개수 8개로 제한)

Mesh ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) 매우높음

‣중, 대형 시스템에 적합

Interconnection network

© 조준동, 2007년 여름 161

네트워크 라우터 셀

– 멀티프로세서 플렛폼으로 4개의Master와 6개의 Slave구조

– CI Cell은 24개의 2by2 mux 구조로 설계

– CI Controller => Req, Grant, mux control etc.

– Seamless CVE와 Modelsim을연동한 상태에서 ARM926,Teak DSP가 동시에 slave에 접근하여각각의 데이터를 Read & Write 플랫폼 Function Block

– 각 Master가 Slave(Ips)로 접근시 CI Controller내부 기능은Request신호, Grant 신호 및 각Mux Control 제어신호, Round Robin기능, Decoder기능 수행 Ci Controller inner block

© 조준동, 2007년 여름 162

CI-controller State Diagram

© 조준동, 2007년 여름 163

CI controller simulation waveform

© 조준동, 2007년 여름 164

DVB-T Baseband Receiver

© 조준동, 2007년 여름 165

Hardware-software co-design flows

© 조준동, 2007년 여름 166

A shared memory structure and hardware-software partitioning

© 조준동, 2007년 여름 167

Frequency offset compensator hardware

© 조준동, 2007년 여름 168

Fine and Coarse Frequency Synchronizer (Beek & Classen)

© 조준동, 2007년 여름 169

FFT block diagram

© 조준동, 2007년 여름 170

Equalizer hardware block diagrams

© 조준동, 2007년 여름 171

DVB-T baseband Receiver Scheduling

© 조준동, 2007년 여름 172

DVB-T baseband Receiver Scheduling

© 조준동, 2007년 여름 173

Performance evaluation

Processing Types /

Functional BlockSW

SW & HW (Teaks + ARM +

HW IP)

HW(IP) only MAL

Frequency compensator & Remove Guard - 182.5us 13.8us 10.5us

Fine Freq. sync. (Beek) - 56.3us 1.5us 7.8us

Symbol Timing Recovery 144 us - - 5.2us

FFT - 188.9us 38.6 us 13.6us

Coarse Freq. Sync. (Classen) - 241us 3.3us 11us

Scattered Pilot Detection 46.5us - - 3.3us

Equalizer - 219.5us 11.2us 9.5us

De-mapping 19.9us - - 4.9us

© 조준동, 2007년 여름 174

Task Chart of Multi-processor platform for DVB-T baseband receiver

© 조준동, 2007년 여름 175

Task Chart of Multi-processor platform for DVB-T baseband receiver

© Jun Dong Cho, 2007.7 176


Modeling of Motion Compensation IP using SCML

Le Minh Nghia & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

© 조준동, 2007년 여름 177

Introduction of SystemC Modeling in CoWare

• TLM Peripheral Modeling with CoWare• SystemC Modeling Library (SCML)• Motion Compensation Modeling using SCML

© 조준동, 2007년 여름 178

TLM Peripheral Modeling with CoWare

• Four use-cases for Transaction-Level Modeling (TLM)– Functional View (FV)– Architecture’s View (AV)– Programmer’s View (PV)– Verification View (VV)

© 조준동, 2007년 여름 179


• General pattern for modeling peripheral component– Separate Behavior, Communication and Timing– Initiators and Targets depending platform are created by user.– Bus-transactor convert a generic communication into a bus-specific TLM

interface.– Accuracy of Timing depends on use-case

© 조준동, 2007년 여름 180


• Communication– Communication through function calls– Simulation speed strongly depends on bus-model– PV bus-model used for software development can

be simulated very fast• Behavior

– Functionality– Synchronization– Storage

• Timing– Modeling timing model based on clock object in

SystemC Modeling Library (SCML)

© 조준동, 2007년 여름 181


• Modeling Target pattern– Communication : Bus-transactor– Storage and Synchronization : Register bank as interface– Behavior: Collection of call-back functions, each call-back

corresponding to a bitfiel or register in register bank

© 조준동, 2007년 여름 182


• Modeling Initiator pattern– Communication : Bus-transactor convert posted transactions in

queue into real bus transaction– Storage and Synchronization : Include Post port and initiator storage

element scml_array (in SCML). Post port post transactions in term of nonblocking. The real synchronization depends on data and space in storage element which related to scml_array object

– Behavior: Modeled by autonomous SystemC processes

© 조준동, 2007년 여름 183

Initiator Synchronization

• Two class of initiator blocks:– Free-running initiator: all transfer initialized by

block do not need any accesses from another peripheral

– Initiator block has target port and transfers will only be initialized

• Three pattern synchronization of Initiator block:– Free-running Initiator– Fully Slaved Initiator– Semi-free running Initiator

© 조준동, 2007년 여름 184


• Modeling a Free-Running Initiator Peripheral– Thread is modeled by SC_THREAD and post

transaction– Wait(sc_time) : To schedule the next-execution of

thread

© 조준동, 2007년 여름 185


• Modeling a Fully-Slaved Initiator Peripheral– Slaved- Initiator only sends transaction when its target

port is accessed– Loop in Fully-Slaved Initiator returns control to master

thread after it posted transaction

© 조준동, 2007년 여름 186


• Modeling a Semi-Free-Running-Slaved Initiator Peripheral– Thread containing Loop is triggered by start event– Start event is generated by accessing target port of

initiator

© 조준동, 2007년 여름 187

SystemC Modeling Library (SCML)

• Memories and Bitfield object:– To model bit-field and memory-map registers– Memory object support posting non-blocking

transactions– Support synchronization by read and write data based

on blocking access • Clock object

– To model timing or clock in IPs• Initiator-side object

– Model the communication of initiator peripherals to support re-use.

© 조준동, 2007년 여름 188

Modeling TLM Motion Compensation

• Outline features of Motion Compensation IP– Synchronization : Semi-Free-Running-Slaved

Initiator– Behavior: Algorithm extracted from J.M source

code– Structure includes two part

• Target part: Interface with Master Processor using Register bank and modeling follow Target pattern of SCML

• Initiator part : Modeling the posting of transactions and synchronization of transmission transactions follow Inititator pattern of SCML

© 조준동, 2007년 여름 189


© 조준동, 2007년 여름 190


• Three ports:– pConfig: Interface with Master Processor to receive parameters.– p_Irq: Generate interrupt to synchronization with Master processor– p_Post : Post transactions to specific bus through bus-transactor

• Register bank: for parameters of Motion Compensation block

• StartStopReg and IrqReg: for interface with Master Processor

• Behavior block : for Motion Compensation Algorithm and transaction modeling

• Call-back functions : for events caused by writing StartStopReg and IrqReq.

© 조준동, 2007년 여름 191


• Functions in TLM Motion Compensation Model:– f_initialize(): Init parameters of Model– f_thread (): Wait event generated by writing to StartStopReg– f_write_start_stop(): Call-back function corresponding with event writing

to StartStopReg. It activate or deactivate Model by generating a sc_eventto signal f_thread().

– f_clear_irq(): Clear IrqReg– f_MotionCompensation(): Motion Compensation behavior based on

original source code in J.M reference software.– f_do_post(): Post transactions in storage(transaction pool) to bus

transactor and manage synchronization posting– f_postTransfer(): Post a transaction to bus transactor– f_release_trans(): Release transaction pool

© 조준동, 2007년 여름 192

Next…

• Extract parameters as TestVector from J.M source code

• Build a platform in CoWare• Test Motion Compensation IP with TestVector

© 조준동, 2007년 여름 193

맺음말

• (Mobile) SoC의 complexity 및 cost의 증가로 MP-SOC platform을 이용한 설계 프로세스 중요

• Mobile platform의 challenge로 low power, RF I/F를 포함한 검증, variety of standards, platform optimization 제시

• 여러 platform 및 methodology의 장단점을 취한platform 개발이 바람직

• HW/SW/algorithm을 이해하고 설계할 수 있는 인재(system architect) 육성

Documents

Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –