193
© Jun Dong Cho, 2007.7 1 Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea Multi-processor System on Chip Design 성균관대 조준동

Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

  • Upload
    lenhu

  • View
    221

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 1

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Multi-processor System on Chip Design

성균관대 조준동

Page 2: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 2

목 차

• 차세대 SoC (System on Chip)의 요구사항

• MPSOC의 필요성

• History of Multiprocessors• MP-SoC Examples and Applications• Homogeneity and Heterogeneity• MP-SoC Design Automation• Network on Chip• SKKU’s Mobile MP-SoC Platform

Page 3: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 3

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

차세대 SoC (System on Chip)의요구사항

Page 4: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 4

Processor: AP - MCModem: GSM/GPRS - WCDMA - CDMA2000

Connectivity: Wireless LAN - GPS - Bluetooth

RF/Analog: Rx - Tx - Zero IF - PM

Camera Chipset: CIS - CCD - ISP

Display Driver IC (DDI): STN - TFT - OLED

Smart Card: Smart Card: SIMSIM

Flash Memory: Flash Memory: Code/Data StorageCode/Data Storage

SIP / MCPSIP / MCP

RAM: Mobile DRAM - SRAM - UtRAM

What is System on Chip? What is System on Chip?

SoCSoC

Page 5: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 5

고성능 및 저전력의 필요성

3D graphics

Moore’s law

ShannonShannon’’s laws law((2.8x / 18m)

2G (IS-95)9.6kbps

3G (CDMA 1xEV)3,100kbps

4G (1GMbps~100Mbps)

20031995 2012

Battery capacityQVGA

D1

HD (720p)

Full HD (1080i)Mobile MultimediaMobile Multimedia

Design Complexity

Productivity Gap: Design complexity vs. Moore’s law Power Gap: Design complexity vs. Battery

Page 6: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 6

임베디드 프로세서(ARM) 0.5 MOPS/mW

신호처리 프로세서ASIPs, DSPs

3 MOPS/mW

신호처리ASIC

가용성

에너

지효

율(M

OPS

/mW

)

0.1

1

10

100

1000

200 MOPS/mW

10-80 MOPS/mWFPFA

6

Flexibility-Energy Gap

FPFA : Field Programmable Function Array

Sensor network design space

Wireless embedded systems design space

Page 7: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 7

차세대 SoC의 생산성 증대를 위한 5가지 요구사항

1. High Performance 2. Fast Verification3. Small Form Factor4. Low Power Solutions5. Design-Technology Integration for

Manufacturability

Page 8: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 8

1. High-Performance: CMP +NoC

Heterogeneous Chip MultiHeterogeneous Chip Multi--processor Architectureprocessor Architecture

μP

IP

Mem

IP

PE

PE

PE

μP

Mem

PE

NoC

0

50

100

150

200

250

300

350

400

2004 2007 2010 2013 2016

#. PEs

Source: ITRS 2005 draft

Technology Evolution

Page 9: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 9

2. Fast Verification: Embedded System Level

ComplexityComplexity

MooreMoore’’s Laws Law2x / 18m2x / 18m

NielsenNielsen’’s Laws Law2x / 12m2x / 12mEmbedded SWEmbedded SW

2x / 10m2x / 10m

System specification

Architecture design

RTL design

UML / Java / MatLab

SystemC / ADL

Verilog / VHDL

ESL

ctrl1/cmd1/

Req

Addr

Grant

Data

ack1ack0

TLMTLM

Page 10: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 10

MobileMobileAPAP

32MB32MBNANDNAND

16MB16MBSDRAMSDRAM

~35mm~35mm

~2

5m

m~

25

mm16MB16MB

SDRAMSDRAM

17mm17mm

17mm

17mm

FlashSDRAM

SDRAM

Mobile AP

EMI ReductionEMI Reduction

60% Smaller Area60% Smaller Area

▷▷88--layers of MCPlayers of MCP▷▷ Cost reduction by 15%Cost reduction by 15%

3. Small Form Factor

SiPSiP: Mobile Application Processor + Mobile Memory: Mobile Application Processor + Mobile Memory

Page 11: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 11

• MTCMOS

• Clock Gating

• Multi-Vdd

• Tr Sizing • VTCMOS• Multi-Vt• SOI

• High-κ Metal Gate

Device Circuit Architecture Runtime•Parallelization•GALS

DAC 2004

4. Low Power Solutions

DVFS

1.2V, 350MHz

1.5V, 500MHz

1.0V200MHz

Multi-Vdd

•DPM/DVS

Active

Active

Standby

Standby

VBP

VBN

VDD

VSS

VTCMOS

• MTCMOS

Page 12: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 12

Statistical Analysis

CriticalTiming,power

Designer’sIntention

?

?

5. Design-Technology Integration for Manufacturability (DfM)

VariationInformation

NA, NA, ToxTox

Latency, PowerLatency, Power

Fault ProbabilityFault Probability

VddVdd, Temp, Temp

VtVt, , LgLg, L, t, , L, t, tILDtILD

Quantum Physics

Mask / Process Design

Architecture Design

Logic / PhysicalDesign

Algorithm DesignFault-tolerant algorithm

Yield-improving architecture

Statistical STA

Page 13: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 13

More SoC topics …

• Platform optimization– Power management– BW allocation– Resource sharing– Task distribution– Efficient communications

• Low Power• Verification

•인재 (System Architect) 양성

Page 14: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 14

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC (Multi-Processor System on Chip)의 필요성

Page 15: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 15

Definition of MP-SOC?

Usually Heterogeneous Multiprocessor:

CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.

Definition of Multiprocessor by Enslow Jr.

MIMD machines with shared memory•Shared memory•Shared I/O•Distributed OS•HomogeneousExtended definition: All parallel machines (wrong usage)

Page 16: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 16

Future Microprocessors

Page 17: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 17

Why MP?

Uniprocessors have hit the ceilingGet performance from better architecture instead of more MHz

Page 18: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 18

Anatomy of a Cellular Phone

3G Wireless Protocols

Page 19: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 19

MP-SoC 응용 분야:4G: Multiple standards

Communications.Networking.Multimedia.Security.

Mutiband/multimode를 지원하는 Digital RF

Page 20: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 20

MP-SoC Platform의 진화 방향(WCDMA+CDMA2000의 예)

Page 21: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 21

System Architecture for 3G

•4 PEs–static kernel mapping

and scheduling–SIMD+Scalar units•1 ARM GPP controller–scalar algorithms and

protocol controls

Page 22: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 22

ARM MPCoreTM 아키텍쳐

Page 23: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 23

재구성 및 Scalable MP-SoC 플랫폼

Page 24: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 24

Road Map to MP-SoC Trends

• Mask NRE: Over 1M$; • Design NRE: 10M$ to 75M$

– ASICs replaced by programmable ASSP, FPGA’s• Number of embedded processors

– DVD/STB/HDTV, mobile phones: 5 to 8• Image proc, networking, basestation: 8 to 100+• E-S/W complexity

– Set-top box, audio: >1 million lines of codeE-S/W becoming essential part of SoC’s

WhoWho’’s Law?s Law?

Page 25: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 25

Why is MP-SOC Challenging?

Page 26: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 26

Software Defined Wireless Multimedia Terminals

•Lower costs–Platform longevity, higher

volume–SW has lower development

costs•Time to market–Future protocols will have

complex implementations–Overlap testing/development

cycles•Adaptability–Standards change over time–Multi-mode operation–Sharing hardware resources

Multistandard Radio

• UMTS• GSM/GPRS/EDGE• WLAN• Bluetooth• UWB

Multistandard M/M• H.264• MP3• AAC• GPS• DVB-H• TPEG

SDR = Reconfigurable Radios

Page 27: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 27

SDR Configuration• Modulation Format

– QPSK– DQPSK– π/4 DQPSK– {16,64,256,1024} QAM– OFDM– OFDM CDMA

• Digital Down/Up Conversion (DDC)– Channel Center– Decimation/Interpolation rates– Compensation Filters– Matched Filter α = {0.25,0.35,...}

• FEC– Convolutional– Reed-Solomon– Concatenated Coding– Turbo CC/PC– (De-)Interleave

Soft RadioDigital Signal

Processing Engine

• Network Interface Definition

• Channel Access– CDMA– TDMA

• Security• Beam Forming

• DSSS– Rake, track, acquire– Multi User Detect. (MUD– ICU

Page 28: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 28

Future mobile applications?

• Mobile supercomputing– Speech recognition.– Cryptography.– Augmented reality.– Typical applications (email, etc.).

• Requires 16x 2 GHz Pentium 4 ?

Mudge et al:

Culture and Education? Personal Entertainment ?

Page 29: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 29

Broa

dcas

ting,

Ubi

quito

us

Health, H

uman, Bio

MP-SoC 응용 분야

D-TV

CIS Mobile

Recorder

Health

HCI Bio

Data Broadcasting

RFID

Automotive & Robotics

Telematics UnmannedDriving Robot

Page 30: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 30

MPSoC today

• High performance, low power: there is no other way than MPSoC!

• Virtually all processor vendors are on the MPSoC route– TI: OMAP, DaVinci– STM: Nomadik– IBM: Cell– Intel: IXP, CoreDuo– Philips: Nexperia– Atmel: Diopsis– ARM: MPCore– ARC: VRaptor

• Urgent need for MPSoC design tools– Application design and platform capture– Architecture exploration and optimization– Simulation and verification– Application to architecture mapping

Page 31: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 31

The triangle, Chicken and Egg?

architectures

applications

methodologies

•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.

Page 32: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 32

Tape Out

VerifyCompose the system

VerifySimulate

VerifySoC Composer

Verify (timing, area)Synthesis + P&R

VerifySimulate (performance)

Should the SoC designer work hard?

Requirements

Mobile SoC에서검증이 왜 중요한지?

왜 우리는 검증이취약하게 되었는지

Page 33: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 33

Some statements from MPSoC 2006 Symposium

• The ad-hoc approach to SoC design cannot scale with Moore’s Law...The SW development environment as afterthought era of IC design is rapidly drawing to a close“ (K. Keutzer, UCB)

• Power-constrained CPUs are mandatory, but the most exciting features require system-level SW optimization“ (M. Kuulusa, Nokia)

• Multi-core platforms are a reality – but where is the SW support?“ (R. Lauwereins, IMEC)

Page 34: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 34

Gartner, 2007년 10대 기술 발표

• 향후 3년간 성숙단계에 이를 것으로 예상되는 10대 기술을발표(2006년 제25차 가트너 데이터 센터 연례회의, 2006.11.28~12.1)

• 오픈소스(Open Source), 가상화(Virtualization), 정보 액세스 (Information Access), 유비쿼터스 컴퓨팅(Ubiquitous Computing), 그리드 컴퓨팅(Grid Computing), 컴퓨트 유틸리티(Compute Utilities), 멀티코어 프로세서(Multicore Processors), 웹 2.0(Web 2.0), 네트워크 통합(Network Convergence), 수냉 방식(Water Cooling)

http://www.gartner.com/2_events/conferences/lsc25.jsp

Page 35: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 35

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Fundamental to Parallel Machines

Page 36: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 36

Purposes of multiple processors

• Performance– A job can be executed quickly with multiple

processors

• Fault tolerance– If a processing unit is damaged, total system

can be available: Redundant systems

• Resource sharing– Multiple jobs share memory and/or I/O modules

for cost effective processing:Distributed systems

• Low power– High performance with Low frequency operation

Page 37: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 37

DSP

Why Multi-Threaded Cores?

Out

NoC

In SRAM

DSPDSPH/W-MTRISC

H/WProc. Element

$GPP

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock)

More parallel processing

(lower-power, higher-perf./mm2)

Page 38: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 38

Flynn’s Classification

• The number of Instruction Stream:

M(Multiple)/S(Single)

• The number of Data Stream:M/S– SISD

• Uniprocessors(including Super scalar、VLIW)

– MISD: Not existing(Analog Computer)– SIMD– MIMD

Page 39: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 39

MIMD

Processors

Memory modules (Instructions・Data)

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

Page 40: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 40

Classification of MIMD machines

• UMA(Uniform Memory Access Model)provides shared memory which can be accessed from all processors with the same manner.

• NUMA(Non-Uniform Memory Access Model)

provides shared memory but not uniformly accessed.

• NORA/NORMA (No Remote Memory Access Model)

provides no shared memory. Communication is done with message passing.

Page 41: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 41

An example of UMA:Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main Memory

shared bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

Page 42: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 42

Switch connected UMA

Switch

CPUInterface

Local Memory

Main Memory

. . . .

….

Page 43: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 43

NUMA

• Each processor provides a local memory, and accesses other processors’ memory through the network.

• Address translation and cache control often make the hardware structure complicated.

• Scalable:– Programs for UMA can run without

modification. – The performance is improved as the system

size.

Competitive to WS/PC clusters with Software DSM

Page 44: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 44

Typical structure of NUMA

Node 1

Node 2

Node 3

Node 0 0

InterconnectonNetwork

Logical address space

Page 45: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 45

Classification of NUMA

• Simple NUMA:– Remote memory is not cached.– Simple structure but access cost of remote

memory is large.• CC-NUMA:Cache Coherent

– Cache consistency is maintained with hardware.

– The structure tends to be complicated.• COMA:Cache Only Memory Architecture

– No home memory– Complicated control mechanism

Page 46: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 46

Cray’s T3D: A simple NUMA supercomputer (1993)

• UsingAlpha 21064

Page 47: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 47

The Earth simulator(2002)

Page 48: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 48

NORA/NORMA

• No shared memory• Communication is done with

message passing• Simple structure but high pe

ak performance

The fastest processor is always NORA(except The Earth Simulator)

Hard for programming

Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2

Page 49: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 49

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC Examples and Applications

Page 50: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 50

Dual-Core (DSP+ARM) Platform

Page 51: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 51

IBM Power4

– 2 cores– F = 1.4GHz– Single clock over entire

die– Balanced H-tree driving

global grid– Measured clock skew

below 25ps– Power ~85W– 180nm SOI process,

174M transistors

Page 52: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 52

IBM’s Multiple processors on MCM

- 4 POWER4 chips into single module (MCM)

– The POWER4 chips connected via 4 128-bit buses

– Up to 128MB L3 cache– Bus speed ½ processor

speed– Total throughput ~35

GB/s

Page 53: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 53

MPSoC “Bus” Alternatives• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O• NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric

Page 54: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 54

Available Mobile Processors

• The ARM Family– The ARM7 Generation– The StrongARM– The ARM Thumb Option– The ARM Piccolo Option– The ARM9 and ARM10

• The Motorola M-Core• The LSI TinyRisc• The Hitachi SuperH Family• VLIW Processors

– The Motorola-Lucent Star*Core– The Philips TriMedia– The HP/Intel IA-64

Page 55: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 55

Available MP-Cores

• TI OMAP• Philips’s NexperiaTM DVP• ST Nomadik• Intel® Itanium® Montecito• CELL Processor• CT 3400 Multi-core DSP• Hibrid SoC• Systolic Ring• Virtual platform in SHAPES project

Page 56: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 56

TI OMAP

• Targets communications, multimedia.

• Dual-processor (DSP, RISC) with shared memory

• Hierarchical Definition of Platform

• Critical Role of Software as well as Hardware

• OCP (Open Core Protocol) based SoCplatform

C55x DSP

OMAP 5910:

ARM9

MMU

Memory ctrl

MPUinterface

SystemDMAcontrol

bridge

I/O

Page 57: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 57

플랫폼 계층 및 구분

• Level 0: Foundation Platform– Infrastructure & standards : Basic Arch.

• Processor core, Peripheral/Interface IP, Bus: e.g., ARM PrimeXsys

• Level 1: Application specific Integration Platform

• Application Specific SoC: HW & SW• Mobile Platform, Home Platform

• Level 2: System Platform• Terminal Platform• Handset case: RF + Modem + AP + Memory + MMI

Page 58: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 58

Hierarchy of Platforms in OMAP

Reference Design

Application Platform

SoC Platform

ASIC Library & Tools

Silicon Technology

Application Specific

Broadly Applica

ble

OMAP Products

OMAP Infrastructure

Reuse

System Platform

Page 59: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 59

Scalable Multi-processors

Page 60: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 60

TI OMAP 1510 Platform Architecture

Peripheral Bus

TI925

SDRAM Bus(16)

Peripherals: LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, McBSP

Peripheral Bus

C55X

SystemDMA

Peripherals Buses (8/16/32)

MAIL Box

DSP MMU

IMIF

Traffic Controller

EMIFF EMIFS

SRAMLB MMU

HASB MMU

Flash Bus(16)

LB(32)

HASB(32)

GPP:TI925 Core

- 16KB I-Cache - Write Buffer- MMU and D-MMU- Dual TLB

DSP:C55x CoreInternal Memory

- 48KW SARAM - 32KW DARAM- 16KW PDRAM

24KB I-CacheGraphic HW AcceleratorARM Port Interface

IPC- Mail Boxes - API- DSP MMU

System DMATraffic ControllerInternal SRAMBussesPeripherals

Page 61: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 61

TI OMAP 1510 Platform S/W

TI925 General-Purpose Processor

OS kernel& drivers

TMS320 DSP

MPEG4

OS adapter LINK driverMCU Bridge Kernel

RESOURCE MANAGER

LINK driver Other drivers

DSP/BIOS KernelRM Server

MP3 AMRMEDIA APIs

raw data streams video audio speech

XDAIS AlgorithmsEncapsulated in socket nodes

Node Data Base

Page 62: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 62

Philips’s NexperiaTM DVP

(source: Th. Claasen, Philips, DAC 2000)

Page 63: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 63C

ompr

esse

d A

/V In

put B

us

Philips NexperiaTM DVP S/W Reference Architecture

Analog Inputs

Analog Front End

Analog Front End

Digital Inputs

Analog Front End

Optical Drive

Network Protocols

Hard Disk

Digital Front Ends

Com

pres

sed

A/V

Inpu

t Bus

Network Protocols

Players

Broadcast-MPEG2

VCD/SVCD

DVD

CD/SACD

WMT

RN

Broadcast-MPEG4

Recoders

DVD+RW Auth

PVR-SPTS

Lo-Rate SPTS

CD/DVD-MP3

• • •

Unc

ompr

esse

d A

/V In

put B

us

Transcoders

Translaters

TS-SPTS Filter

Loopback / Feedthrough

Digital Outputs

Protocol Stack

Network Protocols

Driver

HDD/Ethernet

Presentation Engine

Audio and Video Processing

Page 64: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 64

Philips NexperiaTM DVP MP-SoC

• Philips's advanced set-top box anddigital TV SoC (Viper2)

• 0.13 μm• 50 M transistors• 100 clock domains• > 60 IP blocks

Page 65: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 65

ST Nomadik

• Targets mobile multimedia. A Heterogenousmultiprocessor-of-multiprocessors.

Page 66: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 66

Power Distribution 인텔 제온 프로세서

Page 67: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 67

Clock and Power Convergence

Dynamic voltage and frequency scaling (DVS)

Page 68: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 68

Intel® Itanium® Montecito - Clock system architecture

– Each core split into 3 clock domains on variablepower supply

Page 69: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 69

Intel® Itanium® Montecito - Power management

– Dynamic voltage-scaling power management system– 4 on-die sensors– On-die microcontroller– Power and temperature measurement– Voltage and frequency modulation– 8μs power/temperature sampling interval– Embedded firmware– Power, temperature, or calibration measurements– Power: closed-loop power control and system

stability check– Temperature: thermal sensor readout (junction

temperature below 90°C monitoring) and power-control communication

– Calibration: power-measurement accuracy check

Page 70: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 70

The implementation of a first-generation CELL Processor

Page 71: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 71

The Cell Processor

• Fclock > 4 GHz.• Memory bandwidth: 25.6 GBytes per second.• I/O bandwidth: 76.8 GBytes per second.• Performance:

– 256 GFLOPS (Single precision at 4 GHz).– 256 GOPS (Integer at 4 GHz).– 25 GFLOPS (Double precision at 4 GHz).

• 235 square mm.• 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz

Page 72: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 72

Cell’s Element Interconnect Bus

• 4 rings (2 ckwise + 2 counter-ckwise)• No token rings, still request/grant arbitrations

Page 73: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 73

CT 3400 Multi-core DSP

• 8개 32비트DSP 코어

• 6개 32비트 범용프로세서 코어

• 128핀 프로그램가능 I/O 서브시스템으로 구성

• C 프로그램 가능

• H.264 및MPEG4 코드를지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control

Page 74: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 74

H.264 codec onto CT3400 MDSP

From cradle

Page 75: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 75

CT 3400 Multi-core DSP

CT3400 DPS Engine

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier

Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory

Page 76: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 76

CT3600 Multiprocessor DSP Family Members

• CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수

• 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다

• 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네배로 증가

• 40달러에서 90달러

http://www.cradle.com/downloads/CT3600-PB.pdf

Page 77: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 77

CT 3616 Multi-core DSP

http://www.cradle.com/downloads/CT3600-PB.pdf

Page 78: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 78

HiBRID-SoC Architecture

Multi-Core SoC Architecture Dedicated chips

for the Mpeg-4 Simple Profile

Integrate a powerful on-chip communication

structure

Three programmable cores: Each adapted

towards a specific class of algorithmsInstruction Level VLIW (Very long

instruction word)Data Level SIMD (Single instruction

multiple data)Task Level (Simultaneous

multithreading)

Developed at the University of Hannover

Page 79: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 79

Multi-Core SoC Architecture

• Hi-par DSP• 16-datatath SIMD processor core controlled by VLIW,• Particularly optimized towards high-throughput two

dimensional DSP-style processing• (FFT-intensive applications or filtering)

• Stream Processor (SP)• 32-Bit RISC architecture that is more optimized to-wards

control-dominated task• Bitstream processing or global system control

• Macroblock processor(MP)• Efficient processing of data blocks (Heterogeneous data

path structure consisting of scalar and a vecture unit)• Controlled by dual-issue VLIW, offers flexible subword

parallelism, and contains instruction set extensions for typical processing computation steps

Page 80: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 80

HiBRID-SoC multi-core architecture

64-bit AMBA AHB system

bus

Connects all cores

SDRAM memory via a

64 Bit SDRAM

interface

Two versatile 32-Bit

host interfaces for

access (e.g., host PC

via PCI and to serial

flash memory)

Page 81: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 81

HiPAR-DSP

Highly paralled DSP core with a

VLIW-controlled SIMD

architecture

DMA unit serves all cache misses

and performs data prefetch

transfers to the matrix memory

At the targeted clock frequency of

145 MHz, the HiPAR-DSP

achieves a performance of 2.3

GMACs

Page 82: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 82

Macroblock processorHeterogeneous data path structure consisting

of a scalar and a vector data path

The scalar data path operates on 32-Bit data

words in a 32-entry register file and provides

control instructions (jump,branch, and loop)

The vector data path is equipped with a 64

entry register file of 64 bit width

Special fuction unit(SFU) provide

instruction set extensions for common video

and multimedia core algorithms.

MUL/MAC or ALU, incorporate SIMD-

style subword parallelism by processing

either two 32-Bit, four 16-Bit, or eight 8-Bit

data entities in parallel within a 64-bit

register operand

Page 83: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 83

HiBRID-SoC Implementations

Chip layout of the HiBRID-SoC.

MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits:

HiBRID-SoC is fabricated in a 0.18 um,

6LM standard-cell technology,

14 million tr’s 3.5W

82 mm2, 145 MHz

Page 84: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 84

New Taxonomy/Metric

• Flynn: Triple (d,i,c)d: # of data streamsi: # of instruction

streamsc: # of configuration

states

SISD, SIMD, MIMD,MISD

• RA: (c,g,a)– c: configurability to

various environment– g: size of granularity– a: adaptability to

various components

– SCSG,SCMG,SCLG– MCSG,MCMG,MCLG

Page 85: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 85

Systolic Ring

• Based on a coarse-grained configurable PE

• Circular datapathsC: # of layers C = 4N: # of Dnodes per layer

N = 2S: # of Rings s = 1

• Control Units (sequencer)Local Dnode unitLocal Ring unitGlobal unit

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

layer 1

layer 2

layer 3

layer 4

Dnode Sequencer

Local RingSequencer

Page 86: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 86

Motivation For Using Hierarchical Rings

• Relatively simple switching logic reduces the complexity at each node resulting in reduced buffer, area and energy requirements.

• Low latency since packets are forwarded in 1 clock cycle.

• Packets will always arrive in-order at the destination.

• Broadcast and Multicast packets are efficiently implemented.

• Hierarchical rings can be partitioned into independent clock domains.

Page 87: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 87

Remanence

Fe

Fc

FcNcFeNR PE

..=

• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency

Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Page 88: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 88

Operative Density

NPE: # of PE

A: Core Area (relative unit λ²)

Area can be expressed as a function of NPE

)()(

PE

PEPE NA

NNOD =

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Page 89: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 89

Remanence formalisation

• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

k = 2k = 4

k = 8

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

layer 1 layer 2

layer 3

layer 4

layer 5layer 6

layer 7

layer 8

k = 1k = 1

k = 2k = 4

k = 8

PEPE NkNR .)( =

k= C/N

Page 90: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 90

Architectural model Characterization

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Global Bus

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Global Bus

Global Sequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)

Control Units• Local Dnode unit• Local Ring unit• Global unit

•www.qstech.com

Page 91: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 91

Best OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceWorst interconnect resources and processing power

Page 92: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 92

Worst OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceBest interconnect resources and processing power

Page 93: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 93

Comparisons of RA

1. Only 1 cycle to (re)configure the DSP

2. Few cycles to (re)configure coarse grain RA (≤8)

3. Many cycles to (re)configure fine grain RA

NPE Nc RName Type F (MHz)

2304 0.14 16457

24 4 6

24 4 6

128 16 8

ARDOISE

Systolic Ring

DART

MorphoSys

TMS320C62

Fine Grain RA

Coarse Grain RA

Coarse Grain RA

Coarse Grain RA

DSP VLIW 8 8

33

200

130

100

300 1

FcNcFeNR PE

..

=Pascal BENOIT

Page 94: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 94

Virtual platform in SHAPES project

Page 95: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 95

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Homogeneity and Heterogeneity

Page 96: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 96

MPSoC Architecture Trends

Page 97: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 97

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

1

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

1~100

MultiFlex Thread-Level

Parallelism

100’s

Page 98: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 98

Three Levels of Parallelism

Page 99: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 99

Parallel Heterogeneous Platforms (PHPs)

• Challenges:– Explore the theoretically high performance

Platform Company PEs Het?

Cell IBM/Sony/Toshiba 9 Y

DRP NEC 512 N

Nomadik ST 3+ Y

OMAP2420 TI 4 Y

Nexperia Philips 3+ Y

X-Fi Creative 7 Y

ARM11 MPCore ARM 1-4 N

IXP2800 Intel 17 Y

MXP5800 Intel 54 Y

… … … …

(From Abhijit Davare’s Quals Presentation)

Page 100: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 100

Homogeneous MP-SOC

• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore

module• 32bit interconnection

(AMBA Bus or STBus)• Porcessor Core modeling :

C++• Hardware interconnection

modeling : SystemC

Page 101: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 101

NEC MP211: Homogeneous MP core

• Asymmetric mp with very coarse grain multitasking

• 3 ARM9’s utilized as predefined function units

• NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

•Asymmetric mp with very coarse grain multitasking•3 ARM9’s utilized as predefined function units•NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

Page 102: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 102

MP211 block diagram

Page 103: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 103

Power consumption of H.264+AAC

H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)DTV: 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가실행되고 있는 영역을 뜻한다.

Page 104: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 104

Homogeneous MP의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.

▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은

자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하

다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적

으로 설계되었다.

Page 105: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 105

Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPsdo, a crossbar provides a high bandwidth connection.

Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism,instruction level parallelism을 동시에 처리 가능하도록 설계되었다.

Page 106: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 106

Heterogeneous MP core

• 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.

Page 107: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 107

Heterogenous MP Core

▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous

CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은

큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다.

▷ Multi-ISA multicore architecture는 vector/data-level parallellism,

instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 코어

숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우,

인터콘넥트의 전력 소모량은 하나의 코어와 같다.

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을

이용하는 heterogeneous processors는 homogeneous에 비해서 63%

성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의

개선이 있다.

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에

대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를

최대화 한다. [Annavaram, et al]

Page 108: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 108

NEC’s Asymmetric(or Heterogeneous) Multi processing

• 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른하드웨어 개발기간과 가격을 낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 어플리케이션간 인터페이스를 줄일 수 있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이가능하다.

Page 109: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 109

Heterogeneous MP-SoC 문제점들

• Processors are bound by wire and memory latencies

• Peak performance on only a small class of applications.

• How well they map to a given design• Diversification of workloads • Increased hardware complexity • Poor resource utilization

Page 110: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 110

AMP task allocation image

Page 111: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 111

Bus and Memory Architecture

Page 112: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 112

Alpha cores scaled to 0.10 um.

EV8 is 80 times bigger but provides only two to three times more single-threaded performance

Page 113: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 113

Equal-area heterogeneous architectures with multithreaded cores

Page 114: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 114

Exploring the potential from heterogeneity

Page 115: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 115

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC Design Automation

Page 116: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 116

Optimization and Synthesis

• Computation Synthesis:– Task Allocation–Task Scheduling

• Communication Synthesis:– Interconnection Synthesis–Buffer sizing

Page 117: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 117

Energy-Aware Task mapping

Minimize Energy Consumption, given a CTG and a heterogenous NoC• Find:

– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned

• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.

• Communication-optimal task mapping– minimal hardware (buffers and wires) required to

meet the timing requirements defined in the specification.

– given a multiprocessor network find a mapping of the application satisfies the timing constraints.

• Genetic algorithm (Chromosome, Generation, Crossover, mutation)

Addressed by Hu et al 2002:

Page 118: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 118

Interconnection Synthesis– With each new

technology:– Gate delay decreases

~25%– Wire delay increases

~100%

– Cross-chip communication increases

– Clock needs multiple cycles to cover die

Source: SIA NTRS Projection

Page 119: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 119

Interconnect Delays & Density

Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

Page 120: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 120

Buffer Sizing

• Architectures have bounded buffer resources.• If more communication buffer resources are

utilized, processors may spend less time waiting to send/receive data.

• Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.

Page 121: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 121

Multiple Clocks due to Interconnect limitation

Page 122: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 122

MPSoC HW platform perspective

• Today´s platforms are quite heterogeneous– Reasons: efficiency and legacy IP

• Homogeneous MPSoC would scale welland would simplify programming– Works well for desktop PCs– Too inefficient for embedded apps

• Mixed MPSoC as a compromise?– Globally homogeneous, locally heterogeneous– (re)configurable PEs

www.iss.rwth-aachen.de

Page 123: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 123

Future MPSoC programming

• Sequential-to-parallel code generationC code (and platform/RTOS model) in,

• parallel C codes out» Step 1: exhibit parallelism at block/task level

to the user for manual mapping» Step 2: automate code partitioning/mapping

• Massive use of compiler technology, e.g.data flow analysis

• Use of „platform refinement“ technology asbackend for machine code generation and

simulation www.iss.rwth-aachen.de

Page 124: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 124

The von Neumann inheritance

• Sequential programming of sequentialmachines– Pascal, Modula-2, C, C++, Java, ...

• Sequential programming of parallel machines?– VLIW: handled by sophisticated compilers– SIMD: will be accomplished by compilers– Does not scale to heterogeneous MPSoC

with– distributed control paths!

• Parallel programming of parallel machines!– We need to move ... to parallel thinking and

programming...We are standing at the verybeginning...It´s a huge area. (J. Gutknecht,

– ETH Zurich)• What to do in the meantime?

© 2007 R. Leupers

Page 125: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 125

Block clustering approach

www.iss.rwth-aachen.de

Page 126: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 126

Block clustering approach –Cn’t

www.iss.rwth-aachen.de

Page 127: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 127

Block clustering approach –Cn’t

www.iss.rwth-aachen.de

Page 128: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 128

멀티코어 SoC 설계방법

Page 129: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 129

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Network On Chip

Page 130: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 130

Technology Evolution

Page 131: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 131

What are NoC’s?

• According to Wikipedia:

– “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.”

Page 132: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 132

Network-on-Chip (NoC)

• Communication is achieved by connecting switches together to form a network topology:

• Offers much greater scalability.• parallelism: multiple components can send

data simultaneously• energy efficient: point-to-point connections

require less energy than a bus.• Global synchronization is no longer needed.

Page 133: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 133

NoC Design Considerations (I)

• There are several popular topologies:– 2D Mesh (most popular).– Torus (rings)– Tree (fat-tree, butterfly fat-tree)

• The on-chip interconnection network will soon be a limiting– factor for performance and energy consumption:– has been reported to account for over 50% of the total

energyrequirement!

• The interconnect should consume the fewest resourcespossible and should be:– area efficient: switches should be as small (simple) as

possible.– energy efficient: related to area efficiency– fast: simple routing algorithms should be used.

Page 134: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 134

ProcessorMaster

GlobalMemory

Slave

Global I/OSlave

Global I/OSlave

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

NoC exemplified

Page 135: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 135

NoC: Good news

☺ Only point-to-point one-way wires are used, for all network sizes.

☺ Aggregated bandwidth scales with the network size.

☺ Routing decisions are distributed and the same router is re-instanciated, for all network sizes.

☺ NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)

Sergio Tota and Mario R. Casu

Page 136: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 136

There’s no free lunch…

Internal network contention causes (often unpredictable) latency.The network has a significant silicon area.Bus-oriented IPs need smart wrappers.Software needs clean synchronization in multiprocessor systems.System designers need reeducation for new concepts.

Sergio Tota and Mario R. Casu

Page 137: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 137

Facts about NoC’s

• It is a way to decouple computation from communication

• The design is layered (physical, network, application…): Taming complexity is made easier

• Communication between processing elements in NoC takes place by encapsulating data in packets

• The elementary packet piece to which switch and routing operations apply is the flit

Page 138: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 138

Topologies• Heritage of networks with new constraints

– Need to accommodate interconnects in a 2D layout– Cannot route long wires (clock frequency bound)

a) SPIN, b) CLICHE’c) Torusd) Folded toruse) Octagonf) BFT.

Page 139: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 139

SPIN (Guerrier et al., DATE ’00/’03)

• Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology.• A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words

Page 140: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 140

Dally et al., DAC’01• 2D folded torus topology• Wormhole routing and Virtual Channels (VC)

Page 141: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 141

Kumar et al., ISLVLSI’02

• Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’• 2D Mesh Topology• Message Passing

Page 142: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 142

Pande et al., TCOMP’05 • Butterfly Fat Tree• Wormhole, Virtual channels• Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)• “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration

Page 143: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 143

Goossens et al., IEE CDT’03

• Both VCT and WH, GT and BE, IQ and VOQ

• GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k”in a S&F fashion.

• BE uses Matrix Scheduling• GT connections set up by BE

special system packets• Prototype with WH and IQ

– 5 ports– 0.13 um, 0.26 mm2 , 500/166 MHz– Flit size = 3 words, each 32 bits– 80 Gb/s aggregate bandwidth

Page 144: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 144

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

SKKU’s Mobile MP-SoC Platform

Koonshik Cho & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

Page 145: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 145

1. Multiprocessor SoC 설계 Platform

• SW-성능 개선과 표준 변동에 능동적으로 대처

• HW- Modular, Flexible and Scalable Architecture

• Platform based design

2. Multiprocessor SoC Platform test

• DVB-T Receiver

3. Tools

• Seamless CVE (Mentor Graphics)

• ADS(ARM)

SoC (DSP+ARM) Platform

Page 146: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 146

SoC (DSP+ARM) Platform

Page 147: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 147

Extended multi-processor platform

Page 148: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 148

ARM Platform

Page 149: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 149

AMBA BUS (1)

AMBA BUS는 Multiplexer, Arbiter, Decoder가 있어 여러개의 Master와 Slave를 중재해 주는 역할을 한다.

Page 150: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 150

AMBA BUS (2)AMBA Bus (Master to slave multiplexer)

• Bus Master는 Address나 Control signal들을 Slave로 내보냄으로 Read 나Write 등의Operation을 할 수 있도록 해 주는 장치이다. 동시 간에 하나의Master만이 전송을 가능하게 한다. 또한 Multiple master가 가능하다

Page 151: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 151

AMBA BUS (3)

AMBA Bus (Slave to master multiplexer)

• Bus Slave는 주어진 Address-space안에서 Master의 Read와 Write를 가능하게 해주는 장치이다. Slave는 Ready 및 Response signal을통해 동작 상태에 대해 Master에게 알려준다. 또한 Multiple slave가가능하다.

Page 152: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 152

AMBA BUS (4)

• AHB Arbiter : Bus Arbiter는 한번에 오직 하나의 Master가 선택

되도록 하는 역할을 한다. 고유의 Priority scheme을 가지고 이러

한 Arbitration을 하게 되는데, AHB에는 오직 하나의 Arbiter가 존

재하게 된다.

• AHB Decoder : AHB Decoder의 역할은 Master로 나오는

Address의 상위 비트를 가지고서 적절한 Slave를 선택해 주는 것

이다. AHB에는 역시 하나의 Decoder가 존재한다.

• APB Bridge : APB (Advanced Peripheral Bus)상의 유일한 Bus Master이다. APB Bridge는 ASB의 Slave로서 Decoder에서 APB가 선택이 되었을 때는 APB 상에서 Master의 역할을 하게 된다

APB Bridge는 Slave module로 Local peripheral bus를 대신해

서 Bus handshake와 신호 Retiming을 조정한다.

Page 153: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 153

AMBA BUS (5)

• Interrupt controller : 최대 32개의 Interrupt source로부터 Interrupt request 신호를 받아서 ARM9 프로세서에 인가되는 nIRQ 또는 nFIQ 신호를 생성한다. 32개의 Interrupt source 중에서 0~3번 Interrupt source가 nFIQ, 4~31번 Interrupt source가 nIRQ를생성한다. 낮은 번호일수록 높은 우선순위를 가진다.

• Timer :Timer 모듈에서는 3개의 Timer 기능을 제공한다. 각 Timer는 16bit counter로서 16, 256, 4096의세 가지 Prescale을 지원하며, 매 주기마다 Counter값을 1씩 감소시키고, Count값이 0이 되면 Interrupt를발생시킨다. ARM9 프로세서가 Timer interrupt clear 레지스터를 통해 Interrupt ack 신호를 줄 때까지Interrupt request를 유지한다.

Page 154: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 154

Teak DSP Platform

• 전제 플랫폼에서 Co-프로세서인 Teak DSP 플랫폼의 구조

Page 155: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 155

Configuration of crossbar switch

• Communication interface Architecture (Crossbar 구조)

Page 156: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 156

재구성 가능한 크로스바 스위치

VHDL 의 generate문을 사용

Page 157: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 157

재구성 가능한 크로스바 스위치(VHDL code)

entity CI_TOP isgeneric ( number_of_masters , number_of_slaves : integer);

port ( …생략);

end CI_TOP;

CI 모듈의 entity (ci_top.vhd)

COMMUNICATION_INTERFACE : CI_TOPgeneric map( number_of_masters=>4 , number_of_slaves =>6)

port map( …생략);

CI 모듈의 사용(multiplatform.vhd)

Page 158: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 158

Advantage Disadvantage

Mux

‣비교적 쉽게 구현 가능‣Master, Slave 가적은 경우 효과적

‣Processor 간 병렬처리가 어려움‣시스템이 확장될 경우병목현상을

발생

Crossbar

‣Processor 간 효과적인 병렬처리가 가능‣시스템이 확장되어도같은 Delay를 가짐

‣구현이 어려움‣Size 및 low-power면에서 비교적불리함

Communication Interface Mux vs Crossbar

Page 159: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 159

Interconnection network

Omega interconnection Octagon interconnection

Mesh interconnection

Page 160: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 160

장점 단점

Shared bus ‣비교적 쉽게 구현 가능

‣마스터, 슬레이브가 적은

경우 효과적.시스템이 확장되어도 같은 Delay를 가짐

‣프로세서 간 병렬처리 힘듦

‣버스 효율 낮음

‣전력소모 많음 (broadcasting) ‣구현 복잡도 - 낮음

Crossbar ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) - 보통

‣확장에 따른 Size 및 wiring 증가

Omega network

‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수(짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐

Octagon ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수 (가장 짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐 (마스터, 슬레이브의 개수 8개로 제한)

Mesh ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) 매우높음

‣중, 대형 시스템에 적합

Interconnection network

Page 161: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 161

네트워크 라우터 셀

– 멀티프로세서 플렛폼으로 4개의Master와 6개의 Slave구조

– CI Cell은 24개의 2by2 mux 구조로 설계

– CI Controller => Req, Grant, mux control etc.

– Seamless CVE와 Modelsim을연동한 상태에서 ARM926,Teak DSP가 동시에 slave에 접근하여각각의 데이터를 Read & Write 플랫폼 Function Block

– 각 Master가 Slave(Ips)로 접근시 CI Controller내부 기능은Request신호, Grant 신호 및 각Mux Control 제어신호, Round Robin기능, Decoder기능 수행 Ci Controller inner block

Page 162: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 162

CI-controller State Diagram

Page 163: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 163

CI controller simulation waveform

Page 164: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 164

DVB-T Baseband Receiver

Page 165: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 165

Hardware-software co-design flows

Page 166: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 166

A shared memory structure and hardware-software partitioning

Page 167: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 167

Frequency offset compensator hardware

Page 168: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 168

Fine and Coarse Frequency Synchronizer (Beek & Classen)

Page 169: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 169

FFT block diagram

Page 170: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 170

Equalizer hardware block diagrams

Page 171: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 171

DVB-T baseband Receiver Scheduling

Page 172: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 172

DVB-T baseband Receiver Scheduling

Page 173: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 173

Performance evaluation

Processing Types /

Functional BlockSW

SW & HW (Teaks + ARM +

HW IP)

HW(IP) only MAL

Frequency compensator & Remove Guard - 182.5us 13.8us 10.5us

Fine Freq. sync. (Beek) - 56.3us 1.5us 7.8us

Symbol Timing Recovery 144 us - - 5.2us

FFT - 188.9us 38.6 us 13.6us

Coarse Freq. Sync. (Classen) - 241us 3.3us 11us

Scattered Pilot Detection 46.5us - - 3.3us

Equalizer - 219.5us 11.2us 9.5us

De-mapping 19.9us - - 4.9us

Page 174: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 174

Task Chart of Multi-processor platform for DVB-T baseband receiver

Page 175: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 175

Task Chart of Multi-processor platform for DVB-T baseband receiver

Page 176: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© Jun Dong Cho, 2007.7 176

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Modeling of Motion Compensation IP using SCML

Le Minh Nghia & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

Page 177: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 177

Introduction of SystemC Modeling in CoWare

• TLM Peripheral Modeling with CoWare• SystemC Modeling Library (SCML)• Motion Compensation Modeling using SCML

Page 178: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 178

TLM Peripheral Modeling with CoWare

• Four use-cases for Transaction-Level Modeling (TLM)– Functional View (FV)– Architecture’s View (AV)– Programmer’s View (PV)– Verification View (VV)

Page 179: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 179

TLM Peripheral Modeling with CoWare

• General pattern for modeling peripheral component– Separate Behavior, Communication and Timing– Initiators and Targets depending platform are created by user.– Bus-transactor convert a generic communication into a bus-specific TLM

interface.– Accuracy of Timing depends on use-case

Page 180: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 180

TLM Peripheral Modeling with CoWare

• Communication– Communication through function calls– Simulation speed strongly depends on bus-model– PV bus-model used for software development can

be simulated very fast• Behavior

– Functionality– Synchronization– Storage

• Timing– Modeling timing model based on clock object in

SystemC Modeling Library (SCML)

Page 181: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 181

TLM Peripheral Modeling with CoWare

• Modeling Target pattern– Communication : Bus-transactor– Storage and Synchronization : Register bank as interface– Behavior: Collection of call-back functions, each call-back

corresponding to a bitfiel or register in register bank

Page 182: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 182

TLM Peripheral Modeling with CoWare

• Modeling Initiator pattern– Communication : Bus-transactor convert posted transactions in

queue into real bus transaction– Storage and Synchronization : Include Post port and initiator storage

element scml_array (in SCML). Post port post transactions in term of nonblocking. The real synchronization depends on data and space in storage element which related to scml_array object

– Behavior: Modeled by autonomous SystemC processes

Page 183: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 183

Initiator Synchronization

• Two class of initiator blocks:– Free-running initiator: all transfer initialized by

block do not need any accesses from another peripheral

– Initiator block has target port and transfers will only be initialized

• Three pattern synchronization of Initiator block:– Free-running Initiator– Fully Slaved Initiator– Semi-free running Initiator

Page 184: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 184

Initiator Synchronization

• Modeling a Free-Running Initiator Peripheral– Thread is modeled by SC_THREAD and post

transaction– Wait(sc_time) : To schedule the next-execution of

thread

Page 185: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 185

Initiator Synchronization

• Modeling a Fully-Slaved Initiator Peripheral– Slaved- Initiator only sends transaction when its target

port is accessed– Loop in Fully-Slaved Initiator returns control to master

thread after it posted transaction

Page 186: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 186

Initiator Synchronization

• Modeling a Semi-Free-Running-Slaved Initiator Peripheral– Thread containing Loop is triggered by start event– Start event is generated by accessing target port of

initiator

Page 187: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 187

SystemC Modeling Library (SCML)

• Memories and Bitfield object:– To model bit-field and memory-map registers– Memory object support posting non-blocking

transactions– Support synchronization by read and write data based

on blocking access • Clock object

– To model timing or clock in IPs• Initiator-side object

– Model the communication of initiator peripherals to support re-use.

Page 188: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 188

Modeling TLM Motion Compensation

• Outline features of Motion Compensation IP– Synchronization : Semi-Free-Running-Slaved

Initiator– Behavior: Algorithm extracted from J.M source

code– Structure includes two part

• Target part: Interface with Master Processor using Register bank and modeling follow Target pattern of SCML

• Initiator part : Modeling the posting of transactions and synchronization of transmission transactions follow Inititator pattern of SCML

Page 189: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 189

Modeling TLM Motion Compensation

Page 190: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 190

Modeling TLM Motion Compensation

• Three ports:– pConfig: Interface with Master Processor to receive parameters.– p_Irq: Generate interrupt to synchronization with Master processor– p_Post : Post transactions to specific bus through bus-transactor

• Register bank: for parameters of Motion Compensation block

• StartStopReg and IrqReg: for interface with Master Processor

• Behavior block : for Motion Compensation Algorithm and transaction modeling

• Call-back functions : for events caused by writing StartStopReg and IrqReq.

Page 191: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 191

Modeling TLM Motion Compensation

• Functions in TLM Motion Compensation Model:– f_initialize(): Init parameters of Model– f_thread (): Wait event generated by writing to StartStopReg– f_write_start_stop(): Call-back function corresponding with event writing

to StartStopReg. It activate or deactivate Model by generating a sc_eventto signal f_thread().

– f_clear_irq(): Clear IrqReg– f_MotionCompensation(): Motion Compensation behavior based on

original source code in J.M reference software.– f_do_post(): Post transactions in storage(transaction pool) to bus

transactor and manage synchronization posting– f_postTransfer(): Post a transaction to bus transactor– f_release_trans(): Release transaction pool

Page 192: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 192

Next…

• Extract parameters as TestVector from J.M source code

• Build a platform in CoWare• Test Motion Compensation IP with TestVector

Page 193: Multi-processor System on Chip Designvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/MP-SoC-design(185p).pdf · Verilog / VHDL ESL ctrl1/cmd1/ Req Addr Grant ... – Turbo CC/PC –

© 조준동, 2007년 여름 193

맺음말

• (Mobile) SoC의 complexity 및 cost의 증가로 MP-SOC platform을 이용한 설계 프로세스 중요

• Mobile platform의 challenge로 low power, RF I/F를 포함한 검증, variety of standards, platform optimization 제시

• 여러 platform 및 methodology의 장단점을 취한platform 개발이 바람직

• HW/SW/algorithm을 이해하고 설계할 수 있는 인재(system architect) 육성