Upload
lenhu
View
221
Download
6
Embed Size (px)
Citation preview
© Jun Dong Cho, 2007.7 1
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
Multi-processor System on Chip Design
성균관대 조준동
© 조준동, 2007년 여름 2
목 차
• 차세대 SoC (System on Chip)의 요구사항
• MPSOC의 필요성
• History of Multiprocessors• MP-SoC Examples and Applications• Homogeneity and Heterogeneity• MP-SoC Design Automation• Network on Chip• SKKU’s Mobile MP-SoC Platform
© Jun Dong Cho, 2007.7 3
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
차세대 SoC (System on Chip)의요구사항
© 조준동, 2007년 여름 4
Processor: AP - MCModem: GSM/GPRS - WCDMA - CDMA2000
Connectivity: Wireless LAN - GPS - Bluetooth
RF/Analog: Rx - Tx - Zero IF - PM
Camera Chipset: CIS - CCD - ISP
Display Driver IC (DDI): STN - TFT - OLED
Smart Card: Smart Card: SIMSIM
Flash Memory: Flash Memory: Code/Data StorageCode/Data Storage
SIP / MCPSIP / MCP
RAM: Mobile DRAM - SRAM - UtRAM
What is System on Chip? What is System on Chip?
SoCSoC
© 조준동, 2007년 여름 5
고성능 및 저전력의 필요성
3D graphics
Moore’s law
ShannonShannon’’s laws law((2.8x / 18m)
2G (IS-95)9.6kbps
3G (CDMA 1xEV)3,100kbps
4G (1GMbps~100Mbps)
20031995 2012
Battery capacityQVGA
D1
HD (720p)
Full HD (1080i)Mobile MultimediaMobile Multimedia
Design Complexity
Productivity Gap: Design complexity vs. Moore’s law Power Gap: Design complexity vs. Battery
© 조준동, 2007년 여름 6
임베디드 프로세서(ARM) 0.5 MOPS/mW
신호처리 프로세서ASIPs, DSPs
3 MOPS/mW
신호처리ASIC
가용성
에너
지효
율(M
OPS
/mW
)
0.1
1
10
100
1000
200 MOPS/mW
10-80 MOPS/mWFPFA
6
Flexibility-Energy Gap
FPFA : Field Programmable Function Array
Sensor network design space
Wireless embedded systems design space
© 조준동, 2007년 여름 7
차세대 SoC의 생산성 증대를 위한 5가지 요구사항
1. High Performance 2. Fast Verification3. Small Form Factor4. Low Power Solutions5. Design-Technology Integration for
Manufacturability
© 조준동, 2007년 여름 8
1. High-Performance: CMP +NoC
Heterogeneous Chip MultiHeterogeneous Chip Multi--processor Architectureprocessor Architecture
μP
IP
Mem
IP
PE
PE
PE
μP
Mem
PE
NoC
0
50
100
150
200
250
300
350
400
2004 2007 2010 2013 2016
#. PEs
Source: ITRS 2005 draft
Technology Evolution
© 조준동, 2007년 여름 9
2. Fast Verification: Embedded System Level
ComplexityComplexity
MooreMoore’’s Laws Law2x / 18m2x / 18m
NielsenNielsen’’s Laws Law2x / 12m2x / 12mEmbedded SWEmbedded SW
2x / 10m2x / 10m
System specification
Architecture design
RTL design
UML / Java / MatLab
SystemC / ADL
Verilog / VHDL
ESL
ctrl1/cmd1/
Req
Addr
Grant
Data
ack1ack0
TLMTLM
© 조준동, 2007년 여름 10
MobileMobileAPAP
32MB32MBNANDNAND
16MB16MBSDRAMSDRAM
~35mm~35mm
~2
5m
m~
25
mm16MB16MB
SDRAMSDRAM
17mm17mm
17mm
17mm
FlashSDRAM
SDRAM
Mobile AP
EMI ReductionEMI Reduction
60% Smaller Area60% Smaller Area
▷▷88--layers of MCPlayers of MCP▷▷ Cost reduction by 15%Cost reduction by 15%
3. Small Form Factor
SiPSiP: Mobile Application Processor + Mobile Memory: Mobile Application Processor + Mobile Memory
© 조준동, 2007년 여름 11
• MTCMOS
• Clock Gating
• Multi-Vdd
• Tr Sizing • VTCMOS• Multi-Vt• SOI
• High-κ Metal Gate
Device Circuit Architecture Runtime•Parallelization•GALS
DAC 2004
4. Low Power Solutions
DVFS
1.2V, 350MHz
1.5V, 500MHz
1.0V200MHz
Multi-Vdd
•DPM/DVS
Active
Active
Standby
Standby
VBP
VBN
VDD
VSS
VTCMOS
• MTCMOS
© 조준동, 2007년 여름 12
Statistical Analysis
CriticalTiming,power
Designer’sIntention
?
?
5. Design-Technology Integration for Manufacturability (DfM)
VariationInformation
NA, NA, ToxTox
Latency, PowerLatency, Power
Fault ProbabilityFault Probability
VddVdd, Temp, Temp
VtVt, , LgLg, L, t, , L, t, tILDtILD
Quantum Physics
Mask / Process Design
Architecture Design
Logic / PhysicalDesign
Algorithm DesignFault-tolerant algorithm
Yield-improving architecture
Statistical STA
© 조준동, 2007년 여름 13
More SoC topics …
• Platform optimization– Power management– BW allocation– Resource sharing– Task distribution– Efficient communications
• Low Power• Verification
•인재 (System Architect) 양성
© Jun Dong Cho, 2007.7 14
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
MP-SoC (Multi-Processor System on Chip)의 필요성
© 조준동, 2007년 여름 15
Definition of MP-SOC?
Usually Heterogeneous Multiprocessor:
CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.
Definition of Multiprocessor by Enslow Jr.
MIMD machines with shared memory•Shared memory•Shared I/O•Distributed OS•HomogeneousExtended definition: All parallel machines (wrong usage)
© 조준동, 2007년 여름 16
Future Microprocessors
© 조준동, 2007년 여름 17
Why MP?
Uniprocessors have hit the ceilingGet performance from better architecture instead of more MHz
© 조준동, 2007년 여름 18
Anatomy of a Cellular Phone
3G Wireless Protocols
© 조준동, 2007년 여름 19
MP-SoC 응용 분야:4G: Multiple standards
Communications.Networking.Multimedia.Security.
Mutiband/multimode를 지원하는 Digital RF
© 조준동, 2007년 여름 20
MP-SoC Platform의 진화 방향(WCDMA+CDMA2000의 예)
© 조준동, 2007년 여름 21
System Architecture for 3G
•4 PEs–static kernel mapping
and scheduling–SIMD+Scalar units•1 ARM GPP controller–scalar algorithms and
protocol controls
© 조준동, 2007년 여름 22
ARM MPCoreTM 아키텍쳐
© 조준동, 2007년 여름 23
재구성 및 Scalable MP-SoC 플랫폼
© 조준동, 2007년 여름 24
Road Map to MP-SoC Trends
• Mask NRE: Over 1M$; • Design NRE: 10M$ to 75M$
– ASICs replaced by programmable ASSP, FPGA’s• Number of embedded processors
– DVD/STB/HDTV, mobile phones: 5 to 8• Image proc, networking, basestation: 8 to 100+• E-S/W complexity
– Set-top box, audio: >1 million lines of codeE-S/W becoming essential part of SoC’s
WhoWho’’s Law?s Law?
© 조준동, 2007년 여름 25
Why is MP-SOC Challenging?
© 조준동, 2007년 여름 26
Software Defined Wireless Multimedia Terminals
•Lower costs–Platform longevity, higher
volume–SW has lower development
costs•Time to market–Future protocols will have
complex implementations–Overlap testing/development
cycles•Adaptability–Standards change over time–Multi-mode operation–Sharing hardware resources
Multistandard Radio
• UMTS• GSM/GPRS/EDGE• WLAN• Bluetooth• UWB
Multistandard M/M• H.264• MP3• AAC• GPS• DVB-H• TPEG
SDR = Reconfigurable Radios
© 조준동, 2007년 여름 27
SDR Configuration• Modulation Format
– QPSK– DQPSK– π/4 DQPSK– {16,64,256,1024} QAM– OFDM– OFDM CDMA
• Digital Down/Up Conversion (DDC)– Channel Center– Decimation/Interpolation rates– Compensation Filters– Matched Filter α = {0.25,0.35,...}
• FEC– Convolutional– Reed-Solomon– Concatenated Coding– Turbo CC/PC– (De-)Interleave
Soft RadioDigital Signal
Processing Engine
• Network Interface Definition
• Channel Access– CDMA– TDMA
• Security• Beam Forming
• DSSS– Rake, track, acquire– Multi User Detect. (MUD– ICU
© 조준동, 2007년 여름 28
Future mobile applications?
• Mobile supercomputing– Speech recognition.– Cryptography.– Augmented reality.– Typical applications (email, etc.).
• Requires 16x 2 GHz Pentium 4 ?
Mudge et al:
Culture and Education? Personal Entertainment ?
© 조준동, 2007년 여름 29
Broa
dcas
ting,
Ubi
quito
us
Health, H
uman, Bio
MP-SoC 응용 분야
D-TV
CIS Mobile
Recorder
Health
HCI Bio
Data Broadcasting
RFID
Automotive & Robotics
Telematics UnmannedDriving Robot
© 조준동, 2007년 여름 30
MPSoC today
• High performance, low power: there is no other way than MPSoC!
• Virtually all processor vendors are on the MPSoC route– TI: OMAP, DaVinci– STM: Nomadik– IBM: Cell– Intel: IXP, CoreDuo– Philips: Nexperia– Atmel: Diopsis– ARM: MPCore– ARC: VRaptor
• Urgent need for MPSoC design tools– Application design and platform capture– Architecture exploration and optimization– Simulation and verification– Application to architecture mapping
© 조준동, 2007년 여름 31
The triangle, Chicken and Egg?
architectures
applications
methodologies
•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.
© 조준동, 2007년 여름 32
Tape Out
VerifyCompose the system
VerifySimulate
VerifySoC Composer
Verify (timing, area)Synthesis + P&R
VerifySimulate (performance)
Should the SoC designer work hard?
Requirements
Mobile SoC에서검증이 왜 중요한지?
왜 우리는 검증이취약하게 되었는지
© 조준동, 2007년 여름 33
Some statements from MPSoC 2006 Symposium
• The ad-hoc approach to SoC design cannot scale with Moore’s Law...The SW development environment as afterthought era of IC design is rapidly drawing to a close“ (K. Keutzer, UCB)
• Power-constrained CPUs are mandatory, but the most exciting features require system-level SW optimization“ (M. Kuulusa, Nokia)
• Multi-core platforms are a reality – but where is the SW support?“ (R. Lauwereins, IMEC)
© 조준동, 2007년 여름 34
Gartner, 2007년 10대 기술 발표
• 향후 3년간 성숙단계에 이를 것으로 예상되는 10대 기술을발표(2006년 제25차 가트너 데이터 센터 연례회의, 2006.11.28~12.1)
• 오픈소스(Open Source), 가상화(Virtualization), 정보 액세스 (Information Access), 유비쿼터스 컴퓨팅(Ubiquitous Computing), 그리드 컴퓨팅(Grid Computing), 컴퓨트 유틸리티(Compute Utilities), 멀티코어 프로세서(Multicore Processors), 웹 2.0(Web 2.0), 네트워크 통합(Network Convergence), 수냉 방식(Water Cooling)
http://www.gartner.com/2_events/conferences/lsc25.jsp
© Jun Dong Cho, 2007.7 35
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
Fundamental to Parallel Machines
© 조준동, 2007년 여름 36
Purposes of multiple processors
• Performance– A job can be executed quickly with multiple
processors
• Fault tolerance– If a processing unit is damaged, total system
can be available: Redundant systems
• Resource sharing– Multiple jobs share memory and/or I/O modules
for cost effective processing:Distributed systems
• Low power– High performance with Low frequency operation
© 조준동, 2007년 여름 37
DSP
Why Multi-Threaded Cores?
Out
NoC
In SRAM
DSPDSPH/W-MTRISC
H/WProc. Element
$GPP
I$D$ I$
Increasing gap: memory & processor
speeds(2x / 2 years)
Increasing gap: interconnect &
gate delays(multi-clock)
More parallel processing
(lower-power, higher-perf./mm2)
© 조준동, 2007년 여름 38
Flynn’s Classification
• The number of Instruction Stream:
M(Multiple)/S(Single)
• The number of Data Stream:M/S– SISD
• Uniprocessors(including Super scalar、VLIW)
– MISD: Not existing(Analog Computer)– SIMD– MIMD
© 조준동, 2007년 여름 39
MIMD
Processors
Memory modules (Instructions・Data)
•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible
Interconnectionnetworks
© 조준동, 2007년 여름 40
Classification of MIMD machines
• UMA(Uniform Memory Access Model)provides shared memory which can be accessed from all processors with the same manner.
• NUMA(Non-Uniform Memory Access Model)
provides shared memory but not uniformly accessed.
• NORA/NORMA (No Remote Memory Access Model)
provides no shared memory. Communication is done with message passing.
© 조준동, 2007년 여름 41
An example of UMA:Bus connected
PU PU
SnoopCache
PU
SnoopCache
PU
SnoopCache
Main Memory
shared bus
SnoopCache
SMP(Symmetric MultiProcessor)
On chip multiprocessor
© 조준동, 2007년 여름 42
Switch connected UMA
Switch
CPUInterface
Local Memory
Main Memory
. . . .
….
© 조준동, 2007년 여름 43
NUMA
• Each processor provides a local memory, and accesses other processors’ memory through the network.
• Address translation and cache control often make the hardware structure complicated.
• Scalable:– Programs for UMA can run without
modification. – The performance is improved as the system
size.
Competitive to WS/PC clusters with Software DSM
© 조준동, 2007년 여름 44
Typical structure of NUMA
Node 1
Node 2
Node 3
Node 0 0
1
2
3
InterconnectonNetwork
Logical address space
© 조준동, 2007년 여름 45
Classification of NUMA
• Simple NUMA:– Remote memory is not cached.– Simple structure but access cost of remote
memory is large.• CC-NUMA:Cache Coherent
– Cache consistency is maintained with hardware.
– The structure tends to be complicated.• COMA:Cache Only Memory Architecture
– No home memory– Complicated control mechanism
© 조준동, 2007년 여름 46
Cray’s T3D: A simple NUMA supercomputer (1993)
• UsingAlpha 21064
© 조준동, 2007년 여름 47
The Earth simulator(2002)
© 조준동, 2007년 여름 48
NORA/NORMA
• No shared memory• Communication is done with
message passing• Simple structure but high pe
ak performance
The fastest processor is always NORA(except The Earth Simulator)
Hard for programming
Inter-PU communications Cluster computing
Early Hypercube machine nCUBE2
© Jun Dong Cho, 2007.7 49
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
MP-SoC Examples and Applications
© 조준동, 2007년 여름 50
Dual-Core (DSP+ARM) Platform
© 조준동, 2007년 여름 51
IBM Power4
– 2 cores– F = 1.4GHz– Single clock over entire
die– Balanced H-tree driving
global grid– Measured clock skew
below 25ps– Power ~85W– 180nm SOI process,
174M transistors
© 조준동, 2007년 여름 52
IBM’s Multiple processors on MCM
- 4 POWER4 chips into single module (MCM)
– The POWER4 chips connected via 4 128-bit buses
– Up to 128MB L3 cache– Bus speed ½ processor
speed– Total throughput ~35
GB/s
© 조준동, 2007년 여름 53
MPSoC “Bus” Alternatives• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O• NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric
© 조준동, 2007년 여름 54
Available Mobile Processors
• The ARM Family– The ARM7 Generation– The StrongARM– The ARM Thumb Option– The ARM Piccolo Option– The ARM9 and ARM10
• The Motorola M-Core• The LSI TinyRisc• The Hitachi SuperH Family• VLIW Processors
– The Motorola-Lucent Star*Core– The Philips TriMedia– The HP/Intel IA-64
© 조준동, 2007년 여름 55
Available MP-Cores
• TI OMAP• Philips’s NexperiaTM DVP• ST Nomadik• Intel® Itanium® Montecito• CELL Processor• CT 3400 Multi-core DSP• Hibrid SoC• Systolic Ring• Virtual platform in SHAPES project
© 조준동, 2007년 여름 56
TI OMAP
• Targets communications, multimedia.
• Dual-processor (DSP, RISC) with shared memory
• Hierarchical Definition of Platform
• Critical Role of Software as well as Hardware
• OCP (Open Core Protocol) based SoCplatform
C55x DSP
OMAP 5910:
ARM9
MMU
Memory ctrl
MPUinterface
SystemDMAcontrol
bridge
I/O
© 조준동, 2007년 여름 57
플랫폼 계층 및 구분
• Level 0: Foundation Platform– Infrastructure & standards : Basic Arch.
• Processor core, Peripheral/Interface IP, Bus: e.g., ARM PrimeXsys
• Level 1: Application specific Integration Platform
• Application Specific SoC: HW & SW• Mobile Platform, Home Platform
• Level 2: System Platform• Terminal Platform• Handset case: RF + Modem + AP + Memory + MMI
© 조준동, 2007년 여름 58
Hierarchy of Platforms in OMAP
Reference Design
Application Platform
SoC Platform
ASIC Library & Tools
Silicon Technology
Application Specific
Broadly Applica
ble
OMAP Products
OMAP Infrastructure
Reuse
System Platform
© 조준동, 2007년 여름 59
Scalable Multi-processors
© 조준동, 2007년 여름 60
TI OMAP 1510 Platform Architecture
Peripheral Bus
TI925
SDRAM Bus(16)
Peripherals: LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, McBSP
Peripheral Bus
C55X
SystemDMA
Peripherals Buses (8/16/32)
MAIL Box
DSP MMU
IMIF
Traffic Controller
EMIFF EMIFS
SRAMLB MMU
HASB MMU
Flash Bus(16)
LB(32)
HASB(32)
GPP:TI925 Core
- 16KB I-Cache - Write Buffer- MMU and D-MMU- Dual TLB
DSP:C55x CoreInternal Memory
- 48KW SARAM - 32KW DARAM- 16KW PDRAM
24KB I-CacheGraphic HW AcceleratorARM Port Interface
IPC- Mail Boxes - API- DSP MMU
System DMATraffic ControllerInternal SRAMBussesPeripherals
© 조준동, 2007년 여름 61
TI OMAP 1510 Platform S/W
TI925 General-Purpose Processor
OS kernel& drivers
TMS320 DSP
MPEG4
OS adapter LINK driverMCU Bridge Kernel
RESOURCE MANAGER
LINK driver Other drivers
DSP/BIOS KernelRM Server
MP3 AMRMEDIA APIs
raw data streams video audio speech
XDAIS AlgorithmsEncapsulated in socket nodes
Node Data Base
© 조준동, 2007년 여름 62
Philips’s NexperiaTM DVP
(source: Th. Claasen, Philips, DAC 2000)
© 조준동, 2007년 여름 63C
ompr
esse
d A
/V In
put B
us
Philips NexperiaTM DVP S/W Reference Architecture
Analog Inputs
Analog Front End
Analog Front End
Digital Inputs
Analog Front End
Optical Drive
Network Protocols
Hard Disk
Digital Front Ends
Com
pres
sed
A/V
Inpu
t Bus
Network Protocols
Players
Broadcast-MPEG2
VCD/SVCD
DVD
CD/SACD
WMT
RN
Broadcast-MPEG4
Recoders
DVD+RW Auth
PVR-SPTS
Lo-Rate SPTS
CD/DVD-MP3
• • •
Unc
ompr
esse
d A
/V In
put B
us
Transcoders
Translaters
TS-SPTS Filter
Loopback / Feedthrough
Digital Outputs
Protocol Stack
Network Protocols
Driver
HDD/Ethernet
Presentation Engine
Audio and Video Processing
© 조준동, 2007년 여름 64
Philips NexperiaTM DVP MP-SoC
• Philips's advanced set-top box anddigital TV SoC (Viper2)
• 0.13 μm• 50 M transistors• 100 clock domains• > 60 IP blocks
© 조준동, 2007년 여름 65
ST Nomadik
• Targets mobile multimedia. A Heterogenousmultiprocessor-of-multiprocessors.
© 조준동, 2007년 여름 66
Power Distribution 인텔 제온 프로세서
© 조준동, 2007년 여름 67
Clock and Power Convergence
Dynamic voltage and frequency scaling (DVS)
© 조준동, 2007년 여름 68
Intel® Itanium® Montecito - Clock system architecture
– Each core split into 3 clock domains on variablepower supply
© 조준동, 2007년 여름 69
Intel® Itanium® Montecito - Power management
– Dynamic voltage-scaling power management system– 4 on-die sensors– On-die microcontroller– Power and temperature measurement– Voltage and frequency modulation– 8μs power/temperature sampling interval– Embedded firmware– Power, temperature, or calibration measurements– Power: closed-loop power control and system
stability check– Temperature: thermal sensor readout (junction
temperature below 90°C monitoring) and power-control communication
– Calibration: power-measurement accuracy check
© 조준동, 2007년 여름 70
The implementation of a first-generation CELL Processor
© 조준동, 2007년 여름 71
The Cell Processor
• Fclock > 4 GHz.• Memory bandwidth: 25.6 GBytes per second.• I/O bandwidth: 76.8 GBytes per second.• Performance:
– 256 GFLOPS (Single precision at 4 GHz).– 256 GOPS (Integer at 4 GHz).– 25 GFLOPS (Double precision at 4 GHz).
• 235 square mm.• 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz
© 조준동, 2007년 여름 72
Cell’s Element Interconnect Bus
• 4 rings (2 ckwise + 2 counter-ckwise)• No token rings, still request/grant arbitrations
© 조준동, 2007년 여름 73
CT 3400 Multi-core DSP
• 8개 32비트DSP 코어
• 6개 32비트 범용프로세서 코어
• 128핀 프로그램가능 I/O 서브시스템으로 구성
• C 프로그램 가능
• H.264 및MPEG4 코드를지원
http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
H.264 encoder , decoder and audio codecs and the system control
© 조준동, 2007년 여름 74
H.264 codec onto CT3400 MDSP
From cradle
© 조준동, 2007년 여름 75
CT 3400 Multi-core DSP
CT3400 DPS Engine
http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
DSP Engine
Each DSP engine contains
A Single Instruction Multiple Data
Arithmetic Logic Unit (SIMD ALU)
A Packed Integer Multiplier
Accumulator (PIMAC)
A Floating Point Unit (FPU)
Bi-directional FIFO data buffers
DMA channels
A 128 x 32 register and
A 512 x 20 program memory
© 조준동, 2007년 여름 76
CT3600 Multiprocessor DSP Family Members
• CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수
• 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다
• 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네배로 증가
• 40달러에서 90달러
http://www.cradle.com/downloads/CT3600-PB.pdf
© 조준동, 2007년 여름 77
CT 3616 Multi-core DSP
http://www.cradle.com/downloads/CT3600-PB.pdf
© 조준동, 2007년 여름 78
HiBRID-SoC Architecture
Multi-Core SoC Architecture Dedicated chips
for the Mpeg-4 Simple Profile
Integrate a powerful on-chip communication
structure
Three programmable cores: Each adapted
towards a specific class of algorithmsInstruction Level VLIW (Very long
instruction word)Data Level SIMD (Single instruction
multiple data)Task Level (Simultaneous
multithreading)
Developed at the University of Hannover
© 조준동, 2007년 여름 79
Multi-Core SoC Architecture
• Hi-par DSP• 16-datatath SIMD processor core controlled by VLIW,• Particularly optimized towards high-throughput two
dimensional DSP-style processing• (FFT-intensive applications or filtering)
• Stream Processor (SP)• 32-Bit RISC architecture that is more optimized to-wards
control-dominated task• Bitstream processing or global system control
• Macroblock processor(MP)• Efficient processing of data blocks (Heterogeneous data
path structure consisting of scalar and a vecture unit)• Controlled by dual-issue VLIW, offers flexible subword
parallelism, and contains instruction set extensions for typical processing computation steps
© 조준동, 2007년 여름 80
HiBRID-SoC multi-core architecture
64-bit AMBA AHB system
bus
Connects all cores
SDRAM memory via a
64 Bit SDRAM
interface
Two versatile 32-Bit
host interfaces for
access (e.g., host PC
via PCI and to serial
flash memory)
© 조준동, 2007년 여름 81
HiPAR-DSP
Highly paralled DSP core with a
VLIW-controlled SIMD
architecture
DMA unit serves all cache misses
and performs data prefetch
transfers to the matrix memory
At the targeted clock frequency of
145 MHz, the HiPAR-DSP
achieves a performance of 2.3
GMACs
© 조준동, 2007년 여름 82
Macroblock processorHeterogeneous data path structure consisting
of a scalar and a vector data path
The scalar data path operates on 32-Bit data
words in a 32-entry register file and provides
control instructions (jump,branch, and loop)
The vector data path is equipped with a 64
entry register file of 64 bit width
Special fuction unit(SFU) provide
instruction set extensions for common video
and multimedia core algorithms.
MUL/MAC or ALU, incorporate SIMD-
style subword parallelism by processing
either two 32-Bit, four 16-Bit, or eight 8-Bit
data entities in parallel within a 64-bit
register operand
© 조준동, 2007년 여름 83
HiBRID-SoC Implementations
Chip layout of the HiBRID-SoC.
MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits:
HiBRID-SoC is fabricated in a 0.18 um,
6LM standard-cell technology,
14 million tr’s 3.5W
82 mm2, 145 MHz
© 조준동, 2007년 여름 84
New Taxonomy/Metric
• Flynn: Triple (d,i,c)d: # of data streamsi: # of instruction
streamsc: # of configuration
states
SISD, SIMD, MIMD,MISD
• RA: (c,g,a)– c: configurability to
various environment– g: size of granularity– a: adaptability to
various components
– SCSG,SCMG,SCLG– MCSG,MCMG,MCLG
© 조준동, 2007년 여름 85
Systolic Ring
• Based on a coarse-grained configurable PE
• Circular datapathsC: # of layers C = 4N: # of Dnodes per layer
N = 2S: # of Rings s = 1
• Control Units (sequencer)Local Dnode unitLocal Ring unitGlobal unit
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
layer 1
layer 2
layer 3
layer 4
Dnode Sequencer
Local RingSequencer
© 조준동, 2007년 여름 86
Motivation For Using Hierarchical Rings
• Relatively simple switching logic reduces the complexity at each node resulting in reduced buffer, area and energy requirements.
• Low latency since packets are forwarded in 1 clock cycle.
• Packets will always arrive in-order at the destination.
• Broadcast and Multicast packets are efficiently implemented.
• Hierarchical rings can be partitioned into independent clock domains.
© 조준동, 2007년 여름 87
Remanence
Fe
Fc
FcNcFeNR PE
..=
• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency
Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0
Sequencer
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0
Sequencer
© 조준동, 2007년 여름 88
Operative Density
NPE: # of PE
A: Core Area (relative unit λ²)
Area can be expressed as a function of NPE
)()(
PE
PEPE NA
NNOD =
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0Sequencer
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0Sequencer
© 조준동, 2007년 여름 89
Remanence formalisation
• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140 160 180 # Dnodes
REMANENCE
k = 2k = 4
k = 8
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140 160 180 # Dnodes
REMANENCE
Switch
Dnode Dnode
Dnode Dnode
Swi
tch
Dnode
Dnode
Switch
Dnode
Dnode
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Switch
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Switch
Dnode Dnode
Dnode Dnode
Swi
tch
Dnode
Dnode
Switch
Dnode
Dnode
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Switch
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
layer 1 layer 2
layer 3
layer 4
layer 5layer 6
layer 7
layer 8
k = 1k = 1
k = 2k = 4
k = 8
PEPE NkNR .)( =
k= C/N
© 조준동, 2007년 여름 90
Architectural model Characterization
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Global Bus
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Global Bus
Global Sequencer
Local RingSequencer
Local RingSequencer
Local RingSequencer
Local RingSequencer
# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)
Control Units• Local Dnode unit• Local Ring unit• Global unit
•www.qstech.com
© 조준동, 2007년 여름 91
Best OD and remanence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Ope
rativ
e D
ensi
ty
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanenc
e
Rem
anence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Ope
rativ
e D
ensi
ty
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanenc
e
Rem
anence
Design SpaceWorst interconnect resources and processing power
© 조준동, 2007년 여름 92
Worst OD and remanence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Ope
rativ
e D
ensi
ty
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanenc
e
Rem
anence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Ope
rativ
e D
ensi
ty
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanenc
e
Rem
anence
Design SpaceBest interconnect resources and processing power
© 조준동, 2007년 여름 93
Comparisons of RA
1. Only 1 cycle to (re)configure the DSP
2. Few cycles to (re)configure coarse grain RA (≤8)
3. Many cycles to (re)configure fine grain RA
NPE Nc RName Type F (MHz)
2304 0.14 16457
24 4 6
24 4 6
128 16 8
ARDOISE
Systolic Ring
DART
MorphoSys
TMS320C62
Fine Grain RA
Coarse Grain RA
Coarse Grain RA
Coarse Grain RA
DSP VLIW 8 8
33
200
130
100
300 1
FcNcFeNR PE
..
=Pascal BENOIT
© 조준동, 2007년 여름 94
Virtual platform in SHAPES project
© Jun Dong Cho, 2007.7 95
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
Homogeneity and Heterogeneity
© 조준동, 2007년 여름 96
MPSoC Architecture Trends
© 조준동, 2007년 여름 97
1~8 2~6
Exploitable Parallelism
GP O/SThread-LevelParallelism
Instruction-Level
Parallelism
1
10 000’sInstructions
Min parallel grain size (instrns.)
Exploitable taskparallelism
1~100
MultiFlex Thread-Level
Parallelism
100’s
© 조준동, 2007년 여름 98
Three Levels of Parallelism
© 조준동, 2007년 여름 99
Parallel Heterogeneous Platforms (PHPs)
• Challenges:– Explore the theoretically high performance
Platform Company PEs Het?
Cell IBM/Sony/Toshiba 9 Y
DRP NEC 512 N
Nomadik ST 3+ Y
OMAP2420 TI 4 Y
Nexperia Philips 3+ Y
X-Fi Creative 7 Y
ARM11 MPCore ARM 1-4 N
IXP2800 Intel 17 Y
MXP5800 Intel 54 Y
… … … …
(From Abhijit Davare’s Quals Presentation)
© 조준동, 2007년 여름 100
Homogeneous MP-SOC
• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore
module• 32bit interconnection
(AMBA Bus or STBus)• Porcessor Core modeling :
C++• Hardware interconnection
modeling : SystemC
© 조준동, 2007년 여름 101
NEC MP211: Homogeneous MP core
• Asymmetric mp with very coarse grain multitasking
• 3 ARM9’s utilized as predefined function units
• NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing
•Asymmetric mp with very coarse grain multitasking•3 ARM9’s utilized as predefined function units•NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing
© 조준동, 2007년 여름 102
MP211 block diagram
© 조준동, 2007년 여름 103
Power consumption of H.264+AAC
H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)DTV: 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가실행되고 있는 영역을 뜻한다.
© 조준동, 2007년 여름 104
Homogeneous MP의 문제점
▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.
▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은
자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하
다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적
으로 설계되었다.
© 조준동, 2007년 여름 105
Heterogenous MP Core
If two or more cores share L2, the way a lot of present CMPsdo, a crossbar provides a high bandwidth connection.
Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism,instruction level parallelism을 동시에 처리 가능하도록 설계되었다.
© 조준동, 2007년 여름 106
Heterogeneous MP core
• 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을낮출 수 있다.
• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.
• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.
• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수있다.
• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.
© 조준동, 2007년 여름 107
Heterogenous MP Core
▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous
CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은
큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다.
▷ Multi-ISA multicore architecture는 vector/data-level parallellism,
instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 코어
숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우,
인터콘넥트의 전력 소모량은 하나의 코어와 같다.
▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을
이용하는 heterogeneous processors는 homogeneous에 비해서 63%
성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의
개선이 있다.
▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에
대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를
최대화 한다. [Annavaram, et al]
© 조준동, 2007년 여름 108
NEC’s Asymmetric(or Heterogeneous) Multi processing
• 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른하드웨어 개발기간과 가격을 낮출 수 있다.
• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.
• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.
• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 어플리케이션간 인터페이스를 줄일 수 있다.
• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이가능하다.
© 조준동, 2007년 여름 109
Heterogeneous MP-SoC 문제점들
• Processors are bound by wire and memory latencies
• Peak performance on only a small class of applications.
• How well they map to a given design• Diversification of workloads • Increased hardware complexity • Poor resource utilization
© 조준동, 2007년 여름 110
AMP task allocation image
© 조준동, 2007년 여름 111
Bus and Memory Architecture
© 조준동, 2007년 여름 112
Alpha cores scaled to 0.10 um.
EV8 is 80 times bigger but provides only two to three times more single-threaded performance
© 조준동, 2007년 여름 113
Equal-area heterogeneous architectures with multithreaded cores
© 조준동, 2007년 여름 114
Exploring the potential from heterogeneity
© Jun Dong Cho, 2007.7 115
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
MP-SoC Design Automation
© 조준동, 2007년 여름 116
Optimization and Synthesis
• Computation Synthesis:– Task Allocation–Task Scheduling
• Communication Synthesis:– Interconnection Synthesis–Buffer sizing
© 조준동, 2007년 여름 117
Energy-Aware Task mapping
Minimize Energy Consumption, given a CTG and a heterogenous NoC• Find:
– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned
• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.
• Communication-optimal task mapping– minimal hardware (buffers and wires) required to
meet the timing requirements defined in the specification.
– given a multiprocessor network find a mapping of the application satisfies the timing constraints.
• Genetic algorithm (Chromosome, Generation, Crossover, mutation)
Addressed by Hu et al 2002:
© 조준동, 2007년 여름 118
Interconnection Synthesis– With each new
technology:– Gate delay decreases
~25%– Wire delay increases
~100%
– Cross-chip communication increases
– Clock needs multiple cycles to cover die
Source: SIA NTRS Projection
© 조준동, 2007년 여름 119
Interconnect Delays & Density
Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology
© 조준동, 2007년 여름 120
Buffer Sizing
• Architectures have bounded buffer resources.• If more communication buffer resources are
utilized, processors may spend less time waiting to send/receive data.
• Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.
© 조준동, 2007년 여름 121
Multiple Clocks due to Interconnect limitation
© 조준동, 2007년 여름 122
MPSoC HW platform perspective
• Today´s platforms are quite heterogeneous– Reasons: efficiency and legacy IP
• Homogeneous MPSoC would scale welland would simplify programming– Works well for desktop PCs– Too inefficient for embedded apps
• Mixed MPSoC as a compromise?– Globally homogeneous, locally heterogeneous– (re)configurable PEs
www.iss.rwth-aachen.de
© 조준동, 2007년 여름 123
Future MPSoC programming
• Sequential-to-parallel code generationC code (and platform/RTOS model) in,
• parallel C codes out» Step 1: exhibit parallelism at block/task level
to the user for manual mapping» Step 2: automate code partitioning/mapping
• Massive use of compiler technology, e.g.data flow analysis
• Use of „platform refinement“ technology asbackend for machine code generation and
simulation www.iss.rwth-aachen.de
© 조준동, 2007년 여름 124
The von Neumann inheritance
• Sequential programming of sequentialmachines– Pascal, Modula-2, C, C++, Java, ...
• Sequential programming of parallel machines?– VLIW: handled by sophisticated compilers– SIMD: will be accomplished by compilers– Does not scale to heterogeneous MPSoC
with– distributed control paths!
• Parallel programming of parallel machines!– We need to move ... to parallel thinking and
programming...We are standing at the verybeginning...It´s a huge area. (J. Gutknecht,
– ETH Zurich)• What to do in the meantime?
© 2007 R. Leupers
© 조준동, 2007년 여름 125
Block clustering approach
www.iss.rwth-aachen.de
© 조준동, 2007년 여름 126
Block clustering approach –Cn’t
www.iss.rwth-aachen.de
© 조준동, 2007년 여름 127
Block clustering approach –Cn’t
www.iss.rwth-aachen.de
© 조준동, 2007년 여름 128
멀티코어 SoC 설계방법
© Jun Dong Cho, 2007.7 129
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
Network On Chip
© 조준동, 2007년 여름 130
Technology Evolution
© 조준동, 2007년 여름 131
What are NoC’s?
• According to Wikipedia:
– “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.”
© 조준동, 2007년 여름 132
Network-on-Chip (NoC)
• Communication is achieved by connecting switches together to form a network topology:
• Offers much greater scalability.• parallelism: multiple components can send
data simultaneously• energy efficient: point-to-point connections
require less energy than a bus.• Global synchronization is no longer needed.
© 조준동, 2007년 여름 133
NoC Design Considerations (I)
• There are several popular topologies:– 2D Mesh (most popular).– Torus (rings)– Tree (fat-tree, butterfly fat-tree)
• The on-chip interconnection network will soon be a limiting– factor for performance and energy consumption:– has been reported to account for over 50% of the total
energyrequirement!
• The interconnect should consume the fewest resourcespossible and should be:– area efficient: switches should be as small (simple) as
possible.– energy efficient: related to area efficiency– fast: simple routing algorithms should be used.
© 조준동, 2007년 여름 134
ProcessorMaster
GlobalMemory
Slave
Global I/OSlave
Global I/OSlave
ProcessorMaster
ProcessorMaster
ProcessorMaster
ProcessorMaster
ProcessorMaster
ProcessorMaster
ProcessorMaster
ProcessorMaster
RoutingNode
RoutingNode
RoutingNode
RoutingNode
RoutingNode
RoutingNode
RoutingNode
RoutingNode
RoutingNode
NoC exemplified
© 조준동, 2007년 여름 135
NoC: Good news
☺ Only point-to-point one-way wires are used, for all network sizes.
☺ Aggregated bandwidth scales with the network size.
☺ Routing decisions are distributed and the same router is re-instanciated, for all network sizes.
☺ NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)
Sergio Tota and Mario R. Casu
© 조준동, 2007년 여름 136
There’s no free lunch…
Internal network contention causes (often unpredictable) latency.The network has a significant silicon area.Bus-oriented IPs need smart wrappers.Software needs clean synchronization in multiprocessor systems.System designers need reeducation for new concepts.
Sergio Tota and Mario R. Casu
© 조준동, 2007년 여름 137
Facts about NoC’s
• It is a way to decouple computation from communication
• The design is layered (physical, network, application…): Taming complexity is made easier
• Communication between processing elements in NoC takes place by encapsulating data in packets
• The elementary packet piece to which switch and routing operations apply is the flit
© 조준동, 2007년 여름 138
Topologies• Heritage of networks with new constraints
– Need to accommodate interconnects in a 2D layout– Cannot route long wires (clock frequency bound)
a) SPIN, b) CLICHE’c) Torusd) Folded toruse) Octagonf) BFT.
© 조준동, 2007년 여름 139
SPIN (Guerrier et al., DATE ’00/’03)
• Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology.• A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words
© 조준동, 2007년 여름 140
Dally et al., DAC’01• 2D folded torus topology• Wormhole routing and Virtual Channels (VC)
© 조준동, 2007년 여름 141
Kumar et al., ISLVLSI’02
• Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’• 2D Mesh Topology• Message Passing
© 조준동, 2007년 여름 142
Pande et al., TCOMP’05 • Butterfly Fat Tree• Wormhole, Virtual channels• Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)• “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration
© 조준동, 2007년 여름 143
Goossens et al., IEE CDT’03
• Both VCT and WH, GT and BE, IQ and VOQ
• GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k”in a S&F fashion.
• BE uses Matrix Scheduling• GT connections set up by BE
special system packets• Prototype with WH and IQ
– 5 ports– 0.13 um, 0.26 mm2 , 500/166 MHz– Flit size = 3 words, each 32 bits– 80 Gb/s aggregate bandwidth
© Jun Dong Cho, 2007.7 144
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
SKKU’s Mobile MP-SoC Platform
Koonshik Cho & Jun Dong ChoMobile SoC Design Automation Lab.
Sungkyunkwan Univ.
© 조준동, 2007년 여름 145
1. Multiprocessor SoC 설계 Platform
• SW-성능 개선과 표준 변동에 능동적으로 대처
• HW- Modular, Flexible and Scalable Architecture
• Platform based design
2. Multiprocessor SoC Platform test
• DVB-T Receiver
3. Tools
• Seamless CVE (Mentor Graphics)
• ADS(ARM)
SoC (DSP+ARM) Platform
© 조준동, 2007년 여름 146
SoC (DSP+ARM) Platform
© 조준동, 2007년 여름 147
Extended multi-processor platform
© 조준동, 2007년 여름 148
ARM Platform
© 조준동, 2007년 여름 149
AMBA BUS (1)
AMBA BUS는 Multiplexer, Arbiter, Decoder가 있어 여러개의 Master와 Slave를 중재해 주는 역할을 한다.
© 조준동, 2007년 여름 150
AMBA BUS (2)AMBA Bus (Master to slave multiplexer)
• Bus Master는 Address나 Control signal들을 Slave로 내보냄으로 Read 나Write 등의Operation을 할 수 있도록 해 주는 장치이다. 동시 간에 하나의Master만이 전송을 가능하게 한다. 또한 Multiple master가 가능하다
© 조준동, 2007년 여름 151
AMBA BUS (3)
AMBA Bus (Slave to master multiplexer)
• Bus Slave는 주어진 Address-space안에서 Master의 Read와 Write를 가능하게 해주는 장치이다. Slave는 Ready 및 Response signal을통해 동작 상태에 대해 Master에게 알려준다. 또한 Multiple slave가가능하다.
© 조준동, 2007년 여름 152
AMBA BUS (4)
• AHB Arbiter : Bus Arbiter는 한번에 오직 하나의 Master가 선택
되도록 하는 역할을 한다. 고유의 Priority scheme을 가지고 이러
한 Arbitration을 하게 되는데, AHB에는 오직 하나의 Arbiter가 존
재하게 된다.
• AHB Decoder : AHB Decoder의 역할은 Master로 나오는
Address의 상위 비트를 가지고서 적절한 Slave를 선택해 주는 것
이다. AHB에는 역시 하나의 Decoder가 존재한다.
• APB Bridge : APB (Advanced Peripheral Bus)상의 유일한 Bus Master이다. APB Bridge는 ASB의 Slave로서 Decoder에서 APB가 선택이 되었을 때는 APB 상에서 Master의 역할을 하게 된다
APB Bridge는 Slave module로 Local peripheral bus를 대신해
서 Bus handshake와 신호 Retiming을 조정한다.
© 조준동, 2007년 여름 153
AMBA BUS (5)
• Interrupt controller : 최대 32개의 Interrupt source로부터 Interrupt request 신호를 받아서 ARM9 프로세서에 인가되는 nIRQ 또는 nFIQ 신호를 생성한다. 32개의 Interrupt source 중에서 0~3번 Interrupt source가 nFIQ, 4~31번 Interrupt source가 nIRQ를생성한다. 낮은 번호일수록 높은 우선순위를 가진다.
• Timer :Timer 모듈에서는 3개의 Timer 기능을 제공한다. 각 Timer는 16bit counter로서 16, 256, 4096의세 가지 Prescale을 지원하며, 매 주기마다 Counter값을 1씩 감소시키고, Count값이 0이 되면 Interrupt를발생시킨다. ARM9 프로세서가 Timer interrupt clear 레지스터를 통해 Interrupt ack 신호를 줄 때까지Interrupt request를 유지한다.
© 조준동, 2007년 여름 154
Teak DSP Platform
• 전제 플랫폼에서 Co-프로세서인 Teak DSP 플랫폼의 구조
© 조준동, 2007년 여름 155
Configuration of crossbar switch
• Communication interface Architecture (Crossbar 구조)
© 조준동, 2007년 여름 156
재구성 가능한 크로스바 스위치
VHDL 의 generate문을 사용
© 조준동, 2007년 여름 157
재구성 가능한 크로스바 스위치(VHDL code)
entity CI_TOP isgeneric ( number_of_masters , number_of_slaves : integer);
port ( …생략);
end CI_TOP;
CI 모듈의 entity (ci_top.vhd)
COMMUNICATION_INTERFACE : CI_TOPgeneric map( number_of_masters=>4 , number_of_slaves =>6)
port map( …생략);
CI 모듈의 사용(multiplatform.vhd)
© 조준동, 2007년 여름 158
Advantage Disadvantage
Mux
‣비교적 쉽게 구현 가능‣Master, Slave 가적은 경우 효과적
‣Processor 간 병렬처리가 어려움‣시스템이 확장될 경우병목현상을
발생
Crossbar
‣Processor 간 효과적인 병렬처리가 가능‣시스템이 확장되어도같은 Delay를 가짐
‣구현이 어려움‣Size 및 low-power면에서 비교적불리함
Communication Interface Mux vs Crossbar
© 조준동, 2007년 여름 159
Interconnection network
Omega interconnection Octagon interconnection
Mesh interconnection
© 조준동, 2007년 여름 160
장점 단점
Shared bus ‣비교적 쉽게 구현 가능
‣마스터, 슬레이브가 적은
경우 효과적.시스템이 확장되어도 같은 Delay를 가짐
‣프로세서 간 병렬처리 힘듦
‣버스 효율 낮음
‣전력소모 많음 (broadcasting) ‣구현 복잡도 - 낮음
Crossbar ‣프로세서 간 병렬 처리 가능
‣확장성 및 유연성 - 우수
‣데이터 path - 보통
‣구현 복잡도(라우팅 및 스케줄링) - 보통
‣확장에 따른 Size 및 wiring 증가
Omega network
‣프로세서 간 병렬 처리 가능
‣확장성 및 유연성 - 보통
‣데이터 path - 우수(짧음)
‣구현 복잡도(라우팅 및 스케줄링) 높음
‣확장성이 다소 떨어짐
Octagon ‣프로세서 간 병렬 처리 가능
‣확장성 및 유연성 - 보통
‣데이터 path - 우수 (가장 짧음)
‣구현 복잡도(라우팅 및 스케줄링) 높음
‣확장성이 다소 떨어짐 (마스터, 슬레이브의 개수 8개로 제한)
Mesh ‣프로세서 간 병렬 처리 가능
‣확장성 및 유연성 - 우수
‣데이터 path - 보통
‣구현 복잡도(라우팅 및 스케줄링) 매우높음
‣중, 대형 시스템에 적합
Interconnection network
© 조준동, 2007년 여름 161
네트워크 라우터 셀
– 멀티프로세서 플렛폼으로 4개의Master와 6개의 Slave구조
– CI Cell은 24개의 2by2 mux 구조로 설계
– CI Controller => Req, Grant, mux control etc.
– Seamless CVE와 Modelsim을연동한 상태에서 ARM926,Teak DSP가 동시에 slave에 접근하여각각의 데이터를 Read & Write 플랫폼 Function Block
– 각 Master가 Slave(Ips)로 접근시 CI Controller내부 기능은Request신호, Grant 신호 및 각Mux Control 제어신호, Round Robin기능, Decoder기능 수행 Ci Controller inner block
© 조준동, 2007년 여름 162
CI-controller State Diagram
© 조준동, 2007년 여름 163
CI controller simulation waveform
© 조준동, 2007년 여름 164
DVB-T Baseband Receiver
© 조준동, 2007년 여름 165
Hardware-software co-design flows
© 조준동, 2007년 여름 166
A shared memory structure and hardware-software partitioning
© 조준동, 2007년 여름 167
Frequency offset compensator hardware
© 조준동, 2007년 여름 168
Fine and Coarse Frequency Synchronizer (Beek & Classen)
© 조준동, 2007년 여름 169
FFT block diagram
© 조준동, 2007년 여름 170
Equalizer hardware block diagrams
© 조준동, 2007년 여름 171
DVB-T baseband Receiver Scheduling
© 조준동, 2007년 여름 172
DVB-T baseband Receiver Scheduling
© 조준동, 2007년 여름 173
Performance evaluation
Processing Types /
Functional BlockSW
SW & HW (Teaks + ARM +
HW IP)
HW(IP) only MAL
Frequency compensator & Remove Guard - 182.5us 13.8us 10.5us
Fine Freq. sync. (Beek) - 56.3us 1.5us 7.8us
Symbol Timing Recovery 144 us - - 5.2us
FFT - 188.9us 38.6 us 13.6us
Coarse Freq. Sync. (Classen) - 241us 3.3us 11us
Scattered Pilot Detection 46.5us - - 3.3us
Equalizer - 219.5us 11.2us 9.5us
De-mapping 19.9us - - 4.9us
© 조준동, 2007년 여름 174
Task Chart of Multi-processor platform for DVB-T baseband receiver
© 조준동, 2007년 여름 175
Task Chart of Multi-processor platform for DVB-T baseband receiver
© Jun Dong Cho, 2007.7 176
Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea
Modeling of Motion Compensation IP using SCML
Le Minh Nghia & Jun Dong ChoMobile SoC Design Automation Lab.
Sungkyunkwan Univ.
© 조준동, 2007년 여름 177
Introduction of SystemC Modeling in CoWare
• TLM Peripheral Modeling with CoWare• SystemC Modeling Library (SCML)• Motion Compensation Modeling using SCML
© 조준동, 2007년 여름 178
TLM Peripheral Modeling with CoWare
• Four use-cases for Transaction-Level Modeling (TLM)– Functional View (FV)– Architecture’s View (AV)– Programmer’s View (PV)– Verification View (VV)
© 조준동, 2007년 여름 179
TLM Peripheral Modeling with CoWare
• General pattern for modeling peripheral component– Separate Behavior, Communication and Timing– Initiators and Targets depending platform are created by user.– Bus-transactor convert a generic communication into a bus-specific TLM
interface.– Accuracy of Timing depends on use-case
© 조준동, 2007년 여름 180
TLM Peripheral Modeling with CoWare
• Communication– Communication through function calls– Simulation speed strongly depends on bus-model– PV bus-model used for software development can
be simulated very fast• Behavior
– Functionality– Synchronization– Storage
• Timing– Modeling timing model based on clock object in
SystemC Modeling Library (SCML)
© 조준동, 2007년 여름 181
TLM Peripheral Modeling with CoWare
• Modeling Target pattern– Communication : Bus-transactor– Storage and Synchronization : Register bank as interface– Behavior: Collection of call-back functions, each call-back
corresponding to a bitfiel or register in register bank
© 조준동, 2007년 여름 182
TLM Peripheral Modeling with CoWare
• Modeling Initiator pattern– Communication : Bus-transactor convert posted transactions in
queue into real bus transaction– Storage and Synchronization : Include Post port and initiator storage
element scml_array (in SCML). Post port post transactions in term of nonblocking. The real synchronization depends on data and space in storage element which related to scml_array object
– Behavior: Modeled by autonomous SystemC processes
© 조준동, 2007년 여름 183
Initiator Synchronization
• Two class of initiator blocks:– Free-running initiator: all transfer initialized by
block do not need any accesses from another peripheral
– Initiator block has target port and transfers will only be initialized
• Three pattern synchronization of Initiator block:– Free-running Initiator– Fully Slaved Initiator– Semi-free running Initiator
© 조준동, 2007년 여름 184
Initiator Synchronization
• Modeling a Free-Running Initiator Peripheral– Thread is modeled by SC_THREAD and post
transaction– Wait(sc_time) : To schedule the next-execution of
thread
© 조준동, 2007년 여름 185
Initiator Synchronization
• Modeling a Fully-Slaved Initiator Peripheral– Slaved- Initiator only sends transaction when its target
port is accessed– Loop in Fully-Slaved Initiator returns control to master
thread after it posted transaction
© 조준동, 2007년 여름 186
Initiator Synchronization
• Modeling a Semi-Free-Running-Slaved Initiator Peripheral– Thread containing Loop is triggered by start event– Start event is generated by accessing target port of
initiator
© 조준동, 2007년 여름 187
SystemC Modeling Library (SCML)
• Memories and Bitfield object:– To model bit-field and memory-map registers– Memory object support posting non-blocking
transactions– Support synchronization by read and write data based
on blocking access • Clock object
– To model timing or clock in IPs• Initiator-side object
– Model the communication of initiator peripherals to support re-use.
© 조준동, 2007년 여름 188
Modeling TLM Motion Compensation
• Outline features of Motion Compensation IP– Synchronization : Semi-Free-Running-Slaved
Initiator– Behavior: Algorithm extracted from J.M source
code– Structure includes two part
• Target part: Interface with Master Processor using Register bank and modeling follow Target pattern of SCML
• Initiator part : Modeling the posting of transactions and synchronization of transmission transactions follow Inititator pattern of SCML
© 조준동, 2007년 여름 189
Modeling TLM Motion Compensation
© 조준동, 2007년 여름 190
Modeling TLM Motion Compensation
• Three ports:– pConfig: Interface with Master Processor to receive parameters.– p_Irq: Generate interrupt to synchronization with Master processor– p_Post : Post transactions to specific bus through bus-transactor
• Register bank: for parameters of Motion Compensation block
• StartStopReg and IrqReg: for interface with Master Processor
• Behavior block : for Motion Compensation Algorithm and transaction modeling
• Call-back functions : for events caused by writing StartStopReg and IrqReq.
© 조준동, 2007년 여름 191
Modeling TLM Motion Compensation
• Functions in TLM Motion Compensation Model:– f_initialize(): Init parameters of Model– f_thread (): Wait event generated by writing to StartStopReg– f_write_start_stop(): Call-back function corresponding with event writing
to StartStopReg. It activate or deactivate Model by generating a sc_eventto signal f_thread().
– f_clear_irq(): Clear IrqReg– f_MotionCompensation(): Motion Compensation behavior based on
original source code in J.M reference software.– f_do_post(): Post transactions in storage(transaction pool) to bus
transactor and manage synchronization posting– f_postTransfer(): Post a transaction to bus transactor– f_release_trans(): Release transaction pool
© 조준동, 2007년 여름 192
Next…
• Extract parameters as TestVector from J.M source code
• Build a platform in CoWare• Test Motion Compensation IP with TestVector
© 조준동, 2007년 여름 193
맺음말
• (Mobile) SoC의 complexity 및 cost의 증가로 MP-SOC platform을 이용한 설계 프로세스 중요
• Mobile platform의 challenge로 low power, RF I/F를 포함한 검증, variety of standards, platform optimization 제시
• 여러 platform 및 methodology의 장단점을 취한platform 개발이 바람직
• HW/SW/algorithm을 이해하고 설계할 수 있는 인재(system architect) 육성