View
249
Download
0
Category
Preview:
Citation preview
1
QoS for High-Performance
and Power-Efficient HD
Multimedia Systems
Rob Kaye
2
Convergence is Happening – For Real
1GHz+ Processor, increasingly multi-core SMP-capable
CPUs
1080p HD video & graphics
Internet connectivity – Either wired, wireless or both
3
The Need for Quality of Service
Communication explosion: more masters, more functions, more data
Multiple high-performance masters competing for limited memory bandwidth
QoS employed to manage traffic flows through interconnect and memory controller Allocate bandwidth and manage
latency appropriately
Allocate any excess capacity for greatest benefit
4
Little’s Law for Queuing Latency
NT = RT . LT
where
NT = number of requests waiting
(“outstanding transactions”)
RT = arrival rate
(bandwidth requested)
LT = latency
(delay in request being completed)
Note: To achieve max theoretical bandwidth from memory system:
Replace RT with theoretical peak memory bandwidth
NT = Bandwidth . Latency
Gives the min number of queuing outstanding transactions to achieve peak theoretical
bandwidth of memory system
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
Po
pu
lati
on
Latency/clocks
System Latency
Static Latency
Queuing Latency
http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf
6
How Much Buffering is Needed?
NT = Bandwidth * Latency
Simplistically, NT = Latency / Time per transaction
If latency is 20 cycles and each burst takes 4 cycles of active data, then to maintain 100% active data cycles there must be 20 / 4 = 5 outstanding transactions (min)
Static latency Processing rate
Average DMC Queue Depth
4 6 8 10 12
To
tal u
tiliz
atio
n
4 6 8 10 12
Average DMC Queue Depth
Rea
d first la
ten
cy
Read first la
tency
Adjusted
Theoretical Observed
PL340 SDR SDRAMC study
Theoretical
7
CPU Latency Sensitivity : Browser
Memory Latencies Baseline is 130ns
~50ns increments up to 330ns
Measured Cached Time
Cortex A8 768:192:192MHz 32KB-L1 256KB-L2
33% performance loss 130 ->330ns latency
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
130ns 180ns 230ns 280ns 330ns
No
rmal
ize
d E
xecu
tio
n T
ime
Effective Memory Latency
Memory Latency Sensitivity with varying L2 size
0KB
256KB
512KB
1024KB
Averaged over three runs with different sleep values on SystemBench 45~50B cycles/run
8
Reducing CPU Latency
Make CPU high priority
Put it in highest priority group
Cache memory reduces latencyseen by the master (eg CPU)
Reduces memory bandwidth whichreduces latency to other mastersand saves power
Diminishing returns from increasingin cache size
Write data can be buffered
Coherency must be observed
The latency for write traffic seen by the system is significantly reduced
Read latency reduced by prioritizing reads
9
Dealing with Latency-Critical Masters
Real-time latency-critical masters like LCD controllers
Adding latency does not affect performance
Until latency limit is reached
Increase latency-tolerance by inserting additional
buffering FIFO
Priority lower than CPU
Reduces the latency to CPU
If the transaction is still waiting after a time-out period
Promote to highest priority
Only higher priority than CPU if/when necessary
CPU
Priority
Time-out
Latency-Critical
GPU
Mem-mem DMA etc
10
Handling Batch Processing Masters (eg GPUs)
These devices can soak up almost unlimited bandwidth
Memory to memory DMA another example
Can swamp system with transactions
Can typically support multiple outstanding
transactions
SDRAMC with page-hit detection exacerbates
issue
Make these devices lowest priority
Option to increase priority to
ensure a certain minimum bandwidth is obtained
CPU
Priority
Time-out
Latency Critical
GPU
Mem-mem DMA etc
11
System-Level QoS Study
Bus switch I1
Video GPU HDLCD 2 x CPU + L2
Bus switch I2
DMC
AMBA Network Interconnect
NIC-301
12
What Bandwidth is Needed for 1080p?
Item Value
Display refresh bandwidth
1920x1080 60Hz
497.6MB/s ≈ 500MB/s
GPU bandwidth (estimate) 1.5GB/s
Video decode - approx 500MB/s
Total (no video) 2.0GB/sec
Total (with video & GPU) 2.5GB/sec
Excludes CPU and other
DMA bandwidth
13
How Important Is Interconnect to QoS?
SDRAM QoS scheme relies on there being space for QoS masters in the SDRAM queue
High outstanding transactions & high latency cause queue to fill
Stalls interconnect
Time-out measures time in SDRAMC only
Real-time masters cannot jump the queue
QoS mechanism breaks down
Interconnect needs to „regulate‟ outstanding transactions
format PHY
master
master
master
master
Memory Controller
Interconnect
memory
14
Transaction Issue Rate RegulationLittle’s Law
NT = RT.LT
Queue length = Arrival rate * Latency
Regulate arrival rate to control queue length & latency
Latency = Queue length / Arrival rate
Issue rate regulation sometimes known as TSPEC
From Traffic SPECification, used in networking QoS terminology
Approximates to bandwidth regulation (burst size)
Gives a „hard‟ limit to max bandwidth of a master
Like a speed limit on the master
15
Outstanding Transaction Regulation (OT)
Latency = Queue length / Arrival rate
Reducing queue length (outstanding) reduces Latency
Regulate number of outstanding transactions to control
SDRAMC queue
Avoid over-regulation as that could affect SDRAM efficiency
Nicely adaptive – Regulated masters get additional bandwidth
when system is lightly loaded – no hard limit
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
% o
f to
tal c
ycle
s
Queue depth
Queue depth vs Outstanding Transactions DMC Queue Fill DMC Outstanding Transactions
16
Latency Regulation
Controlling the 3rd variable in Little’s Law
Latency LT
Cannot directly control LT
Dynamically adjust priority inclosed- loop Set lowest priority to meet latency
requirement
Adaptive to lightly loaded systems Masters get more bandwidth when
lightly-loaded
Requires co-operation from slave(memory controller) to prioritize
Default low priority
Measure latency
Compare with target
Increase priority if (latency > target)and vice versa
17
QoS Validation with VPE
Used VPE
Verification and Performance
Exploration
VPE executes much faster
than RTL
Reduced 2M cycle test-
bench simulation from 4
hours to 4 minutes
Statistically matches
pattern of traffic
18
Performance Without QoS-301
Bandwidth per master is calculated for:
GPU active phase
GPU idle phase
Aggregate (total) for frame
Results for unconstrained system
GPU was active for 55% frame, when it achieved bandwidth of
2734MB/s
1500MB/s overall
CPU achieved bandwidth of
32MB/s when GPU was active
163 MB/s when GPU was idle (5x)
91 MB/s overall
19
Issue Rate Regulation
Regulate GPU RT (transaction rate) to:
2.4 GB/s (vs 2.66 GB/sunregulated)
Results
GPU bandwidth
Active for 61% frame (+11%)
2441MB/s active (-11%)
1500MB/s overall (+0%)
Maximum NT = 2.65 (measured)
CPU achieved bandwidth of
119MB/s when GPU active (+279%)
163 MB/s when GPU idle (+0%)
136 MB/s overall (+50%)
Factor improvement over
unconstrained case
+50% CPU
bandwidth
20
Outstanding Transactions (OT) Regulation
Regulate GPU number of transactions at input to system:
3 outstanding read transactions
1 outstanding write transaction
Results
GPU bandwidth
Active for 56% of frame (+1%)
2664MB/s active (-3%)
1500MB/s overall (+0%)
CPU achieved bandwidth of
99MB/s when GPU active (+215%)
163 MB/s when GPU idle (+0%)
127 MB/s overall (+40%)
Suffered from lack of granularity in OT level
Factor improvement over
unconstrained case
CPU bandwidth increased by 40%
+40% CPU
bandwidth
21
Fractional Outstanding Regulation
Regulating maximum outstanding transactions often preferable to regulating bandwidth
More adaptive to loading
Integer NT provided too coarse-grained control – Needed ~2.5 OT
Added average number of outstanding transactions to QoS-301
By varying duty cycle, e.g. NT = 2 .4
Finer degree of control
Useful when many low-bandwidth masters
Each may only require NT <<1
22
Latency Reduction with OT Regulation
Unconstrained system
Large number of queuing transactions (NT) from GPU
NT = 14 (read), 8 (write)
Little or no benefit to GPU –DMC cannot supply more bandwidth in this example system
Queuing latency affects CPU bandwidth
NT = 0.74 (read), 0.21 (write)
CPU cannot issue more simultaneous requests
Regulated system
NT sufficient for GPU bandwidth
Queuing latency (LT) reduced
CPU gains BW
Fewer request buffers required
23
OT versus TSPEC Regulation
Outstanding Transaction
Regulation (OT)
40 50 60 70 80 90 100
RT
LT
Issue Rate Regulation (TSPEC)
40 50 60 70 80 90 100
RT
LT
Consider what happens when system bandwidth requirement reduces
System queuing latency reduces
Queuing latency
bandw
idth
Bandwidth fixed
by definitionAdaptive
Bandwidth doesn‟t
degrade as system
workload increases
24
How QoS-301 is Inserted into NIC-301
The QoS-301 hardware can be configured at any NIC-301 slave
interfaces (ASIB) or internal interface block (IB) with AMBA® Designer
25
QoS Techniques and Their Applications
QoS Min
Bandwidth
Max
Bandwidth
Max Latency Adaptive?
Issue Rate
Regulation x
Latency
Regulation via
priority
Outstanding
Transaction
Regulation
These techniques can be used in isolation or together in combination
26
Future technology challenges with QoS
Cortex™-A15 and ARM‟s next generation CoreLink™ system
IP and Mali™ graphics bring higher performance and new
technology
AMBA 4 Phase 2 in 2011 brings coherency, barriers and virtualisation
ARM is developing roadmap interconnect products for
release in 2011
Network interconnect for efficient connectivity with packetization,
clock management and QoS extensions
High performance coherent interconnect
QoS is critical to system performance, bandwidth and latency
New technologies including virtual networks are in development
27
QoS for Cortex-A15 and Mali
Optimized non-blocking interconnect with
Cache coherency up to 8 Cortex-A15 cores
End to end QoS
Lowest latency for CPU
Highest bandwidth for GPU
New high efficiencymemory controller
1/2/4 channels DDR3 or LPDDR2 up to 1066MHz
System MMU forI/O virtualization
Complements Cortex-A15virtualization extensions
ARM is building systems with processor, graphics, interconnect and memory to test QoS for real applications
Quad
Cortex-A15
Quad
Cortex-A15
AMBA 4 Cache Coherent Interconnect
CCI-400
I/O
device
MMU-400
Dynamic Memory Controller
DMC-400
AXI Network Interconnect
NIC-400
Slaves Slaves
AXI Network Interconnect
NIC-400
LCDVideo
DDR3/
LPDDR2
DDR3/
LPDDR2
PHY
GIC-400Mali 3D
Graphics
PHY
MMU-400 MMU-400
28
Summary
Little‟s Law shows there‟s 3 ways to regulate latency with QoS
Outstanding transactions
Issue Rate
Latency – via dynamic priority
ARM CoreLink NIC-301 with Advanced Quality of Service QoS-301 supports all three singly or in combination
Simulation tuning enabled by fast turn-around of VPE simulations
Programmable for tuning and optimization in silicon
Latency regulation supported in conjunction with DMC-400
QoS is important part of the CoreLink system IP mission to maximize performance and power efficiency
LET „ER ROLL!
29
Thank You
Please visit www.arm.com for ARM related technical details
For any queries contact < Salesinfo-IN@arm.com >
Recommended