QoS for High-Performance and Power-Efficient HD … · QoS for High-Performance and Power-Efficient...

QoS for High-Performance

and Power-Efficient HD

Multimedia Systems

Rob Kaye

Convergence is Happening – For Real

1GHz+ Processor, increasingly multi-core SMP-capable

1080p HD video & graphics

Internet connectivity – Either wired, wireless or both

The Need for Quality of Service

Communication explosion: more masters, more functions, more data

Multiple high-performance masters competing for limited memory bandwidth

QoS employed to manage traffic flows through interconnect and memory controller Allocate bandwidth and manage

latency appropriately

Allocate any excess capacity for greatest benefit

Little’s Law for Queuing Latency

NT = RT . LT

NT = number of requests waiting

(“outstanding transactions”)

RT = arrival rate

(bandwidth requested)

LT = latency

(delay in request being completed)

Note: To achieve max theoretical bandwidth from memory system:

Replace RT with theoretical peak memory bandwidth

NT = Bandwidth . Latency

Gives the min number of queuing outstanding transactions to achieve peak theoretical

bandwidth of memory system

Latency/clocks

System Latency

Static Latency

Queuing Latency

http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf

How Much Buffering is Needed?

NT = Bandwidth * Latency

Simplistically, NT = Latency / Time per transaction

If latency is 20 cycles and each burst takes 4 cycles of active data, then to maintain 100% active data cycles there must be 20 / 4 = 5 outstanding transactions (min)

Static latency Processing rate

Average DMC Queue Depth

4 6 8 10 12

Average DMC Queue Depth

d first la

Read first la

Adjusted

Theoretical Observed

PL340 SDR SDRAMC study

Theoretical

CPU Latency Sensitivity : Browser

Memory Latencies Baseline is 130ns

~50ns increments up to 330ns

Measured Cached Time

Cortex A8 768:192:192MHz 32KB-L1 256KB-L2

33% performance loss 130 ->330ns latency

130ns 180ns 230ns 280ns 330ns

Effective Memory Latency

Memory Latency Sensitivity with varying L2 size

1024KB

Averaged over three runs with different sleep values on SystemBench 45~50B cycles/run

Reducing CPU Latency

Make CPU high priority

Put it in highest priority group

Cache memory reduces latencyseen by the master (eg CPU)

Reduces memory bandwidth whichreduces latency to other mastersand saves power

Diminishing returns from increasingin cache size

Write data can be buffered

Coherency must be observed

The latency for write traffic seen by the system is significantly reduced

Read latency reduced by prioritizing reads

Dealing with Latency-Critical Masters

Real-time latency-critical masters like LCD controllers

Adding latency does not affect performance

Until latency limit is reached

Increase latency-tolerance by inserting additional

buffering FIFO

Priority lower than CPU

Reduces the latency to CPU

If the transaction is still waiting after a time-out period

Promote to highest priority

Only higher priority than CPU if/when necessary

Priority

Time-out

Latency-Critical

Mem-mem DMA etc

Handling Batch Processing Masters (eg GPUs)

These devices can soak up almost unlimited bandwidth

Memory to memory DMA another example

Can swamp system with transactions

Can typically support multiple outstanding

transactions

SDRAMC with page-hit detection exacerbates

Make these devices lowest priority

Option to increase priority to

ensure a certain minimum bandwidth is obtained

Priority

Time-out

Latency Critical

Mem-mem DMA etc

System-Level QoS Study

Bus switch I1

Video GPU HDLCD 2 x CPU + L2

Bus switch I2

AMBA Network Interconnect

NIC-301

What Bandwidth is Needed for 1080p?

Item Value

Display refresh bandwidth

1920x1080 60Hz

497.6MB/s ≈ 500MB/s

GPU bandwidth (estimate) 1.5GB/s

Video decode - approx 500MB/s

Total (no video) 2.0GB/sec

Total (with video & GPU) 2.5GB/sec

Excludes CPU and other

DMA bandwidth

How Important Is Interconnect to QoS?

SDRAM QoS scheme relies on there being space for QoS masters in the SDRAM queue

High outstanding transactions & high latency cause queue to fill

Stalls interconnect

Time-out measures time in SDRAMC only

Real-time masters cannot jump the queue

QoS mechanism breaks down

Interconnect needs to „regulate‟ outstanding transactions

format PHY

master

Memory Controller

Interconnect

memory

Transaction Issue Rate RegulationLittle’s Law

NT = RT.LT

Queue length = Arrival rate * Latency

Regulate arrival rate to control queue length & latency

Latency = Queue length / Arrival rate

Issue rate regulation sometimes known as TSPEC

From Traffic SPECification, used in networking QoS terminology

Approximates to bandwidth regulation (burst size)

Gives a „hard‟ limit to max bandwidth of a master

Like a speed limit on the master

Outstanding Transaction Regulation (OT)

Latency = Queue length / Arrival rate

Reducing queue length (outstanding) reduces Latency

Regulate number of outstanding transactions to control

SDRAMC queue

Avoid over-regulation as that could affect SDRAM efficiency

Nicely adaptive – Regulated masters get additional bandwidth

when system is lightly loaded – no hard limit

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Queue depth

Queue depth vs Outstanding Transactions DMC Queue Fill DMC Outstanding Transactions

Latency Regulation

Controlling the 3rd variable in Little’s Law

Latency LT

Cannot directly control LT

Dynamically adjust priority inclosed- loop Set lowest priority to meet latency

requirement

Adaptive to lightly loaded systems Masters get more bandwidth when

lightly-loaded

Requires co-operation from slave(memory controller) to prioritize

Default low priority

Measure latency

Compare with target

Increase priority if (latency > target)and vice versa

QoS Validation with VPE

Used VPE

Verification and Performance

Exploration

VPE executes much faster

than RTL

Reduced 2M cycle test-

bench simulation from 4

hours to 4 minutes

Statistically matches

pattern of traffic

Performance Without QoS-301

Bandwidth per master is calculated for:

GPU active phase

GPU idle phase

Aggregate (total) for frame

Results for unconstrained system

GPU was active for 55% frame, when it achieved bandwidth of

2734MB/s

1500MB/s overall

CPU achieved bandwidth of

32MB/s when GPU was active

163 MB/s when GPU was idle (5x)

91 MB/s overall

Issue Rate Regulation

Regulate GPU RT (transaction rate) to:

2.4 GB/s (vs 2.66 GB/sunregulated)

Results

GPU bandwidth

Active for 61% frame (+11%)

2441MB/s active (-11%)

1500MB/s overall (+0%)

Maximum NT = 2.65 (measured)

119MB/s when GPU active (+279%)

163 MB/s when GPU idle (+0%)

136 MB/s overall (+50%)

Factor improvement over

unconstrained case

+50% CPU

bandwidth

Outstanding Transactions (OT) Regulation

Regulate GPU number of transactions at input to system:

3 outstanding read transactions

1 outstanding write transaction

Results

GPU bandwidth

Active for 56% of frame (+1%)

2664MB/s active (-3%)

1500MB/s overall (+0%)

99MB/s when GPU active (+215%)

163 MB/s when GPU idle (+0%)

127 MB/s overall (+40%)

Suffered from lack of granularity in OT level

Factor improvement over

unconstrained case

CPU bandwidth increased by 40%

+40% CPU

bandwidth

Fractional Outstanding Regulation

Regulating maximum outstanding transactions often preferable to regulating bandwidth

More adaptive to loading

Integer NT provided too coarse-grained control – Needed ~2.5 OT

Added average number of outstanding transactions to QoS-301

By varying duty cycle, e.g. NT = 2 .4

Finer degree of control

Useful when many low-bandwidth masters

Each may only require NT <<1

Latency Reduction with OT Regulation

Unconstrained system

Large number of queuing transactions (NT) from GPU

NT = 14 (read), 8 (write)

Little or no benefit to GPU –DMC cannot supply more bandwidth in this example system

Queuing latency affects CPU bandwidth

NT = 0.74 (read), 0.21 (write)

CPU cannot issue more simultaneous requests

Regulated system

NT sufficient for GPU bandwidth

Queuing latency (LT) reduced

CPU gains BW

Fewer request buffers required

OT versus TSPEC Regulation

Outstanding Transaction

Regulation (OT)

40 50 60 70 80 90 100

Issue Rate Regulation (TSPEC)

40 50 60 70 80 90 100

Consider what happens when system bandwidth requirement reduces

System queuing latency reduces

Queuing latency

Bandwidth fixed

by definitionAdaptive

Bandwidth doesn‟t

degrade as system

workload increases

How QoS-301 is Inserted into NIC-301

The QoS-301 hardware can be configured at any NIC-301 slave

interfaces (ASIB) or internal interface block (IB) with AMBA® Designer

QoS Techniques and Their Applications

QoS Min

Bandwidth

Max Latency Adaptive?

Issue Rate

Regulation x

Latency

Regulation via

priority

Outstanding

Transaction

Regulation

These techniques can be used in isolation or together in combination

Future technology challenges with QoS

Cortex™-A15 and ARM‟s next generation CoreLink™ system

IP and Mali™ graphics bring higher performance and new

technology

AMBA 4 Phase 2 in 2011 brings coherency, barriers and virtualisation

ARM is developing roadmap interconnect products for

release in 2011

Network interconnect for efficient connectivity with packetization,

clock management and QoS extensions

High performance coherent interconnect

QoS is critical to system performance, bandwidth and latency

New technologies including virtual networks are in development

QoS for Cortex-A15 and Mali

Optimized non-blocking interconnect with

Cache coherency up to 8 Cortex-A15 cores

End to end QoS

Lowest latency for CPU

Highest bandwidth for GPU

New high efficiencymemory controller

1/2/4 channels DDR3 or LPDDR2 up to 1066MHz

System MMU forI/O virtualization

Complements Cortex-A15virtualization extensions

ARM is building systems with processor, graphics, interconnect and memory to test QoS for real applications

Cortex-A15

AMBA 4 Cache Coherent Interconnect

CCI-400

device

MMU-400

Dynamic Memory Controller

DMC-400

AXI Network Interconnect

NIC-400

Slaves Slaves

AXI Network Interconnect

NIC-400

LCDVideo

LPDDR2

GIC-400Mali 3D

Graphics

MMU-400 MMU-400

Summary

Little‟s Law shows there‟s 3 ways to regulate latency with QoS

Outstanding transactions

Issue Rate

Latency – via dynamic priority

ARM CoreLink NIC-301 with Advanced Quality of Service QoS-301 supports all three singly or in combination

Simulation tuning enabled by fast turn-around of VPE simulations

Programmable for tuning and optimization in silicon

Latency regulation supported in conjunction with DMC-400

QoS is important part of the CoreLink system IP mission to maximize performance and power efficiency

LET „ER ROLL!

Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < Salesinfo-IN@arm.com >

QoS for High-Performance and Power-Efficient HD … · QoS for High-Performance and Power-Efficient...

Documents

XenMon : QoS Monitoring and Performance Profiling Tool

Performance Evaluation of the IEEE 802.16 MAC for QoS Support

QoS Performance for Monitoring and Optimization of Data

Key performance indicators for qos assessment in tetra networks

Location Based Performance of WiMAX Network for QoS with ... · QoS configurations within the WiMAX cells to meet voice requirements, and further adjusting the QoS con-figurations

1.01b Dragon PTN Ethernet Services - Hirschmann...Figure 48 MPLS TC Mapping (HQoS = On)..... 52 Figure 49 QoS Performance Counters..... 52 Figure 50 QoS Policer Figure 51 QoS Queue

3 Deploying QoS for Application Performance Optimization

Performance Evaluation of QoS in WLAN-UMTS Network … · Title: Performance Evaluation of QoS in WLAN-UMTS Network Using OPNET Modeller Author: Vijay Verma, Silki Baghla Subject:

High Performance HD Polyethylene Geomembranes

ITU-T SG12 overview Performance, QoS and QoE · ITU-T SG12 overview Performance, QoS and QoE 1. ... – development of multimedia quality assessment methodologies, ... Analysis methods

Unobtrusive Performance Analysis – Where is the QoS in TAPAS?

Performance Machine HD 2009

ALU - UMTS QoS and Performance Monitoring

QoS Manager User Guide - support.ca.com Spectrum 9 4 0-ENU/Bookshelf... · About Quality of Service (QoS) Manager CA Spectrum QoS Manager facilitates IP fault and performance management

QoS Performance Evaluation of Video Conferencing over LTE832428/FULLTEXT01.pdf · Master Thesis Electrical Engineering March 2012 QoS Performance Evaluation of Video Conferencing

Voice Analysis for Mobile Networks · performance. Quantitative measurements for network performance are collectively called Quality of Service (QoS) metrics, QoS enables user’s

Performance Comparison of QOS Metrics for a Distributed Pricing Scheme

PERFORMANCE MODEL DRIVEN QOS GUARANTEES AND OPTIMIZATION IN CLOUDS

NN47263-601 04.01 Performance Management QoS

Quality of service (QoS) analysis performance enhancement