49
Reliability and Maintainability of IoT systems Prof. Tajana Simunic Rosing UCSD

Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability and Maintainability of

IoT systems

Prof. Tajana Simunic Rosing

UCSD

Page 2: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Internet Evolution

Starts with a network and evolves into everythingthat can be connected with a network

Source: Perera et al. 2014

Sensor Networks

Page 3: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Sensor Networks (SNs)

Consist of a number of sensor nodes communicating in a wireless multi-hop fashion

Source: Perera et al. 2014

Page 4: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Sensor Networks (SNs)

• SNs have been designed, developed, and used for specific application purposes

– Environmental monitoring, agriculture, medical care, event detection etc.

• SNs use middleware for:

– Abstraction support, data fusion, resource constraints, dynamic topology, adaptability, scalability, security, and QoS support, etc.

Page 5: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Large Scale High Performance Wireless Research and Educational Network - HPWREN

Motion

detect

cameras

Environmental

sensors & cams

Acoustic sensorsWolf howls at the CA Wolf Center

Wildfire tracking camsCal Fire Dept.

approximately 50 miles:

MVFDMTGY

MPO

SMER

CNM

UCSD

to CI and

PEMEX

70+ miles

to SCI

PL

M

L

O

MON

P

CWC

P480

USG

C

SO

LVA2

BVDA

RMNA

Santa

Rosa

GVDA

KNW

WMC

RDM

CRY

SNDBZN

AZRY

FRD

PSAP

WIDC

KYVW

COTD

PFO

BDC

KSW

DHL

SLMS

SCS

CRRS

GLRS

DSM

E

WLA

P506

P510

P499

P494

P497

B

0

8

1

P486

N

S

S

S

SDSU

P474

P478

DESC

P

4

7

3

POTRP066

P483

CE

HPWREN wireless

connectivity covers

20k sq. mile area

with numerous

sensors

HW Braun, F. Vernon, T. S. Rosing, UCSD

Page 6: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

HPWREN: 16 Years of Experience

• Began in 1999 due to the need to provide reliable connectivity for seismic sensors; started by Frank Vernon, SIO & Hans W. Braun, SDSC, UCSD.

– Solar cell powers a battery, seismic sensors and computation done underground; large scale antennas

• It grew to 20,000 square miles, from San Clemente Island in the Pacific Ocean (72 miles of wireless link), to the inland valleys and onto the high mountains, reaching an altitude of more than 8,700 feet, across the desert, reaching sites close to the Arizona border

• Original funding was from NSF (2000, 2004, 2009 grants), but public safety sector and other agencies got involved since:

– San Diego Gas & Electric

– NASA, US Navy

– California Dept. for forestry and fire protection, San Diego City Fire and Life Safety Departments, Ramona Fire Department, Gillespie Helitack Base

– San Diego County’s Sheriff's Department, San Diego City Police Department

– Mt Laguna & Palomar observatories

– Three Native American reservations

– US National Park Service, NOAA, Santa Margarita Reserve, California Wolf Center, SIO, UCSB Earth Research Institute, Seismic Warning Systems, UNAVCO

Page 7: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Mount Laguna Sensors

temperature

relative humidity

fuel moisture

fuel temperature

data logger

barometric pressure

Pan-tilt-zoom

camera

support

equipment

3D ultrasonic

anemometersolar

radiation

tipping

rainbucket

anemometer

Page 8: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Wind Gusts on Mt. Laguna

Page 9: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

~59 miles ~73 miles

Interactive Spectrum Analysis of Real-Time Seismic Data(magnitude 1.6, near Banning, Riverside County, CA)

~34 miles ~75 miles

Page 10: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

75 Cluster Heads

connected via WLAN

Santa Margarita Ecological Reserve

Page 11: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

U.S. Navy Deep Submergence Unit

http://www.csp.navy.mil/csda5/dsu/dsu.htm

Page 12: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

CitiSense Deployment: Wearables and Environmental Sensors

• Sensors and phones given to commuters using various transportation means throughout the greater San Diego area

• Results found “urban valleys” where buildings trapped pollution

• Major routes reported contrastingly high/low AQI for the same location

• Ease of deployment and in-network adaptation is key: learn from context!

Page 13: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Gillespie Helitack Base:

Monitoring Wild Fires by Helicopters

Page 14: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Eagle Fire HPWREN connection, May 2004HPWREN

access at

SDSU/SMER

relay

Incident Command Post

Local SAM connection at the ICP

Preparation for relay battery

transport at the ICP site

Eagle Fire as seen from HWY79 on May 3rd, 2004

Page 15: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Volcan Fire HPWREN connection, September 2005Incident Command Post

site

Volcan relay site

Page 16: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Summary of Learnings from Firefighting Over the Last Decade

• Rapid response is key:

– It started with enabling connectivity for Incident Command Post deployed for large wildfires, and connectivity to fire stations, camps and air bases

• Successfully done for fires in 2003, 2004, 2005, 2006 & 2007

– Real-time event detection based on weather sensors sent automated pager and email alarms to public safety personnel, alerting them in real time of Santa Ana conditions

• During 2007 fire HPWREN provided data to both fire fighters and public on the current conditions; enabling much more accurate and scalable troop deployment

• Key problem: we need scalable, adaptable, manageable infrastructure– The question is not if failures will happen, but how proactive the infrastructure will be at

avoiding them and solving them -> big issue for the IoT!

Page 17: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

What is IoT?

The Internet of Things allows people and things to be connected Anytime, Anyplace, with Anything and Anyone, ideally using Any path/network and Any service.

Source: Perera et al. 2014

Page 18: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Characteristics of IoT

1. Intelligence– Knowledge extraction from the generated data

2. Heterogeneous Architecture– Supports many different types of devices and connectivity

3. Complexity– A diverse set of dynamically changing objects

4. Huge Size– Scalability

5. Timing– Billions of parallel and simultaneous events

6. Space– Localization

7. Everything-as-a-service– Consuming resources as a service

Page 19: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Education – Partnership – Solutions

Information SecurityOffice of Budget and Finance

Source: Beecham Research

Page 20: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Top Industries in IoT

Page 21: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

IoT Architecture

HW: Sensing And Communication

SW: Middleware and Applications

Source: ZTE

Page 22: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

IoT HW & SW

HW components used

SW features used

Page 23: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

The changing nature of computing

50 Billion connected devices• Impossible to manage using centralized control ->

distributed resource management and control are key

Cisco 2011

Page 24: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Hidden Costs of IoT

• Initial costs:

– Design, HW sourcing and manufacturing, implementation

• Operational Expenses (OpEx) are recurring costs:

Source: Jasper, Cisco

Page 25: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Average Annual AdministrativeCosts for 100,000 Devices

• Data analysis of more 3500 companies worldwide

• Assumptions:

– Average salary for professional operations administrator in the USA is $48/hr

– Each interaction requires 5 minutes of labor

– Number of interactions varies by industry:

Page 26: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Average Annual Technical Support Costs for 100,000 Devices

• Data analysis of more 3500 companies worldwide• Assumptions:

– Average tier 1 engineer salary is $20/hr, and call time is 25min– Average tier 2 engineer salary is $40/hr, and the mean time to

resolution is 4hrs– 20% of support has to be escalated from Tier 1 to Tier 2– 10-30% of devices require technical support

Page 27: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Sources of failure

• User errors

• Installation errors

• Communication problems

• Power issues (battery lifetime)

• Hardware failures

– Soft errors: transient

– Hard errors: permanent

Page 28: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability, MTTF & Maintainability

• Reliability R(t): – Probability function that a system will operate correctly in [0,t)

• Mean time to failure (MTTF):

• Maintainability: a measure of ability to restore a device to specified condition when maintenance is performed– If we maintain systems at an interval T, with the total number of

maintenance events N, the reliability of the maintained system during the time 𝑁𝑇 ≤ 𝑡 < (𝑁 + 1)𝑇 is:

𝑅𝑀(𝑡) = 𝑅(𝑇)𝑁𝑅(𝑡 − 𝑁𝑇)

0)( dttR

Page 29: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Permanent Failures

• Reliability during useful lifetime:

– Modeled with exponential distribution

– Constant failure rate (λf) :tfetR

)(

Page 30: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Power, Thermals, Reliability & Variability

- Multiple power

hungry subsystem

- Intense application

requirements

- Limited battery

capacity

- Power

dissipation

increases

temperature

- Big issue with

no active

cooling (fans)

Workload User Ambient

temperature

- Performance

- Power

- Degradation Rate

- Technology is

rapidly scaling

- Accuracy of

fabrication

process is limited

Power

Temperature

Reliability

Variability- Utilization and

temperature

stress reduces

device lifetime

TemperatureVoltage

time

R(t)

Degradation

mechanisms

Interrelated problems!

Performance-related metrics

Perf/Pow/DR

Page 31: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Energy OptimizationE

nerg

y C

onsum

ed

time

Target battery lifetime Final battery energy

(Algorithm activation time,

Available energy) Maximum

performance

Minimum

performance

Available range for selected target

Norm

al

opera

tion

Battery lifetime

extension

Max(Perf) s.t. Target Battery Lifetime

Page 32: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Integrating Temperature: Proactive Approach

Max(Battery Lifetime) s.t. App Perf

Max(Perf) s.t. Target Battery Lifetimeor s.t. Temp < Threshold

Example: 𝐹 = 0𝑇𝑘+1 = 𝐴𝑇𝑘

𝑇𝑘+1 = 𝐴𝑇𝑘 + 𝐵𝐹𝑘+1

Compact Thermal Model

Mean error = 1.1°C

AnTuTu benchmark

Model-based prediction

Page 33: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Application CPU Freq(MHz)

GPU Freq(MHz)

DisplayBrightness

Chrome 1000 200 150

Camera 1500 350 150

Gmail 800 200 70

Hangouts 900 200 100

Dialer 1300 200 55

Youtube 1500 450 150

Angrybirds 2260 450 255

Default 500 200 100

0

500

1000

1500

2000

2500

3000

3500

32%

20%

30%

46%

39%

31%-0.9%

35%

Po

we

r [m

W]

- Nexus 5 smartphone

- Profile of popular application

- Comparison between off-the-shelf

baseline and proposed framework

Max(Battery Lifetime) s.t. App Perf

Max(Perf) s.t. Target Battery Lifetimeor s.t. Temp < Threshold

Integrating Temperature: Proactive Approach

Baseline

ProposedEnergy savings:

Page 34: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability Degradation due to Hot Spots

• Thermal hot spots [Failure Mechanisms for Semiconductor Devices, JEDEC]

– Electromigration:• Transport of material caused by gradual movement of ions in the

interconnect lattice • Eventually leads to opens and shorts

– Stress migration:• Mechanical stress causes metal atoms in the interconnect to

migrate• Similar failures as electromigration

– Time dependent dielectric breakdown:• Result of the gate dielectric’s gradual wear-out• Leads to transistor failure

kT

Ea

λ: Failure rate; T: temperature

Ea: Activation energy, k: Boltzman’s constant

• Increasing temperature increases λ

• 10 – 15 C increase in temperature causes ~2X increase

in failure rate

Page 35: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability Degradation due to Thermal Cycles

• Thermal cycling [JEDEC]

– Different expansion coefficients

– Fatigue failures

– Increases with:

• ∆T : Magnitude of variation

• f : Frequency of cycles

– ΔT increases by 10oC

(q=4 for metallic structures)

• Failures happen 16 times sooner 10

30

50

70

90

110

130

0.0 10.0 20.0 30.0 40.0 50.0

Power Savings [%]

MTTF

[years]

EM

TDDB

TC

System

[Rosing et al., TVLSI’07]

fT q|| 60

70

80

90

1 31 61 91 121 151 181 211 241 271

Time (sec)

T (C)

ΔT

Page 36: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability Management

Time

Target Reliability (𝑹𝑻𝑨𝑹𝑮𝑬𝑻)

1

0Target Lifetime (𝒕𝑳𝑰𝑭𝑬)

Dynamic Reliability Management:

𝑹 𝒕𝑳𝑰𝑭𝑬 ≥ 𝑹𝑻𝑨𝑹𝑮𝑬𝑻

- Reliability: “measure” of device integrity

- Sensors (available only on prototypes)

- Mathematical models (Online Emulation)

Long Interval timeH application: requires max

voltage/frequency

L application: Enables Reliability

Borrowing

Perf

om

ance

Avg Target

Current perf level

Borrowing strategy

𝑽𝒂𝒗𝒆𝒓𝒂𝒈𝒆 ≤ 𝑽𝑳𝑻𝑪𝑻𝒂𝒗𝒆𝒓𝒂𝒈𝒆 ≤ 𝑻𝑳𝑻𝑪

Long TermController

Short TermController

Hardware Platform

Emulation WL Ctrl

Avg

target

R(t)Request

Decision

Reli

ab

ilit

y

- Depend on temperature and

voltage stress:

𝑹(𝑻, 𝑽, 𝒕)

Page 37: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability Emulation (“RelDroid”)

MPSoC

TemperatureSensor

VoltageSensor

Voltage/FrequencySelection

Hardware

O.S./Run-time

User-space

Sensor Driver

Reliability Management

Power Management Driver

APP1 APP2 APPN

R(t)

T(t) V(t)

Freq(t+1)

perfperfperf

Reliability Emulator Module and Driver

Reliability Model Reliability Log

TAVG (t)

VAVG(t)

Slow Rate Fast Rate

RelDroid is used for prototyping on real devices

Page 38: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Optimal Reliability Controller

Optimal Long Term Controller: computes the average target values

for voltage and temperature 𝑉𝐿𝑇𝐶 and 𝑇𝐿𝑇𝐶

core1

coreN

Optimal Short Term

Controller

Optimal Long Term Controller

𝑉𝐿𝑇𝐶

Reliability Management

𝑇𝐿𝑇𝐶

Processor

.

.

.

.

Long Intervals Short Intervals

.

.

.

.

- Applied frequency 𝐹

- Allocation 𝑋𝑎𝑙𝑙𝑜𝑐

.

.

.

Optimal Short Term Controller: determines frequency to apply 𝐹and task allocation 𝑋𝑎𝑙𝑙𝑜𝑐

Solve optimal problem using convex optimization

Large computation overhead for OSTC

Page 39: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Workload Aware Reliability Manager

Thermal Controller

core1

coreN

STCOLTC

STCOLTC taskM

task1

𝑉𝐿𝑇𝐶

Local Reliability Manager

𝑇𝐿𝑇𝐶

𝑉𝐿𝑇𝐶 : Reference voltage

𝑇𝐿𝑇𝐶 : Reference temperature

𝑓𝐴𝑃𝑃/𝑉𝐴𝑃𝑃

Task allocation

Processor

𝑓𝐴𝑃𝑃/𝑉𝐴𝑃𝑃: Applied frequency/voltage

.

.

.

.

.

.

.

Long Intervals Medium Intervals Short Intervals

Thermal Controller:

- Medium Intervals (MI, s)

- Assigns H tasks to cooler cores first (𝑋𝑎𝑙𝑙𝑜𝑐)

Short Term Controller:

- Short Interval (SI, ms)Borrowing strategy:

𝑽𝒂𝒗𝒆𝒓𝒂𝒈𝒆 ≤ 𝑽𝑳𝑻𝑪

𝑻𝒂𝒗𝒆𝒓𝒂𝒈𝒆 ≤ 𝑻𝑳𝑻𝑪

Page 40: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

ImplementationUSER

APP MONITOR DRIVER

APP MONITOR CONFIG

APP MONITOR

OLTC TC

SCHED

STC

WARM GOVERNOR RELIABILITY MODULE

RELIABILITY DRIVER

APPLICATIONS

USERSPACE

KERNEL

HARDWARE

Favorite apps

Currently executing apps

foreground app PID

Current PID

Reference voltage and temperature

Average voltage and temperature

Process allocation

Voltage and frequency settings

Voltage and temperature sensors

RELIABILITY MODEL

WARM RelDroidApp Monitor

Slow rate Medium rate Fast rate

WARM DRIVER

Page 41: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

ImplementationUSER

APP MONITOR DRIVER

APP MONITOR CONFIG

APP MONITOR

OLTC TC

SCHED

STC

WARM GOVERNOR RELIABILITY MODULE

RELIABILITY DRIVER

APPLICATIONS

USERSPACE

KERNEL

HARDWARE

Favorite apps

Currently executing apps

foreground app PID

Current PID

Reference voltage and temperature

Average voltage and temperature

Process allocation

Voltage and frequency settings

Voltage and temperature sensors

RELIABILITY MODEL

WARM RelDroidApp Monitor

Slow rate Medium rate Fast rate

WARM DRIVER

Implementation overhead

Page 42: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Testing and Validation

Odroid XU3 development board:- ARM big.LITTLE architecture- OS is Android 4.4.4 with Linux kernel 3.10.9.- The device has 4 integrated voltage/current /power

monitoring sensors (LITTLE cluster, big cluster, GPU and memory)

𝑓0𝑓1

𝑓2𝑓3

𝑃(𝑓)𝑃(𝑓)

𝑃(𝑓)𝑃(𝑓)

𝑇𝑛𝑒𝑥𝑡 = 𝑡ℎ𝑒𝑟𝑚𝑎𝑙_𝑚𝑜𝑑𝑒𝑙(𝑇, 𝑃)

Frequency

adaptation

Virtual

cores

Power model:

𝑃 𝑓 = 𝑎𝑓3 + 𝑏𝑓

Temperature model

(state space

representation)

Control Policy

Virtual platform:- 4 cores- configured based on

measurements from the Cortex A15 cores of the ARM big.LITTLE

- CVX for solving optimization

Page 43: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Comparison Results: Optimal Policy vs. Heuristic (simulation)

Both optimal and

WARM are effective

in meeting the

average constraints

WARM runs more than 400x fasterLong interval

1 jiffy = 5ms (sched. tick)

Page 44: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Measurement Comparison with Standard Governors

Without background activity:

Performance comparable to

max frequency for high quality

(H) applications (in the 4%)

With background activity:

it improves performance while

meeting the target lifetime.

100% lifetime improvement

max frequency

min frequency

ondemand

proposed

interactive

max frequency proposed min frequency

AnTuTu Benchmark (H) – Normalized Score

max frequency proposed ondemand interactive min frequency

H : critical tasks, require

max performance

(1 LI = 30 days, emulated time)

Page 45: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Impact of Variability

STATIC VARIATIONS:

- Channel length

- Threshold

voltage

- Operating frequency

(performance)

- Power consumption

- Degradation rate (DR)TRANSISTOR

PROCESSOR

DYNAMIC VARIATIONS:

TEMPERATUREVOLTAGE

WorkloadUser

Ambient

temperature

Performance/ Power/

Degradation Rate

# o

ccura

nces

Without DVM

With DVM

Product

requirement

DYNAMIC VARIABILITY

MANAGEMENT (DVM)

Page 46: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Online ReliabilityEmulation

Target system

Online VariabilityEmulation

User Experience Requirements

System

feedback

Comprehensive Management Framework

Power

Temperature

Reliability

Variability

Control decisions

ST

CLT

C

LTC: Long Term

Controller

STC: Short Term

Controller

App perf Battery

Unexploited

margins

Average

target

A comprehensive management framework for

power, temperature, reliability, & variability

management in individual devices

Page 47: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Networks of IoT DevicesInternet of Things:

- Energy and reliability are important issues -> maintenance costs

- Challenges: heterogeneity, distributed nature, large scale

Control utilizations

R

P

QoS

Upper Layer

Lower Layer

Reliability

Power

QoS

Utilization

log

stream

Control

decisions

stream

CUtilization

metrics

Control engine

Average utilization constraintResearch directions:

- Framework for energy /

reliability management of

the IoT infrastructure

- Control of multiple

heterogeneous control

variables

- Automatic recognition of

user experience

requirements

IoT Network

Page 48: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Reliability of Networks of Devices

• Series combination: all components must operate for the system to function correctly - R = RA x RB x RC

• Parallel combination: system fails if all components fail: Fs(t) = Fa(t) Fb(t) Fc(t) where F(t) = 1 – R(t)

• Most IoT systems are mixed

– Soft redundancy: imperfect replacements

A B CInput Output

A

B

C

Page 49: Reliability and Maintainability of IoT systems · or s.t. Temp < Threshold Integrating Temperature: Proactive Approach Baseline Energy savings: Proposed. Reliability Degradation due

Research Challenge

Designing an IoT infrastructure that meets performance requirements while maximizing reliability and minimizing maintenance costs