41
COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors Executing Multi-threaded Workloads Masab Ahmad, Syed Kamran Haider, Farrukh Hijaz, Marten van Dijk, Omer Khan University of ConnecAcut

Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Exploring the Performance Implications of Memory Safety

Primitives in Many-core Processors Executing Multi-threaded Workloads

Masab  Ahmad,  Syed  Kamran  Haider,  Farrukh  Hijaz,    Marten  van  Dijk,  Omer  Khan  

 University  of  ConnecAcut  

Page 2: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

Page 3: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Agenda Mo#va#on    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

Page 4: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Page 5: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Security Vulnerabilities in Processors

Vulnerable Applications Buffer Overflow Attacks Code Injection

OS

CPU

DRAM I/O Disk

Applications Compromised OS Privacy Leakage

Malicious I/Os DoS Attacks

Other Devices Data Tampering

Side Channels

Performance  

Security  

Page 6: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Key Questions • What  are  the  performance  implicaAons  of  these  security  schemes?  

•  How  do  they  affect  performance  in  the  context  of  mulAcores?  

•  IEEE  Computer  Vision  2022  predicts  that  MulAcores  as  a  key  enabling  technology  by  2022  

•  To  get  started,  we  look  at  Buffer  Overflow  Protec/on  Schemes  

•  How  do  they  affect  tradeoffs  in  MulAcores?  

Page 7: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

A Primer on Buffer Overflow Attacks •  The  stack  return  address  is  located  aRer  the  program  contents  

•  An  aSacker  enters  a  modified  string,  overwrites  the  return  address  

•  The  return  address  now  points  to  the  malicious  code  

a+2  

a+2  

Malicious  Code  

a+1  

a  

a+n  

a+2  

……  

a+3  

a+2  

a+2  

Return  Address  

Other  Variables  

Buffer  

a+1  

a  

a+n  

a+2  

……  

a+3  

Overflow  ASack    

overwrites  Return  Address  

A)acker  Comes  In  

A)acker  takes  Control  

Page 8: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Security Solutions and Prior Works •  Dynamic  InformaAon  Flow  Tracking  (DIFT)  :  

•  Flags  and  Tags  all  data  from  I/O  •  Raises  excepAons  of  tagged  data  propagates  into  program  control  flow  •  High  False  posiAve  rates,  Zero  false  negaAves  (Hence  not  used  in  applicaAons)    

•  Context  based  Checking  :    •  Creates  a  “context”,  an  idenAfier  for  each  variable/data  structure  in  the  program  

•  Checks  on  each  variable  read/write  •  Large  data  structure  required,  1  element  for  each  array  element.  Pointer  aliasing  problems  (Hence  also  not  used  in  applicaAons)  

•  Bounds  Checking  :    •  Creates  a  base  and  bounds  Metadata  data  structure  for  a  program  •  Reads  and  writes  are  checked  and  updated  as  meta  data  •  Close  to  Zero  False  posiAves  and  negaAves  (Used  a  lot  in  Safe  languages  etc)  

Page 9: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Security Solutions and Prior Works •  Dynamic  InformaAon  Flow  Tracking  (DIFT)  :  

•  Flags  and  Tags  all  data  from  I/O  •  Raises  excepAons  of  tagged  data  propagates  into  program  control  flow  •  High  False  posiAve  rates,  Zero  false  negaAves  (Hence  not  used  in  applicaAons)    

•  Context  based  Checking  :    •  Creates  a  “context”,  an  idenAfier  for  each  variable/data  structure  in  the  program  

•  Checks  on  each  variable  read/write  •  Large  data  structure  required,  1  element  for  each  array  element.  Pointer  aliasing  problems  (Hence  also  not  used  in  applicaAons)  

•  Bounds  Checking  :    •  Creates  a  base  and  bounds  Metadata  data  structure  for  a  program  •  Reads  and  writes  are  checked  and  updated  as  meta  data  •  Close  to  Zero  False  posiAves  and  negaAves  (Used  a  lot  in  Safe  languages  etc)  

Page 10: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Bound Checking Schemes • Hardware  based  Schemes:  

•  CHERI  •  HardBound  •  WatchdogLite  

•  SoRware  based  Schemes:  •  So5Bound  •  Cyclone  •  SafeCode  •  Ccured  •  And  many  others  •  Safe  Languages  such  as  Python  and  Java  

Return  Address  Other  Variables  

Buffer  

a+1  

a  

a+n  

a+2  

……  

a+3  

Overflow  ASack    

cannot  write  beyond  the    Bound  

Base  

Bound  

Compute  Pipeline  

 Bounds    Check  

Extra  Memory  Accesses  

Caches    Bounds    Metadata  

DRAM    Bounds    Metadata  

Bounds  Checks  

Page 11: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Performance Implications in Multicores

Compute  Pipeline  

 Bounds    Check  

Increased  Coherency  Bo)lenecks  

Shared  Cache  

 Bounds    Metadata  

Compute  Pipeline  

 Bounds    Check  

Core   Core  

 Bounds    Metadata  

Cache  Pollu#on  and  Interference  

Synchroniza#on  Bo)lenecks  as  more  #me  spent  in  cri#cal  

code  sec#ons  

Compute  Stalls  for  Bound  Checks  

Reduced  Scalability  

 …    

•  Previous  works  have  not  analyzed  memory  safety  schemes  in  the  context  of  mulAcores  

•  All  prior  works  use  sequenAal  benchmarks  such  as  SPEC  

Extra  Memory  Accesses  

DRAM    Bounds    Metadata  

Page 12: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Performance Implications in Multicores : Critical Code •  CriAcal  code  secAons  such  as  atomically  locked  program  funcAons  are  most  affected.  

•  Due  to  bounds  checking  stalls,  a  thread  keeps  a  locked  secAon  for  a  longer  Ame,  depriving  other  threads  from  exploiAng  scalability.  

Code  Without  So5Bound  

1.    int  a  =  1;  

2.    pthread_mutex_lock(&lock);  

3.    do_parallel_work();  

4.    pthread_mutex_unlock(&lock);  

5.  return  0;  

 Code  With  So5Bound  

1.    int  a  =  1;  

2.    pthread_mutex_lock(&lock);  

3.    do_parallel_work();  

4.  Get  Security  Metadata();  

5.  SoRBound  Checks  ();  

6.    pthread_mutex_unlock(&lock);  

7.  return  0;  

 

Page 13: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    Characteriza#on  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

Page 14: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

System Model

Mul#core  Processor  

SoRBound  

.EXE  Executables  compiled  with  SoRBound  

ApplicaAon  Data  

DRAM

 

SoRBound  Metadata  

Tile  

Graph

ite  Sim

ulator  

Page 15: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

System Parameters •  The  Graphite  Simulator  is  used  to  analyze  Parallel  Benchmarks  

•  256  Cores  @  1  GHz  •  Results  for  both  in-­‐order  and    

Out-­‐of-­‐Order  cores  •  L1-­‐I  Cache  :  32  KB  per  core  •  L1-­‐D  Cache  :  32  KB  per  core  •  L2  Cache  :  256  KB  per  core  •  OOO  ROB  Buffer  Size:  168  

•  An  8  Core  Intel  machine  used  as  well  (Trends  similar  as  Graphite  results)  (Results  in  Paper)  

•  POSIX  Threading  Model  is  used  •  Evalua#on  Includes  results  for  1-­‐256  Threads  

Page 16: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Evaluated Benchmarks •  Prefix  Scan  –  16M  elements  

•  Each  Thread  gets  a  chunk  to  scan,  then  the  Master  Thread  reduces  the  scans  of  other  threads  to  get  the  final  soluAon  

•  Barriers  used  to  do  explicit  SynchronizaAon  

•  Matrix  Mul#ply  –  1K  x  1K  •  Each  Thread  Tiles  a  chunk  of  the  matrix,  and  mulAplies  to  get  the  final  matrix  •  Barriers  used  to  do  explicit  SynchronizaAon  

•  Breadth  First  Search  (BFS)  –  1M  ver#ces,  16M  edges  •  Each  Thread  gets  a  chunk  of  the  graph  to  search  on  •  Atomic  locks  used  on  all  graph  verAces  to  ensure  that  no  race  condiAons  occur  in  shared  verAces  

•  Dijkstra  –  16K  ver#ces,  134M  edges  •  Checking  neighboring  nodes  for  shortest  path  distances  parallelized  among  threads  •  Explicit  Barriers  to  progress  each  node  check  

Page 17: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

Characteriza#on  Results    Insights  and  Possible  Improvements  

Page 18: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Page 19: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Page 20: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Page 21: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Page 22: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Page 23: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1   2   4   8   16   32   64   96   128   160   192   224   256  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Shows  Weak  Scaling  

Weak  Scaling  

Page 24: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  SoRBound  has  higher  overheads  due  to  extra  compute  and  memory  accesses  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

Page 25: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Prefix Scan

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

•  Concurrency  hides  SoRBound’s  overhead  at  high  thread  counts  

•  More  Compute  than  CommunicaAon  in  this  Workload  

•  ‘S’  shows  results  with  SoRBound  

Concurrency  Improvement!  

Baseline      So5Bound  

Page 26: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  Significant  Overhead  with  SoRBound  

•  Memory  Bounds  ApplicaAons  lead  to  more  Metadata  and  security  checks  

•  ‘S’  shows  results  with  SoRBound   Baseline      So5Bound  

Page 27: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  Concurrency  does  reduce  SoRBound’s  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  

•  ‘S’  shows  results  with  SoRBound   Baseline      So5Bound  

Page 28: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Matrix Multiply

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,05  

0,1  

160   S  

192   S  

224   S  

256   S  

•  One  can  get  the  Efficiency  of  the  baseline  by  using  addiAonal  Threads  for  SoRBound  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

So5Bound  Equal  to  Baseline  

Page 29: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Breadth First Search (BFS)

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,1  

0,2  

160   S  

192   S  

224   S  

256   S  

•  Another  Graph  Workload,  however  with  higher  scalability  and  Locality  

•  Concurrency  does  reduce  SoRBound’s  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

Page 30: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Breadth First Search (BFS)

0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1,6  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  

Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,1  

0,2  

160   S  

192   S  

224   S  

256   S  

•  Concurrency  helps  reduce  SoRBound  overhead  at  high  thread  counts  

•  However,  overhead  sAll  quite  significant  because  of  fine  grained  synchronizaAon  in  this  workload  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

Page 31: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Dijkstra

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,2  

16  S   32  S   64  S   96  S  

•  SoRBound  results  in  higher  on-­‐chip  and  off-­‐chip  data  accesses,  due  to  large  working  set  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

Page 32: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Dijkstra

0  

0,5  

1  

1,5  

2  

2,5  

1   S   2   S   4   S   8   S   16   S   32   S   64   S   96   S  

128   S  

160   S  

192   S  

224   S  

256   S  Normalized

 Com

ple#

on  Tim

e  

Thread  Count  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

0  

0,2  

16  S   32  S   64  S   96  S  

•  Scalability  is  limited  due  to  fine  grained  communicaAon  

•  SoRBound  checks  within  criAcal  secAons  hurts  performance  

•  ‘S’  shows  results  with  SoRBound  

Baseline      So5Bound  

Page 33: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

So5Bo

und’s  N

ormalized

 Overhead  

Thread  Count  

PREFIX  

Concurrency  Improvement!  

Baseline  Scales  Faster    Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  

Page 34: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

MATMUL  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~32-­‐64  threads  •  Then  worsens  since  metadata  impacts  the  available  local  cache  size  

Page 35: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

BFS  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  Ini#ally  

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~2-­‐8  threads  •  Then  worsens  since  metadata  impacts  criAcal  secAons    

Page 36: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Summary of Slowdowns

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

DIJK  

Larger  Metadata  Size  Inhibits  Scalability    

So5Bound  Scales  a  bit  

Cache  Effects    

•  Concurrency  helps  hide  SoRBound  Overheads  iniAally  at  ~2-­‐8  threads  •  Then  worsens  since  metadata  impacts  criAcal  secAons  

Page 37: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Benchmark Characterization : Slowdowns in Out-of-Order Cores

1  

1,5  

2  

2,5  

3  

3,5  

4  

0   50   100   150   200   250  

Slow

down  du

e  to  So5

Boun

d  

Thread  Count  

DIJK   PREFIX   BFS   MATMUL  •  All  Prior  Results  had  In-­‐Order  Cores  

•  ReducAon  in  performance  degradaAon  observed  

•  OOOs  improve  criAcal  code  secAons  

•  Latency  hiding  helps  So5Bound  scale  be)er  

Page 38: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

0  

0,5  

1  

1,5  

2  

2,5  

IN   OOC   IN   OOC   IN   OOC   IN   OOC  

So5Bo

und  Overhead  

DIJK  

Compute   L1Cache-­‐L2Home  

L2Home-­‐OffChip   Synchroniza#on  

Benchmark Characterization : In-Order vs. Out-of-Order

Prefix   BFS   Matrix  Mul#ply  

•  Parallel  SoRBound  results  for  both  In-­‐Order  and  OOO  core  types  

•  At  thread  counts  showing  highest  speedups  

 

OOO  Improvements  in  Compute  Bound  Workloads  

•  OOOs  can  not  improve  parallel  performance  much  •  Alterna#ve  

architectural  improvements  needed  

 

Page 39: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Agenda MoAvaAon    CharacterizaAon  Methodology  

CharacterizaAon  Results    Insights  and  Possible  Improvements  

Page 40: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Insights from Characterization •  Most  boSlenecks  in  MulA-­‐threaded  SoRBound  workloads  stem  from  Memory  Accesses  and  SynchronizaAon  

•  Latency  Hiding  does  improve  SoRBound  using  OoO  Cores  slightly  

•  Increase  in  thread  count  allows  SoRBound  to  perform  as  well  as  a  baseline  at  a  lower  thread  counts  

•  However,  addi#onal  so5ware/architectural  mechanisms  are  needed  to  further  improve  Performance  

Page 41: Exploring the Performance Implications of Memory Safety ... · COMPUTER ARCHITECTURE GROUP Exploring the Performance Implications of Memory Safety Primitives in Many-core Processors

COMPUTER ARCHITECTURE GROUP

Potential Future Improvements •  Improve  SoRBound’s  Metadata  structures  

•  Compression  •  BeSer  Layout  (e.g.  a  beSer  tree  type  of  structure)  

•  Use  ParallelizaAon  Strategies  (e.g.  Blocking)  that  are  aware  of  Security  Metadata’s  presence    

•  Devise  a  Prefetcher  for  Security  Metadata  •  Prefetch  SoRBound  metadata  from  DRAM  •  Hides  SoRBounds  latency