HawkEye: Efficient Fine-grained OS Support for Huge Pagessbansal/pubs/HawkEye_Slides.pdf · Ashish Panwar 1, Sorav Bansal2, K. Gopinath1 Indian Institute of Science (IISc), Bangalore1

Ashish Panwar1, Sorav Bansal2, K. Gopinath1

Indian Institute of Science (IISc), Bangalore1

Indian Institute of Technology, Delhi 2

1Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019.

HawkEye: Efficient Fine-grained OS Support for Huge Pages

2

Virtual address space

3

Physical address space


4



5



6

Too much TLB pressure!



7



8


Virtual address spaceHuge

pages

Fewer misses

OS Challenges

11

❑ Complex trade-offs

• Memory bloat vs. performance

• Page fault latency vs. the number of page faults

❑ Challenges due to (external) fragmentation• How to leverage limited memory contiguity

• Fairness in huge page allocation

Memory bloatvs.

performance

13

Internal fragmentation

14

Virtual memory

Physical memory

huge page mapping

aggressive allocation

15

Virtual memory

Physical memory

huge page mapping

aggressive allocation conservative allocation


16

Virtual memory

Physical memory

huge page mapping


unused pages


17

Virtual memory

Physical memory

huge page mapping


unused pages

bloat


18

Virtual memory

Physical memory

huge page mapping


Lower TLB reach (impacts performance)bloat


unused pages

Bloat vs. performance

Aggressive

Higher perf

Higher bloat

Conservative

Lower perf

Lower bloat

20

Latencyvs.

# page faults

21

▪ Find a page

pre4-KB

22

▪ Find a page, zero-fill

pre zero-fill post4-KB

23

▪ Find a page, zero-fill, map


24



25%

25


pre zero-fill post

pre

zero-fill

post

4-KB

2-MB

25%

26


pre zero-fill post

pre

zero-fill

post

4-KB

2-MB

25%

27


pre zero-fill post

pre

zero-fill

post

4-KB

2-MB

25%

28


pre zero-fill post

pre

zero-fill

post

4-KB

dominated by zero-filling (97%)

2-MB

25%

Latency vs. # page faults

32

Aggressive

High latency

Fewer faults

Conservative

Low latency

Higher faults

33

FreeBSD Linux

Memory bloat Low High

Performance Low High

Allocation latency Low High

# page faults High Low

conservative vs. aggressive

Tradeoff-1:

Tradeoff-2:

Current systems favor opposite ends of the design spectrum

• FreeBSD is conservative (compromise on performance)

• Linux is throughput-oriented (compromise on latency and bloat)

Ingens (OSDI’16)

34

▪ Asynchronous allocation

• Huge pages allocated in the background

▪ Utilization-threshold based allocation

• Tunable bloat vs. performance

• Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

• Heuristic based on past behavior

Ingens (OSDI’16)

35








low latency

too many page faults

Ingens (OSDI’16)

36








low latency


manual

Ingens (OSDI’16)

37








low latency


manual

weak correlation with page walk overhead

Current state-of-the-art

38

FreeBSD Linux Ingens

Memory bloat Low High Tunable

Performance Low High Tunable

Allocation latency Low High Low

# page faults High Low High

Tradeoff-1:

Tradeoff-2:

▪ Hard to find the sweet-spot for utilization-threshold in Ingens

• Application dependent, phase dependent

HawkEye

39

Key Optimizations

40

➢ Asynchronous page pre-zeroing[1]

➢ Content deduplication based bloat mitigation

➢ Fine-grained intra-process allocation

➢ Fairness driven by hardware performance counters

[1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99

Asynchronous page pre-zeroing

41

▪ Pages zero-filled in the background

▪ Potential issues:

• Cache pollution – leverage non-temporal writes

• DRAM bandwidth consumption – rate-limited

o Limit CPU utilization (e.g., 5%)

Asynchronous page pre-zeroing

42

Enables aggressive allocation with low latency

✓ 13.8x faster VM spin-up

✓ 1.26x higher throughput (Redis)

Mitigating bloat

43

Mitigating bloat

44

Virtual memory

Physical memory

huge page mapping

Mitigating bloat

45

Virtual memory

Physical memory

huge page mapping

unused

Mitigating bloat

46

Virtual memory

Physical memory

huge page mapping

unused

zero-filled

Mitigating bloat

47

▪ Observation: Unused base pages remain zero-filled

▪ Identify bloat by scanning memory

▪ Dedup zero-filled base pages to remove bloat

Virtual memory

Physical memory

huge page mapping

unused

zero-filled

Mitigating bloat

48

67.555.4

115.5

3.9 2.8 1.2 1 6.63

27.49.11

0

30

60

90

120

dis

tan

ce (

byte

s)▪ Ease of detecting non-zero pages

offse

t (b

yte

s)

Mitigating bloat

49

✓ Automated "bloat vs. performance" management

0

8

16

24

32

40

48

1101

201301

401501

601701

801901

10011101

12011301

14011501

1601

RSS

(GB

)

Time (seconds)

Linux Ingens HawkEye

out-of-memory successout-of-memory successP

1

P2

P3

Redis

P1: insert

P2: delete

P3: insert

50

FreeBSD Linux Ingens HawkEye

Memory bloat Low High Tunable Automated

Performance Low High Tunable Automated

Allocation latency Low High Low Low

# page faults High Low High Low

Tradeoff-1:

Tradeoff-2:

Fine-grained (intra-process) allocation

51

▪ Maximizing performance with limited contiguity


52

hot regions

access-coverage: # base pages accessed per second

❖ A good indicator of TLB-contention due to a region

▪ Maximizing performance with limited contiguity

XSBench

acce

ss-c

ove

rage


53

access_map

▪ Track access-coverage (access_map)

▪ Allocate in the sorted order

(top to bottom)

✓ Yields higher profit per allocation


54

0

10

20

30

40

50

1 101 201 301 401 501

MM

U O

verh

ead

(%)

Time (seconds)


Workload: XSBench

Page

Walk

Ove

rhead (

%)

access-c

overa

ge


55

0

300

600

900

1200

Graph500 XSBench NPB_CG.D

ms

save

d p

er h

uge

pag

e Linux Ingens HawkEyeE

xe

cu

tio

n tim

e (

ms)

sa

ve

dp

er

hu

ge

pa

ge

allo

ca

tio

n

Fair (inter-process) allocation

56

▪ Prioritize allocation to the process with highest

expected improvement

▪ How to estimate page walk overhead

• Profile hardware performance counters

• Low cost, accurate!

Fair (inter-process) allocation

57

-10

0

10

20

30

40

50

60

70

cactusADM tigr Graph500 lbm_s SVM XSBench CG.D

% s

pe

ed

up


Workloads running alongside a TLB-insensitive process

% s

pe

ed

up

Summary

58

▪ OS support for huge pages involves complex tradeoffs

▪ Balancing fine-grained control with high performance

▪ Dealing with fragmentation for efficiency and fairness

Summary

59

▪ OS support for huge pages involves complex tradeoffs

▪ Balancing fine-grained control with high performance

▪ Dealing with fragmentation for efficiency and fairness

HawkEye: Resolving fundamental conflictsfor huge page optimizations

https://github.com/apanwariisc/HawkEye

https://github.com/apanwariisc/HawkEye

60

Documents

HawkEye: Efficient Fine-grained OS Support for Huge Pagessbansal/pubs/HawkEye_Slides.pdf · Ashish Panwar 1, Sorav Bansal2, K. Gopinath1 Indian Institute of Science (IISc), Bangalore1