XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei

HUAWEI TECHNOLOGIES CO., LTD.

www.huawei.com

Huawei Confidential

Security Level:

47pt

30pt

反白

:

FrutigerNext LT Medium

: Arial

47pt

黑体

28pt

反白

细黑体

August, 2016

Weidong Han <[email protected]>

Wei Yang <[email protected]>

Zhichao Huang <[email protected]>

>

Xen Scalability Analysis

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential

35pt

32pt

) :18pt

Page 2

Agenda

Current Status

Issues & Proposals

Summary


35pt

32pt

) :18pt

Page 3

What’s Scalability

Scalability is the capability of a system, network, or process to

handle a growing amount of work, or its potential to be enlarged in

order to accommodate that growth [wikipedia]

We care about below dimensions of scalability on a single server

Horizontal scaling: more VMs , more users

Vertical scaling: larger VM

Source https://en.wikipedia.org/wiki/Scalability


35pt

32pt

) :18pt

Page 4

Current Status: Xen Max Limits

Xen 4.7

Host Limits (x86)

Up to 4095 physical CPUs

Up to 16TB physical memory

HVM Guest Limits (x86)

Up to 512 virtual CPUs

Up to 1TB virtual memory


35pt

32pt

) :18pt

Page 5

Current Status: Scale to Thousands of VMs

3000 VMs were experimented successfully

Improving Scalability of Xen: the 3,000 domains experiment

(https://events.linuxfoundation.org/images/stories/slides/lfcs2013_liu.pdf

)

https://events.linuxfoundation.org/images/stories/slides/lfcs2013_liu.pdf


35pt

32pt

) :18pt

Page 6

Scalability was improved a lot in upstream

Grant table

Split grant table lock into maptrack lock and grant table lock

per-vCPU maptrack free lists

Read-write lock

Per-active entry lock

Persistent grants for virtual block scalability

Avoid grant operations and TLB flushes

Event channel

Extend event channel limit: up to 131072

per-event channel lock for sending events

Scheduler

node-affinity for vCPU load balance

P2M

Use unlocked p2m lookups in hvmemul_rep_movs

defer the invalidation until the p2m lock is released


35pt

32pt

) :18pt

Page 7

Agenda

Current Status

Bottleneck & Proposals

Summary


35pt

32pt

) :18pt

Page 8

More Cores are Integrated

More and more cores are integrated into a CPU. There are hundreds of cores in 8P servers. It requires better many core scalability.


35pt

32pt

) :18pt

Page 9

Bottleneck 1: ticket spinlock is non-scalable

Xen uses ticket spinlock by default

Ticket spinlock is fair, FIFO

But, it’s non-scalable for many core

Spin on global shared variable

Expensive cache entries invalidation


35pt

32pt

) :18pt

Page 10

Proposal: scalable lock

MCS lock is scalable

Spin on local variable

Generate a constant number of cache misses per acquisition, avoid the

performance collapse with many cores.


35pt

32pt

) :18pt

Page 11

Bottleneck 2: call lock

Call lock is global lock, used in on_selected_cpus to protect

IPIs on targeted CPUs

Bottleneck case

Frequent EPT entry changes invalidate EPT on related

PCPUs

ept_sync_domain -> on_selected_cpus

heavy contention of call lock


35pt

32pt

) :18pt

Page 12

Proposal: scalable lock and finer-grain lock

Replace ticket lock with MCS lock for call lock

Change global lock to per-cpu lock

Hold a per-cpu lock of each target CPU, instead of a global lock

(call lock)

Then do IPIs: smp_send_call_function_mask(&call_data.selected);

1

2

3

N

…

It locks all even though the IPI

target CPUs are only CPU 1 and 2

src

1

2

3

N

…

src

It only locks IPI target CPUs (1 and 2),

other CPUs are still available for IPI


35pt

32pt

) :18pt

Page 13

Evaluation Environment and Method

Hardware

8 sockets, 96 cores, 192 threads

2TB RAM

Host

Xen 4.7.0

Dom0: SLES 11SP3

VM

Win 7 64bits

40GB RAM

64 vCPUs

Test case

Launch 8 VMs with POD in parallel, it will result in lots of EPT changes due to

Windows OS will zero pages during booting, then scan and recycle zero pages

when VM memory usage beyonds reserved threshold.


35pt

32pt

) :18pt

Page 14

Evaluation Results

0

10000

20000

30000

40000

50000

60000

70000

1

12

23

34

45

56

67

78

89

10

0

11

1

12

2

13

3

14

4

15

5

16

6

17

7

18

8

19

9

21

0

22

1

23

2

24

3

25

4

26

5

27

6

28

7

29

8

30

9

32

0

33

1

34

2

35

3

36

4

37

5

38

6

39

7

40

8

41

9

43

0

44

1

45

2

46

3

47

4

48

5

49

6

50

7

51

8

52

9

54

0

55

1

56

2

57

3

58

4

59

5

Number of Lock Wait

Xen 4.7 Xen 4.7 with Optimization

0

2000000

4000000

6000000

8000000

10000000

1

12

23

34

45

56

67

78

89

10

0

11

1

12

2

13

3

14

4

15

5

16

6

17

7

18

8

19

9

21

0

22

1

23

2

24

3

25

4

26

5

27

6

28

7

29

8

30

9

32

0

33

1

34

2

35

3

36

4

37

5

38

6

39

7

40

8

41

9

43

0

44

1

45

2

46

3

47

4

48

5

49

6

50

7

51

8

52

9

54

0

55

1

56

2

57

3

58

4

59

5

Avg Consuming of Each Lock Wait (ns)


With the optimization, the number of lock wait and each wait consuming are both reduced significantly


35pt

32pt

) :18pt

Page 15

Bottleneck 3: vCPU load balance in credit scheduler

vCPU load balance

If the next highest priority local runnable vCPU has already eaten

through its credits, look on other PCPUs to see if we have more urgent

work

Select non-idling CPUs

Try to acquire the schedule lock of a non-idling CPU

Then try to steal a task from this non-idling CPU

If stealing not succeed, go to next non-idling CPU

The bottleneck analysis

In many core case, there will be lots of non-idling CPUs, it’s a big waste

if it often acquired the schedule lock but failed to steal a task.


35pt

32pt

) :18pt

Page 16

Proposal: Add a check before acquiring schedule

lock

Add a check for each non-idling CPU before acquiring

schedule lock

Add a bitmap of each PCPU, record what PCPUs its runq vCPUs can run

(vCPU may be pinned on some PCPUs)

If this CPU (which wants to steal task from other CPUs) is not in the

bitmap of a non-idling CPU, that means it cannot steal a task from this

non-idling CPU. Save the cost of acquiring schedule lock.


35pt

32pt

) :18pt

Page 17

Evaluation Environment

Hardware

8 socket, 96 cores, 192 threads

2TB RAM

Host

Xen 4.7.0

Dom0: SLES 11SP3

VM

Win 7 64bits

40GB RAM

24 vCPUs

Test case

1:1 pin vCPU to PCPU, and run NotMyFault in VM


35pt

32pt

) :18pt

Page 18

Evaluation Results: 1:1 pin VCPU to PCPU

0

10000

20000

30000

40000

50000

60000

1

7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

10

3

10

9

11

5

12

1

12

7

13

3

13

9

14

5

15

1

15

7

16

3

16

9

17

5

18

1

18

7

19

3

19

9

20

5

21

1

21

7

22

3

22

9

23

5

24

1

24

7

25

3

25

9

26

5

27

1

27

7

28

3

28

9

29

5

Number of Lock Wait


0

5000000

10000000

15000000

20000000

1

7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

10

3

10

9

11

5

12

1

12

7

13

3

13

9

14

5

15

1

15

7

16

3

16

9

17

5

18

1

18

7

19

3

19

9

20

5

21

1

21

7

22

3

22

9

23

5

24

1

24

7

25

3

25

9

26

5

27

1

27

7

28

3

28

9

29

5

Avg. Consuming of Each Lock Wait (ns)


With the optimization, the number of lock wait and each wait consuming are both reduced significantly


35pt

32pt

) :18pt

Page 19

Summary

Some scalability bottlenecks and proposals

Future work

Virtualization scalability benchmark

Scenarios

Measurement

Huge VM (vcpu > 128, memory > 1TB) support

Thank you www.huawei.com

Technology

XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei