View
443
Download
5
Category
Preview:
DESCRIPTION
Investigation on (basic of) Linux's page reclaim function.
Citation preview
Linuxカーネル
ページ回収 吉田雅徳@siburu!
2014/7/27(Sun)
1. 前回のあらすじ
What’s Page Frame❖ page frame = A page-sized/aligned piece of RAM!
❖ struct page = An one-on-one structure in kernel for each page frame!
❖ mem_map!
❖ Unique array of struct page's which covers all RAM that a kernel manages.!
❖ but in CONFIG_SPARSEMEM environment!
❖ There's no unique mem_map.!
❖ Instead, there's a list of 2MB-sized arrays of struct page's.!
❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
What’s NUMA❖ NUMA(Non-Uniform Memory Architecture)!
❖ System is comprised of nodes.!
❖ Each node is defined by a set of CPUs and one physical memory range.!
❖ Memory access latency differs depending on source and destination nodes.!
❖ NUMA configuration!
❖ ACPI provides NUMA configuration:!
❖ SRAT(Static Resource Affinity Table)!
❖ To know which CPUs and memory range are contained in which NUMA node?!
❖ SLIT(System Locality Information Table)!
❖ To know how far a NUMA node is from another node?
What’s Memory Zone❖ Physical memory is separated by address range:!
❖ ZONE_DMA: <16MB!
❖ ZONE_DMA32: <4GB!
❖ ZONE_NORMAL: the rest!
❖ ZONE_MOVABLE: none by default.!
❖ This is used to define a hot-removable physical memory range.
struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];!};
Memory node, zone
物理アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! …!};
NUMA node1 NUMA node2
❖ どのpglist_dataにも各ZONE(DMA~MOVABLE)に対応するzone構造体が用意される(但し一部の中身は空かもしれない)
Memory Allocation
1. At first, checks threshold for each zone (threshold = watermark and dirty-ratio).!
❖ If all zones are failed, the kernel goes into page reclaim path (=today’s topic).!
2. If some zone is ok, allocates a page from the zone’s buddy system.!
❖ 0-order page is allocated from per-cpu cache.!
❖ higher order page is obtained from per-order lists of pages
Memory Deallocation❖ Page is returned to buddy system.!
❖ 0-order page is returned to per-cpu cache via free_hot_cold_page().!
❖ Cold page: A page estimated not to be on CPU cache!
❖ This is linked to the tail of LRU list of the per-cpu cache.!
❖ Hot page: A page estimated to be on CPU cache!
❖ This is linked to the head of LRU list of the per-cpu cache.!
❖ higher order page is directly returned to per-order lists of pages.
Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
・・・
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0 (de)alloc
HOT COLD
order1
order10
・・・
2. ページの回収2.1 Direct reclaim!2.2 Daemon reclaim
ページ割当フローの復習❖ __alloc_pages_nodemask(ページ割当基本関数)!
❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_freelist(2nd: all zones)!
❖ __alloc_pages_slowpath!
1. wake_all_kswapds(kswapd達の起床)!
2. get_page_from_freelist(3rd: all zones, min wmark)!
3. if {__GFP,PF}_MEMALLOC → __alloc_pages_high_priority!
4. __alloc_pages_direct_compact(非同期的)!
5. __alloc_pages_direct_reclaim(本コンテキストで直接ページ回収)!
6. if not did_some_progress → __alloc_pages_may_oom!
7. リトライ(2.へ) 又は __alloc_pages_direct_compact(同期的)
2.1 Direct Reclaim (ページ割当要求者本人による回収)
__alloc_pages_direct_reclaim()❖ __perform_reclaim!
❖ current->flags |= PF_MEMALLOC!❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるように!
❖ try_to_free_pages!
❖ throttle_direct_reclaim!
❖ if !pfmemalloc_watermark_ok → kswapdによりokになるのを待機!
❖ do_try_to_free_pages!
❖ current->flags &= ~PF_MEMALLOC!
❖ get_page_from_freelist!
❖ drain_all_pages!
❖ get_page_from_freelist
pfmemalloc_watermark_ok()❖ ARGS!
❖ pgdat(type: struct pglist_data)!
❖ RETURN!
❖ type: bool!
❖ node’s free_pages > 0.5 * node’s min_wmark!
❖ DESC!
❖ node単位で(zone単位でなく)、フリーページ量を min watermarkの半分と比較し、超えていればOK!
❖ 下回っていればfalseを返すとともに、 当該nodeのkswapdを起床!
❖ メモリ逼迫したnodeではdirect reclaimはやめて kswapdに任せる、その閾値を決める関数。
do_try_to_free_pages()❖ Core function for page reclaim, which is called at 3 different scenes!
❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()!
❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path!
❖ Right before per-memcg slab allocation!
❖ Right before per-memcg file page allocation!
❖ Right before per-memcg anon page allocation!
❖ Right before per-memcg swapin allocation!
❖ shrink_all_memory() → Hibernation path!
❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
struct scan_controlstruct scan_control {!! unsigned long nr_scanned;!! unsigned long nr_reclaimed;!! unsigned long nr_to_reclaim;!! …!! int swappiness; // 0..100!! …!! struct mem_cgroup *target_mem_cgroup;!! …!! nodemask_t! *nodemask;!};!
do_try_to_free_pagesの処理❖ 以下二つのループ!
❖ shrink_zones()!❖ 後述!
❖ wakeup_flusher_threads()!
❖ shrink_zonesが、回収目標(scan_context::nr_to_reclaim)の 1.5倍以上のページをスキャンするたび、呼び出し。!
❖ 最大で、スキャンした分のページをライトバックするよう、
全ブロックデバイス(bdi)に要求。
shrink_zones()1. for_each_zone_zonelist_nodemask:!
1. mem_cgroup_soft_limit_reclaim!
❖ while mem_cgroup_largest_soft_limit_node:!
❖ mem_cgroup_soft_reclaim!
❖ shrink_zoneに進む前に、当該zoneを使ってる memcgでlimitを超えてるものについて、 ページ回収を済ませる処理!
2. shrink_zone!
❖ foreach mem_cgroup_iter:!
❖ shrink_lruvec!
❖ ここでのiterationはGlobal reclaimの場合は root memcgから回収!
2. shrink_slab!❖ スラブについては次回以降で・・・
shrink_lruvec()❖ per-zone page freer!
1. get_scan_count!❖ 回収目標ページ数決定!
2. while 目標未達:!
❖ shrink_list(LRU_INACTIVE_ANON)!
❖ shrink_list(LRU_ACTIVE_ANON)!
❖ shrink_list(LRU_INACTIVE_FILE)!
❖ shrink_list(LRU_ACTIVE_FILE)!
3. if INACTIVEな無名メモリだけでは不足:!
❖ shrink_active_list
shrink_list()❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを
shrinkするのは、対となるinactiveリストより大きい場合のみ!
1. if ACTIVEなリストを指定:!
❖ if size of lru(ACTIVE) > size of lru(INACTIVE):!
❖ shrink_active_list!
2. else:!
❖ shrink_inactive_list
shrink_{active,inactive}_list❖ shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the list and move them to an inactive list!
❖ shrink_inactive_list()!
❖ foreach page:!
1. page_mapped(page) => try_to_unmap(page)!
2. if PageDirty(page) => pageout(page)
inactiveなページとは❖ !laptop_modeの場合!
❖ active LRU listの末尾から、単純に指定数分のページをinactiveなページとして取得!
❖ laptop_modeの場合!
❖ active LRU listの末尾から、cleanな指定数分のページをinactiveなページとして取得
try_to_unmap()❖ Unmap a specified page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_{file, anon, or ksm}!
❖ rmap walk is iterating VMAs and unmapping from it!
A. file: traverse address_space::i_mmap tree!
B. anon: traverse anon_vma tree!
C. ksm: traverse all merged anon_vma trees!
❖ each operation is similar to that for anon
A. rmap_walk_file
page
address_space(inode)i_mmap(type: rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
B. rmap_walk_anon
page
anon_vmarb_root(type:rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
C. rmap_walk_ksm
page
stable_nodehlist
anon!vma
anon vma
anon!vma
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
anon!vma
2.2 Daemon Reclaim (KSwapDによる代行回収)
kswapd❖ Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
❖ balance_pgdat()!
❖ Work until all zones of pgdat are at or over hi-wmark.!
❖ reclaim function: kswapd_shrink_zone()
Recommended