内存管理2

内存管理

Linux的内存管理实际上非常复杂,包括物理内存的管理,设备内存的管理以及虚拟内存的管理,我们就一点点来分析呗。首先是Linux的可访问的内存并不仅仅指的是内存条上的物理内存,也包括很多设备的内存和设备寄存器的映射,这些内存会通过MMIO映射到Linux的物理内存进行管理。

物理内存管理

我们先来看看物理内存是怎么管理的。

首先现代处理器基本都是NUMA架构,这意味着处理器通过增加了内存访问控制器来提高了内存访问带宽,但是不同的core在访问不同内存区域时会存在一定的速度差异,这是由于内存的组织和接口分布在不同位置决定的。因此,物理内存会被分为多个node来进行划分。

Linux中物理内存的管理大致可以分为Node->Zone->Page.

  • Node(pglist_data):在 NUMA(非一致内存访问)系统中,每个 CPU 节点都有自己独立的内存区域。每个这样的节点由一个 pglist_data 结构体管理。

  • Zone(zone):节点内部又分为不同“类型”的内存区域(例如 DMA 区、普通区、HighMem 区等)。

  • Page(struct page):Zone 内部的最小管理单位,即页框。

Node

在内核源码中,用pglist_data结构体来表示一个node结点的上内存相关数据。在UMA架构下只有一个pglist_data结构体,而在numa架构下会有多个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//arch/x86/mm/numa.c
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);

typedef struct pglist_data {
/*
* node_zones contains just the zones for THIS node. Not all of the
* zones may be populated, but it is the full list. It is referenced by
* this node's node_zonelists as well as other node's node_zonelists.
*/
struct zone node_zones[MAX_NR_ZONES]; //一个节点包含多个内存区(zone)例如: ZONE_DMA,ZONE_DMA32,ZONE_NORMAL,ZONE_HIGHMEM(在x86 32bit).每个 struct zone 管理该区域的页框分配、回收、空闲页、伙伴系统等。

/*
* node_zonelists contains references to all zones in all nodes.
* Generally the first zones will be references to this node's
* node_zones.
*/
struct zonelist node_zonelists[MAX_ZONELISTS];

int nr_zones; /* number of populated zones in this node */ //当前节点上实际被初始化的 zone 数量。
#ifdef CONFIG_FLATMEM /* means !SPARSEMEM */
struct page *node_mem_map; //这是节点的 struct page 数组(即 “mem_map”),每个物理页框对应一个 struct page。
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
#endif
#endif
......
} pg_data_t;

具体连接的方式参考图。还有其他一些结构是锁和数据统计,以及和热插拔相关的,这里就不做细说

image-20251005211616201

Zone

首先是Zone的类型有以下几种,具体每一种Zone的用法看注释即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
enum zone_type {
/*
* ZONE_DMA and ZONE_DMA32 are used when there are peripherals not able
* to DMA to all of the addressable memory (ZONE_NORMAL).
* On architectures where this area covers the whole 32 bit address
* space ZONE_DMA32 is used. ZONE_DMA is left for the ones with smaller
* DMA addressing constraints. This distinction is important as a 32bit
* DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit
* platforms may need both zones as they support peripherals with
* different DMA addressing limitations.
*/
#ifdef CONFIG_ZONE_DMA
ZONE_DMA, //低端 DMA 可达内存区(通常为 <16MB 或 <32MB)
#endif
#ifdef CONFIG_ZONE_DMA32
ZONE_DMA32, //低端 DMA 可达内存区(通常为 <16MB 或 <32MB)
#endif
/*
* Normal addressable memory is in ZONE_NORMAL. DMA operations can be
* performed on pages in ZONE_NORMAL if the DMA devices support
* transfers to all addressable memory.
*/
ZONE_NORMAL,//普通的、内核直接线性映射的物理内存区。
#ifdef CONFIG_HIGHMEM
/*
* A memory area that is only addressable by the kernel through
* mapping portions into its own address space. This is for example
* used by i386 to allow the kernel to address the memory beyond
* 900MB. The kernel will set up special mappings (page
* table entries on i386) for each page that the kernel needs to
* access.
*/
ZONE_HIGHMEM, //高端内存区:CPU 不能直接线性映射访问的物理内存。
#endif
/*
* ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains
* movable pages with few exceptional cases described below. Main use
* cases for ZONE_MOVABLE are to make memory offlining/unplug more
* likely to succeed, and to locally limit unmovable allocations - e.g.,
* to increase the number of THP/huge pages. Notable special cases are:
*
* 1. Pinned pages: (long-term) pinning of movable pages might
* essentially turn such pages unmovable. Therefore, we do not allow
* pinning long-term pages in ZONE_MOVABLE. When pages are pinned and
* faulted, they come from the right zone right away. However, it is
* still possible that address space already has pages in
* ZONE_MOVABLE at the time when pages are pinned (i.e. user has
* touches that memory before pinning). In such case we migrate them
* to a different zone. When migration fails - pinning fails.
* 2. memblock allocations: kernelcore/movablecore setups might create
* situations where ZONE_MOVABLE contains unmovable allocations
* after boot. Memory offlining and allocations fail early.
* 3. Memory holes: kernelcore/movablecore setups might create very rare
* situations where ZONE_MOVABLE contains memory holes after boot,
* for example, if we have sections that are only partially
* populated. Memory offlining and allocations fail early.
* 4. PG_hwpoison pages: while poisoned pages can be skipped during
* memory offlining, such pages cannot be allocated.
* 5. Unmovable PG_offline pages: in paravirtualized environments,
* hotplugged memory blocks might only partially be managed by the
* buddy (e.g., via XEN-balloon, Hyper-V balloon, virtio-mem). The
* parts not manged by the buddy are unmovable PG_offline pages. In
* some cases (virtio-mem), such pages can be skipped during
* memory offlining, however, cannot be moved/allocated. These
* techniques might use alloc_contig_range() to hide previously
* exposed pages from the buddy again (e.g., to implement some sort
* of memory unplug in virtio-mem).
* 6. ZERO_PAGE(0), kernelcore/movablecore setups might create
* situations where ZERO_PAGE(0) which is allocated differently
* on different platforms may end up in a movable zone. ZERO_PAGE(0)
* cannot be migrated.
* 7. Memory-hotplug: when using memmap_on_memory and onlining the
* memory to the MOVABLE zone, the vmemmap pages are also placed in
* such zone. Such pages cannot be really moved around as they are
* self-stored in the range, but they are treated as movable when
* the range they describe is about to be offlined.
*
* In general, no unmovable allocations that degrade memory offlining
* should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
* have to expect that migrating pages in ZONE_MOVABLE can fail (even
* if has_unmovable_pages() states that there are no unmovable pages,
* there can be false negatives).
*/
ZONE_MOVABLE, //可迁移内存区,主要用于内存热插拔和大页管理。
#ifdef CONFIG_ZONE_DEVICE
ZONE_DEVICE, //设备专用内存区(非标准 DRAM 页),如持久内存(NVDIMM)、GPU BAR、CXL Memory。
#endif
__MAX_NR_ZONES

};
/*
物理地址空间
┌──────────────────────────────┐
│ ZONE_DMA (<16MB) │ → legacy ISA DMA
├──────────────────────────────┤
│ ZONE_DMA32 (<4GB) │ → 32-bit PCIe DMA
├──────────────────────────────┤
│ ZONE_NORMAL (>=4GB) │ → kernel, user, pagecache
├──────────────────────────────┤
│ ZONE_MOVABLE │ → movable pages, hotplug
├──────────────────────────────┤
│ ZONE_DEVICE │ → device memory (PMEM, GPU)
└──────────────────────────────┘
*/

看完了不同种类的Zone,我们来看看Zone结构体

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
struct zone {
/* Read-mostly fields */

/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long _watermark[NR_WMARK]; //本内存区域的三个水线,高中低
unsigned long watermark_boost;

unsigned long nr_reserved_highatomic;

/*
* We don't know if the memory that we're going to allocate will be
* freeable or/and it will be released eventually, so to avoid totally
* wasting several GB of ram we must reserve some of the lower zone
* memory (otherwise we risk to run OOM on the lower zones despite
* there being tons of freeable ram on the higher zones). This array is
* recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
* changes.
*/
long lowmem_reserve[MAX_NR_ZONES]; //为各内存与指定若干页面,用于无论如何都不能失败的内存分配

#ifdef CONFIG_NUMA
int node; //本Zone所属Node的id
#endif
struct pglist_data *zone_pgdat; //指向对应Node的pglist_data结构体
struct per_cpu_pages __percpu *per_cpu_pageset;
struct per_cpu_zonestat __percpu *per_cpu_zonestats;
/*
* the high and batch values are copied to individual pagesets for
* faster access
*/
int pageset_high_min;
int pageset_high_max;
int pageset_batch;

#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn; //起始页帧号

/*
* spanned_pages is the total pages spanned by the zone, including
* holes, which is calculated as:
* spanned_pages = zone_end_pfn - zone_start_pfn;
*
* present_pages is physical pages existing within the zone, which
* is calculated as:
* present_pages = spanned_pages - absent_pages(pages in holes);
*
* present_early_pages is present pages existing within the zone
* located on memory available since early boot, excluding hotplugged
* memory.
*
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
* managed_pages = present_pages - reserved_pages;
*
* cma pages is present pages that are assigned for CMA use
* (MIGRATE_CMA).
*
* So present_pages may be used by memory hotplug or memory power
* management logic to figure out unmanaged pages by checking
* (present_pages - managed_pages). And managed_pages should be used
* by page allocator and vm scanner to calculate all kinds of watermarks
* and thresholds.
*
* Locking rules:
*
* zone_start_pfn and spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path. But, it is written
* quite infrequently.
*
* The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/done(). Any reader who can't tolerant drift of
* present_pages should use get_online_mems() to get a stable value.
*/
atomic_long_t managed_pages; //也即 managed_pages 是这个 zone 被伙伴系统管理的所有的 page 数目。
unsigned long spanned_pages; // 指的是不管中间有没有物理内存空洞,反正就是最后的页号减去起始的页号,即spanned_pages包含内存空洞区域页。
unsigned long present_pages; //也即 present_pages 是这个 zone 在物理内存中真实存在的所有 page 数目。
#if defined(CONFIG_MEMORY_HOTPLUG)
unsigned long present_early_pages;
#endif
#ifdef CONFIG_CMA
unsigned long cma_pages;
#endif

const char *name;

#ifdef CONFIG_MEMORY_ISOLATION
/*
* Number of isolated pageblock. It is used to solve incorrect
* freepage counting problem due to racy retrieving migratetype
* of pageblock. Protected by zone->lock.
*/
unsigned long nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif

int initialized;

/* Write-intensive fields used from the page allocator */
CACHELINE_PADDING(_pad1_);

/* free areas of different sizes */
struct free_area free_area[NR_PAGE_ORDERS]; //Buddy伙伴系统的核心数据结构,管理空闲页块链表的数组。

#ifdef CONFIG_UNACCEPTED_MEMORY
/* Pages to be accepted. All pages on the list are MAX_PAGE_ORDER */
struct list_head unaccepted_pages;
#endif

/* zone flags, see below */
unsigned long flags;

/* Primarily protects free_area */
spinlock_t lock;

/* Write-intensive fields used by compaction and vmstats. */
CACHELINE_PADDING(_pad2_);

/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
* drift allowing watermarks to be breached
*/
unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* pfn where compaction free scanner should start */
unsigned long compact_cached_free_pfn;
/* pfn where compaction migration scanner should start */
unsigned long compact_cached_migrate_pfn[ASYNC_AND_SYNC];
unsigned long compact_init_migrate_pfn;
unsigned long compact_init_free_pfn;
#endif

#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
* compact_order_failed is the minimum compaction failed order.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif

bool contiguous;

CACHELINE_PADDING(_pad3_);
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
} ____cacheline_internodealigned_in_smp;

Page

看完了结点和内存域之后,我们来讨论内存管理的核心元素,页帧。每个物理页帧都有一个struct page结构来表示.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
/*
* Five words (20/40 bytes) are available in this union.
* WARNING: bit 0 of the first word is used for PageTail(). That
* means the other users of this union MUST NOT use the bit to
* avoid collision and false-positive PageTail().
*/
union {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
union {
struct list_head lru;

/* Or, for the Unevictable "LRU list" slot */
struct {
/* Always even, to negate PageTail */
void *__filler;
/* Count page's or folio's mlocks */
unsigned int mlock_count;
};

/* Or, free page */
struct list_head buddy_list;
struct list_head pcp_list;
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
union {
pgoff_t index; /* Our offset within mapping. */
unsigned long share; /* share count for fsdax */
};
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
struct { /* page_pool used by netstack */
/**
* @pp_magic: magic value to avoid recycling non
* page_pool allocated pages.
*/
unsigned long pp_magic;
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
* mapped so the next 3 words hold the mapping, index,
* and private fields from the source anonymous or
* page cache page while the page is migrated to device
* private memory.
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
* use the mapping, index, and private fields when
* pmem backed DAX files are mapped.
*/
};

/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};

union { /* This union is 4 bytes in size. */
/*
* For head pages of typed folios, the value stored here
* allows for determining what this page is used for. The
* tail pages of typed folios will not store a type
* (page_type == _mapcount == -1).
*
* See page-flags.h for a list of page types which are currently
* stored here.
*
* Owners of typed folios may reuse the lower 16 bit of the
* head page page_type field after setting the page type,
* but must reset these 16 bit to -1 before clearing the
* page type.
*/
unsigned int page_type;

/*
* For pages that are part of non-typed folios for which mappings
* are tracked via the RMAP, encodes the number of times this page
* is directly referenced by a page table.
*
* Note that the mapcount is always initialized to -1, so that
* transitions both from it and to it can be tracked, using
* atomic_inc_and_test() and atomic_add_negative(-1).
*/
atomic_t _mapcount;
};

/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;

#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#elif defined(CONFIG_SLAB_OBJ_EXT)
unsigned long _unused_slab_obj_exts;
#endif

/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif

#ifdef CONFIG_KMSAN
/*
* KMSAN metadata for this page:
* - shadow page: every bit indicates whether the corresponding
* bit of the original page is initialized (0) or not (1);
* - origin page: every 4 bytes contain an id of the stack trace
* where the uninitialized value was created.
*/
struct page *kmsan_shadow;
struct page *kmsan_origin;
#endif
} _struct_page_alignment;

看着非常复杂,其实就是两个union加几个数据,换个方式就能够明白了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
struct page
├─ flags // 页标志位 (PG_locked, PG_dirty, etc.)
├─ union { // 主要根据页类型复用
│ ├─ Page cache / anon page
│ │ ├─ lru / buddy_list / pcp_list / mlock_count
│ │ ├─ mapping → struct address_space* //如果页帧与文件相关,指向address_space对象,且最低位为0
│ │ ├─ index/share // 页在 mapping 中的偏移 (file offset / page index)
│ │ └─ private
│ │
│ ├─ page_pool (网络栈专用)
│ │ ├─ pp_magic
│ │ ├─ pp
│ │ ├─ dma_addr
│ │ └─ pp_ref_count
│ │
│ ├─ Tail page
│ │ └─ compound_head (bit0=1 指向 head)
│ │
│ ├─ ZONE_DEVICE
│ │ ├─ pgmap → struct dev_pagemap*
│ │ └─ zone_device_data
│ │
│ └─ rcu_head

├─ union { page_type / _mapcount } // 页类型或映射计数(被多少个页表项映射(PTE 引用数))
├─ _refcount // 引用计数(核心字段),表示内核是否仍持有该页(缓存、IO、LRU等)。
├─ memcg_data / _unused_slab_obj_exts
├─ (optional) virtual // 高端内存虚拟地址(高端内存)
├─ (optional) _last_cpupid // 最近访问CPU
└─ (optional) kmsan_shadow/origin // KMSAN 元数据

以下给出几个常用的page struct的去除union之后的成员

  • 普通页缓存/匿名页

    1
    2
    3
    4
    5
    6
    7
    8
    9
    struct page {
    unsigned long flags; // PG_locked, PG_uptodate, PG_dirty...
    struct list_head lru; // 在 zone->lruvec 的链表上
    struct address_space *mapping;// 所属的文件或 anon_vma
    pgoff_t index; // 页在 mapping 中的偏移 (file offset / page index)
    unsigned long private; // 私有数据: buffer_head, swap entry, 等
    atomic_t _mapcount; // 被多少个页表项映射(PTE 引用数)
    atomic_t _refcount; // 内核总引用计数
    };
  • 复合页(Compound Page,Hugepage或order>0页)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [Head page] struct page
    ├─ flags (标记为 head)
    ├─ private (order)
    ├─ ...
    └─ compound_mapcount, refcount

    [Tail page] struct page
    ├─ compound_head = (head | 1)
    └─ 其他字段未使用
  • 伙伴系统空闲页(当页处于空闲状态时,它并不属于文件映射,而是链在 zone->free_area[order].free_list 上)

    1
    2
    3
    4
    5
    6
    struct page {
    unsigned long flags; // PG_buddy
    struct list_head buddy_list; // 连接到 buddy freelist
    unsigned long private; // 存储页阶 (order)
    atomic_t _refcount = 0;
    };
  • 网络页池,作为内核高速回收 DMA buffer。

    1
    2
    3
    4
    5
    6
    struct page {
    unsigned long pp_magic;
    struct page_pool *pp;
    unsigned long dma_addr;
    atomic_long_t pp_ref_count;
    };
  • Zone_device页(这些页并非普通 RAM,而是设备内存(如 GPU 显存或持久内存)。它们通过 struct dev_pagemap 管理。)

    1
    2
    3
    4
    5
    6
    7
    struct page {
    struct dev_pagemap *pgmap; // 设备页映射信息
    void *zone_device_data; // 驱动自定义
    struct address_space *mapping;
    pgoff_t index;
    unsigned long private;
    };

伙伴系统

前面说过每个Zone结构内部都有一个free_area字段, 那我们看看free_area到底是什么东西。

1
2
3
4
5
6
7
8
9
10
11
12
//include/linux/mmzone.h
struct zone{
...
struct free_area free_area[NR_PAGE_ORDERS];
...
};

struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

大致结构如图所示,第一个free_area[0]表示页面大小为4K的空闲页,第二个free_area[1]表示页面大小为8K的空闲块,依次类推。而每个free_area之间又有多个free_list,这是由于不同的页有不同的属性,例如MIGRATE_UNMOVABLE,MIGRATE_MOVABLE,MIGRATE_RECLAIMABLE,MIGRATE_PCPTYPES等,因此就弄了多个list, 但是总数记录在nr_free中。
image-20251006010750662

对于不同结点上不同的zone的组织方式就简化成如下图:

image-20251006011505993

虚拟内存管理

讲完了物理内存管理,接下来讲讲Linux最核心的机制,虚拟内存,即地址映射和页表相关的内容。

在x86架构下,实际有三个地址,逻辑地址,线性地址和物理地址,逻辑地址经过分段管理单元转化成线性地址,线性地址经过分页页表和MMU转化成物理地址。虽然听着比较复杂,但是实际上x86和x86_64采用的是平坦内存模型,也就是绕过了逻辑地址到线性地址的这层翻译。具体来说就是虽然还是用了段寄存器和段选择子等机制来分段,但是所有段都是从物理地址0开始,且界限都一样,也就是说虽然有用户代码段,数据段,内核代码段,数据段之分,但是指向的都是同一段,只是使用了它们的特权级作为特权控制。

五级页表地址翻译

五级页表的内容相信都耳熟能详了,这里便不再过多叙述。PGD,P4D,PUD,PMD,PTE。

具体就贴两个网上blog里截的图把。

image-20251006125155971

image-20251006125218150

image-20251006125228549

进程如何管理地址空间与页表

task_struct

我们首先了解一下Linux 中task_struct结构,也就是所谓的PCB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct task_struct {
struct thread_info thread_info;
void *stack;//内核栈地址

// 进程打开的文件信息
struct files_struct *files;
// 内存描述符表示进程虚拟地址空间
struct mm_struct *mm;
struct mm_struct *active_mm; //内核线程用的是上一个线程的mm,内核线程和用户态线程的区别就是内核线程没有相关的内存描述符 mm_struct ,内核线程对应的 task_struct 结构中的 mm 域指向 Null,所以内核线程之间调度是不涉及地址空间切换的。当一个内核线程被调度时,它会发现自己的虚拟地址空间为 Null,虽然它不会访问用户态的内存,但是它会访问内核内存,聪明的内核会将调度之前的上一个用户态进程的虚拟内存空间 mm_struct 直接赋值给内核线程,因为内核线程不会访问用户空间的内存,它仅仅只会访问内核空间的内存,所以直接复用上一个用户态进程的虚拟地址空间就可以避免为内核线程分配 mm_struct 和相关页表的开销,以及避免内核线程之间调度时地址空间的切换开销。

// 进程id
pid_t pid;
// 用于标识线程所属的进程 pid
pid_t tgid;

/* Real parent process: */
struct task_struct __rcu *real_parent;

/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;

...
}

mm_struct

其中最重要的一个结构就是mm_struct,用来表示内存地址空间结构,我们看看它包含了什么

1
2
3
4
5
6
7
8
9
10
11
struct mm_struct {
struct maple_tree mm_mt; //枫树!这个用来代替红黑树,是5.13之后加入的新特性
unsigned long mmap_base; /* base of mmap area */ //现版本内核从上往下,从高地址往低地址增长
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ //早版本内核mmap从下往上,从低地址到高地址增长
unsigned long task_size; /* size of task vm space */ //task_size是表明内核空间的起始地址,x86架构下为`0xC000 000`, x86_64架构下为`0x0000 7FFF FFFF F000`
pgd_t * pgd; //页表基址,这是虚拟地址,也就是物理地址在directmapping下记录的虚拟地址
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
......
}

其中有一个关键的结构 struct maple_tree mm_mt;,这个是用来代替红黑树的,通过这个枫树,能够快速找到不同的vm_area_struct结构。这里替换的主要原因是,红黑树还需要用链表来辅助,在多线程中mmap/unmap存在以下问题

问题 说明
cache locality 差 红黑树节点分散,cache miss 高
链表 + 树双维护 维护两种结构复杂
mmap_lock 粗粒度 读写 mmap_lock 导致全局竞争
RCU 不友好 查找时需要加锁或引用计数

而maple_tree有以下优点

  • 使用「块状节点」存储多个 VMA 指针;

  • 节点内连续数组便于缓存;

  • 支持 RCU 安全遍历;

  • 极大减少锁冲突。

vm_area_struct

先看看结构体里有哪些东西, 可以发现已经没有之前的rb_node成员了,之前老版本会有一个struct rb_node vm_rb;,现在已经删除了,但是别的东西还是很明了的,就是表明一个区域的起始和结尾,以及文件映射相关的内容,还有一些权限什么的,具体看注释都能知道什么意思。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */

union {
struct {
/* VMA covers [vm_start; vm_end) addresses within mm */
unsigned long vm_start;
unsigned long vm_end;
};
#ifdef CONFIG_PER_VMA_LOCK
struct rcu_head vm_rcu; /* Used for deferred freeing. */
#endif
};

struct mm_struct *vm_mm; /* The address space we belong to. */
pgprot_t vm_page_prot; /* Access permissions of this VMA. */

/*
* Flags, see mm.h.
* To modify use vm_flags_{init|reset|set|clear|mod} functions.
*/
union {
const vm_flags_t vm_flags;
vm_flags_t __private __vm_flags;
};

#ifdef CONFIG_PER_VMA_LOCK
/* Flag to indicate areas detached from the mm->mm_mt tree */
bool detached;

/*
* Can only be written (using WRITE_ONCE()) while holding both:
* - mmap_lock (in write mode)
* - vm_lock->lock (in write mode)
* Can be read reliably while holding one of:
* - mmap_lock (in read or write mode)
* - vm_lock->lock (in read or write mode)
* Can be read unreliably (using READ_ONCE()) for pessimistic bailout
* while holding nothing (except RCU to keep the VMA struct allocated).
*
* This sequence counter is explicitly allowed to overflow; sequence
* counter reuse can only lead to occasional unnecessary use of the
* slowpath.
*/
int vm_lock_seq;
struct vma_lock *vm_lock;
#endif

/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
*
*/
struct {
struct rb_node rb;
unsigned long rb_subtree_last;
} shared;

/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
struct list_head anon_vma_chain; /* Serialized by mmap_lock &
* page_table_lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */

/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;

/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */

#ifdef CONFIG_ANON_VMA_NAME
/*
* For private and shared anonymous mappings, a pointer to a null
* terminated string containing the name given to the vma, or NULL if
* unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
*/
struct anon_vma_name *anon_name;
#endif
#ifdef CONFIG_SWAP
atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
#ifdef CONFIG_NUMA_BALANCING
struct vma_numab_state *numab_state; /* NUMA Balancing state */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;

Maple Tree 枫树

既然都看到这了,就学习一下枫树吧,顺便把B树什么都一起看了,这个应该全网没啥教程,只能自己看源码。

首先是maple树的一个定义,具体如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/*
* If the tree contains a single entry at index 0, it is usually stored in
* tree->ma_root. To optimise for the page cache, an entry which ends in '00',
* '01' or '11' is stored in the root, but an entry which ends in '10' will be
* stored in a node. Bits 3-6 are used to store enum maple_type.
*
* The flags are used both to store some immutable information about this tree
* (set at tree creation time) and dynamic information set under the spinlock.
*
* Another use of flags are to indicate global states of the tree. This is the
* case with the MAPLE_USE_RCU flag, which indicates the tree is currently in
* RCU mode. This mode was added to allow the tree to reuse nodes instead of
* re-allocating and RCU freeing nodes when there is a single user.
*/
struct maple_tree {
union {
//写操作(例如插入、删除 VMA)时通常需要持有这个锁。
spinlock_t ma_lock; //如果本结构体需要加锁则使用这个写锁,读结构一般不加锁,所以可以RCU支持。RCU(read-copy update),指的是要写指针的时候拷贝一份再更新,因此读写不用互斥,仅有写写需要互斥。
lockdep_map_p ma_external_lock; //或者使用外部代码的锁,这里可以忽略。当外部代码(例如 mmap_lock)负责保护整个树时,Maple Tree 就可以复用外部锁的 lockdep 信息,而不用自己的 spinlock。lockdep_map_p 是一个 “lockdep 伪锁” 类型,用来让 lockdep 工具追踪锁顺序,而不实际参与加锁。
};
unsigned int ma_flags; //用于记录树的全局状态与配置选项,至于maple_flags看下面定义
void __rcu *ma_root; //这个是 Maple Tree 的根指针(root pointer),但它是一个 “encoded pointer”(即 maple_enode 类型),带有位域信息。
};

/**
* DOC: Maple tree flags
*
* * MT_FLAGS_ALLOC_RANGE - Track gaps in this tree
* * MT_FLAGS_USE_RCU - Operate in RCU mode
* * MT_FLAGS_HEIGHT_OFFSET - The position of the tree height in the flags
* * MT_FLAGS_HEIGHT_MASK - The mask for the maple tree height value
* * MT_FLAGS_LOCK_MASK - How the mt_lock is used
* * MT_FLAGS_LOCK_IRQ - Acquired irq-safe
* * MT_FLAGS_LOCK_BH - Acquired bh-safe
* * MT_FLAGS_LOCK_EXTERN - mt_lock is not used
*
* MAPLE_HEIGHT_MAX The largest height that can be stored
*/

#define MT_FLAGS_ALLOC_RANGE 0x01
#define MT_FLAGS_USE_RCU 0x02
#define MT_FLAGS_HEIGHT_OFFSET 0x02
#define MT_FLAGS_HEIGHT_MASK 0x7C
#define MT_FLAGS_LOCK_MASK 0x300
#define MT_FLAGS_LOCK_IRQ 0x100
#define MT_FLAGS_LOCK_BH 0x200
#define MT_FLAGS_LOCK_EXTERN 0x300
#define MT_FLAGS_ALLOC_WRAPPED 0x0800

/*
* The Maple Tree squeezes various bits in at various points which aren't
* necessarily obvious. Usually, this is done by observing that pointers are
* N-byte aligned and thus the bottom log_2(N) bits are available for use. We
* don't use the high bits of pointers to store additional information because
* we don't know what bits are unused on any given architecture.
*
* Nodes are 256 bytes in size and are also aligned to 256 bytes, giving us 8
* low bits for our own purposes. Nodes are currently of 4 types:
* 1. Single pointer (Range is 0-0)
* 2. Non-leaf Allocation Range nodes
* 3. Non-leaf Range nodes
* 4. Leaf Range nodes All nodes consist of a number of node slots,
* pivots, and a parent pointer.
*/

struct maple_node {
union {
struct {
struct maple_pnode *parent;
void __rcu *slot[MAPLE_NODE_SLOTS];
};
struct {
void *pad;
struct rcu_head rcu;
struct maple_enode *piv_parent;
unsigned char parent_slot;
enum maple_type type;
unsigned char slot_len;
unsigned int ma_flags;
};
struct maple_range_64 mr64;
struct maple_arange_64 ma64;
struct maple_alloc alloc;
};
};

OK了,看完了树的定义,其实核心结构都在ma_root这个指针里。我们来看看不同情况下该指针包含的内容。当树为空时,ma_root存放的值即为NULL,非常好理解。当树只有一个结点时,根据注释可知,其指针最低几位,即bit0,bit1,bit2作为类型信息,如果该单元素的地址低两位模式为 00 / 01 / 11,则直接存在 ma_root;如果是 10(即 0b10),那会与内部编码规则冲突,所以这种情况会强制放进一个 node 里。

然后我们再来看一下树的状态机,用来标识树的一个状态的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/*
* Maple State Status
* ma_active means the maple state is pointing to a node and offset and can
* continue operating on the tree.
* ma_start means we have not searched the tree.
* ma_root means we have searched the tree and the entry we found lives in
* the root of the tree (ie it has index 0, length 1 and is the only entry in
* the tree).
* ma_none means we have searched the tree and there is no node in the
* tree for this entry. For example, we searched for index 1 in an empty
* tree. Or we have a tree which points to a full leaf node and we
* searched for an entry which is larger than can be contained in that
* leaf node.
* ma_pause means the data within the maple state may be stale, restart the
* operation
* ma_overflow means the search has reached the upper limit of the search
* ma_underflow means the search has reached the lower limit of the search
* ma_error means there was an error, check the node for the error number.
*/
enum maple_status {
ma_active, //指向一个node,可以继续对树进行操作
ma_start, //还没开始搜索树
ma_root, //树已经开始搜索了,但是目前只有一个根节点
ma_none, //树已经开始搜索了,但是目前没有这个节点
ma_pause,
ma_overflow,
ma_underflow,
ma_error,
};

然后看看往一颗空树添加节点的操作,分为插入一个值和插入一个range

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
int mtree_insert(struct maple_tree *mt, unsigned long index, void *entry,
gfp_t gfp)
{
return mtree_insert_range(mt, index, index, entry, gfp);
}
EXPORT_SYMBOL(mtree_insert);

/**
* mtree_insert_range() - Insert an entry at a given range if there is no value.
* @mt: The maple tree
* @first: The start of the range
* @last: The end of the range
* @entry: The entry to store
* @gfp: The GFP_FLAGS to use for allocations.
*
* Return: 0 on success, -EEXISTS if the range is occupied, -EINVAL on invalid
* request, -ENOMEM if memory could not be allocated.
*/
int mtree_insert_range(struct maple_tree *mt, unsigned long first,
unsigned long last, void *entry, gfp_t gfp)
{
MA_STATE(ms, mt, first, last);

if (WARN_ON_ONCE(xa_is_advanced(entry)))
return -EINVAL;

if (first > last)
return -EINVAL;

mtree_lock(mt);
retry:
mas_insert(&ms, entry);
if (mas_nomem(&ms, gfp))
goto retry;

mtree_unlock(mt);
if (mas_is_err(&ms))
return xa_err(ms.node);

return 0;
}
EXPORT_SYMBOL(mtree_insert_range);

可以看到核心在于声明了一个ms变量,即ma_state结构体,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#define MA_STATE(name, mt, first, end)					\
struct ma_state name = { \
.tree = mt, \
.index = first, \
.last = end, \
.node = NULL, \ //初始状态树没有任何节点
.status = ma_start, \
.min = 0, \
.max = ULONG_MAX, \
.alloc = NULL, \
.mas_flags = 0, \
}


struct ma_state {
struct maple_tree *tree; /* The tree we're operating in */// 指向目标树
unsigned long index; // 当前操作的起始index(first)
unsigned long last; // 当前操作的终止index(last)
struct maple_enode *node; /* The node containing this entry */
unsigned long min; /* The minimum index - implied pivot min */// 当前节点管理的index区间
unsigned long max; /* The maximum index - implied pivot max */// 当前节点管理的index区间
struct maple_alloc *alloc; /* Allocated nodes for this operation */ // 若需要分配新节点,指向预分配页
enum maple_status status; /* The status of the state (active, start, none, etc) */ // 状态机标记(ma_start, ma_none, ma_data等)
unsigned char depth; /* depth of tree descent during write */
unsigned char offset;
unsigned char mas_flags;
unsigned char end; /* The end of the node */
};

然后调用mas_insert(&ms, entry);那我们来看看mas_insert相关代码, 在该函数中,同样用宏声明了一个MA_WR_STATE,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
* mas_insert() - Internal call to insert a value
* @mas: The maple state
* @entry: The entry to store
*
* Return: %NULL or the contents that already exists at the requested index
* otherwise. The maple state needs to be checked for error conditions.
*/
static inline void *mas_insert(struct ma_state *mas, void *entry)
{
/*
#define MA_WR_STATE(name, ma_state, wr_entry) \
struct ma_wr_state name = { \
.mas = ma_state, \
.content = NULL, \
.entry = wr_entry, \
}
这里进来首先创建一个mas_state上下文,
struct ma_wr_state {
struct ma_state *mas; // 指向当前 ma_state
void *content; // 现有内容(如果已存在)
void *entry; // 要插入的新 entry
};
*/
MA_WR_STATE(wr_mas, mas, entry);

/*
* Inserting a new range inserts either 0, 1, or 2 pivots within the
* tree. If the insert fits exactly into an existing gap with a value
* of NULL, then the slot only needs to be written with the new value.
* If the range being inserted is adjacent to another range, then only a
* single pivot needs to be inserted (as well as writing the entry). If
* the new range is within a gap but does not touch any other ranges,
* then two pivots need to be inserted: the start - 1, and the end. As
* usual, the entry must be written. Most operations require a new node
* to be allocated and replace an existing node to ensure RCU safety,
* when in RCU mode. The exception to requiring a newly allocated node
* is when inserting at the end of a node (appending). When done
* carefully, appending can reuse the node in place.
*/
wr_mas.content = mas_start(mas);
if (wr_mas.content)
goto exists; //已存在,不允许重复插入

if (mas_is_none(mas) || mas_is_ptr(mas)) {
mas_store_root(mas, entry); //如果是空树,直接在root插入entry
return NULL;
}

/* spanning writes always overwrite something */
if (!mas_wr_walk(&wr_mas))
goto exists;

/* At this point, we are at the leaf node that needs to be altered. */
wr_mas.offset_end = mas->offset;
wr_mas.end_piv = wr_mas.r_max;

if (wr_mas.content || (mas->last > wr_mas.r_max))
goto exists;

if (!entry)
return NULL;

mas_wr_modify(&wr_mas);
return wr_mas.content;

exists:
mas_set_err(mas, -EEXIST);
return wr_mas.content;

}

也就是说如果是空树,则到mas_store_root函数中进行处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
static inline void mas_store_root(struct ma_state *mas, void *entry)
{
if (likely((mas->last != 0) || (mas->index != 0))) //插入是一个范围
mas_root_expand(mas, entry);
else if (((unsigned long) (entry) & 3) == 2) //低两位是10
mas_root_expand(mas, entry); //将根扩大
else {
rcu_assign_pointer(mas->tree->ma_root, entry); //如果是插入单独一个数字,则直接把该entry作为根
mas->status = ma_start; //状态改成还没开始搜索树
}
}

/*
* mas_root_expand() - Expand a root to a node
* @mas: The maple state
* @entry: The entry to store into the tree
*/
static inline int mas_root_expand(struct ma_state *mas, void *entry)
{
void *contents = mas_root_locked(mas);
enum maple_type type = maple_leaf_64;
struct maple_node *node;
void __rcu **slots;
unsigned long *pivots;
int slot = 0;

mas_node_count(mas, 1); //确保allocated数量要大于1,不然就多预分配几个
if (unlikely(mas_is_err(mas)))
return 0;

node = mas_pop_node(mas); // 分配一个新的 maple_node
pivots = ma_pivots(node, type);
slots = ma_slots(node, type);
node->parent = ma_parent_ptr(mas_tree_parent(mas));
mas->node = mt_mk_node(node, type);
mas->status = ma_active;

if (mas->index) {
if (contents) {
rcu_assign_pointer(slots[slot], contents);
if (likely(mas->index > 1))
slot++;
}
pivots[slot++] = mas->index - 1;
}

rcu_assign_pointer(slots[slot], entry); //slots成员用于存储entry,也就是vm_area_struct
mas->offset = slot;
pivots[slot] = mas->last;
if (mas->last != ULONG_MAX)
pivots[++slot] = ULONG_MAX;

mas->depth = 1;
mas_set_height(mas);
ma_set_meta(node, maple_leaf_64, 0, slot);
/* swap the new root into the tree */
rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node));
return slot;
}

最后给一个完整的vm_area_struct是如何组织成maple_tree的,这个网上资料太少了,找到一篇论文里的图,能够很好地说明各个节点和内容。具体来说pivot来表示各个vm_area_struct的范围,slots 用来存储vm_area_struct(叶子节点)或者下一级slots(非叶子节点)的地址。这样看就非常了然了。

image-20251006215530853

可以看到非leaf node用的是maple_arange_64结构体,其slots存储的都是下一级slots的指针,而leaf node使用的是maple_range_64结构体,其slots存储user data。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/*
* Leaf nodes do not store pointers to nodes, they store user data. Users may
* store almost any bit pattern. As noted above, the optimisation of storing an
* entry at 0 in the root pointer cannot be done for data which have the bottom
* two bits set to '10'. We also reserve values with the bottom two bits set to
* '10' which are below 4096 (ie 2, 6, 10 .. 4094) for internal use. Some APIs
* return errnos as a negative errno shifted right by two bits and the bottom
* two bits set to '10', and while choosing to store these values in the array
* is not an error, it may lead to confusion if you're testing for an error with
* mas_is_err().
*
* Non-leaf nodes store the type of the node pointed to (enum maple_type in bits
* 3-6), bit 2 is reserved. That leaves bits 0-1 unused for now.
*
* In regular B-Tree terms, pivots are called keys. The term pivot is used to
* indicate that the tree is specifying ranges, Pivots may appear in the
* subtree with an entry attached to the value whereas keys are unique to a
* specific position of a B-tree. Pivot values are inclusive of the slot with
* the same index.
*/

struct maple_range_64 {
struct maple_pnode *parent;
unsigned long pivot[MAPLE_RANGE64_SLOTS - 1];
union {
void __rcu *slot[MAPLE_RANGE64_SLOTS];
struct {
void __rcu *pad[MAPLE_RANGE64_SLOTS - 1];
struct maple_metadata meta;
};
};
};

/*
* At tree creation time, the user can specify that they're willing to trade off
* storing fewer entries in a tree in return for storing more information in
* each node.
*
* The maple tree supports recording the largest range of NULL entries available
* in this node, also called gaps. This optimises the tree for allocating a
* range.
*/
struct maple_arange_64 {
struct maple_pnode *parent;
unsigned long pivot[MAPLE_ARANGE64_SLOTS - 1];
void __rcu *slot[MAPLE_ARANGE64_SLOTS];
unsigned long gap[MAPLE_ARANGE64_SLOTS];
struct maple_metadata meta;
};

内核中一些内存分配函数的细节与区别

伙伴系统分配函数

首先是前面说过的伙伴系统,主要分配函数有alloc_page,alloc_pages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//include/linux/gfp.h

#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))

//NUMA架构
struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;

/*
* No reference counting needed for current->mempolicy
* nor system default_policy
*/
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);

return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX,
numa_node_id());
}

//非NUMA架构
static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order);
}

然后可以看到就是分为NUMA架构和非NUMA架构的区别了,这里我们分别来看,因为调用的是不同函数

NUMA架构

NUMA架构是从内存池中获取1<<order个页面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
nodemask_t *nodemask; //准备node掩码
struct page *page; //准备return value

nodemask = policy_nodemask(gfp, pol, ilx, &nid); //根据mempool的策略,分析可以从哪些node上分配页面。

if (pol->mode == MPOL_PREFERRED_MANY) //当策略允许多个“首选节点”时,跳过后续逻辑,用专门路径分配。
return alloc_pages_preferred_many(gfp, order, nid, nodemask);

if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
/* filter "hugepage" allocation, unless from alloc_pages() */
order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) { //分配透明大页,优先从本节点获取
/*
* For hugepage allocation and non-interleave policy which
* allows the current node (or other explicitly preferred
* node) we only try to allocate from the current/preferred
* node and don't fall back to other nodes, as the cost of
* remote accesses would likely offset THP benefits.
*
* If the policy is interleave or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
if (pol->mode != MPOL_INTERLEAVE &&
pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
(!nodemask || node_isset(nid, *nodemask))) {
/*
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
page = __alloc_pages_node_noprof(nid,
gfp | __GFP_THISNODE | __GFP_NORETRY, order); //尝试先在本地节点路径上快速分配,不触发内存回收
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
/*
* If hugepage allocations are configured to always
* synchronous compact or the vma has been madvised
* to prefer hugepage backing, retry allowing remote
* memory with both reclaim and compact as well.
*/
}
}

page = __alloc_pages_noprof(gfp, order, nid, nodemask); //如果前面失败了,就调用正常路径分配,可以跨节点,可能触发内存回收,会根据nodemask限制分配节点

if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
/* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
if (static_branch_likely(&vm_numa_stat_key) &&
page_to_nid(page) == nid) {
preempt_disable();
__count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
preempt_enable();
}
}

return page;
}

这里函数不会触发内存回收,因为设置了gfp_mask,后续慢速路径使用原来的mask

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
static inline struct page *
__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask); //检查节点没有下线,是否合法

return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
}

/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW; //低水位线
gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = { };

/*
* There are several places where we assume that the order value is sane
* so bail out early if the request is out of bound.
*/
if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) //防止分配过大的页面 这里指的是大于2^10
return NULL;

gfp &= gfp_allowed_mask;
/*
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
* resp. GFP_NOIO which has to be inherited for all allocation requests
* from a particular context which has been marked by
* memalloc_no{fs,io}_{save,restore}. And PF_MEMALLOC_PIN which ensures
* movable zones are not used during allocation.
*/
gfp = current_gfp_context(gfp);
alloc_gfp = gfp;
if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
&alloc_gfp, &alloc_flags)) //做一些准备工作,比如确定目标Zone,调整gfp,检查CPUSET相关等。
return NULL;

/*
* Forbid the first pass from falling back to types that fragment
* memory until all local zones are considered.
*/
alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);

/* First allocation attempt */
page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); //快速路径尝试,最后调用rmqueue() 从伙伴系统取页
if (likely(page))
goto out;

alloc_gfp = gfp;
ac.spread_dirty_pages = false;

/*
* Restore the original nodemask if it was potentially replaced with
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
*/
ac.nodemask = nodemask;

page = __alloc_pages_slowpath(alloc_gfp, order, &ac); //执行满速分配,会执行内存压缩,会触发OOM等。

out:
if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
__free_pages(page, order);
page = NULL;
}

trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
kmsan_alloc_page(page, order, alloc_gfp);

return page;
}

非NUMA架构

非NUMA架构其实也是调用的以上函数,只是调用的参数不太一样,这里就不详细说明了

1
2
3
4
5
6
7
8
static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

return __alloc_pages_node_noprof(nid, gfp_mask, order);
}

其他buddy之上的分配函数

get_free_page

get_dma_pages

alloc_pages_exact

get_zerod_page

最后都会调用__get_free_pages, 然后再调用底层的buddy函数。