|
Description
|
[This change request best read in a mono-spaced font]
Brief summary
The default behavior of the kernel appears to be to prioritize page
size over the home lgroup for memory allocation. This is not always
the best choice. This behavior should be customer-settable.
Slightly longer summary: three intertwined issues regarding fallback
When the NUMA VM code is unable to allocate a page of the size and
location requested, the operating system needs to become more
intelligent about fallbacks. As things stand today:
- 1. The kernel is too likely to make the wrong fallback decision.
- 2. If subjected to a long lasting workload, the kernel does not
appear to learn to improve the fallback.
- 3. Users do not have a way to teach it to make better fallback
decisions.
Politeness interrupt
It is understood that this CR touches on long standing issues, and
that work has been done over the years on these issues. That work is
appreciated. Thank you for progress thus far, and apologies if the
above seems too sharply stated. But more work is needed.
Only the present
Rather than getting lost in a twisty little maze of change request
history, this CR intentionally confines itself to the present moment:
Solaris 10 Update 4 plus patches. Details of patch levels are in the
comments, along with a description of behavior on an unannounced
processor.
Bug or RFE?
This report is classified as a bug on the grounds that it is the _job_
of the NUMA VM code to make intelligent decisions about page placement,
and the submitter respectfully suggests that this job is not being
fulfilled as intelligently as it should be.
Thank you for, at least, considering this point of view.
Details follow on the three numbered points from above.
=================================
1. The kernel is too likely to make the wrong fallback decision.
What should be done when the requested page size is not available in
the requested locality group? Roughly, it would seem that there are
three basic fallback alternatives:
- (a) Settle for a different size.
- (b) Settle for a different location.
- (c) Attempt to manufacture the requested size in the requested
location.
From external observation, it appears that the OS often (typically?
always?) picks (b). If this decision was based on some analysis of
costs of TLB handling vs. cost of remote accesses, perhaps this cost
should be considered afresh, in light of times on contemporary
processors, which have worked to reduce costs of TLB misses.
Presumably the cost is highly dependent on what the program does with
the page. If you're going to allocate it, peek at a few bytes, and
throw it away, the cost calculation is very different than if you
allocate it and then use all of it.
To give an example with a simple model that has round numbers, suppose
that taking a TLB miss costs on the order of 1000 ns, local pages are
accessible in 100ns, and remote pages are accessible in 200ns.
[Aside: Yes, yes, these round numbers aren't right. Fix them with
better numbers, if you wish! When you fix them, though, please
take into account that pure lmbench numbers may understate the
costs of remote accesses, because lmbench may effectively assume
that the remote processor and memory have nothing better to do
than to satisfy your request.]
Sparse access (worst case)
A program requests a 4MB page and touches 64 bytes on it, scattered
evenly across the page (that is, 64KB apart). In our simple model,
the costs would be one of these:
7,400 ns Local bigpage 64 local accesses, 1 TLB miss
70,400 ns Fallback (a) 64 local access, 64 TLB misses
13,800 ns Fallback (b) 64 remote accesses, 1 TLB miss
For this example, fallback (b) is indeed the better choice.
Dense access (NOT worst case)
Suppose a program requests a 4MB page and updates all of it, reading
64-byte cache lines and writing them. This would be 131072 accesses
to the page. In our simple model, the costs would be:
13,108,200 ns Local bigpage 131072 local accesses, 1 TLB miss
13,171,200 ns Fallback (a) 131072 local accesses, 64 TLB misses
26,215,400 ns Fallback (b) 131072 remote accesses, 1 TLB miss
For this example, we're better off with fallback (a).
Dense access (worst case)
Please note that the previous example is not "worst case" for dense
access. In the worst case, we would indefinitely iterate over the
data -- and certainly there are real-life algorithms that do
repeated accesses to relatively small address spaces.
As a program iterates indefinitely over a small dataset, fallback (a)
becomes indefinitely better than fallback (b).
=================================
2. If subjected to a long lasting workload, the kernel does not
appear to learn to improve the fallback.
Workloads with heavy demand for large pages do not appear to encourage
the operating system to increase the supply of local bigpages, even
when the workload persists for days. For example, a workload was run
with 12 applications that made varying, but sometimes heavy, demand for
4MB pages. The largest of the 12 applications requests ~950 MB. All
the applications were compiled with -xpagesize=4M.
The tested system had 18 locality groups, each with 16 GB, _except_ for
the locality group with Solaris itself, which had 32GB. Notice in this
extract from 'lgrpinfo' that each leaf lgroup has about the same amount
of memory allocated:
Memory: installed 32768 Mb, allocated 8577 Mb, free 24191 Mb
Memory: installed 16384 Mb, allocated 8307 Mb, free 8077 Mb
Memory: installed 16384 Mb, allocated 8344 Mb, free 8040 Mb
Memory: installed 16384 Mb, allocated 8286 Mb, free 8098 Mb
Memory: installed 16384 Mb, allocated 8289 Mb, free 8095 Mb
Memory: installed 16384 Mb, allocated 8267 Mb, free 8117 Mb
Memory: installed 16384 Mb, allocated 8251 Mb, free 8133 Mb
Memory: installed 16384 Mb, allocated 9293 Mb, free 7091 Mb
Memory: installed 16384 Mb, allocated 8270 Mb, free 8114 Mb
Memory: installed 16384 Mb, allocated 8228 Mb, free 8156 Mb
Memory: installed 16384 Mb, allocated 8219 Mb, free 8165 Mb
Memory: installed 16384 Mb, allocated 8185 Mb, free 8199 Mb
Memory: installed 16384 Mb, allocated 8275 Mb, free 8109 Mb
Memory: installed 16384 Mb, allocated 8208 Mb, free 8176 Mb
Memory: installed 16384 Mb, allocated 8191 Mb, free 8193 Mb
Memory: installed 16384 Mb, allocated 8277 Mb, free 8107 Mb
Memory: installed 16384 Mb, allocated 8218 Mb, free 8166 Mb
Memory: installed 16384 Mb, allocated 8217 Mb, free 8167 Mb
In each locality group, 8 user processes were run, EXCEPT in the first
lgroup, where only 7 were run. Thus the first lgroup is more lightly
loaded, AND has plenty of free memory (24191 MB above). Nevertheless,
even after more than 24 hours, processes run in the first lgroup (and
only processes run in the first lgroup) show page scattering. For
example:
Address Bytes Pgsz Mode Lgrp Mapped File
00800000 4096K 4M rwx-- 9 [ heap ]
00C00000 4096K 4M rwx-- 3 [ heap ]
01000000 4096K 4M rwx-- 10 [ heap ]
01400000 4096K 4M rwx-- 7 [ heap ]
01800000 4096K 4M rwx-- 15 [ heap ]
01C00000 4096K 4M rwx-- 3 [ heap ]
02000000 8192K 4M rwx-- 12 [ heap ]
02800000 4096K 4M rwx-- 14 [ heap ]
02C00000 4096K 4M rwx-- 8 [ heap ]
03000000 4096K 4M rwx-- 4 [ heap ]
03400000 4096K 4M rwx-- 6 [ heap ]
03800000 4096K 4M rwx-- 4 [ heap ]
03C00000 4096K 4M rwx-- 9 [ heap ]
04000000 4096K 4M rwx-- 14 [ heap ]
04400000 4096K 4M rwx-- 5 [ heap ]
04800000 8192K 4M rwx-- 10 [ heap ]
05000000 4096K 4M rwx-- 8 [ heap ]
05400000 4096K 4M rwx-- 5 [ heap ]
05800000 4096K 4M rwx-- 8 [ heap ]
05C00000 4096K 4M rwx-- 14 [ heap ]
06000000 4096K 4M rwx-- 16 [ heap ]
06400000 4096K 4M rwx-- 6 [ heap ]
06800000 4096K 4M rwx-- 4 [ heap ]
06C00000 4096K 4M rwx-- 17 [ heap ]
07000000 4096K 4M rwx-- 4 [ heap ]
07400000 4096K 4M rwx-- 5 [ heap ]
(continues for 200 more lines)
Emphasis: there is plenty of free memory in the first locality group.
There a processor that is almost entirely free. Neverthelesss, even
after more than 24 hours, local bigpage allocations are not completed.
=================================
3. Users do not have a way to teach it to make better fallback decisions.
If the fallback decisions are not as intelligent as we might wish, and
if the OS does not adjust to the load over time to make better fallback
decisions, then can the user at least provide information to improve
the decisions? Not at this time.
One could imagine tunables that would indicate a preference among
fallback options (a), (b), and (c). These preferences might be
expressed as direction to the OS, or as mere guidance. While many
solutions might be considered, for the sake of specificity, here's a
proposal for a pair of parameters. The intent is that these be
considered as a pair: both would be implemented.
========================
numa_fallback_preference
========================
Description
Controls a heuristic for allocation of memory pages when the
requested page size is not immediately available in the local
memory group, but could be satisfied from a remote memory group.
Lower values suggest that the operating system should give more
weight to the the requested size (potentially allocating the page
in a remote memory locality group). Higher values give more
weight to location (potentially allocating a different size than
requested).
Note: when a program requests smaller pages than are immediately
available in the current locality group, the operating system may
be able to create them by dividing larger pages into smaller
pieces. Although page division is straightforward, page division
may lead to page fragmentation; a better choice might be to use a
small page in a remote locality group. When
numa_fallback_preference is decreased to lower values, page
division becomes less likely; when numa_fallback_preference is
increased, the probability of page division is increased.
Default
- 4 (Moderate preference to allocate pages locally, with
moderately reduced priority for the requested size)
Range
- 1 (Strong preference to allocate pages with the requested size) to
- 5 (Strong preference to allocate pages locally)
When to change
This parameter may be decreased if the workload on the system
tends to use large address spaces with relatively sparse access,
because in this situation TLB misses will be more costly than
remote latency.
This parameter may be increased if the workload tends to use
smaller address spaces, or for programs that touch most of the
memory that they allocate. In such situations, the cost of a
remote memory access may dominate the cost of TLB misses.
TLB misses may be observed using the -T option to trapstat (1M).
Page locations and sizes may be observed via the NUMA
Observability Tools, available at opensolaris.org.
========================
page_coalesce_priority
========================
Description
Controls a heuristic for the allocation of pages when the
requested page size is not immediately available, but smaller
pages are available.
When a program requests larger pages than are immediately
available, it may not be possible to fulfill the request without
substantial cost, such as moving smaller pages around to coalesce
larger pages. Page coalesce operations may be done either by a
background thread at low priority, or at the time that a large
page is requested.
Default
- 2 (a background thread runs at low priority to attempt to
coalesce pages, but coalesce operations are not attempted at the
time of a large page request)
Range
- 1 (Allocation of the requested size is considered relatively low
priority. Page coalesce operations are rarely attempted.)
- 5 (Allocation of the requested size is considered a high
priority. Pages coalesce operations are frequently attempted,
including both by a background thread and at the time that page
requests are made.)
When to change
This parameter may be decreased when the workload consists
primarily of programs with relatively small address spaces. It
may be increased if running programs with large address spaces,
especially if these programs explicitly request large pages, and
the requests often fail, resulting in increased TLB miss activity.
TLB misses may be observed using the -T option to trapstat (1M).
Page sizes may be observed via the -s option for pmap (1).
Business impact: on unannounced processor is discussed under comments.
Observations at a customer site will also be added under comments.
|