-
Notifications
You must be signed in to change notification settings - Fork 293
Add numa.md design sketch #6719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/content/design/numa.md
Outdated
| NUMA support is only available in XenServer 9. For now, I would suggest | ||
| to make the required changes in XenServer 8 as well but no VM there | ||
| will be NUMA optimised. Some parts of the code will have to be managed | ||
| by patches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xapi has been able to produce NUMA-optimised VMs for at least 2 years for hosts configured to do so. Without patches in some cases it will fail to create NUMA-optimised VMs, but I don't see a technical reason to prevent xapi on master from showing these stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases I think the API to query the needed information is only available with a newer Xen. In particular the one that tells you how the memory of a VM actually got split over NUMA nodes (not how XAPI wanted it to get split, which are 2 different things)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps we should refer to Xen versions here, rather than XenServer versions.
| * `numa_nodes`: 0 | ||
| * `numa_node_memory`: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the values for halted and suspended VMs? the defaults or the ones from last they were run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The values would be updated when the VM changes into the run state. I also don't know what the underlying Xen API returns in such a case - we are passing these through. A halted VM does not have a domain - so there is no way to call the Xen API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to know the number of nodes to resume a VM, so for suspended VMs it might make sense to store these.
For a halted VM these should be empty, or we could have a last_booted_ like we do for CPUID, but I'm not sure the complication would be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the values for halted and suspended VMs? the defaults or the ones from last they were run?
Common practice would be to keep the values as before when still running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current implementation resets them. But we can adjust that.
doc/content/design/numa.md
Outdated
| where the two fields would be set. Xenopsds relies on bindings to Xen to | ||
| observe NUMA-related properties of a domain. | ||
|
|
||
| Given that NUMA related funcationality is only anvailable for Xen in XS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I checked xenops only supports allocating the memory of a VM in a single node, or distribute it among all nodes. I guess on the second one xapi wouldn't have the information of the pages allocated on each NUMA node, but otherwise if xenopsd decides to NUMA-optimise a VM, it already knows how much memory to allocate for each node, before the domain is built.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to double check that Xen actually did what we asked it to do.
E.g. there was a situation where a stray page got allocated on another node.
Also we want to double check what happened after migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also with all the patching and conditional builds I'm not actually sure we'd be calling the right APIs on the right XS versions, but that is a separate matter to improve. So double checking that we got what we intended, at least for testing purposes is very useful. Might also be useful for support/customers to be able to open better bug reports.
It is also very difficult to develop with all the patches in the way: everything other than XS8 is a second-class citizen, and that has been very visible with the long delays to get the NUMA patches merged.
I discussed with @mg12 that we could try to upstream large parts of the patches (e.g. we have Seq refactoring mixed with introducing better NUMA support), leaving only a very small boolean and function call in the patch.
Then those could be moved to a separate module, and we could use Dune's conditional compilation support (select) to build XAPI for a particular version of Xen from the same (unpatched source code) based on an environment variable.
That'd also help reduce the maintenance overhead with different Xen versions (we currently support 3), we'd only need 1 line for each Xen version in the .spec file: to set the appropriate env var.
But I think all those improvements will be made independently in other PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also very difficult to develop with all the patches in the way: everything other than XS8 is a second-class citizen, and that has been very visible with the long delays to get the NUMA patches merged.
That's why I tried to minimize the patches for supporting the memory reservation call of future xen versions in xapi. I think I got it down to a 5-line patch that was trivial to review. This patch was also a revert of the last commit in the PR that added this in xapi-master. This last commit disabled the new feature to make it compatible with xen 4.17.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has evolved a bit since then (and there are more functions being added there), and we have 3 versions of Xen in xapi.spec, which is why I'm looking at how to simplify this to make both development and maintenance easier.
Here is what I currently have.
But I think we don't need cppo (we could also use the config ppx instead if needed, but it doesn't appear to be widely adopted). cppo is not compatible with ocamlformat.
The conditional select OTOH is built into Dune, and can select a different file as input based on the version of a dependency (xenctrlext already has a version, except the mock one, which we'll need to fix for the CI, but should be easy).
Some of the patches could also be made smaller, e.g. 0005-xenopsd-xc-do-not-try-keep-track-of-free-memory-when.patch if we move the list->seq upstream then I think the only remaining part is deleting a single line numa_resources := Some nodea. And then we wouldn't need to delete the line either, we could do that based on a boolean, and that boolean could be defined in Xenver.4*.ml, which would allow us to delete the patch completely.
I always wanted to look into how to simplify this, unfortunately I haven't found the time to do that until now (would've been good if we worked out these details when work on the NUMA feature got started, but I was busy with other things then).
|
There is also https://github.com/xapi-project/xen-api/blob/master/doc/content/toolstack/features/NUMA/index.md, would be good to link the two documents. |
| handle -> int -> int -> domain_numainfo_node_pages | ||
| = "stub_xc_domain_get_numa_info_node_pages" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the number of nodes used by the domain is needed as a parameter here, doesn't the returning datastructure need to contain as many array elements as nodes in the host? Otherwise I don't see how the amount of pages can be mapped to only the nodes that are actually used by the domain.
Also, why is the _size call exposed if the information can be derived from domain_numainfo_node_pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the confusion and don't like how this function is named.
|
|
||
| NUMA stands for Non-Uniform Memory Access and describes that RAM access | ||
| for CPUs in a multi-CPU system is not equally fast for all of them. CPUs | ||
| are grouped into so-called nodes and each node has fast access to RAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUMA is a rather abstract way to represent topology, we can think of it as an abstract container that contains some CPUs and some memory, and a cost associated with accessing a different NUMA node.
There can be NUMA nodes without any memory for example.
A node without any memory does make physical sense: that CPU has no (local) memory access at all, the only way it can access memory is through another CPU.
A node without any CPU is more peculiar, and I'm not sure whether that'd be allowed.
Linux might try to abstract away these corner cases and give you an equivalent representation where each NUMA node is guaranteed to have at least one CPU and memory (and all other distances updated appropriately).
I don't know what Xen would do, but I think it has 2 different ways to refer to NUMA nodes and ACPI proximity domains internally.
A definition can be found here in the ACPI spec: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html
doc/content/design/numa.md
Outdated
| * RO `VM.numa_optimised`: boolean: if the VM is optimised for NUMA | ||
| * RO `VM.numa_nodes`: integer: number of NUMA nodes of the host the VM | ||
| is using | ||
| * MRO `VM.numa_node_memory`: list of tuple (`node_X`: integer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure our IDL supports this currently. I think we usually have string to string maps.
We should probably avoid storing an S-expression inside a string field like we do in the database though.
Might need to update the IDL to support integers as values, but if that is not already supported it'd be a useful extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is we can represent a list, we could store a list of strings using "x:y" or "x,y" to represent pairs. I believe list of pairs are supported, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API doesn't have lists or tuples. In this case you'd want an (int -> int) map. And we have (at least) one field with that type already: VM_metrics.VCPUs_CPU (see https://xapi-project.github.io/xen-api/classes/vm_metrics.html). In fact, VM_metrics may be a suitable class for these three new fields, as it has similar run-time values.
doc/content/design/numa.md
Outdated
| `records.ml`; we could decide to make `numa_optimised` visible by | ||
| default in `xe vm-list. | ||
|
|
||
| Introducing new fields requires defaults; these woould be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
doc/content/design/numa.md
Outdated
| As far as Xapi clients are concerned, we implement new fields in the VM | ||
| class of the data model and surface the values in the CLI via | ||
| `records.ml`; we could decide to make `numa_optimised` visible by | ||
| default in `xe vm-list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing `
doc/content/design/numa.md
Outdated
|
|
||
| with three fields: | ||
|
|
||
| * `numae_optimised: bool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, although the XAPI<->xenopsd interface could contain only the ground truth values reported by Xen, and any higher-level derived property could be computed purely in XAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However, computing these is Xenopsd saves RPC calls.
doc/content/design/numa.md
Outdated
| = "stub_xc_domain_get_numa_info_node_pages_size" | ||
| ``` | ||
|
|
||
| Thia function reports the nunber of NUMA nodes used by a domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in XAPI we usually refer to guests as (Xen) domains, but I wonder whether it'd be useful to explicitly spell that out in this document at least once.
To distinguish it from PCI domains, which we might also need to support (not so long) in the future.
Or OCaml 5 domains, which we might also have in the future.
| We are not reporting `nr_nodes` directly but use it to determine the | ||
| value of `numa_optimised` for a domain/VM: | ||
|
|
||
| numa_optimised = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field might change during migration, e.g. if we move to a host with a different number of NUMA nodes, or one that has fewer NUMA nodes free (requiring us to fall back to interleaving memory across all nodes).
So we must make sure to reevaluate it post-migration.
doc/content/design/numa.md
Outdated
| # NUMA | ||
|
|
||
| NUMA stands for Non-Uniform Memory Access and describes that RAM access | ||
| for CPUs in a multi-CPU system is not equally fast for all of them. CPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for CPUs in a multi-CPU system has type confusion over CPUs.
The closest would be multi-socket system , except this is no longer true either. Plenty of single-socket server chips have 2 or more NUMA nodes these days.
I'd suggest larger systems, as the only form which probably won't bitrot.
| capabilities : physinfo_cap_flag list; | ||
| max_nr_cpus : int; | ||
| arch_capabilities : arch_physinfo_cap_flags; | ||
| + nr_nodes : int; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is available on both XS8 and XS9 already. So although we can use that field when available as an optimization, for now it'd probably be easier to keep using the existing mechanism (one fewer conditionally applied patch).
It got added back in July
commit 495b0de
Author: Changlei Li [email protected]
Date: Tue Jul 29 15:07:46 2025 +0800
CP-308927 Add nr_nodes in host.cpu_info to expose numa nodes count
Add Xenctrlext.get_nr_nodes to get numa nodes count, get_nr_nodes gets
the count from numainfo.memory array size.
Signed-off-by: Changlei Li <[email protected]>
|
|
|
||
| ## XAPI Low-Level Implementation | ||
|
|
||
| NUMA properties are observed by Xenopsd and Xapi learns about them as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest the memory stats be exposed as rrds as well?
This would allow to follow a similar architecture to the other memory stats: squeezed generates them, and xapi reads them with a thread to update the database. see #6561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already some RRDs for Numa (outside of this) in an unmerged patch from @bernhardkaindl
Signed-off-by: Christian Lindig <[email protected]>
c1b271b to
c058215
Compare
Design sketch for exposing a VM's NUMA properties observable though the API.