-
Notifications
You must be signed in to change notification settings - Fork 293
Add numa.md design sketch #6719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| +++ title = "NUMA" +++ | ||
|
|
||
| # NUMA | ||
|
|
||
| NUMA stands for Non-Uniform Memory Access and describes that RAM access | ||
| for CPUs in a large system is not equally fast for all of them. CPUs | ||
| are grouped into so-called nodes and each node has fast access to RAM | ||
| that is considered local to its node and slower access to other RAM. | ||
| Conceptually, a node is a container that bundles some CPUs and RAM and | ||
| there is an associated cost when accessing RAM in a different node. In | ||
| the context of CPU virtualisation assigning vCPUs to NUMA nodes is an | ||
| optimisation strategy to reduce memory latency. This document describes | ||
| a design to make NUMA-related assignments for Xen domains (hence, VMs) | ||
| visible to the user. Below we refer to these assignments and | ||
| optimisations collectively as NUMA for simplicity. | ||
|
|
||
| NUMA is more generally discussed as | ||
| [NUMA Feature](../toolstack/features/NUMA/index.md). | ||
|
|
||
|
|
||
| ## NUMA Properties | ||
|
|
||
| Xen 4.20 implements NUMA optimisation. We want to expose the following | ||
| NUMA-related properties of VMs to API clients, and in particualar | ||
| XenCenter. Each one is represented by a new field in XAPI's `VM_metrics` | ||
| data model: | ||
|
|
||
| * RO `VM_metrics.numa_optimised`: boolean: if the VM is | ||
| optimised for NUMA | ||
| * RO `VM_metrics.numa_nodes`: integer: number of NUMA nodes of the host | ||
| the VM is using | ||
| * MRO `VM_metrics.numa_node_memory`: int -> int map; mapping a NUMA node | ||
| (int) to an amount of memory (bytes) in that node. | ||
|
|
||
| Required NUMA support is only available in Xen 4.20. Some parts of the | ||
| code will have to be managed by patches. | ||
|
|
||
| ## XAPI High-Level Implementation | ||
|
|
||
| As far as Xapi clients are concerned, we implement new fields in the | ||
| `VM_metrics` class of the data model and surface the values in the CLI | ||
| via `records.ml`; we could decide to make `numa_optimised` visible by | ||
| default in `xe vm-list`. | ||
|
|
||
| Introducing new fields requires defaults; these would be: | ||
|
|
||
| * `numa_optimised`: false | ||
| * `numa_nodes`: 0 | ||
| * `numa_node_memory`: [] | ||
|
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the values for halted and suspended VMs? the defaults or the ones from last they were run? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. The values would be updated when the VM changes into the run state. I also don't know what the underlying Xen API returns in such a case - we are passing these through. A halted VM does not have a domain - so there is no way to call the Xen API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably need to know the number of nodes to resume a VM, so for suspended VMs it might make sense to store these. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Common practice would be to keep the values as before when still running. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My current implementation resets them. But we can adjust that. |
||
|
|
||
| The data model ensures that the values are visible to API clients. | ||
|
|
||
| ## XAPI Low-Level Implementation | ||
|
|
||
| NUMA properties are observed by Xenopsd and Xapi learns about them as | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May I suggest the memory stats be exposed as rrds as well? This would allow to follow a similar architecture to the other memory stats: squeezed generates them, and xapi reads them with a thread to update the database. see #6561 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are already some RRDs for Numa (outside of this) in an unmerged patch from @bernhardkaindl |
||
| part of the `Client.VM.stat` call implemented by Xenopsd. Xapi makes | ||
| these calls frequently and we will update the Xapi VM fields related to | ||
| NUMA simply as part of processing the result of such a call in Xapi. | ||
|
|
||
| For this to work, we extend the return type of `VM.stat` in | ||
|
|
||
| * `xenops_types.ml`, type `Vm.state` | ||
|
|
||
| with three fields: | ||
|
|
||
| * `numa_optimised: bool` | ||
| * `numa_nodes: int` | ||
| * `numa_node_memory: (int, int64) list` | ||
|
|
||
| matching the semantics from above. | ||
|
|
||
| ## Xenopsd Implementation | ||
|
|
||
| Xenopsd implements the `VM.stat` return value in | ||
|
|
||
| * `Xenops_server_sen.get_state` | ||
|
|
||
| where the three fields would be set. Xenopsds relies on bindings to Xen to | ||
| observe NUMA-related properties of a domain. | ||
|
|
||
| Given that NUMA related functionality is only available for Xen 4.20, we | ||
| probably will have to maintain a patch in xapi.spec for compatibility | ||
| with earlier Xen versions. | ||
|
|
||
| The (existing) C bindings and changes come in two forms: new functions | ||
| and an extension of a type used by and existing function. | ||
|
|
||
| ```ocaml | ||
| external domain_get_numa_info_node_pages_size : handle -> int -> int | ||
| = "stub_xc_domain_get_numa_info_node_pages_size" | ||
| ``` | ||
|
|
||
| Thia function reports the number of NUMA nodes used by a Xen domain | ||
| (supplied as an argument) | ||
|
|
||
| ```ocaml | ||
| type domain_numainfo_node_pages = { | ||
| tot_pages_per_node : int64 array; | ||
| } | ||
| external domain_get_numa_info_node_pages : | ||
| handle -> int -> int -> domain_numainfo_node_pages | ||
| = "stub_xc_domain_get_numa_info_node_pages" | ||
|
Comment on lines
+101
to
+102
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand why the number of nodes used by the domain is needed as a parameter here, doesn't the returning datastructure need to contain as many array elements as nodes in the host? Otherwise I don't see how the amount of pages can be mapped to only the nodes that are actually used by the domain. Also, why is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I share the confusion and don't like how this function is named. |
||
| ``` | ||
|
|
||
| This function receives as arguments a domain ID and the number of nodes | ||
| this domain is using (acquired using `domain_get_numa_info_node_pages`) | ||
|
|
||
| The number of NUMA nodes of the host (not domain) is reported by | ||
| `Xenctrl.physinfo` which returns a value of type `physinfo`. | ||
|
|
||
| ```diff | ||
| index b4579862ff..491bd3fc73 100644 | ||
| --- a/tools/ocaml/libs/xc/xenctrl.ml | ||
| +++ b/tools/ocaml/libs/xc/xenctrl.ml | ||
| @@ -155,6 +155,7 @@ type physinfo = | ||
| capabilities : physinfo_cap_flag list; | ||
| max_nr_cpus : int; | ||
| arch_capabilities : arch_physinfo_cap_flags; | ||
| + nr_nodes : int; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW this is available on both XS8 and XS9 already. So although we can use that field when available as an optimization, for now it'd probably be easier to keep using the existing mechanism (one fewer conditionally applied patch). It got added back in July |
||
| } | ||
| ``` | ||
|
|
||
| We are not reporting `nr_nodes` directly but use it to determine the | ||
| value of `numa_optimised` for a domain/VM: | ||
|
|
||
| numa_optimised = | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This field might change during migration, e.g. if we move to a host with a different number of NUMA nodes, or one that has fewer NUMA nodes free (requiring us to fall back to interleaving memory across all nodes). |
||
| (VM.numa_nodes = 1) | ||
| or (VM.numa_nodes < physinfo.Xenctrl.nr_nodes) | ||
|
|
||
| ### Details | ||
|
|
||
| The three new fields that become part of type `VM.state` are updated as | ||
| part of `get_state()` using the primitives above. | ||
|
|
||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUMA is a rather abstract way to represent topology, we can think of it as an abstract container that contains some CPUs and some memory, and a cost associated with accessing a different NUMA node.
There can be NUMA nodes without any memory for example.
A node without any memory does make physical sense: that CPU has no (local) memory access at all, the only way it can access memory is through another CPU.
A node without any CPU is more peculiar, and I'm not sure whether that'd be allowed.
Linux might try to abstract away these corner cases and give you an equivalent representation where each NUMA node is guaranteed to have at least one CPU and memory (and all other distances updated appropriately).
I don't know what Xen would do, but I think it has 2 different ways to refer to NUMA nodes and ACPI proximity domains internally.
A definition can be found here in the ACPI spec: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html