Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions doc/content/design/numa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
+++ title = "NUMA" +++

# NUMA

NUMA stands for Non-Uniform Memory Access and describes that RAM access
for CPUs in a large system is not equally fast for all of them. CPUs
are grouped into so-called nodes and each node has fast access to RAM
Copy link
Contributor

@edwintorok edwintorok Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUMA is a rather abstract way to represent topology, we can think of it as an abstract container that contains some CPUs and some memory, and a cost associated with accessing a different NUMA node.

There can be NUMA nodes without any memory for example.
A node without any memory does make physical sense: that CPU has no (local) memory access at all, the only way it can access memory is through another CPU.
A node without any CPU is more peculiar, and I'm not sure whether that'd be allowed.
Linux might try to abstract away these corner cases and give you an equivalent representation where each NUMA node is guaranteed to have at least one CPU and memory (and all other distances updated appropriately).
I don't know what Xen would do, but I think it has 2 different ways to refer to NUMA nodes and ACPI proximity domains internally.

A definition can be found here in the ACPI spec: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html

that is considered local to its node and slower access to other RAM.
Conceptually, a node is a container that bundles some CPUs and RAM and
there is an associated cost when accessing RAM in a different node. In
the context of CPU virtualisation assigning vCPUs to NUMA nodes is an
optimisation strategy to reduce memory latency. This document describes
a design to make NUMA-related assignments for Xen domains (hence, VMs)
visible to the user. Below we refer to these assignments and
optimisations collectively as NUMA for simplicity.

NUMA is more generally discussed as
[NUMA Feature](../toolstack/features/NUMA/index.md).


## NUMA Properties

Xen 4.20 implements NUMA optimisation. We want to expose the following
NUMA-related properties of VMs to API clients, and in particualar
XenCenter. Each one is represented by a new field in XAPI's `VM_metrics`
data model:

* RO `VM_metrics.numa_optimised`: boolean: if the VM is
optimised for NUMA
* RO `VM_metrics.numa_nodes`: integer: number of NUMA nodes of the host
the VM is using
* MRO `VM_metrics.numa_node_memory`: int -> int map; mapping a NUMA node
(int) to an amount of memory (bytes) in that node.

Required NUMA support is only available in Xen 4.20. Some parts of the
code will have to be managed by patches.

## XAPI High-Level Implementation

As far as Xapi clients are concerned, we implement new fields in the
`VM_metrics` class of the data model and surface the values in the CLI
via `records.ml`; we could decide to make `numa_optimised` visible by
default in `xe vm-list`.

Introducing new fields requires defaults; these would be:

* `numa_optimised`: false
* `numa_nodes`: 0
* `numa_node_memory`: []
Comment on lines +48 to +49
Copy link
Member

@psafont psafont Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the values for halted and suspended VMs? the defaults or the ones from last they were run?

Copy link
Contributor Author

@lindig lindig Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The values would be updated when the VM changes into the run state. I also don't know what the underlying Xen API returns in such a case - we are passing these through. A halted VM does not have a domain - so there is no way to call the Xen API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to know the number of nodes to resume a VM, so for suspended VMs it might make sense to store these.
For a halted VM these should be empty, or we could have a last_booted_ like we do for CPUID, but I'm not sure the complication would be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the values for halted and suspended VMs? the defaults or the ones from last they were run?

Common practice would be to keep the values as before when still running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The data model ensures that the values are visible to API clients.

## XAPI Low-Level Implementation

NUMA properties are observed by Xenopsd and Xapi learns about them as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest the memory stats be exposed as rrds as well?

This would allow to follow a similar architecture to the other memory stats: squeezed generates them, and xapi reads them with a thread to update the database. see #6561

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already some RRDs for Numa (outside of this) in an unmerged patch from @bernhardkaindl

part of the `Client.VM.stat` call implemented by Xenopsd. Xapi makes
these calls frequently and we will update the Xapi VM fields related to
NUMA simply as part of processing the result of such a call in Xapi.

For this to work, we extend the return type of `VM.stat` in

* `xenops_types.ml`, type `Vm.state`

with three fields:

* `numa_optimised: bool`
* `numa_nodes: int`
* `numa_node_memory: (int, int64) list`

matching the semantics from above.

## Xenopsd Implementation

Xenopsd implements the `VM.stat` return value in

* `Xenops_server_sen.get_state`

where the three fields would be set. Xenopsds relies on bindings to Xen to
observe NUMA-related properties of a domain.

Given that NUMA related functionality is only available for Xen 4.20, we
probably will have to maintain a patch in xapi.spec for compatibility
with earlier Xen versions.

The (existing) C bindings and changes come in two forms: new functions
and an extension of a type used by and existing function.

```ocaml
external domain_get_numa_info_node_pages_size : handle -> int -> int
= "stub_xc_domain_get_numa_info_node_pages_size"
```

Thia function reports the number of NUMA nodes used by a Xen domain
(supplied as an argument)

```ocaml
type domain_numainfo_node_pages = {
tot_pages_per_node : int64 array;
}
external domain_get_numa_info_node_pages :
handle -> int -> int -> domain_numainfo_node_pages
= "stub_xc_domain_get_numa_info_node_pages"
Comment on lines +101 to +102
Copy link
Member

@psafont psafont Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the number of nodes used by the domain is needed as a parameter here, doesn't the returning datastructure need to contain as many array elements as nodes in the host? Otherwise I don't see how the amount of pages can be mapped to only the nodes that are actually used by the domain.

Also, why is the _size call exposed if the information can be derived from domain_numainfo_node_pages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the confusion and don't like how this function is named.

```

This function receives as arguments a domain ID and the number of nodes
this domain is using (acquired using `domain_get_numa_info_node_pages`)

The number of NUMA nodes of the host (not domain) is reported by
`Xenctrl.physinfo` which returns a value of type `physinfo`.

```diff
index b4579862ff..491bd3fc73 100644
--- a/tools/ocaml/libs/xc/xenctrl.ml
+++ b/tools/ocaml/libs/xc/xenctrl.ml
@@ -155,6 +155,7 @@ type physinfo =
capabilities : physinfo_cap_flag list;
max_nr_cpus : int;
arch_capabilities : arch_physinfo_cap_flags;
+ nr_nodes : int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is available on both XS8 and XS9 already. So although we can use that field when available as an optimization, for now it'd probably be easier to keep using the existing mechanism (one fewer conditionally applied patch).

It got added back in July
commit 495b0de
Author: Changlei Li [email protected]
Date: Tue Jul 29 15:07:46 2025 +0800

CP-308927 Add nr_nodes in host.cpu_info to expose numa nodes count

Add Xenctrlext.get_nr_nodes to get numa nodes count, get_nr_nodes gets
the count from numainfo.memory array size.

Signed-off-by: Changlei Li <[email protected]>

}
```

We are not reporting `nr_nodes` directly but use it to determine the
value of `numa_optimised` for a domain/VM:

numa_optimised =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field might change during migration, e.g. if we move to a host with a different number of NUMA nodes, or one that has fewer NUMA nodes free (requiring us to fall back to interleaving memory across all nodes).
So we must make sure to reevaluate it post-migration.

(VM.numa_nodes = 1)
or (VM.numa_nodes < physinfo.Xenctrl.nr_nodes)

### Details

The three new fields that become part of type `VM.state` are updated as
part of `get_state()` using the primitives above.



Loading