Skip to content

Conversation

@ankita-nv
Copy link
Contributor

Address the following bugs:

Correct setting of numa distances
GPU memory VMA alignment adjustment for hugepfnmap

@nvmochs
Copy link
Collaborator

nvmochs commented Dec 17, 2025

General question: What is the upstream plan for these?


NVIDIA: SAUCE: hw/vfio: adjust alignment for hugepfnmap

Can you add an example in the commit message where this was causing an issue?


NVIDIA: SAUCE: acpi: generic initiator in sorted order

Does this need a fixes tag?

Can you add an example in the commit message of the mismatch and the how it looks when fixed up?

On the iterator, in other parts of the QEMU source I see g_slist_next() being used to traverse ‘next’. I don’t care but upstream might.

@ankita-nv
Copy link
Contributor Author

General question: What is the upstream plan for these?
I will start a formal internal review in the next few days but want to unblock users at this point. I see several folks doing verification.

NVIDIA: SAUCE: hw/vfio: adjust alignment for hugepfnmap

Can you add an example in the commit message where this was causing an issue?

Sure, will do.

NVIDIA: SAUCE: acpi: generic initiator in sorted order

Does this need a fixes tag?

Can you add an example in the commit message of the mismatch and the how it looks when fixed up?

Sure, will address them.

On the iterator, in other parts of the QEMU source I see g_slist_next() being used to traverse ‘next’. I don’t care but upstream might.

Thanks for the headsup. I'll fix this before posting for upstream.

During creation of the VM's SRAT table, the generic intiator entries
are added. Currently, the code queries the object, which may not be
in the sorted order. This results in the mismatch in the VMs view
of the PXM and the numa node ids.

As a fix, the patch builds a list of generic intiator objects,
sorts them and then put it in the VM's SRAT table.

Original (unsorted) PXM in the VM SRAT table
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000007
[1C4h 0452 004h]            Proximity Domain : 00000006
[1E4h 0484 004h]            Proximity Domain : 00000005
[204h 0516 004h]            Proximity Domain : 00000004
[224h 0548 004h]            Proximity Domain : 00000003
[244h 0580 004h]            Proximity Domain : 00000009
[264h 0612 004h]            Proximity Domain : 00000002
[284h 0644 004h]            Proximity Domain : 00000008
[2A2h 0674 004h]            Proximity Domain : 00000009

After the patch (sorted)
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000002
[1C4h 0452 004h]            Proximity Domain : 00000003
[1E4h 0484 004h]            Proximity Domain : 00000004
[204h 0516 004h]            Proximity Domain : 00000005
[224h 0548 004h]            Proximity Domain : 00000006
[244h 0580 004h]            Proximity Domain : 00000007
[264h 0612 004h]            Proximity Domain : 00000008
[284h 0644 004h]            Proximity Domain : 00000009

Fixes: 0a5b5ac ("hw/acpi: Implement the SRAT GI affinity structure")
Signed-off-by: Ankit Agrawal <[email protected]>
Qemu's determination of the VMA address for a region needs an
update to handle regions that may be a BAR, but with the actual
size of the mapping to not be at a power-of-2 alignment.

This happens for the case of Grace based systems, where the device
memory is exposed as a BAR. The mapping however is only of the
size of the actual physical memory, which may not be a power-of-2
aligned. This affects hugepfnmap mappings on such regions.

The current algorithm determines the VMA address alignment based
on the mapping alignment. This needs change so as to be based
on the next power-of-2 of the mapping size.

This patch updates the algorithm to achieve the alignment.

Original VMA mapping to the device memory of size 0x2F00F00000 on a GB200
ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

After the patch application (aligned at order 13 PMD)
ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

Signed-off-by: Ankit Agrawal <[email protected]>
@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-1216 branch from f869f1b to c12786a Compare December 17, 2025 20:44
@ankita-nv
Copy link
Contributor Author

Hi Matt, the branch is ready for review after addressing your suggestions. Thanks!

@nvmochs nvmochs self-requested a review December 17, 2025 21:30
Copy link
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further issues from me.

Acked-by: Matthew R. Ochs <[email protected]>

@MitchellAugustin
Copy link

MitchellAugustin commented Dec 17, 2025

I don't see links to any upstream discussions included in the commits. Could you add those so we can track the upstreaming progress?

@ankita-nv
Copy link
Contributor Author

I don't see links to any upstream discussions included in the commits. Could you add those so we can track the upstreaming progress?

Hi Mitchell, it has not been posted upstream just yet.

Copy link

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, all of these changes look good to me. Please add the upstream discussion links to these commit messages once they are available (preferably in advance of us initiating the next build.)


for (i = 0; i < region->nr_mmaps; i++) {
size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
size_t align = MIN(pow2ceil(region->mmaps[i].size), 1 * GiB);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

static int build_acpi_generic_initiator(Object *obj, void *opaque)
static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


for (i = 0; i < region->nr_mmaps; i++) {
size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
size_t align = MIN(pow2ceil(region->mmaps[i].size), 1 * GiB);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

static int build_acpi_generic_initiator(Object *obj, void *opaque)
static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return 0;
}

static int acpi_generic_initiator_list(Object *obj, void *opaque)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return 0;
}

static void build_all_acpi_generic_initiators(GArray *table_data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return list;
}

static int build_acpi_generic_initiator(AcpiGenericInitiator *gi,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


void build_srat_generic_affinity_structures(GArray *table_data)
{
object_child_foreach_recursive(object_get_root(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ankita-nv
Copy link
Contributor Author

ankita-nv commented Dec 18, 2025

Thanks Matt and Mitchell for the review. Yes, I'll update with the links once I post the patches upstream.

Copy link

@shamiali2008 shamiali2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be the upstream proposal as well? Based on the earlier discussion, there was a suggestion to map the entire BAR region and then unmap the non-mmap regions. If this is a temp fix, may be good to mention that in commit log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants