Skip to content

Conversation

@winterddd
Copy link

New aer log like follow:

        <...>-2682840 [125] ....     0.017661 aer_event 2025-03-27
17:34:44 +0800 0000:99:00.0 (Intel Corporation Device 0b60 -
vendor_id: 0x8086 device_id: 0xb60) Data Link Protocol Uncorrected
(Non-Fatal)

And introduce EDPC config in rasdaemon

New aer log like follow:

	<...>-2682840 [125] ....     0.017661 aer_event 2025-03-27
17:34:44 +0800 0000:99:00.0 (Intel Corporation Device 0b60 -
vendor_id: 0x8086 device_id: 0xb60) Data Link Protocol Uncorrected
(Non-Fatal)

Signed-off-by: Ruidong Tian <[email protected]>
ras-pcie-edpc.c Outdated
Comment on lines 20 to 23
#define PCI_EXP_DPC_CTL 0x06 /* DPC control */
#define PCI_EXP_DPC_CTL_EN_FATAL 0x0001 /* Enable trigger on ERR_FATAL message */
#define PCI_EXP_DPC_CTL_EN_NONFATAL 0x0002 /* Enable trigger on ERR_NONFATAL message */
#define PCI_EXP_DPC_CTL_INT_EN 0x0008 /* DPC Interrupt Enable */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux provides a header file for PCI-related constants and macros, including those for Downstream Port Containment (DPC). The relevant definitions for PCI_EXP_DPC_CTL and related constants can be found in the Linux kernel header file:

#include <linux/pci_regs.h>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted

ras-pcie-edpc.c Outdated
return;

control = pci_read_word(dev, cap->addr + PCI_EXP_DPC_CTL);
need_config = ((control & (PCI_EXP_DPC_CTL_INT_EN | PCI_EXP_DPC_CTL_EN_MASK)) == EDPC_CONTROL_CONFIG) ? 0 : 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCI_EXP_DPC_CTL_INT_EN is set to enable msi or msix interrupt when DPC is triggered, we should not set it in firmware first.

The correct logic is check if PCI_EXP_DPC_CTL_EN_FATAL is set:
need_config = (control & PCI_EXP_DPC_CTL_EN_MASK) != PCI_EXP_DPC_CTL_EN_FATAL;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted and fixed

ras-pcie-edpc.c Outdated
Comment on lines 25 to 26
#define PCI_EXP_DPC_CTL_EN_MASK (PCI_EXP_DPC_CTL_EN_FATAL | \
PCI_EXP_DPC_CTL_EN_NONFATAL)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer hard code the mask with 0x3,aka bit 1:0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use PCI_DPC_CTL_TRIGGER in pci/pci.h

ras-pcie-edpc.c Outdated

if (need_config) {
control &= ~(PCI_EXP_DPC_CTL_INT_EN | PCI_EXP_DPC_CTL_EN_MASK);
control |= EDPC_CONTROL_CONFIG;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just clear bit 1:0 and set PCI_EXP_DPC_CTL_EN_FATAL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted

ras-pcie-edpc.c Outdated
struct pci_filter *filter = NULL;
char *token, *err, pci_names[MAX_PATH + 1];

strscpy(pci_names, names, sizeof(names));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sizeof(pci_names) instead of sizeof(names) to ensure the destination buffer size is respected.

ras-pcie-edpc.c Outdated
for (dev = pacc->devices; dev; dev = dev->next) {
if (len) {
for (i = 0; i < len; i++)
if (pci_filter_match(&filter[i], dev))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CXL.io also supports PCIe Enhanced Downstream Port Containment (eDPC) to disable links upon
detecting uncorrectable errors to contain and enable recovery from such errors. Enabling eDPC is not
recommended in most CXL 2.0 systems because eDPC containment flow brings the link down, disrupting
CXL.cache and CXL.mem traffic which can lead to host timeouts.

https://computeexpresslink.org/wp-content/uploads/2023/12/CXL-RAS-Whitepaper-Post-WG-Revision_FINAL.pdf

Should we also add a filter to avoid enabling EDPC for CXL.mem and CXL.cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept, test on qemu
image

System with EDPC enabled device can recovery from fatal aer error.
Rasdaemon now helps users correctly configure EDPC functionality.

Rasdaemon will enable EDPC for fatal error if PCIE_EDPC_ENABLE set
to 1. All device with EDPC capability will be enabled by default
if EDPC_DEVICE is specified, only the specified device will be
enabled. For example:
	PCIE_EDPC_ENABLE=1
	EDPC_DEVICE=0000:01:00.0
only enable device 0000:01:00.0.

Signed-off-by: Ruidong Tian <[email protected]>
pci_fill_info(dev, PCI_FILL_IDENT);
*vendor_id = dev->vendor_id;
*device_id = dev->device_id;
pci_lookup_name(pacc, pci_name, len,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost to iterate all devices for each event is painful, do you consider adding a cache for bdf to device info?

pci_init(pacc);
pci_scan_bus(pacc);

pci_names = getenv(EDPC_DEVICE);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean pci_names to pci_bdf_names

{
struct pci_access *pacc;
struct pci_dev *dev, *dev_head, *tmp;
int ret = 0, len = 1, i;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to init len = 0

token = strtok(pci_names, ",");
while (token) {
pci_filter_init(pacc, &filter[i]);
err = pci_filter_parse_slot(&filter[i++], token);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check to filter out pcie device and only collect down stream port

}

pci_cleanup(pacc);
free(filter);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter is double freed, the first is in config_pcie_edpc_device, line 135

@mchehab
Copy link
Owner

mchehab commented Nov 14, 2025

This one didn't apply. Please rebase on the top of upstream:

ras-pcie-edpc.c: In function ‘is_cxl_mem_or_cache’:
ras-pcie-edpc.c:44:50: error: ‘PCI_CXL_CAP’ undeclared (first use in this function); did you mean ‘PCI_DLF_CAP’?
44 | cxl_cap = pci_read_word(dev, cap->addr + PCI_CXL_CAP);
| ^~~~~~~~~~~
| PCI_DLF_CAP
ras-pcie-edpc.c:44:50: note: each undeclared identifier is reported only once for each function it appears in
ras-pcie-edpc.c:45:24: error: ‘PCI_CXL_CAP_CACHE’ undeclared (first use in this function); did you mean ‘PCI_CXL_FB_CAP_CACHE’?
45 | if (cxl_cap & (PCI_CXL_CAP_CACHE | PCI_CXL_CAP_MEM))
| ^~~~~~~~~~~~~~~~~
| PCI_CXL_FB_CAP_CACHE
ras-pcie-edpc.c:45:44: error: ‘PCI_CXL_CAP_MEM’ undeclared (first use in this function); did you mean ‘PCI_CXL_FB_CAP_MEM’?
45 | if (cxl_cap & (PCI_CXL_CAP_CACHE | PCI_CXL_CAP_MEM))
| ^~~~~~~~~~~~~~~
| PCI_CXL_FB_CAP_MEM
ras-pcie-edpc.c: In function ‘cxl_check_rp’:
ras-pcie-edpc.c:66:38: error: ‘struct pci_dev’ has no member named ‘aux’
66 | dpc_p->aux = (void *)true;
| ^~
ras-pcie-edpc.c: In function ‘config_pcie_edpc’:
ras-pcie-edpc.c:193:25: error: ‘struct pci_dev’ has no member named ‘aux’
193 | if (!dev->aux) {
| ^~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants