-
Notifications
You must be signed in to change notification settings - Fork 103
print pci device name and intruduce EDPC config #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
New aer log like follow: <...>-2682840 [125] .... 0.017661 aer_event 2025-03-27 17:34:44 +0800 0000:99:00.0 (Intel Corporation Device 0b60 - vendor_id: 0x8086 device_id: 0xb60) Data Link Protocol Uncorrected (Non-Fatal) Signed-off-by: Ruidong Tian <[email protected]>
ras-pcie-edpc.c
Outdated
| #define PCI_EXP_DPC_CTL 0x06 /* DPC control */ | ||
| #define PCI_EXP_DPC_CTL_EN_FATAL 0x0001 /* Enable trigger on ERR_FATAL message */ | ||
| #define PCI_EXP_DPC_CTL_EN_NONFATAL 0x0002 /* Enable trigger on ERR_NONFATAL message */ | ||
| #define PCI_EXP_DPC_CTL_INT_EN 0x0008 /* DPC Interrupt Enable */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linux provides a header file for PCI-related constants and macros, including those for Downstream Port Containment (DPC). The relevant definitions for PCI_EXP_DPC_CTL and related constants can be found in the Linux kernel header file:
#include <linux/pci_regs.h>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted
ras-pcie-edpc.c
Outdated
| return; | ||
|
|
||
| control = pci_read_word(dev, cap->addr + PCI_EXP_DPC_CTL); | ||
| need_config = ((control & (PCI_EXP_DPC_CTL_INT_EN | PCI_EXP_DPC_CTL_EN_MASK)) == EDPC_CONTROL_CONFIG) ? 0 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCI_EXP_DPC_CTL_INT_EN is set to enable msi or msix interrupt when DPC is triggered, we should not set it in firmware first.
The correct logic is check if PCI_EXP_DPC_CTL_EN_FATAL is set:
need_config = (control & PCI_EXP_DPC_CTL_EN_MASK) != PCI_EXP_DPC_CTL_EN_FATAL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted and fixed
ras-pcie-edpc.c
Outdated
| #define PCI_EXP_DPC_CTL_EN_MASK (PCI_EXP_DPC_CTL_EN_FATAL | \ | ||
| PCI_EXP_DPC_CTL_EN_NONFATAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer hard code the mask with 0x3,aka bit 1:0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use PCI_DPC_CTL_TRIGGER in pci/pci.h
ras-pcie-edpc.c
Outdated
|
|
||
| if (need_config) { | ||
| control &= ~(PCI_EXP_DPC_CTL_INT_EN | PCI_EXP_DPC_CTL_EN_MASK); | ||
| control |= EDPC_CONTROL_CONFIG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just clear bit 1:0 and set PCI_EXP_DPC_CTL_EN_FATAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted
ras-pcie-edpc.c
Outdated
| struct pci_filter *filter = NULL; | ||
| char *token, *err, pci_names[MAX_PATH + 1]; | ||
|
|
||
| strscpy(pci_names, names, sizeof(names)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sizeof(pci_names) instead of sizeof(names) to ensure the destination buffer size is respected.
ras-pcie-edpc.c
Outdated
| for (dev = pacc->devices; dev; dev = dev->next) { | ||
| if (len) { | ||
| for (i = 0; i < len; i++) | ||
| if (pci_filter_match(&filter[i], dev)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CXL.io also supports PCIe Enhanced Downstream Port Containment (eDPC) to disable links upon
detecting uncorrectable errors to contain and enable recovery from such errors. Enabling eDPC is not
recommended in most CXL 2.0 systems because eDPC containment flow brings the link down, disrupting
CXL.cache and CXL.mem traffic which can lead to host timeouts.
Should we also add a filter to avoid enabling EDPC for CXL.mem and CXL.cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c84ffde to
2e91426
Compare
System with EDPC enabled device can recovery from fatal aer error. Rasdaemon now helps users correctly configure EDPC functionality. Rasdaemon will enable EDPC for fatal error if PCIE_EDPC_ENABLE set to 1. All device with EDPC capability will be enabled by default if EDPC_DEVICE is specified, only the specified device will be enabled. For example: PCIE_EDPC_ENABLE=1 EDPC_DEVICE=0000:01:00.0 only enable device 0000:01:00.0. Signed-off-by: Ruidong Tian <[email protected]>
| pci_fill_info(dev, PCI_FILL_IDENT); | ||
| *vendor_id = dev->vendor_id; | ||
| *device_id = dev->device_id; | ||
| pci_lookup_name(pacc, pci_name, len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost to iterate all devices for each event is painful, do you consider adding a cache for bdf to device info?
| pci_init(pacc); | ||
| pci_scan_bus(pacc); | ||
|
|
||
| pci_names = getenv(EDPC_DEVICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean pci_names to pci_bdf_names
| { | ||
| struct pci_access *pacc; | ||
| struct pci_dev *dev, *dev_head, *tmp; | ||
| int ret = 0, len = 1, i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to init len = 0
| token = strtok(pci_names, ","); | ||
| while (token) { | ||
| pci_filter_init(pacc, &filter[i]); | ||
| err = pci_filter_parse_slot(&filter[i++], token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check to filter out pcie device and only collect down stream port
| } | ||
|
|
||
| pci_cleanup(pacc); | ||
| free(filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter is double freed, the first is in config_pcie_edpc_device, line 135
|
This one didn't apply. Please rebase on the top of upstream: ras-pcie-edpc.c: In function ‘is_cxl_mem_or_cache’: |

New aer log like follow:
And introduce EDPC config in rasdaemon