-
Notifications
You must be signed in to change notification settings - Fork 115
drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages #194
Conversation
Add register list and enable devcoredump for JPEG4_0_3 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG5_0_1 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG4_0_0 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG4_0_5 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG3_0_0 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG2_0_0 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG2_5_0 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Add register list and enable devcoredump for JPEG5_0_0 V2: (Lijo) - remove version specific callbacks and use simplified helper functions V3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call here Signed-off-by: Sathishkumar S <[email protected]> Reviewed-by: Leo Liu <[email protected]> Acked-by: Lijo Lazar <[email protected]>
JPEG 5.0.1 supports upto 10 rings, however PMFW support for SMU v13.0.6 variants is now limited to 8 per instance. Limit to 8 temporarily to avoid out of bounds access. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]>
Currently, there are several files in drm/amd/display that aim to have a higher -Wframe-larger-than value to avoid instances of that warning with a lower value from the user's configuration. However, with the way that it is currently implemented, it does not respect the user's request via CONFIG_FRAME_WARN for a higher stack frame limit, which can cause pain when new instances of the warning appear and break the build due to CONFIG_WERROR. Adjust the logic to switch from a hard coded -Wframe-larger-than value to only using the value as a minimum clamp and deferring to the requested value from CONFIG_FRAME_WARN if it is higher. Suggested-by: Harry Wentland <[email protected]> Reported-by: Greg Kroah-Hartman <[email protected]> Closes: https://lore.kernel.org/2025013003-audience-opposing-7f95@gregkh/ Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
mpc1_is_mpcc_idle() was added in 2017 by commit feb4a3c ("drm/amd/display: Integrating MPC pseudocode") but never used. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
mod_freesync_get_vmin_vmax() and mod_freesync_get_v_position() were added in 2017 by commit 72ada5f ("drm/amd/display: FreeSync Auto Sweep Support") mod_freesync_is_valid_range() was added in 2018 by commit e80e944 ("drm/amd/display: add method to check for supported range") mod_freesync_get_settings() was added in 2018 by commit a3e1737 ("drm/amd/display: Implement stats logging") and mod_freesync_calc_field_rate_from_timing() was added in 2020 by commit 49c70ec ("drm/amd/display: Change input parameter for set_drr") None of these have been used. Remove them. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
The last user of dc_stream_get_crtc_position() was mod_freesync_get_v_position() which is removed in a previous patch in this series. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
get_clock_requirements_for_state() was added in 2018 by commit 8ab2180 ("drm/amd/display: Add function to fetch clock requirements") but never used. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
hubbub1_toggle_watermark_change_req() last use was removed in 2017 by commit b8fce2c ("drm/amd/display: Optimize programming front end") Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
get_max_support_fbc_buffersize() is unused since 2021's commit 94f0d0c ("drm/amd/display/dc/dce110/dce110_compressor: Remove unused function 'dce110_get_required_compressed_surfacesize") removed it's only caller. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
link_enc_cfg_get_link_enc_used_by_stream() is no longer used after 2021's: commit 6366b00 ("drm/amd/display: Maintain consistent mode of operation during encoder assignment") which introduces and uses the _current version instead. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
This reverts commit
2a69ae1e1354 ("drm/amd/display: Use HW lock mgr for PSR1")
Because it may cause system hang while connect with two edp panel.
Acked-by: Wayne Lin <[email protected]>
Signed-off-by: Tom Chung <[email protected]>
It is possible for some waves in a workgroup to finish their save sequence before the group leader has had time to capture the workgroup barrier state. When this happens, having those waves exit do impact the barrier state. As a consequence, the state captured by the group leader is invalid, and is eventually incorrectly restored. This patch proposes to have all waves in a workgroup wait for each other at the end of their save sequence (just before calling s_endpgm_saved). Signed-off-by: Lancelot SIX <[email protected]> Reviewed-by: Jay Cornwall <[email protected]>
Use DC_LOG_DEBUG instead of pr_info to match other uses in dc.c.
Fixes: eb8eec752038 ("drm/amd/display: Add debug messages for dc_validate_boot_timing()")
Reviewed-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Hung <[email protected]>
SDMA 4.x is not part of the GFX power domain so this is not necessary. Reviewed-by: Kenneth Feng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
atom bios header files are not required in these files. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]>
Wrap the checks before device access in helper functions and use them for device access. The generic order of APIs now is to do input argument validation first and check if device access is allowed. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
If a device supports runtime pm, then pm_runtime_get_if_active returns 0 if a device is not active and 1 if already active. However, if a device doesn't support runtime pm, the API returns -EINVAL. A device not supporting runtime pm implies it's not affected by runtime pm and it's active. Hence no need to get() to increment usage count. Remove < 0 return value check. Also, ignore runpm state to determine active status. If the device is already in suspend state, disallow access. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
For amdgpu_get_pp_force_state, amdgpu_get_pp_cur_state already takes care of device state check. In other cases, values are returned from driver cached variables and are not dependent on device state. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
These were missed in the cleanup patch, so remove them to allow the kernel to compile again Fixes: 083e622 ("drm/amdkcl: cleanup macro AMDKCL_AMDGPU_DMABUF_OPS") Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Bob Zhou <[email protected]>
MES requires driver set cleaner_shader_fence_mc_addr for cleaner shader support. Signed-off-by: Shaoyun Liu <[email protected]> Acked-by: Alex Deucher <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Change-Id: Ie7a20254683948735c6c3b9c20e6f0f842ab0720
When mesa started using compute queues more often we started seeing additional hangs with compute queues. Disabling gfxoff seems to mitigate that. Manually control gfxoff and gfx pg with command submissions to avoid any issues related to gfxoff. KFD already does the same thing for these chips. v2: limit to compute v3: limit to APUs v4: limit to Raven/PCO v5: only update the compute ring_funcs v6: Disable GFX PG v7: adjust order Reviewed-by: Lijo Lazar <[email protected]> Suggested-by: Błażej Szczygieł <[email protected]> Suggested-by: Sergey Kovalenko <[email protected]> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3861 Link: https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.html Signed-off-by: Alex Deucher <[email protected]>
Bump the driver version for RV/PCO compute stability fix so mesa can use this check to enable compute queues on RV/PCO. Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Same as amdgpu_gfx_off_ctrl(), but without the delay for gfxoff disallow. Reviewed-by: Lijo Lazar <[email protected]> Suggested-by: Błażej Szczygieł <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
[Why] Register access print missed device info. [How] Using dev_xxx instead of DRM_xxx to indicate which device of a hive is the message for. Signed-off-by: Yifan Zha <[email protected]> Reviewed-by: Alex Deucher <[email protected]>
This patch updates the cleaner shader, which is responsible for initializing GPU resources such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). Changes include adjustments to register clearing and shader configuration. - Updated GPU resource initialization addresses in the cleaner shader from `be803080` to `be803000`. - Simplified the logic in the SGPR clearing section, ensuring all SGPRs are set to zero. Fixes: 25961ba ("drm/amdgpu/gfx10: Add cleaner shader for GFX10.1.10") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Manu Rastogi <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]>
Firmware algorithm changed and the values in this version are not accurate thereby remove host limit metric support for smu_v13_0_6, smu_v13_0_12 & smu_v13_0_14 Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]>
Update smu metrics table to vesrion 0x10 for smu_v13_0_6 v2: Host metrics support removal moved to separate patch (Lijo) Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]>
Add pldm version reporting through sysfs node Signed-off-by: Asad Kamal <[email protected]> Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Update pmfw headers for smu_v_13_0_6 to include pldm version as part of statics metrics table Signed-off-by: Asad Kamal <[email protected]> Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Fetch pldm version from static metrics table for SMU v13.0.6 SOCs Signed-off-by: Asad Kamal <[email protected]> Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Enable static metrics support to fetch board voltage and pldm version for other smu_v13_0_6 program Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Enable static metrics support to fetch board voltage and pldm version for smu_v13_0_14 Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
KIQ invalidate_tlbs request has been seen to marginally exceed the configured 100 ms timeout on systems under load. All other KIQ requests in the driver use a 10 second timeout. Use a similar timeout implementation on the invalidate_tlbs path. v2: Poll once before msleep v3: Fix return value Signed-off-by: Jay Cornwall <[email protected]> Cc: Kent Russell <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> (cherry picked from commit efc206db3bcf6a5ce4dc3ecb97aba4b4548ffcc2) Change-Id: I932ad73e0a6bc71d4a917ab9ed69f1f071f29fef
To unmap and free seq64 memory when drm node close to free vm, if there is signal accepted, then taking vm lock failed and leaking seq64 va mapping, and then dmesg has error log "still active bo inside vm". Change to use uninterruptible lock fix the mapping leaking and no dmesg error log. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Christian König <[email protected]>
When an entity from application B is killed, drm_sched_entity_kill() removes all jobs belonging to that entity through drm_sched_entity_kill_jobs_work(). If application A's job depends on a scheduled fence from application B's job, and that fence is not properly signaled during the killing process, application A's dependency cannot be cleared. This leads to application A hanging indefinitely while waiting for a dependency that will never be resolved. Fix this issue by ensuring that scheduled fences are properly signaled when an entity is killed, allowing dependent applications to continue execution. Signed-off-by: Lin.Cao <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Philipp Stanner <[email protected]>
1. add kicker device list 2. add kicker device checking helper function Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
1. Add kicker firmwares loading for gfx11/smu13/psp13 2. Register additional MODULE_FIRMWARE entries for kicker fws - gc_11_0_0_rlc_kicker.bin - gc_11_0_0_imu_kicker.bin - psp_13_0_0_sos_kicker.bin - psp_13_0_0_ta_kicker.bin - smu_13_0_0_kicker.bin Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
If RAS header read from EEPROM is corrupted, it could result in trying to allocate huge memory for reading the records. Add some validation to header fields. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
If a valid header is not found during RAS eeprom init, consider it as new and reset RAS table info. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
Enable the cleaner shader for other GFX9.x series of GPUs to provide data isolation between GPU workloads. The cleaner shader is responsible for clearing the Local Data Store (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which helps prevent data leakage and ensures accurate computation results. This update extends cleaner shader support to GFX9.x GPUs, previously available for GFX9.4.2. It enhances security by clearing GPU memory between processes and maintains a consistent GPU state across KGD and KFD workloads. Cc: Manu Rastogi <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]>
Switch from a bool to an enum and allow more options for enforce isolation. There are now 3 modes of operation: - Disabled (0) - Enabled (serialization and cleaner shader) (1) - Enabled in legacy mode (no serialization or cleaner shader) (2) This provides better flexibility for more use cases. Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Adjusted the enforce isolation setting handling to include the ability to disable the cleaner shader without affecting isolation between tasks. v2: Updated enforce isolation documentation and parameters. (Alex) Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Alex Deucher <[email protected]>
If the process is exiting, the mmput inside mmu notifier callback from compactd or fork or numa balancing could release the last reference of mm struct to call exit_mmap and free_pgtable, this triggers deadlock with below backtrace. The deadlock will leak kfd process as mmu notifier release is not called and cause VRAM leaking. The fix is to take mm reference mmget_non_zero when adding prange to the deferred list to pair with mmput in deferred list work. If prange split and add into pchild list, the pchild work_item.mm is not used, so remove the mm parameter from svm_range_unmap_split and svm_range_add_child. The backtrace of hung task: INFO: task python:348105 blocked for more than 64512 seconds. Call Trace: __schedule+0x1c3/0x550 schedule+0x46/0xb0 rwsem_down_write_slowpath+0x24b/0x4c0 unlink_anon_vmas+0xb1/0x1c0 free_pgtables+0xa9/0x130 exit_mmap+0xbc/0x1a0 mmput+0x5a/0x140 svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu] mn_itree_invalidate+0x72/0xc0 __mmu_notifier_invalidate_range_start+0x48/0x60 try_to_unmap_one+0x10fa/0x1400 rmap_walk_anon+0x196/0x460 try_to_unmap+0xbb/0x210 migrate_page_unmap+0x54d/0x7e0 migrate_pages_batch+0x1c3/0xae0 migrate_pages_sync+0x98/0x240 migrate_pages+0x25c/0x520 compact_zone+0x29d/0x590 compact_zone_order+0xb6/0xf0 try_to_compact_pages+0xbe/0x220 __alloc_pages_direct_compact+0x96/0x1a0 __alloc_pages_slowpath+0x410/0x930 __alloc_pages_nodemask+0x3a9/0x3e0 do_huge_pmd_anonymous_page+0xd7/0x3e0 __handle_mm_fault+0x5e3/0x5f0 handle_mm_fault+0xf7/0x2e0 hmm_vma_fault.isra.0+0x4d/0xa0 walk_pmd_range.isra.0+0xa8/0x310 walk_pud_range+0x167/0x240 walk_pgd_range+0x55/0x100 __walk_page_range+0x87/0x90 walk_page_range+0xf6/0x160 hmm_range_fault+0x4f/0x90 amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu] amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu] init_user_pages+0xb1/0x2a0 [amdgpu] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu] kfd_ioctl+0x29d/0x500 [amdgpu] Fixes: fa582c6 ("drm/amdkfd: Use mmget_not_zero in MMU notifier") Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> (cherry picked from commit 382c280) Change-Id: Iec202cbf4ad7085cfca1083474e136dc0a822cff
Set HDP_MMHUB_CNTL.HDP_MMHUB_RO_OVERRIDE = 0x0 for gfx943 dGPU. This is needed for enhanced RCCL performance v2: Set the register only if not already set Change-Id: Ifee2fe308bdb9ce4d8b2c613cc13a09c429d0e7d Signed-off-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> (cherry picked from commit 2ab8116dfd6351952a74756aba2f9e10a9e26543)
On multi-aid SOCs, there could be a continuous stream of interrupts from GC after poison consumption. Suspend IH to disable them before doing mode-2 reset. This avoids conflicts in hardware accesses during interrupt handlers while a reset is ongoing. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
Once RAS errors are cleared with appropriate recovery mechanism, clear reset flags also from RAS context. Otherwise, stale flag values could affect the subsequent RAS reset handling on the device. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> (cherry picked from commit 2ed5d493e5ae6b86988e70fccbd9554330762080) Change-Id: I106126dc01e203419a4d59e8100e4b249d9fff41
amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needs to return non-zero if amd_acquire fails. Originally found by Chuck Fossen from HPE Signed-off-by: Matt Ezell <[email protected]>
|
Thanks Matt (and Chuck by extension). I've pulled this internally to the dkms-branch-review process. I'll leave this open until it closes off (and will share any review comments in the interim, should they arise). I'll have an idea on the timeline for which ROCm release will have it once we merge it. |
|
Thanks!
We are actually getting kernel panics in the cassini driver on Frontier that Chuck thinks are likely due to this. I haven't tracked the whole process, but Cassini thinks it's getting device memory from amdgpu that isn't due to this. The code team running the app that triggers this found a bug in their code that was sending the driver down this path. They have fixed that in the interim and panics have stopped, so it's not an emergency at the moment. |
|
So it's not in ROCm 7.0.* or older releases yet. It's going to be out in ROCm 7.1. At long last. Keeping it open still until then though |
amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needs to return non-zero if amd_acquire fails.
Originally found by Chuck Fossen from HPE