Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
3c185a8
PCI: Prepare to protect against concurrent isolated cpuset change
Jan 1, 2026
4867dea
cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug"
Jan 1, 2026
be96896
memcg: Prepare to protect against concurrent isolated cpuset change
Jan 1, 2026
2170e0e
mm: vmstat: Prepare to protect against concurrent isolated cpuset change
Jan 1, 2026
63a8f52
sched/isolation: Save boot defined domain flags
Jan 1, 2026
77fcc39
cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
Jan 1, 2026
d03291e
driver core: cpu: Convert /sys/devices/system/cpu/isolated to use HK_…
Jan 1, 2026
2b3a30e
net: Keep ignoring isolated cpuset change
Jan 1, 2026
c4d75fc
block: Protect against concurrent isolated cpuset change
Jan 1, 2026
d894aaf
timers/migration: Prevent from lockdep false positive warning
Jan 1, 2026
67ee4c8
cpu: Provide lockdep check for CPU hotplug lock write-held
Jan 1, 2026
8293709
cpuset: Provide lockdep check for cpuset lock held
Jan 1, 2026
dea57fb
sched/isolation: Convert housekeeping cpumasks to rcu pointers
Jan 1, 2026
f02ffa4
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
Jan 1, 2026
d507be9
sched/isolation: Flush memcg workqueues on cpuset isolated partition …
Jan 1, 2026
15c760d
sched/isolation: Flush vmstat workqueues on cpuset isolated partition…
Jan 1, 2026
978d49f
PCI: Flush PCI probe workqueue on cpuset isolated partition change
Jan 1, 2026
b92e715
cpuset: Propagate cpuset isolation update to workqueue through housek…
Jan 1, 2026
f4f3778
cpuset: Propagate cpuset isolation update to timers through housekeeping
Jan 1, 2026
197cf7d
timers/migration: Remove superfluous cpuset isolation test
Jan 1, 2026
7de59cf
cpuset: Remove cpuset_cpu_is_isolated()
Jan 1, 2026
a390f68
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
Jan 1, 2026
dd3b9e2
PCI: Remove superfluous HK_TYPE_WQ check
Jan 1, 2026
c36af65
kthread: Refine naming of affinity related fields
Jan 1, 2026
b58c120
kthread: Include unbound kthreads in the managed affinity list
Jan 1, 2026
4905f3c
kthread: Include kthreadd to the managed affinity list
Jan 1, 2026
f1b3f12
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
Jan 1, 2026
f3fa209
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
Jan 1, 2026
0bf865b
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
Jan 1, 2026
f131db7
kthread: Honour kthreads preferred affinity after cpuset changes
Jan 1, 2026
3a2778f
kthread: Comment on the purpose and placement of kthread_affine_node(…
Jan 1, 2026
979a102
kthread: Document kthread_affine_preferred()
Jan 1, 2026
f1b72fe
doc: Add housekeeping documentation
Jan 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions Documentation/core-api/housekeeping.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
======================================
Housekeeping
======================================


CPU Isolation moves away kernel work that may otherwise run on any CPU.
The purpose of its related features is to reduce the OS jitter that some
extreme workloads can't stand, such as in some DPDK usecases.

The kernel work moved away by CPU isolation is commonly described as
"housekeeping" because it includes ground work that performs cleanups,
statistics maintainance and actions relying on them, memory release,
various deferrals etc...

Sometimes housekeeping is just some unbound work (unbound workqueues,
unbound timers, ...) that gets easily assigned to non-isolated CPUs.
But sometimes housekeeping is tied to a specific CPU and requires
elaborated tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote
scheduler tick, etc...).

Thus, a housekeeping CPU can be considered as the reverse of an isolated
CPU. It is simply a CPU that can execute housekeeping work. There must
always be at least one online housekeeping CPU at any time. The CPUs that
are not isolated are automatically assigned as housekeeping.

Housekeeping is currently divided in four features described
by the ``enum hk_type type``:

1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain
isolation performed through ``isolcpus=domain`` boot parameter or
isolated cpuset partitions in cgroup v2. This includes scheduler
load balancing, unbound workqueues and timers.

2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation
performed through ``nohz_full=`` or ``isolcpus=nohz`` boot
parameters. This includes remote scheduler tick, vmstat and lockup
watchdog.

3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed
IRQ isolation performed through ``isolcpus=managed_irq``.

4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain
isolation performed through ``isolcpus=domain`` only. It is similar
to HK_TYPE_DOMAIN except it ignores the isolation performed by
cpusets.


Housekeeping cpumasks
=================================

Housekeeping cpumasks include the CPUs that can execute the work moved
away by the matching isolation feature. These cpumasks are returned by
the following function::

const struct cpumask *housekeeping_cpumask(enum hk_type type)

By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's
isolated partitions are used, which covers most usecases, this function
returns the cpu_possible_mask.

Otherwise the function returns the cpumask complement of the isolation
feature. For example:

With isolcpus=domain,7 the following will return a mask with all possible
CPUs except 7::

housekeeping_cpumask(HK_TYPE_DOMAIN)

Similarly with nohz_full=5,6 the following will return a mask with all
possible CPUs except 5,6::

housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)


Synchronization against cpusets
=================================

Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating,
modifying or deleting an isolated partition.

The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize
properly against cpuset in order to make sure that:

1. The cpumask snapshot stays coherent.

2. No housekeeping work is queued on a newly made isolated CPU.

3. Pending housekeeping work that was queued to a non isolated
CPU which just turned isolated through cpuset must be flushed
before the related created/modified isolated partition is made
available to userspace.

This synchronization is maintained by an RCU based scheme. The cpuset update
side waits for an RCU grace period after updating the HK_TYPE_DOMAIN
cpumask and before flushing pending works. On the read side, care must be
taken to gather the housekeeping target election and the work enqueue within
the same RCU read side critical section.

A typical layout example would look like this on the update side
(``housekeeping_update()``)::

rcu_assign_pointer(housekeeping_cpumasks[type], trial);
synchronize_rcu();
flush_workqueue(example_workqueue);

And then on the read side::

rcu_read_lock();
cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN);
queue_work_on(cpu, example_workqueue, work);
rcu_read_unlock();
1 change: 1 addition & 0 deletions Documentation/core-api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ it.
symbol-namespaces
asm-annotations
real-time/index
housekeeping.rst

Data structures and low-level utilities
=======================================
Expand Down
18 changes: 15 additions & 3 deletions arch/arm64/kernel/cpufeature.c
Original file line number Diff line number Diff line change
Expand Up @@ -1656,6 +1656,18 @@ has_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
return feature_matches(val, entry);
}

/*
* 32 bits support CPUs can't be isolated because tasks may be
* arbitrarily affine to them, defeating the purpose of isolation.
*/
bool arch_isolated_cpus_can_update(struct cpumask *new_cpus)
{
if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
return !cpumask_intersects(cpu_32bit_el0_mask, new_cpus);
else
return true;
}

const struct cpumask *system_32bit_el0_cpumask(void)
{
if (!system_supports_32bit_el0())
Expand All @@ -1669,7 +1681,7 @@ const struct cpumask *system_32bit_el0_cpumask(void)

const struct cpumask *task_cpu_fallback_mask(struct task_struct *p)
{
return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_TICK));
return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_DOMAIN));
}

static int __init parse_32bit_el0_param(char *str)
Expand Down Expand Up @@ -3987,8 +3999,8 @@ static int enable_mismatched_32bit_el0(unsigned int cpu)
bool cpu_32bit = false;

if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
if (!housekeeping_cpu(cpu, HK_TYPE_TICK))
pr_info("Treating adaptive-ticks CPU %u as 64-bit only\n", cpu);
if (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN))
pr_info("Treating domain isolated CPU %u as 64-bit only\n", cpu);
else
cpu_32bit = true;
}
Expand Down
6 changes: 5 additions & 1 deletion block/blk-mq.c
Original file line number Diff line number Diff line change
Expand Up @@ -4257,12 +4257,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)

/*
* Rule out isolated CPUs from hctx->cpumask to avoid
* running block kworker on isolated CPUs
* running block kworker on isolated CPUs.
* FIXME: cpuset should propagate further changes to isolated CPUs
* here.
*/
rcu_read_lock();
for_each_cpu(cpu, hctx->cpumask) {
if (cpu_is_isolated(cpu))
cpumask_clear_cpu(cpu, hctx->cpumask);
}
rcu_read_unlock();

/*
* Initialize batch roundrobin counts
Expand Down
2 changes: 1 addition & 1 deletion drivers/base/cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ static ssize_t print_cpus_isolated(struct device *dev,
return -ENOMEM;

cpumask_andnot(isolated, cpu_possible_mask,
housekeeping_cpumask(HK_TYPE_DOMAIN));
housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
len = sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(isolated));

free_cpumask_var(isolated);
Expand Down
71 changes: 52 additions & 19 deletions drivers/pci/pci-driver.c
Original file line number Diff line number Diff line change
Expand Up @@ -302,9 +302,8 @@ struct drv_dev_and_id {
const struct pci_device_id *id;
};

static long local_pci_probe(void *_ddi)
static int local_pci_probe(struct drv_dev_and_id *ddi)
{
struct drv_dev_and_id *ddi = _ddi;
struct pci_dev *pci_dev = ddi->dev;
struct pci_driver *pci_drv = ddi->drv;
struct device *dev = &pci_dev->dev;
Expand Down Expand Up @@ -338,6 +337,21 @@ static long local_pci_probe(void *_ddi)
return 0;
}

static struct workqueue_struct *pci_probe_wq;

struct pci_probe_arg {
struct drv_dev_and_id *ddi;
struct work_struct work;
int ret;
};

static void local_pci_probe_callback(struct work_struct *work)
{
struct pci_probe_arg *arg = container_of(work, struct pci_probe_arg, work);

arg->ret = local_pci_probe(arg->ddi);
}

static bool pci_physfn_is_probed(struct pci_dev *dev)
{
#ifdef CONFIG_PCI_IOV
Expand All @@ -362,40 +376,55 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
dev->is_probed = 1;

cpu_hotplug_disable();

/*
* Prevent nesting work_on_cpu() for the case where a Virtual Function
* device is probed from work_on_cpu() of the Physical device.
*/
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
pci_physfn_is_probed(dev)) {
cpu = nr_cpu_ids;
error = local_pci_probe(&ddi);
} else {
cpumask_var_t wq_domain_mask;
struct pci_probe_arg arg = { .ddi = &ddi };

if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
error = -ENOMEM;
goto out;
INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback);
/*
* The target election and the enqueue of the work must be within
* the same RCU read side section so that when the workqueue pool
* is flushed after a housekeeping cpumask update, further readers
* are guaranteed to queue the probing work to the appropriate
* targets.
*/
rcu_read_lock();
cpu = cpumask_any_and(cpumask_of_node(node),
housekeeping_cpumask(HK_TYPE_DOMAIN));

if (cpu < nr_cpu_ids) {
struct workqueue_struct *wq = pci_probe_wq;

if (WARN_ON_ONCE(!wq))
wq = system_percpu_wq;
queue_work_on(cpu, wq, &arg.work);
rcu_read_unlock();
flush_work(&arg.work);
error = arg.ret;
} else {
rcu_read_unlock();
error = local_pci_probe(&ddi);
}
cpumask_and(wq_domain_mask,
housekeeping_cpumask(HK_TYPE_WQ),
housekeeping_cpumask(HK_TYPE_DOMAIN));

cpu = cpumask_any_and(cpumask_of_node(node),
wq_domain_mask);
free_cpumask_var(wq_domain_mask);
destroy_work_on_stack(&arg.work);
}

if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
out:
dev->is_probed = 0;
cpu_hotplug_enable();
return error;
}

void pci_probe_flush_workqueue(void)
{
flush_workqueue(pci_probe_wq);
}

/**
* __pci_device_probe - check if a driver wants to claim a specific PCI device
* @drv: driver to call to check if it wants the PCI device
Expand Down Expand Up @@ -1733,6 +1762,10 @@ static int __init pci_driver_init(void)
{
int ret;

pci_probe_wq = alloc_workqueue("sync_wq", WQ_PERCPU, 0);
if (!pci_probe_wq)
return -ENOMEM;

ret = bus_register(&pci_bus_type);
if (ret)
return ret;
Expand Down
4 changes: 4 additions & 0 deletions include/linux/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -229,4 +229,8 @@ static inline bool cpu_attack_vector_mitigated(enum cpu_attack_vectors v)
#define smt_mitigations SMT_MITIGATIONS_OFF
#endif

struct cpumask;

bool arch_isolated_cpus_can_update(struct cpumask *new_cpus);

#endif /* _LINUX_CPU_H_ */
1 change: 1 addition & 0 deletions include/linux/cpuhplock.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
struct device;

extern int lockdep_is_cpus_held(void);
extern int lockdep_is_cpus_write_held(void);

#ifdef CONFIG_HOTPLUG_CPU
void cpus_write_lock(void);
Expand Down
8 changes: 2 additions & 6 deletions include/linux/cpuset.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
#include <linux/mmu_context.h>
#include <linux/jump_label.h>

extern bool lockdep_is_cpuset_held(void);

#ifdef CONFIG_CPUSETS

/*
Expand Down Expand Up @@ -77,7 +79,6 @@ extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
Expand Down Expand Up @@ -213,11 +214,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false;
}

static inline bool cpuset_cpu_is_isolated(int cpu)
{
return false;
}

static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
Expand Down
1 change: 1 addition & 0 deletions include/linux/kthread.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
void kthread_exit(long result) __noreturn;
void kthread_complete_and_exit(struct completion *, long) __noreturn;
int kthreads_update_housekeeping(void);

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
Expand Down
4 changes: 4 additions & 0 deletions include/linux/memcontrol.h
Original file line number Diff line number Diff line change
Expand Up @@ -1037,6 +1037,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return id;
}

void mem_cgroup_flush_workqueue(void);

extern int mem_cgroup_init(void);
#else /* CONFIG_MEMCG */

Expand Down Expand Up @@ -1436,6 +1438,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return 0;
}

static inline void mem_cgroup_flush_workqueue(void) { }

static inline int mem_cgroup_init(void) { return 0; }
#endif /* CONFIG_MEMCG */

Expand Down
2 changes: 1 addition & 1 deletion include/linux/mmu_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ static inline void leave_mm(void) { }
#ifndef task_cpu_possible_mask
# define task_cpu_possible_mask(p) cpu_possible_mask
# define task_cpu_possible(cpu, p) true
# define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_TICK)
# define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_DOMAIN)
#else
# define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
#endif
Expand Down
Loading