Skip to content

Conversation

@jianminzhao
Copy link
Contributor

There is a race between _housekeeper = nullptr in CollectionImpl::stopHousekeeping and enqueue in Housekeeper::doExpirationAsync. This PR attempts to fix it.

@cbl-bot
Copy link

cbl-bot commented Dec 17, 2025

Code Coverage Results:

Type Percentage
branches 65.92
functions 78.19
instantiations 71.69
lines 77.18
regions 73.57

There is a race between _housekeeper = nullptr in CollectionImpl::stopHousekeeping and enqueue in Housekeeper::doExpirationAsync.
This PR attempts to fix it.
@jianminzhao
Copy link
Contributor Author

This test stands at top in terms of frequency of random failures in GH or Jenkins. I have never seen it happening on my working Mac. However, I can repro and see the new code being exercised by adding a sleep of 30ms as the first statement in Housekeeper::doExpirationAsync. (not 100% but after several runs)

Copy link
Collaborator

@snej snej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too complicated.

The problem is that the timer can fire concurrently with being stopped -- its callback (the call to doExpirationAsync) could be running after _expiryTimer.stop() returns, and so the Housekeeper could be freed.

I think this can be fixed by (a) making the timer a unique_ptr, and (b) deleting it, not just stopping it, in Housekeeper::_stop. The Timer's callback is guaranteed not to be running after its destructor returns.

@jianminzhao
Copy link
Contributor Author

Make sense. Do you think we should make the housekeeper re-startable after stop(), or simply document it as not re-startable?

void _start();
void _stop();

bool _isStopped() const { return !_expiryTimer; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is needed to deal with another race between A. releasing of Database point, and B. using of BackgroundDB in actor's methods of Housekeeper. We bail out in the actor's methods if _isStopped()==true.
Another approach is to retain the database at the time to enqueue. Maybe this is overkill. What do you think?

@jianminzhao jianminzhao requested a review from snej December 18, 2025 17:20
@jianminzhao jianminzhao merged commit 883c5cc into master Dec 18, 2025
9 checks passed
@jianminzhao jianminzhao deleted the cbl-7699 branch December 18, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants