Skip to content

Conversation

ChuanFF
Copy link
Contributor

@ChuanFF ChuanFF commented Jun 4, 2025

there is a spelling error in k8s discovery module. post_List should be post_list

Description

There is a spelling error in the k8s service discovery module code. As shown below, the judgment condition in the code of apisix/discovery/kubernetes/informer_factory.lua should be informer.post_list instead of informer.post_List. The current bug causes the post_list function to not be executed.

    informer.fetch_state = "list finished"
    if informer.post_List then
        informer:post_list()
    end

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Spelling error. post_List should be post_list
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Jun 4, 2025
@ChuanFF ChuanFF changed the title fix: typos in k8s service discovery code fix: typos in k8s discovery code Jun 4, 2025
@Baoyuantop
Copy link
Contributor

Hi @ChuanFF, need add test for this change.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Jun 5, 2025

@Baoyuantop just a typo, is the test necessary?

@Baoyuantop
Copy link
Contributor

The current bug causes the post_list function to not be executed.

We need to verify this.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Jun 6, 2025

There are two typos in the following code:
informer.pre_List should be informer.pre_list,
informer.post_List should be informer.post_list.

if informer.pre_List then
informer:pre_list()
end
ok, reason, message = list(httpc, apiserver, informer)
if not ok then
informer.fetch_state = "list failed"
core.log.error("list failed, kind: ", informer.kind,
", reason: ", reason, ", message : ", message)
return false
end
informer.fetch_state = "list finished"
if informer.post_List then
informer:post_list()
end

Function Behavior:

  • pre_list calls shared_dict:flush_all() (docs), which expires all keys in the shared dictionary but doesn't physically delete them.
  • post_list calls shared_dict:flush_expired() (docs), which physically deletes expired keys from the shared dictionary.

Issue After Fixing Typos:

When the typos are corrected, a new problem emerges:
flush_all function seems will kill 3 key-value pairs data in shared_dict (code reference). This creates a critical window between the pre_list call and when new endpoints are populated. During this interval:

  1. some endpoint data in the shared dictionary will be missing
  2. requests may fail due to unavailable endpoint information

Therefore, this typos bug seems not to be fixed directly. I'm not sure whether the behavior of shared_dict:flush_all() deleting some data is a bug

@Baoyuantop
Copy link
Contributor

Hi @ChuanFF, I think we can optimize this by modifying the pre_list and pre_post methods:

  • pre_list function only records existing data keys and does not clean up any data.
  • pre_post compares old and new data and removes old data that no longer exists.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Jun 23, 2025

@Baoyuantop if we want to removes old data that no longer exists, there is a problem that how to records existing data keys,ngx.shared.DICT.get_keys does not seem to be suitable(official document have a caution: Avoid calling this method on dictionaries with a very large number of keys as it may lock the dictionary for significant amount of time and block Nginx worker processes trying to access the dictionary.)

I think we can prepare two tables, data_history and data_existing, recording data to data_existing during list and watch, assigning data_existing to data_history in pre_list and setting data_existing to {}, then comparing data_history with data_existing in post_list and removes old data that no longer exists.

Test cases seem difficult to write, dirty data is generated only after the watch ends and before the next list call, i do not know how to verify old data that no longer exists, do you have any suggest?

Or can we ignore dirty data? After all, the conditions for dirty data to be generated are harsh, and it seems to have no effect.

@Baoyuantop
Copy link
Contributor

Lua table cannot share data between multiple workers.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Jul 2, 2025

Lua table just used for record data to clear dirty data in shared_dict.

@Baoyuantop Baoyuantop moved this from 👀 In review to 📋 Backlog in ⚡️ Apache APISIX Roadmap Jul 29, 2025
Copy link

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 31, 2025
@Baoyuantop Baoyuantop removed the stale label Sep 1, 2025
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Sep 1, 2025
@Baoyuantop
Copy link
Contributor

Hi @ChuanFF, before continuing to complete the code, could you please describe your current modification plan in the description? This will help other maintainers review this PR.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 4, 2025
@ChuanFF
Copy link
Contributor Author

ChuanFF commented Sep 4, 2025

@Baoyuantop Sorry, I've been trying to fix this issue in my own repository for the past couple of days and didn't realize the PR would sync here as well. I have now completed the code changes. The approach is:

In the pre_list function, I record the existing keys from the shared_dict into a temporary table handle.existing_keys. Then, when APISIX lists Kubernetes endpoints/endpointSlices, I record the keys written to the shared_dict into another temporary table handle.current_keys_hash. Essentially, handle.existing_keys represents the old list of keys, while handle.current_keys_hash represents the new list of keys. Finally, in the post_list function, I compare the old and new data to identify and remove any dirty data.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Sep 4, 2025

Additionally, the test case requires waiting for the list query to finish, so a 2-second sleep was added in the location /t. I plan to submit an additional PR that improve the function _M.status_ready() in apisix/init.lua to check if APISIX's Kubernetes service discovery is ready. Once that is implemented, the 2-second sleep will be replaced with a wait for the readiness check.

@ChuanFF
Copy link
Contributor Author

ChuanFF commented Sep 12, 2025

Hi @ChuanFF, before continuing to complete the code, could you please describe your current modification plan in the description? This will help other maintainers review this PR.

@Baoyuantop The modification has been completed. Can we start the review?

@Baoyuantop Baoyuantop self-requested a review September 12, 2025 08:31
@Baoyuantop Baoyuantop moved this from 📋 Backlog to 🏗 In progress in ⚡️ Apache APISIX Roadmap Sep 12, 2025
@Baoyuantop Baoyuantop moved this from 🏗 In progress to 👀 In review in ⚡️ Apache APISIX Roadmap Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

2 participants