Skip to content

(feat) initial types and interfaces for pluggable data layer #1154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 17, 2025

Conversation

elevran
Copy link
Contributor

@elevran elevran commented Jul 13, 2025

This PR includes the types and interfaces needed to support a pluggable data source.
It is the first PR of several aimed to refactor the current pkg/epp/backend in the design described in #1023.

Specifically:

  • Endpoint replaces PodMetrics to includes PodInfo, Metrics and AttributeMap (an extensible set of attributes, to be used by pluggable data sources)
  • DataSource encapsulates the scraping loop with Extractors collecting specific items from the raw object being scraped (e.g., /metrics). Specific implementation, matching the current backend/metrics, will be added under datalayer/metrics.
  • The Endpoint interface currently includes management of the refresh loop. This will be removed under control of the DataSource in a later PR. Similarly, MetricClient and Factory might become redundant.

Next PRs

  • find/replace to migrate code to use new types (e.g., backend.Pod => datalayer.PodInfo across datastore, tests, etc.). This will also introduce mocks for tests.
  • implement base metrics collector and extractor
  • move extractor to use backend/metrics.MetricsSpec, etc.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 13, 2025
@k8s-ci-robot k8s-ci-robot requested review from kfswain and robscott July 13, 2025 13:11
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 13, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @elevran. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 13, 2025
Copy link

netlify bot commented Jul 13, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit a3f98c8
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6878ef7ed5571e00081ab019
😎 Deploy Preview https://deploy-preview-1154--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Comment on lines 48 to 50
// TODO: the following is useful for a data source that operates on
// endpoints and might not be relevant for "global/system" collectors which
// might not need the concept of an endpoint. This can be split, if needed,
// to a separate interface in the future.

// AddEndpoint adds an endpoint to collect from.
AddEndpoint(ep Endpoint) error

// RemoveEndpoint removes an endpoint from collection.
RemoveEndpoint(ep Endpoint) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you wrote it yourself in the comments, it might make sense to have DataSource interface without these functions and additional EndpointDataSource interface that embeds DataSource and defines these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a possibility.
As part of #1023 we agreed to push non-endpoint data source out until we have concrete requirements so I think we can keep it for now (i.e., don't have enough to go on at this time to rationalize what is shared and what's not between the per-endpoint and the non-endpoint data source).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the note in code for now to ensure getting feedback from @kfswain and @ahg-g (will remove once approved and before merging).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with explicit, tight scope for now. And only expanding as needed.

The future will have more information with how we want to expand our interface, so if we can postpone decisions make more informed decisions, that is fantastic.

Comment on lines 64 to 59
// ExpectedType defines the type expected by the extractor. It must match
// the DataSource.OutputType() the extractor registers for.
ExpectedType() reflect.Type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also used internally only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - it is needed by the DataSource to validate the extractor can process its output.
Prefer to validate that data source and extractor are compatible at registration time, so at least one needs to have the type as part of the interface.
But if you have an idea on how to implement type checks differently - would be happy to remove from here as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to validate that data source and extractor are compatible at registration time, so at least one needs to have the type as part of the interface.

Agreed, we should have some 'pre-flight checks' to make sure everything is buttoned up before marking EPP ready.

I see that this function exists below, and have commented there


var (
// DefaultDataSources is the system default data source registry.
DefaultDataSources = DataSourceRegistry{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is initialized on bootstrap and remains static. in such a case we don't need locking and it might be implemented the same way as plugins registry (just as a map[string]DataSource).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to guarantee that concurrent additions are also safe or not expected/allowed.
Am fine with making the assumption that data sources are all added before reads happen and are added from the same (e.g., main) go routine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am fine with making the assumption that data sources are all added before reads happen and are added from the same (e.g., main) go routine.

Yeah, I think that is a safe assumption. Personally, I am cautious to allow any dynamic reconfiguration for our pluggable systems. That would allow for some pretty wild and hard-to-defend behavior imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am cautious to allow any dynamic reconfiguration for our pluggable systems. That would allow for some pretty wild and hard-to-defend behavior imo

Does it make sense to make it explicit (e.g., introduce an explicit "freeze data sources" call so that they can't be changed afterwards)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed here to use sync.Map as well

)

// Addressable supports getting an IP address and a namespaced name.
type Addressable interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Addressable in metrics file?

Copy link
Contributor Author

@elevran elevran Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to keep it closest to where it used (the metrics client).
Seems to small to have its own file but open to suggestions on a better fitting location.
Another option is to move it next to PodInfo (which implements Addressable)

@nirrozenbaum
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 13, 2025
Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments!

I have some higher level questions around the extractor model & how its intended to be used. If our v1 of the pluggable data layer is scoped specifically to endpoints, I'm also wondering why we arent feeding into the datasource representation of the endpoints.

Nothing hard-blocking, but probably better to discuss eariler rather than later.

Thanks Etai!

Comment on lines 48 to 50
// TODO: the following is useful for a data source that operates on
// endpoints and might not be relevant for "global/system" collectors which
// might not need the concept of an endpoint. This can be split, if needed,
// to a separate interface in the future.

// AddEndpoint adds an endpoint to collect from.
AddEndpoint(ep Endpoint) error

// RemoveEndpoint removes an endpoint from collection.
RemoveEndpoint(ep Endpoint) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with explicit, tight scope for now. And only expanding as needed.

The future will have more information with how we want to expand our interface, so if we can postpone decisions make more informed decisions, that is fantastic.


// DataSource is an interface required from all datalayer data collection
// sources.
type DataSource interface {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that each datasource will have it's own list of endpoints, that all need to be kept up to date, rather than feeding into a centralized list of endpoints, just with its custom data included.

OOC, why did we choose this path?

Copy link
Contributor Author

@elevran elevran Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible design options. I'm fine with either so if there's a strong preference one way or another, am happy to change it.

  1. data sources keep references to endpoints (e.g., in the case of /metrics, it needs the IP address and a way to set metrics on the endpoint itself). This is what I had in mind the rationale for Add/RemoveEndpoint.
  2. endpoints keep references (directly or indirectly) to datasources (see bullets below).

I was thinking of the first option as more flexible (e.g., a data source that records node name for each endpoint does not need to be continuously called for the endpoint... the data source would not need to create a go routine for this endpoint at all), but engineering wise both work.

If we prefer a model where there is explicitly one go routine per endpoint (possible downsides: head of line blocking by "slow/failing" data sources impact collection by others, all data sources are applicable to all endpoints, etc.), the design would change as follows:

  • data store own the go routines and calls start/stop when endpoints are added and removed. There is a "collection context/state" stored in datastore per endpoint and there should not be a start/stop RefreshLoop method defined on endpoints - they're "pure data structures".
  • on every "tick", the go routine code calls all data sources to CollectFor(ep). Internally the data source can still call multiple extractors, so the main change is removal of Start/Stop methods and addition of CollectFor to data source.

WDYT? The second option (go routine per endpoint calling all data sources passing each the endpoint) is possibly closer to how it is done in GIE today so can continue along that path.

Copy link
Contributor Author

@elevran elevran Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain @nirrozenbaum this is a schematic snippet of what the second option may look like:

type Collector interface {
  Name() string
  CollectFor(Endpoint) error
  AddExtractor(Extractor)
}

type Extractor interface {
   Name() string
   ExpectedType() reflect.Type
   Extract(interface{}, Endpoint)
 }
 
 // when data store is notified of new Pod, it starts a go routine for it (there a scraping context that saves a reference to the Endpoint, cancellation contexts, etc)
 func (dlc *CollectionState) CollectionCycle() {
    ep := dlc.Endpoint()
    for _, c := range datalayer.Collectors() { // or we can store collectors on the collection state directly
      if err := c.CollectFor(ep); err != nil { 
         // handle failures - such as exponential backoff, disabling collectors after some time...
      } 
      // if successful, collector has called extractors with ep for updating
   }
 }

Let me know if this seems more "natural/aligned".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed so DataSource is no longer informed of and tracks endpoints.

Copy link
Collaborator

@kfswain kfswain Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with you here. The second option, from my perspective, separates concerns more cleanly:

  • Let the datastore focus on the actual collection mechanism. It does potentially create head of line blocking, agreed. But this seems acceptable with proper timeouts/backoff.
  • the Pod controller remains the single source of truth wrt endpoints and just needs to update the datastore endpoint list
  • Custom endpoints just focus on what data they care about, and how to get it

Comment on lines 64 to 59
// ExpectedType defines the type expected by the extractor. It must match
// the DataSource.OutputType() the extractor registers for.
ExpectedType() reflect.Type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to validate that data source and extractor are compatible at registration time, so at least one needs to have the type as part of the interface.

Agreed, we should have some 'pre-flight checks' to make sure everything is buttoned up before marking EPP ready.

I see that this function exists below, and have commented there


var (
// DefaultDataSources is the system default data source registry.
DefaultDataSources = DataSourceRegistry{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am fine with making the assumption that data sources are all added before reads happen and are added from the same (e.g., main) go routine.

Yeah, I think that is a safe assumption. Personally, I am cautious to allow any dynamic reconfiguration for our pluggable systems. That would allow for some pretty wild and hard-to-defend behavior imo

@nirrozenbaum
Copy link
Contributor

this LGTM.
I left just a few few nits in case you want to address.
will leave final stamp to @kfswain

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2025
@elevran
Copy link
Contributor Author

elevran commented Jul 16, 2025

I'm leaning towards the second option outlined in the discussion below this comment where Endpoints are not updated into the DataSource but instead managed in the data store exclusively, calling collectors with the endpoint as parameter (i.e., data source manages the collection go routines).
If acceptable, I'd like to push a final version later today, integrating all feedback.
FYI: @kfswain @nirrozenbaum

elevran added 3 commits July 17, 2025 14:42
- DataSource is no longer tracking Endpoints.
- Removed unused interfaces.
- Changes to exported scope.
- Move Addressable to podinfo.go.

Signed-off-by: Etai Lev Ran <[email protected]>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 17, 2025
@elevran elevran requested review from nirrozenbaum and kfswain July 17, 2025 12:39
Signed-off-by: Etai Lev Ran <[email protected]>
@kfswain
Copy link
Collaborator

kfswain commented Jul 17, 2025

Sorry for the delay @elevran I checked this PR a few days ago and it didn't look like you had responded to all comments. Either I looked just before you commented or I'm going blind :)

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elevran, kfswain, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kfswain,nirrozenbaum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit e9c11d8 into kubernetes-sigs:main Jul 17, 2025
9 checks passed
@elevran elevran deleted the datalayer_types branch July 20, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants