Skip to content

Conversation

nnikitos95
Copy link

Motivation

The current implementation uses PromHTTP's default error handling, which may not suit all deployment scenarios. By allowing customization, users can:

  • Opt for more lenient error handling (e.g., ContinueOnError) in high-availability environments.
  • Ensure better alignment with their operational requirements.

Default Behavior

  • Maintains backward compatibility by retaining the current default error handling behavior unless explicitly configured.

@nnikitos95 nnikitos95 force-pushed the feat/add_promhttp_error_handling_flag branch from b5a9fc6 to 9bfc60d Compare February 10, 2025 12:44
@nnikitos95
Copy link
Author

cc: @SuperQ @kgeckhart

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

I'm not sure we should do this, if there are errors, we don't want to silently ignore them. They should be fixed.

@nnikitos95
Copy link
Author

First of all, promlib has that option, and it looks convenient to have it here. And user is able to define how he wants to handle errors.

Google API has many errors, which couldn't be fixed as simple, as their support is not an example of the fast or customer-faced approach, we have a lot of examples when they deployed some descriptors in BETA stage and exporter has started to fail. So this setting is a good way to save collecting metrics which can be collected.

The problem we are trying to solve is similar to this

Also this exporter has metrics to identify such errors, so users still noticed about them only if provided setting is ContinueOnError

As i mentioned, the default behaviour is still the same.

P.S. Now we are using forked version with this setting and everything goes fine, we haven't lose any metrics we expected, the exporter returns it's own metrics in any case and we notified if any problems happened

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

The problem is that the linked error is an actual problem. By ignoring errors like that you're not going to get the data you expect.

This is a correctness problem that I don't think we want to paper over.

we haven't lose any metrics we expected

Are you sure about that? Because the linked issue is very likely to be a real problem.

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

So this setting is a good way to save collecting metrics which can be collected.

This is a very bad idea. Partial data is very difficult to track and debug. This is why the default behavior is the way it is. As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

This violates "fail fast" principal and is not something we are likely to want to put in an official exporter without some serious discussion.

@nnikitos95
Copy link
Author

Are you sure about that? Because the linked issue is very likely to be a real problem.

Yes. In the fail-fast approach we are loosing all the metrics in any case, as /metrics returns 500 or they outdated as we haven't collected them in-time and haven't reacted within possible alerts rely on them.

Obviously that we can split type-prefixes of broken metrics to different instances of stackdriver, to not affect reliable prefixes. But it takes time in any case, even if we use fail-fast and can react in short time. But usually it happens in night time, so we can lose the data.

As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

I completely agree with that. But as i said above, for us it's not okay to have "no data" when we can have "some non-broken data".

By ignoring errors like that you're not going to get the data you expect.

Usually those metrics are broken on Google side and they in some ALPHA/BETA stages which can't be filtered so we can't do anything with it.

@eenchev
Copy link

eenchev commented Sep 11, 2025

@SuperQ, I completely agree with your reasoning. However, in some configurations, teams add a metric that is ALPHA/BETA that knowingly could be problematic. And don't want to panic the whole stackdriver exporter for the sake of its behavior. Of course, that fail-silent (only with log message) should not be default, but an opt-in flag as this PR implements it.

This PR will not change anything for the users that already use the exporter as is, just would enable the configuration for whoever wants to accept the risk. Issues such as this would not be raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants