-
Notifications
You must be signed in to change notification settings - Fork 14
metrics: handle varying ncpu within a query. #2896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The metrics dashboards assume that queries return a consistent number of series over time. For example, if a query for cpu utilization gets four timeseries at one timepoint, it shouldhave four timeseries at all timepoints. However, this assumption isn't met when the cardinality of the given resource changes over time, such as when the user scales the number of vcpus for an instance. This assumption leads to two bugs: * If the number of vcpus changes within a query, we normalize by the largest observed vcpu count. In other words, increasing the number vcpus makes utilization appear lower at timepoints prior to the scaling event. * If a user scales up the number of vcpus such that the 0th vcpu timeseries has fewer timepoints than subsequent timeseries, we truncate series after the 0th. This patch attempts to fix both bugs. First, we count the number of non-null values per timeseries to normalize cpu use. Second, we consider all timestamps from all returned series so that we don't accidentally truncate the results.
The latest updates on your projects. Learn more about Vercel for GitHub.
|
const timestamps = R.pipe( | ||
timeseriesData, | ||
R.map((series) => series.points.timestamps.map((t) => new Date(t).getTime())), | ||
R.flat(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this. This is a gnarly bug and the logic is gnarly so I will have to spend a little time looking through it. |
const chartData = timestamps | ||
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] })) | ||
// Drop the first datapoint, which — for delta metric types — is the cumulative sum of all previous | ||
// datapoints (like CPU utilization). We've accounted for this by adjusting the start time earlier; | ||
// We could use a more elegant approach to this down the road | ||
.slice(1) | ||
|
||
return { chartData, timeseriesCount } | ||
return { chartData, valueCounts } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the elements of valueCounts
should match up one to one with the elements of chartData
, it might be clearer to represent that relationship more directly by either making the count a field on the chartData
element or even pre-normalizing the summed values data here by dividing by the count, so you don't have to return the count at all. I haven't worked through it enough to articulate it properly, just musing.
const summedValues = sumValues(timeseriesData, timestamps.length) | ||
const valueCounts = countValues(timeseriesData, timestamps.length) | ||
const chartData = timestamps | ||
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] })) | ||
// Drop the first datapoint, which — for delta metric types — is the cumulative sum of all previous | ||
// datapoints (like CPU utilization). We've accounted for this by adjusting the start time earlier; | ||
// We could use a more elegant approach to this down the road | ||
.slice(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does valueCounts
also need to get its first value dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh's it's worse than that. This needs to be done per-timeseries prior to aggregation now that we realize they can have different first timestamps.
const summedValues = sumValues(timeseriesData, timestamps.length) | ||
const valueCounts = countValues(timeseriesData, timestamps.length) | ||
const chartData = timestamps | ||
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add valueCount
here, maybe instead of using idx
, summedValues
and valuesCounts
could be objects indexed by timestamp so we don't have to worry about the arrays lining up index-wise
The metrics dashboards assume that queries return a consistent number of series
over time. For example, if a query for cpu utilization gets four timeseries at
one timepoint, it shouldhave four timeseries at all timepoints. However, this
assumption isn't met when the cardinality of the given resource changes over
time, such as when the user scales the number of vcpus for an instance. This
assumption leads to two bugs:
observed vcpu count. In other words, increasing the number vcpus makes
utilization appear lower at timepoints prior to the scaling event.
fewer timepoints than subsequent timeseries, we truncate series after the
0th.
This patch attempts to fix both bugs. First, we count the number of non-null
values per timeseries to normalize cpu use. Second, we consider all timestamps
from all returned series so that we don't accidentally truncate the results.