Skip to content

Conversation

jmcarp
Copy link

@jmcarp jmcarp commented Sep 4, 2025

The metrics dashboards assume that queries return a consistent number of series
over time. For example, if a query for cpu utilization gets four timeseries at
one timepoint, it shouldhave four timeseries at all timepoints. However, this
assumption isn't met when the cardinality of the given resource changes over
time, such as when the user scales the number of vcpus for an instance. This
assumption leads to two bugs:

  • If the number of vcpus changes within a query, we normalize by the largest
    observed vcpu count. In other words, increasing the number vcpus makes
    utilization appear lower at timepoints prior to the scaling event.
  • If a user scales up the number of vcpus such that the 0th vcpu timeseries has
    fewer timepoints than subsequent timeseries, we truncate series after the
    0th.

This patch attempts to fix both bugs. First, we count the number of non-null
values per timeseries to normalize cpu use. Second, we consider all timestamps
from all returned series so that we don't accidentally truncate the results.

The metrics dashboards assume that queries return a consistent number of series
over time. For example, if a query for cpu utilization gets four timeseries at
one timepoint, it shouldhave four timeseries at all timepoints. However, this
assumption isn't met when the cardinality of the given resource changes over
time, such as when the user scales the number of vcpus for an instance. This
assumption leads to two bugs:

* If the number of vcpus changes within a query, we normalize by the largest
  observed vcpu count. In other words, increasing the number vcpus makes
  utilization appear lower at timepoints prior to the scaling event.
* If a user scales up the number of vcpus such that the 0th vcpu timeseries has
  fewer timepoints than subsequent timeseries, we truncate series after the
  0th.

This patch attempts to fix both bugs. First, we count the number of non-null
values per timeseries to normalize cpu use. Second, we consider all timestamps
from all returned series so that we don't accidentally truncate the results.
Copy link

vercel bot commented Sep 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
console Ready Ready Preview Sep 4, 2025 2:52pm

const timestamps = R.pipe(
timeseriesData,
R.map((series) => series.points.timestamps.map((t) => new Date(t).getTime())),
R.flat(),
Copy link
Collaborator

@david-crespo david-crespo Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that perf matters a ton here, but R.flatMap "is identical to a map followed by a flat of depth 1 (flat(map(data, ...args))), but slightly more efficient than calling those two methods separately"

https://remedajs.com/docs/#flatMap

@david-crespo
Copy link
Collaborator

Thank you for doing this. This is a gnarly bug and the logic is gnarly so I will have to spend a little time looking through it.

const chartData = timestamps
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] }))
// Drop the first datapoint, which — for delta metric types — is the cumulative sum of all previous
// datapoints (like CPU utilization). We've accounted for this by adjusting the start time earlier;
// We could use a more elegant approach to this down the road
.slice(1)

return { chartData, timeseriesCount }
return { chartData, valueCounts }
Copy link
Collaborator

@david-crespo david-crespo Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the elements of valueCounts should match up one to one with the elements of chartData, it might be clearer to represent that relationship more directly by either making the count a field on the chartData element or even pre-normalizing the summed values data here by dividing by the count, so you don't have to return the count at all. I haven't worked through it enough to articulate it properly, just musing.

const summedValues = sumValues(timeseriesData, timestamps.length)
const valueCounts = countValues(timeseriesData, timestamps.length)
const chartData = timestamps
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] }))
// Drop the first datapoint, which — for delta metric types — is the cumulative sum of all previous
// datapoints (like CPU utilization). We've accounted for this by adjusting the start time earlier;
// We could use a more elegant approach to this down the road
.slice(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does valueCounts also need to get its first value dropped?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh's it's worse than that. This needs to be done per-timeseries prior to aggregation now that we realize they can have different first timestamps.

const summedValues = sumValues(timeseriesData, timestamps.length)
const valueCounts = countValues(timeseriesData, timestamps.length)
const chartData = timestamps
.map((timestamp, idx) => ({ timestamp, value: summedValues[idx] }))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add valueCount here, maybe instead of using idx, summedValues and valuesCounts could be objects indexed by timestamp so we don't have to worry about the arrays lining up index-wise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants