Enable opt-in for high frequency GPU metrics #1893

yanhaoluo666 · 2025-10-13T19:50:13Z

Note

This PR is dependent on PR370 and PR371, will update go.mod file and rebase once they are merged.

Description of the issue

Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.

Description of changes

This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:

Introduce a new field - accelerated_compute_gpu_metrics_collection_interval to let customer denote metrics collection interval, default value is 60.
If customer sets it to a value less than 60, belows changes will take effect:
2.1 the batch period will turn from 5s to 60s for batch processor;
2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);

We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

deploy this PR along with PR 370 and PR 371 to personal eks cluster
Spinned up a ML job then checked cloudwatch logs and metrics, confirmed
2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
2.2 gpu metrics were in cloudwatch histogram format.

logs sample

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

metrics graph

Requirements

Before commiting your code, please do the following steps.

Run make fmt and make fmt-sh. - done
Run make lint. - done

Integration Tests

To run integration tests against this PR, add the ready for testing label.

translator/translate/otel/common/common.go

translator/translate/otel/exporter/awsemf/kubernetes.go

translator/translate/otel/receiver/awscontainerinsight/utils.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/translate/otel/common/common_test.go

translator/translate/otel/exporter/awsemf/kubernetes.go

translator/translate/otel/common/common.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/translate/otel/processor/batchprocessor/translator.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml

translator/translate/otel/processor/groupbyattrsprocessor/translator.go

translator/translate/otel/receiver/awscontainerinsight/translator.go

yanhaoluo666 requested a review from a team as a code owner October 13, 2025 19:50

yanhaoluo666 mentioned this pull request Oct 14, 2025

[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter amazon-contributing/opentelemetry-collector-contrib#371

Open

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from 69a8d55 to cbb5dc4 Compare October 14, 2025 15:58

yanhaoluo666 commented Oct 14, 2025

View reviewed changes

translator/translate/otel/common/common.go Show resolved Hide resolved

yanhaoluo666 requested a review from movence October 14, 2025 17:25

yanhaoluo666 mentioned this pull request Oct 15, 2025

[awscontainerinsightreceiver] Add accelerated_compute_gpu_metrics_collection_interval config to support gpu metrics collection interval customization amazon-contributing/opentelemetry-collector-contrib#370

Merged

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/exporter/awsemf/kubernetes.go Outdated Show resolved Hide resolved

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from a40f2b7 to 69ba416 Compare October 16, 2025 13:51

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/exporter/awsemf/kubernetes.go Outdated Show resolved Hide resolved

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/receiver/awscontainerinsight/utils.go Outdated Show resolved Hide resolved

spanaik approved these changes Oct 16, 2025

View reviewed changes

yanhaoluo666 added the ready for testing Indicates this PR is ready for integration tests to run label Oct 17, 2025

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 6c0f9d7 to acbbe17 Compare October 20, 2025 11:39

movence reviewed Oct 22, 2025

View reviewed changes

translator/translate/otel/pipeline/containerinsights/translator.go Outdated Show resolved Hide resolved

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 7 times, most recently from 31fa2a1 to b9ed82e Compare October 23, 2025 16:47

yanhaoluo666 mentioned this pull request Oct 29, 2025

Add integ test on high frequency gpu metrics aws/amazon-cloudwatch-agent-test#618

Open

sky333999 reviewed Oct 30, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from b9ed82e to 7b33d8f Compare October 31, 2025 15:42

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 2 times, most recently from 49bf478 to bfc4009 Compare October 31, 2025 17:43

Enable opt-in for high frequency GPU metrics

8cb0d6e

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from bfc4009 to 8cb0d6e Compare October 31, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable opt-in for high frequency GPU metrics #1893

Enable opt-in for high frequency GPU metrics #1893

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable opt-in for high frequency GPU metrics #1893

Are you sure you want to change the base?

Enable opt-in for high frequency GPU metrics #1893

Conversation

yanhaoluo666 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Description of the issue

Description of changes

License

Tests

Requirements

Integration Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading