Skip to content

Conversation

@dhable
Copy link
Contributor

@dhable dhable commented Dec 12, 2025

There were two issues with the log rotation script:

  1. Logs could be lost since copying and then truncating the file might not finish before logs arrive.

  2. The otel collector application will keeps the file handle and offset cached. After truncating, it will write starting at the last offset leaving the unallocated garbage in the beginning of the file. This garbage uses space.

This commit moves the file instead of copying. That allows the collector to continue writing to the rolled file until a SIGHUP is sent. This causes a config refresh, which also opens a new log file. After, the rolled file and the new log file have correct sizes.

--
ADDITIONAL NOTES:

Claude's code review is not accurate here.

  • The alpine image is based on busybox and fuser is a command implemented by busybox. This can be verified by just running the collector and watching the log rotate behavior.
  • The mv command updates the name of the file in the file system but doesn't change the inode number. A process only uses the file path the first time the file is open to resolve it into a inode number. Moving the file changes the name but doesn't change the inode number so the process will continue to write to that file.

There were two issues with the log rotation script:

1. Logs could be lost since copying and then truncating the file might
   not finish before logs arrive.

2. The otel collector application will keeps the file handle and offset
   cached. After truncating, it will write starting at the last offset
   leaving the unallocated garbage in the beginning of the file. This
   garbage uses space.

This commit moves the file instead of copying. That allows the collector
to continue writing to the rolled file until a SIGHUP is sent. This causes
a config refresh, which also opens a new log file. After, the rolled file
and the new log file have correct sizes.
@changeset-bot
Copy link

changeset-bot bot commented Dec 12, 2025

⚠️ No Changeset found

Latest commit: c7d7264

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Dec 12, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
hyperdx-v2-oss-app Ready Ready Preview, Comment Dec 12, 2025 10:44pm

@dhable
Copy link
Contributor Author

dhable commented Dec 12, 2025

We don't have a changeset for these images so I just didn't add one.

@claude
Copy link

claude bot commented Dec 12, 2025

Code Review

Critical Issues

  • Race condition in SIGHUP signal (log-rotator.sh:42) → fuser -k -HUP sends signal to processes using .1 file, but collector may have already created new log file and closed old handle. Move SIGHUP before mv or target the collector PID directly
  • ⚠️ Error handling missing → Add checks: (1) verify mv succeeded before sending SIGHUP, (2) verify fuser command exists and succeeds
  • ⚠️ Potential log loss window → Between mv and SIGHUP completion, new log writes go to old file handle (moved file). Document this behavior or add verification that collector reopened file

Additional Concerns

  • 🤔 fuser dependency → While busybox provides fuser, the alpine base image installation is implicit. Consider explicit verification or alternative signal mechanism (e.g., pkill -HUP targeting process name)
  • 📝 Incomplete rotation cleanup → Line 46 removes . after rotation, but if rotation fails mid-process, could accumulate extra files. Consider cleanup in error paths

Recommendation

Test the SIGHUP timing thoroughly - the current implementation may work if collector handles SIGHUP quickly, but the race condition between file move and signal delivery could cause issues under load.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

E2E Test Results

All tests passed • 46 passed • 3 skipped • 262s

Status Count
✅ Passed 46
❌ Failed 0
⚠️ Flaky 0
⏭️ Skipped 3

View full report →

@dhable dhable force-pushed the dan/hdx-2747-otel-collector-docker-image-logrotatorsh-does-not-truncate branch from c7587a6 to 41f1bf3 Compare December 12, 2025 22:23
@dhable dhable requested review from a team and brandon-pereira and removed request for a team December 12, 2025 22:42
Copy link
Member

@brandon-pereira brandon-pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants