-
Notifications
You must be signed in to change notification settings - Fork 114
Feat(dhcp): Send DHCPRELEASE on container teardown #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR enhances netavark’s DHCP proxy to actively send a DHCPRELEASE when a container’s network is torn down by adding a release_lease method to the DHCP service, integrating it into the teardown handler in a best-effort manner, and updating the teardown test to assert on proxy logs. Sequence diagram for DHCPRELEASE on container teardownsequenceDiagram
participant Container
participant NetavarkProxy
participant DhcpV4Service
participant DHCPServer
Container->>NetavarkProxy: Request network teardown
NetavarkProxy->>DhcpV4Service: Create service, retrieve lease
DhcpV4Service->>DHCPServer: Send DHCPRELEASE
DhcpV4Service-->>NetavarkProxy: Return result (success/failure)
NetavarkProxy-->>Container: Confirm teardown
Class diagram for updated DhcpV4Service and teardown handlerclassDiagram
class DhcpV4Service {
+release_lease(lease: MozimV4Lease): Result<(), DhcpServiceError>
...
}
class NetavarkProxyService {
+teardown(request: NetworkConfig): Response<Lease>
...
}
NetavarkProxyService --> DhcpV4Service : uses
DhcpV4Service --> MozimV4Lease : parameter
MozimV4Lease <.. DhcpV4Service : used in release_lease
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
42f8352
to
c048a4e
Compare
Ephemeral COPR build failed. @containers/packit-build please check. |
2 similar comments
Ephemeral COPR build failed. @containers/packit-build please check. |
Ephemeral COPR build failed. @containers/packit-build please check. |
c048a4e
to
fa1d7d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Rishikpulhani - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/commands/dhcp_proxy.rs:175` </location>
<code_context>
+ // We use a clone of 'nc' for the new service, and the original 'nc' for logging.
+ match DhcpV4Service::new(nc.clone(), self.dora_timeout) {
+ Ok(mut service) => {
+ if let Err(e) = service.release_lease(&mozim_lease) {
+ // Log the error but continue. Use the correct variable for the MAC address.
+ warn!(
+ "Failed to send DHCPRELEASE for {}: {}",
+ &nc.container_mac_addr, e
+ );
+ } else {
+ debug!(
+ "Successfully sent DHCPRELEASE for {}",
</code_context>
<issue_to_address>
Check if release_lease could block or introduce latency in teardown.
If release_lease performs network I/O, consider adding a timeout or running it asynchronously to avoid blocking teardown longer than necessary.
Suggested implementation:
```rust
match DhcpV4Service::new(nc.clone(), self.dora_timeout) {
Ok(mut service) => {
// Run release_lease asynchronously with a timeout to avoid blocking teardown.
let release_timeout = std::time::Duration::from_secs(3);
match tokio::time::timeout(release_timeout, service.release_lease(&mozim_lease)).await {
Ok(Ok(_)) => {
debug!(
"Successfully sent DHCPRELEASE for {}",
&nc.container_mac_addr
);
}
Ok(Err(e)) => {
warn!(
"Failed to send DHCPRELEASE for {}: {}",
&nc.container_mac_addr, e
);
}
Err(_) => {
warn!(
"DHCPRELEASE for {} timed out after {:?}",
&nc.container_mac_addr, release_timeout
);
}
}
}
Err(e) => {
warn!("Failed to create DHCP service for release: {e}");
}
}
```
- The containing function must be marked as `async` if it isn't already, and its callers must be updated accordingly.
- The `release_lease` method must be async. If it is not, you will need to refactor it to be async.
- Ensure `tokio` is available in your dependencies.
- If you want the timeout duration to be configurable, consider making it a struct field or a constant.
</issue_to_address>
### Comment 2
<location> `test-dhcp/003-teardown.bats:45` </location>
<code_context>
+ # broadcast DHCPRELEASE packet to the dnsmasq listener. We check the
+ # proxy's own log for the success message, as this directly confirms
+ # the code path was executed successfully.
+ run_helper grep "Successfully sent DHCPRELEASE for ${CONTAINER_MAC}" "$TMP_TESTDIR/proxy.log"
run_helper cat "$TMP_TESTDIR/nv-proxy.lease"
# Check that the length of the lease file is now zero
</code_context>
<issue_to_address>
Consider adding a negative test to verify log output when DHCPRELEASE fails.
Testing a failure scenario, such as an invalid lease or mozim client error, would confirm that warning logs are generated and teardown proceeds as expected.
Suggested implementation:
```shell
run_helper grep "Successfully sent DHCPRELEASE for ${CONTAINER_MAC}" "$TMP_TESTDIR/proxy.log"
run_helper cat "$TMP_TESTDIR/nv-proxy.lease"
# Check that the length of the lease file is now zero
run_helper jq ". | length" <<<"$output"
# Negative test: simulate DHCPRELEASE failure by using an invalid MAC address
INVALID_MAC="00:00:00:00:00:00"
# Prepare a config with the invalid MAC
read -r -d '\0' invalid_input_config <<EOF
{
"container_mac": "$INVALID_MAC",
"other_config": "value"
}
EOF
# Run teardown with invalid MAC
run_teardown "$invalid_input_config"
# Check the proxy log for a warning or error about DHCPRELEASE failure
run_helper grep "Failed to send DHCPRELEASE for ${INVALID_MAC}" "$TMP_TESTDIR/proxy.log"
# Confirm teardown still proceeds and lease file is handled
run_helper cat "$TMP_TESTDIR/nv-proxy.lease"
```
- You may need to adjust the invalid_input_config structure to match your actual teardown input requirements.
- Ensure your proxy implementation logs a message like "Failed to send DHCPRELEASE for ..." on failure; if the log message differs, update the grep pattern accordingly.
- If teardown expects more fields in the config, add them to invalid_input_config.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Rishikpulhani, sourcery-ai[bot] The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR
test-dhcp/003-teardown.bats
Outdated
# Redefine start_proxy locally to enable debug logging for this test only. | ||
# This version will be used instead of the one from helpers.bash. | ||
function start_proxy() { | ||
RUST_LOG=debug ip netns exec "$NS_NAME" $NETAVARK dhcp-proxy --dir "$TMP_TESTDIR" --uds "$TMP_TESTDIR" &>"$TMP_TESTDIR/proxy.log" & | ||
PROXY_PID=$! | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason why we should not have debug log tunred on for all tests. I rather not duplicate this here.
test-dhcp/003-teardown.bats
Outdated
# broadcast DHCPRELEASE packet to the dnsmasq listener. We check the | ||
# proxy's own log for the success message, as this directly confirms | ||
# the code path was executed successfully. | ||
run_helper grep "Successfully sent DHCPRELEASE for ${CONTAINER_MAC}" "$TMP_TESTDIR/proxy.log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of checking the proxy's own log was chosen because the virtual bridge in the test environment does not reliably forward the broadcast DHCPRELEASE packet to the dnsmasq listener.
Can you expand on this? I think we also have to check dnsmasq logs, because our logs could just lie and say we send the event without ever doing so.
If we are able to request a lease we should be able to send the release event as well or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Luap99, you're right to question this. My initial implementation did exactly what you suggested – I was checking the dnsmasq log for the DHCPRELEASE message.
However, the test was consistently failing. The dnsmasq.log showed a perfect DHCPDISCOVER/OFFER/REQUEST/ACK sequence, but the DHCPRELEASE message never appeared
Initially i thought that maybe the test's cleanup logic is terminating the dnsmasq process before the kernel's virtual network has time to deliver the in-flight DHCPRELEASE packet from the proxy.
To debug this, I instrumented the netavark-dhcp-proxy with detailed logging. The proxy's debug log definitively shows that the release_lease function is executed correctly and that the mozim client successfully sends the release packet to the network socket.
I thought that this is just the case with the test environment and so I just added a check for checking that whether the packet was emitted or not instead of checking the dnsmasq logs
Is this an acceptable testing strategy given the circumstances, or is there a better way to resolve this race condition in the BATS environment that I might be missing?
PS: I am actually facing the same issue with my new implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Luap99,
I went through the codebase of both Netavark and Mozim, as well as the logs emitted during tests.
In the Mozim library, I used the following release
method on the DhcpV4ClientAsync
:
https://docs.rs/mozim/latest/mozim/struct.DhcpV4ClientAsync.html#method.release
This internally calls the release
method on DhcpV4Client
:
https://docs.rs/mozim/latest/src/mozim/dhcpv4/client.rs.html#554
When reviewing the proxy logs, I found this Release message being sent:
[DEBUG mozim::dhcpv4::msg] DHCP message Message { opcode: BootRequest, htype: Eth, hlen: 6, hops: 0, xid: 1505379826, secs: 0, flags: Flags { broadcast: false }, ciaddr: 10.220.46.58, yiaddr: 0.0.0.0, siaddr: 0.0.0.0, giaddr: 0.0.0.0, chaddr: [106, 108, 253, 24, 156, 237, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sname: Some([102, 111, 111, 98, 97, 114]), fname: None, magic: [99, 130, 83, 99], opts: DhcpOptions({ServerIdentifier: ServerIdentifier(10.220.46.1), MessageType: MessageType(Release), Hostname: Hostname("foobar"), ClientIdentifier: ClientIdentifier([0, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102])}) }
[DEBUG mozim::socket] Sending raw ethernet package: [102, 98, 123, 131, 209, 27, 106, 108, 253, 24, 156, 237, 8, 0, 69, 0, 1, 97, 0, 0, 64, 0, 128, 17, 135, 153, 10, 220, 46, 58, 10, 220, 46, 1, 0, 68, 0, 67, 1, 77, 198, 56, 1, 1, 6, 0, 89, 186, 69, 242, 0, 0, 0, 0, 10, 220, 46, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 106, 108, 253, 24, 156, 237, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 102, 111, 111, 98, 97, 114, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 99, 130, 83, 99, 54, 4, 10, 220, 46, 1, 53, 1, 7, 12, 6, 102, 111, 111, 98, 97, 114, 61, 65, 0, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 255]
[DEBUG mozim::socket] Raw socket sent: 367 bytes
However, for some reason, this message never reaches the DHCP server.
It’s not due to the server being stopped — I even added delays in 003-teardown.bats
before the log check:
sleep 10
assert `grep -c "Successfully sent DHCPRELEASE for ${CONTAINER_MAC}" "$TMP_TESTDIR/dnsmasq.log"` == 1
This had no effect. It seems that although the Mozim client is sending the packet, the DHCP server does not receive it.
My suspicion is that the DHCP server may be dropping these packets.
Any help in identifying what might be going wrong here would be greatly appreciated. Thanks!
e149268
to
fbd03ff
Compare
fbd03ff
to
791c723
Compare
@Luap99 Investigation of the Log BehaviorMy initial observation was that while the DHCPDISCOVER/OFFER/REQUEST/ACK (DORA) messages were visible in the dnsmasq log, the DHCPRELEASE message was not. My working hypothesis is that this is due to the nature of the packets and the virtual test network's architecture. The initial DORA messages are initiated via broadcast, which are flooded across the br0 bridge and correctly picked up by the dnsmasq listener. However, the DHCPRELEASE message is a unicast packet, sent directly from the client's IP to the server's IP. It appears that on a Linux virtual bridge, this directed traffic between two endpoints may not be visible to a listener on the bridge device (br0) itself. Furthermore, my review of the mozim client indicates that it sends the DHCPRELEASE packet in a "fire-and-forget" manner. It considers the operation successful upon sending the packet and does not wait for an acknowledgment, which is compliant with the DHCP standard for this message type. This behavior supports the conclusion that our primary testing goal should be to verify that the client successfully sends the DHCPRELEASE packet. Verifying receipt by the dnsmasq server is not a reliable metric in this specific test environment. Since both the client and server operate on the same virtual bridge, it is plausible that the unicast DHCPRELEASE message is not delivered to the server's listener. This is a quirk of the test setup and would not occur in a real-world scenario where the client and server reside on different machines Verification with tcpdumpTo definitively prove the packet was being sent by my code, I captured the traffic directly from the br0 bridge inside the test's temporary network namespace.
$ sudo ip netns
This tcpdump output provides definitive proof that the code is correctly generating and sending the unicast DHCPRELEASE packet onto the virtual network. I've detailed my testing methodology and findings above and would be grateful for your review of this approach. Please let me know if you have any feedback or suggestions. Thanks! |
@Rishikpulhani Ok I think the package should be able to get received generally, but maybe there is another bug somewhere. I don't really have much time to debug it deeply but I think if we send it and your tcpdump confirms that it is good enough for me now. |
log::error!("{err}"); | ||
continue; | ||
pub async fn process_client_stream(service_arc: Arc<Mutex<DhcpV4Service>>) { | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this not just stay as loop like this
while let Some(lease) = service_arc.lock().await.client.next().await {
...
}
The extra scopes to reduce lock holding is interesting but not actually doing anything since this function is the only one holding the lock anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @Luap99, but this was done to prevent a potential deadlock scenario.
If we use:
while let Some(lease) = service_arc.lock().await.client.next().await {
...
}
we end up holding the lock on service_arc for the entire loop. Later, inside the Ok arm of the match lease when we need to access the client inside DhcpV4Service, we again need to acquire the lock on service_arc to access the client:
match lease {
Ok(lease) => {
let mut client = service_arc.lock().await;
log::info!(
"got new lease for mac {}: {:?}",
&client.network_config.container_mac_addr,
&lease
);
// rest of the code
}
This would result in a deadlock.
To avoid this, I restructured it into a loop with equivalent stopping conditions using an if/else and break. This way, the lock is released after acquiring the lease and then reacquired only when needed to access the client.
test-dhcp/003-teardown.bats
Outdated
function start_proxy() { | ||
RUST_LOG=debug ip netns exec "$NS_NAME" $NETAVARK dhcp-proxy --dir "$TMP_TESTDIR" --uds "$TMP_TESTDIR" &>"$TMP_TESTDIR/proxy.log" & | ||
PROXY_PID=$! | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed it; however, this means there will no longer be any tests related to DHCPRELEASE in the testing suite.
519fa3f
to
5b2eb34
Compare
I’ve incorporated the changes you suggested. |
feat(dhcp): Send DHCPRELEASE on container teardown When a container using DHCP is torn down, its lease is left active on the DHCP server until it expires. This can be problematic in environments with small IP pools or long lease times. In setups using Dynamic DNS (DDNS), it can also lead to stale DNS records. This change introduces the capability for netavark to send a DHCPRELEASE message to the DHCP server when a container's network is torn down. This is implemented by: - Adding a `release_lease` method to the `DhcpV4Service` in `dhcp_service.rs`, which wraps the `release` function from the underlying mozim client. - Wrapped DhcpV4Service inside Arc<> share that to the process_client_stream function and then also store it in the task map - used it in process_client_stream task and also in the teardown function to send the DHCPRELEASE message
5b2eb34
to
d457992
Compare
When a container using DHCP is torn down, its lease is left active on the DHCP server until it expires. This can be problematic in environments with small IP pools or long lease times. In setups using Dynamic DNS (DDNS), it can also lead to stale DNS records.
This change introduces the capability for netavark to send a DHCPRELEASE message to the DHCP server when a container's network is torn down.
Implementation
release_lease
method to theDhcpV4Service
indhcp_service.rs
, which wraps therelease
function from the underlying mozim client.teardown
gRPC handler indhcp_proxy.rs
to create a temporaryDhcpV4Service
instance, retrieve the lease from the cache, and call the newrelease_lease
method in a "best-effort" manner.Testing
This feature was verified using the BATS suite in the test-dhcp/ directory.
fixes #1271
Help Needed: Blocked on Testing
I am opening this as a Draft PR because I am currently blocked on testing due to persistent environmental issues in my WSL Fedora setup. I've attempted several testing architectures (macvlan, unmanaged bridge with a containerized DHCP server, etc.) but have been consistently blocked by WSL-specific networking and
systemd
user session problems.The code is ready for a logical review, and I would greatly appreciate it if a maintainer with a stable test environment could help verify the functionality or suggest a reliable testing setup for WSL.
Summary by Sourcery
Enable sending DHCPRELEASE on container network teardown to free up IP leases immediately
New Features:
Enhancements:
Summary by Sourcery
Enable netavark to send DHCPRELEASE when tearing down container networks to prevent stale leases and DNS records
New Features:
Enhancements:
Tests: