-
Notifications
You must be signed in to change notification settings - Fork 185
Enhanced connection health monitoring. #895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enhanced connection health monitoring. #895
Conversation
Hello @acul71 please can I get a review on this? |
@bomanaps : Nice progress, Mercy. I would encourage reviews by @lla-dane and @Sahilgill24 along with @acul71, who all are working in transport protocols and whose work will be positively impacted by your PR. |
Hello @acul71 @lla-dane & @Sahilgill24 please can I get a review on this? |
@bomanaps : Also, adding @sukhman-sukh to this thread. Would like him to share his reviews as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #895 Review: Enhanced Connection Health Monitoring
Executive Summary
PR #895 introduces comprehensive connection health monitoring for py-libp2p with excellent implementation quality. However, two key issues need addressing before merge.
Recommendation: ✅ APPROVE with REQUEST for CHANGES
Core Implementation Assessment
✅ Excellent Technical Implementation
- Well-architected: Clean
ConnectionHealth
andHealthConfig
dataclasses - Feature-complete: Latency tracking, bandwidth monitoring, health-based load balancing
- Production-ready: Proper error handling, validation, and performance optimization
Key Issues Requiring Changes
🚨 1. API Accessibility Gap (Priority: High)
Problem: Health monitoring requires new_swarm()
but examples use new_host()
API.
# Current: Examples use this (can't enable health monitoring)
host = new_host(key_pair=key_pair)
# Current: Health monitoring requires this (inconsistent with examples)
config = ConnectionConfig(enable_health_monitoring=True)
swarm = new_swarm(connection_config=config)
# Missing: This should work but doesn't
host = new_host(connection_config=config) # ❌ No such parameter
Impact: Developers cannot easily add health monitoring to existing host-based code like echo.py
.
Solution: Add connection_config
parameter to new_host()
function.
Detailed API analysis and implementation options provided separately.
⚠️ 2. Integration Testing Gap (Priority: Medium)
Problem: Tests use mocked connections, not real network behavior.
Impact: No validation of actual trio networking with health monitoring.
Solution: Add real-world demonstration to test suite or examples directory.
Example integration demo available showing the system works with real connections.
Recommendations
- Extend
new_host()
API - Addconnection_config
parameter for API consistency - Add integration demo - Include real-world networking validation
This is a demo I wrote to check the module you can do something like this: #899
Conclusion
Excellent core implementation that significantly enhances py-libp2p's reliability. The health monitoring system is well-designed and performs excellently at scale. However, the API accessibility gap prevents seamless integration into existing code patterns.
Final Recommendation: ✅ APPROVE with REQUEST for CHANGES
Address the API consistency issue to ensure health monitoring is accessible through the standard new_host()
interface used throughout the ecosystem.
libp2p/network/swarm.py
Outdated
|
||
# Start health monitoring if enabled | ||
if self.connection_config.enable_health_monitoring: | ||
self._health_monitoring_task = nursery.start_soon( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of nursery.start_soon
is None
, so the check below doesn't really do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pacrob Thanks for catching that! You're absolutely right - nursery.start_soon()
returns None
, so the check was useless. I've fixed it by replacing _health_monitoring_task
with a boolean flag _health_monitoring_active
that properly tracks the health monitoring state. The health monitoring task cancellation is now handled by the nursery cleanup automatically.
async def _ping_connection(self, conn: INetConn) -> bool: | ||
"""Ping a connection to check health.""" | ||
try: | ||
# Use a simple stream creation test as ping | ||
stream = await conn.new_stream() | ||
await stream.close() | ||
return True | ||
except Exception: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define a ping_timeout
in Swarm's ConnectionConfig
, but it doesn't seem to be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pacrob Good catch! You're right - the ping_timeout
was defined in the config but not actually used. I've fixed it by wrapping the ping operation with trio.fail_after(self.connection_config.ping_timeout)
so it now properly respects the configured timeout value.
- Fix issue identified by @pacrob: nursery.start_soon() returns None - Replace _health_monitoring_task (None) with _health_monitoring_active (bool) - Update class definition and all references to use boolean flag - Health monitoring task cancellation handled by nursery cleanup - Resolves: 'The return type of nursery.start_soon is None, so the check below doesn't really do anything'
- Add tests for health monitoring task tracking (boolean flag) - Add tests for ping connection timeout functionality - Add tests for health monitoring lifecycle and cancellation - Add tests for disabled health monitoring - Add tests for different ping timeout configurations - Add tests for edge cases and configuration validation - All 9 tests pass, proving both fixes work correctly: 1. Health monitoring task tracking (boolean flag) 2. Ping connection timeout usage in _ping_connection method Resolves maintainer feedback from @pacrob on both issues.
- Change from hardcoded string to logging.getLogger(__name__) - Matches pattern used in identify.py and other modules - Follows Python logging best practices - More maintainable if module is moved or renamed
Hi @bomanaps how u doing? |
What was wrong?
Issue ##717 (comment)
How was it fixed?
This PR implements a comprehensive connection health monitoring system for Python libp2p, matching the capabilities of the Go and JS reference implementations.
Summary of approach.
ConnectionHealth
dataclass to track latency, success rates, stream counts, bandwidth, stability, and health scores per connection.To-Do
To-Do
Cute Animal Picture