Skip to content

Conversation

@bmeagherix
Copy link
Contributor

Add a suite of failover tests for NVMe-oF(TCP) - although only a subset of them will be run by default

test_failover: Basic failover tests across multiple dimensions

  • Target implementation: kernel vs SPDK
  • Failover mode: ANA vs IP takeover
  • Failure type: orderly vs crash
  • I/O state: active vs idle
  • Namespace count: 1 vs 3

test_failover_scale: Large-scale failover tests

  • Target implementation: kernel vs SPDK
  • Failover mode: ANA vs IP takeover
  • Failure type: orderly vs crash
  • Scale: 51 subsystems (50 single-namespace + 1 with 20 namespaces) = 70 total namespaces
  • Verifies unique data patterns per namespace survive failover
  • Measures failover timing (must complete within MAX_FAILOVER_TIME seconds)

test_failover_and_failback: Failover + failback cycle tests

  • Target implementation: kernel vs SPDK
  • Failover mode: ANA vs IP takeover
  • Pattern: crash failover -> orderly failback
  • Verifies data integrity survives complete cycle back to original node
  • Measures failover and failback timing (must complete within MAX_FAILOVER_TIME seconds)

Full Matrix (with RUN_FULL_MATRIX=1 - 44 tests total):

  • TestFailover: 32 tests (2×2×2×2×2)
  • TestFailoverScale: 8 tests (2×2×2)
  • TestFailback: 4 tests (2×2)

Quick Subset (default - 10 tests total):

TestFailover (4 tests):

  • Parameters: namespace_count=1, failure_type='orderly', io_active=True
  • Combinations:
    • kernel + ANA
    • kernel + ip_takeover
    • spdk + ANA
    • spdk + ip_takeover

TestFailoverScale (2 tests):

  • Specific combinations only:
    • kernel + ANA + orderly
    • spdk + ip_takeover + crash

TestFailback (4 tests - no filtering):

  • All combinations run:
    • kernel + ANA
    • kernel + ip_takeover
    • spdk + ANA
    • spdk + ip_takeover

@bugclerk bugclerk changed the title NAS-135198 NAS-135198 / 26.04 / NAS-135198 Nov 21, 2025
@bugclerk
Copy link
Contributor

@bmeagherix bmeagherix changed the title NAS-135198 / 26.04 / NAS-135198 NAS-135198 / 26.04 / HA CI tests for NVMe-oF(TCP) Nov 21, 2025
- Add TestFailoverScale class with 8 parametric variations
- Use ThreadPoolExecutor for parallel connection/verification
- Add MAX_FAILOVER_TIME check (60s limit)
- Set 15-minute timeout for test (ZVOL overhead)
- Update docstring to document both test suites
- Add TestFailback crash->orderly failback cycle tests (4 tests)
- Change TestFailover fixtures to class scope to fix backend switching
- Add restore_original_master fixture to restore HA state after tests
- Add MAX_FAILOVER_TIME checks for both failover and failback operations
- Add flush before crash failover to reach stable storage
- Increase namespace verification retries from 5 to 60
- Handles namespaces initializing gradually after crash
- Fix flush method: send_flush() -> flush_namespace()
- Add read retry loop for namespaces not ready after failover
- Increase teardown sleep from 5s to 15s for cleanup
- Add fixture lifecycle logging to diagnose teardown issues
- Verify service state and port release after stop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants