Skip to content

Commit 9cd00ec

Browse files
committed
Making some more changes and additions
Signed-off-by: ytimocin <[email protected]>
1 parent 5d79e44 commit 9cd00ec

File tree

2 files changed

+115
-234
lines changed

2 files changed

+115
-234
lines changed

architecture/2025-03-upgrade-design-doc.md

Lines changed: 115 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ Upgrade complete! Radius has been successfully upgraded to v0.44.0 with your cus
168168

169169
**Exceptions:**
170170

171-
1. If custom configuration parameters are invalid (or should it be ignored?)
171+
1. If custom configuration parameters are invalid
172172

173173
#### Scenario 3: Handling upgrade failure and recovery
174174

@@ -361,28 +361,47 @@ This interface will be implemented (or existing will be improved) to handle vers
361361
362362
**Upgrade Lock Mechanism:**
363363
364-
To prevent concurrent modifications during upgrades, we'll implement a lock system that's acquired early in the process:
364+
- To prevent concurrent data‐modifying operations during `rad upgrade kubernetes`, well rely exclusively on datastore locks (no Kubernetes leases).
365365
366366
```go
367+
// UpgradeLock is implemented per datastore (Postgres, etcd) to serialize upgrades.
367368
type UpgradeLock interface {
368-
// Acquires the upgrade lock, preventing other commands from modifying data
369+
// AcquireLock blocks until it obtains an exclusive lock or the context deadline is exceeded.
369370
AcquireLock(ctx context.Context) error
370371
371-
// Releases the upgrade lock, allowing other commands to proceed
372+
// ReleaseLock frees the lock immediately so others can proceed.
372373
ReleaseLock(ctx context.Context) error
373374
374-
// Checks if an upgrade is currently in progress
375+
// IsUpgradeInProgress returns true if a valid (non‑stale) lock is held by another process.
375376
IsUpgradeInProgress(ctx context.Context) (bool, error)
376377
}
377378
```
378379
380+
**Timeouts:** callers must supply a context with a finite deadline (e.g. 2 min) to avoid blocking forever.
381+
**Stale‑lock detection:** each lock has a TTL/heartbeat; expired leases are auto‑cleaned before AcquireLock.
382+
Force cleanup: --force flag allows manual removal of stale/orphaned locks.
383+
384+
Usage in CLI commands:
385+
386+
```go
387+
ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
388+
defer cancel()
389+
390+
// attempt to acquire (with timeout + stale‐lock cleanup)
391+
// AcquireLock must first scan for & remove any expired/orphaned locks before locking.
392+
if err := upgradeLock.AcquireLock(ctx); err != nil {
393+
return fmt.Errorf("cannot start upgrade: %w", err)
394+
}
395+
defer func(){
396+
_ = upgradeLock.ReleaseLock(context.Background())
397+
}()
398+
```
399+
379400
We can utilize data-store-level lock mechanisms for implementing the distributed locking mechanism:
380401
381402
- PostgreSQL: <https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS>
382403
- etcd: <https://etcd.io/docs/v3.5/tutorials/how-to-create-locks/>
383404
384-
We can utilize Kubernetes Lease objects (coordination.k8s.io/v1) for implementing the distributed locking mechanism (open to discussion and suggestions). Leases are purpose-built for this use case, providing built-in lease duration and automatic expiration capabilities. For more information, see: <https://kubernetes.io/docs/concepts/architecture/leases/>.
385-
386405
Other CLI commands (`rad deploy app.bicep`, `rad delete app my-app` or other data-changing commands) that modify data will check for this lock before proceeding:
387406
388407
```go
@@ -533,7 +552,7 @@ The implementation will primarily focus on the following components:
533552
534553
1. **Upgrade Command**: The `rad upgrade kubernetes` command implementation in the CLI codebase
535554
2. **Version Validation**: Logic to verify compatibility between versions
536-
3. **Lock Mechanism**: Kubernetes Lease-based distributed locking system
555+
3. **Lock Mechanism**: Data-store-level distributed locking system
537556
4. **Backup/Restore**: User data protection system using ConfigMaps/PVs
538557
5. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities
539558
6. **Health Verification**: Component readiness and health check mechanisms
@@ -592,6 +611,8 @@ behaviors or APIs.
592611
593612
The following outlines the key implementation steps required to deliver the Radius Control Plane upgrade feature. Each step includes necessary unit and functional tests to ensure reliability and correctness, along with dependency information.
594613
614+
### Version 1: Simple `rad upgrade kubernetes` command with incremental upgrade
615+
595616
1. **Radius Helm Client Updates**
596617
597618
- Implement the upgrade functionality in the Radius Helm client: [helmclient.go](https://github.com/radius-project/radius/blob/main/pkg/cli/helm/helmclient.go).
@@ -643,6 +664,92 @@ The following outlines the key implementation steps required to deliver the Radi
643664
- Add necessary unit and functional tests to validate command behavior.
644665
- This task depends on all previous tasks (1-6) and should be implemented last.
645666
667+
### Version 2: Data Store Migrations and Rollbacks
668+
669+
1. Pick & embed a migration tool
670+
671+
- Add `migrations/` dir and versioned SQL (or Go) files
672+
- Vendor or import the tool (e.g. golang-migrate) so we have a single binary
673+
674+
2. Define migration tracking schema
675+
676+
- For Postgres: create a `schema_migrations` table (tool-standard)
677+
- For etcd: track applied migrations via a reserved key prefix
678+
679+
3. Wiring in the CLI/server
680+
681+
- On install/upgrade: run `migrate up` before Helm chart upgrade
682+
- On rollback: run `migrate down` (or tool-provided rollback) if upgrade fails
683+
- (Optional) Expose `rad migrate status|up|down` commands for operators
684+
685+
4. Schema evolution helpers
686+
687+
- Provide utilities for common ops (add column, rename, rekey in etcd)
688+
- Write examples/migrations: e.g. move key-value → Postgres row
689+
690+
5. Testing migrations
691+
692+
- Unit tests for each migration file (idempotent, up/down)
693+
- Integration tests: start with old schema + data → apply migrations → verify shape
694+
695+
6. Documentation & patterns
696+
697+
- Doc: “How to write a new migration”
698+
- Versioning rules (major/minor jumps, compatibility guarantees)
699+
- Rollback advice: when to write reversible vs. irreversible migrations
700+
701+
### Version 3: Rollback to the most recent successful version of Radius
702+
703+
1. **Version History Tracking**
704+
705+
- Extend the backup system to record the last successful control-plane version (e.g. in a reserved etcd key or Postgres table).
706+
- Ensure every successful `rad upgrade kubernetes` run writes an entry with timestamp and version.
707+
708+
2. **`rad rollback` CLI Command**
709+
710+
- Introduce `rad rollback kubernetes` that reads the recorded “last known good” version and invokes the same upgrade path in reverse.
711+
- Integrate rollback into the existing lock and backup/restore interfaces.
712+
713+
3. **Stateful Rollback Validation**
714+
715+
- Implement post-rollback health checks (component health, data-integrity assertions).
716+
- Fail early if rollback target is stale or schema mismatches prevent safe restoration.
717+
718+
4. **End-to-End Test Matrix**
719+
- Add scenarios: v0.43 → v0.44 upgrade → failure → `rad rollback` → verify control plane matches pre-upgrade state.
720+
- Test edge cases where no previous version is recorded.
721+
722+
### Version-4: Skip versions during `rad upgrade kubernetes`
723+
724+
1. **Skip-Aware Pre-flight Checks**
725+
726+
- Enhance `VersionValidator` to detect multi-version jumps and verify compatibility (e.g. migrations available).
727+
- Warn or block skips if there are known incompatible intermediate releases.
728+
729+
2. **Migration Plan Bundles**
730+
731+
- Generate a composite plan when skipping (e.g. v0.42 → v0.45):
732+
• List required data migrations in sequence
733+
• Group Helm chart upgrades and backup points for each intermediate step
734+
735+
3. **User Confirmation & Dry-Run**
736+
737+
- Prompt the user with a clear “You're jumping from A→D. We'll run migrations for B and C in turn. Proceed?”
738+
- Offer a `--dry-run` mode that prints the full step list without making changes.
739+
740+
4. **Automated Integration Tests**
741+
742+
- Cover a variety of version skip paths in CI (adjacent vs. multi‑minor).
743+
- Fail if any migration or Helm chart upgrade in the skip path is missing.
744+
745+
### Version 5: Support for Air-Gapped Environments
746+
747+
This can be discussed later.
748+
749+
### Version 6: Upgrading Radius on other platforms like `rad upgrade aci`
750+
751+
This can be discussed later.
752+
646753
### Out of Scope for Implementation
647754
648755
- **Dry-run functionality**: After team discussion, the dry-run feature was explicitly excluded from this implementation.

0 commit comments

Comments
 (0)