[CHORE]: Comprehensive Scalability & Soak-Test Harness (Long-term Stability & Load) - locust, pytest-benchmark, smocker mocked MCP servers

## 🧭 **Chore Summary — Enterprise-Scale Scalability & Soak-Test Harness**

Introduce a **production-realistic test harness** that validates Gateway performance under massive enterprise-scale load with **tiered wave testing** from small deployments to million-user enterprises:

1. **Tiered dataset seeding** across 4 waves: Small → Medium → Large → Enterprise (up to 1M users, 5M teams, 50K tools)
2. **Multi-layer load testing** using **Locust** (HTTP API), **pytest-benchmark** (service layer), and **smocker** (mocked MCP servers)
3. **Federation & caching stress testing** with L1/L2 cache validation under extreme load
4. **Multi-tenancy scale testing** validating private/team/global scope performance with millions of entities
5. **Comprehensive reporting** with Grafana dashboards, flamegraphs, and enterprise capacity planning

---

## 🌊 **Wave Matrix & Target Datasets**

| Wave | Servers | Tools | Users | Teams | Metrics Retention | Max Users / Team | Load Test Duration |
|----------------|---------|--------|---------------|---------------|-------------------|------------------|--------------------|
| **Small** | 100 | 500 | 10,000 | 50,000 | 90 days | 50,000 | 15 minutes |
| **Medium** | 1,000 | 2,500 | 100,000 | 500,000 | 1 year | 100,000 | 30 minutes |
| **Large** | 5,000 | 12,500 | 500,000 | 2,500,000 | 3 years | 500,000 | 60 minutes |
| **Enterprise** | 10,000 | 50,000 | **1,000,000** | **5,000,000** | **5 years** | **1,000,000** | **120 minutes** |
| **Enterprise Stability** | 10,000 | 50,000 | **1,000,000** | **5,000,000** | **5 years** | **1,000,000** | **48 hours** |


> **Rule of thumb**: Teams ≈ Users x 5 (every user owns 1 private team + belongs to ~4 shared teams)
> A 48-hour stability test will run to assess potential memory leaks and infrastructure stability.

---

### 🧱 Areas Affected

* [x] **Make targets** — `make seed-{wave}`, `make soak-test-{wave}`, `make federation-load-enterprise`, `make flamegraph-analysis`
* [x] **CI / GitHub Actions** — nightly wave testing with matrix (PostgreSQL/Redis, caching on/off)
* [x] **Docker Compose** — Locust cluster, smocker mock services, Redis cluster, PostgreSQL tuning
* [x] **Test infrastructure** — Enterprise data seeder, realistic user scenarios, federation mocking at scale
* [x] **Monitoring** — Load-test Grafana dashboard, cache performance metrics, federation health, memory tracking
* [x] **Documentation** — Comprehensive scalability guide with enterprise capacity planning

---

### ⚙️ Context / Rationale

Current performance unknowns that block enterprise deployment decisions:

| Critical Question | Today | After This Epic |
|-------------------|-------|-----------------|
| **Million-user multi-tenancy** performance? | ❓ Unknown | 📊 Query performance across 5M teams |
| **50K tools federation** latency patterns? | ❓ Unknown | 📈 Network overhead, cache efficiency |
| **L1/L2 cache behavior** with 100GB+ datasets? | ❓ Unknown | 🔍 Memory usage, eviction patterns |
| **Database scaling** to 500M+ records? | ❓ Unknown | 📋 Connection pooling, query optimization |
| **Memory leak patterns** over 48h at enterprise scale? | ❓ Unknown | 🧠 RSS trends, GC patterns, cache bloat |

**Real-world enterprise scenarios tested:**
- Global enterprise: 50,000 tools across 5,000,000 teams
- Federation mesh of 100+ external MCP gateways with realistic failure rates
- 2,000+ concurrent API clients with mixed workloads (read-heavy, write-heavy, federation)
- Cache warming with 10GB+ datasets, invalidation storms, federation failover

---

### 📐 Enhanced Architecture Design

**Enterprise Test Environment Stack:**
```mermaid
flowchart TD
 %% Load Generation Cluster
 subgraph "Load Generation Cluster"
 LOCUST[Locust Controller Web UI :8089]
 L1[Locust Worker 1 Enterprise API Load]
 L2[Locust Worker 2 Federation Load]
 L3[Locust Worker 3 Multi-tenant Load]
 L4[Locust Worker 4 Cache Stress Load]
 L5[Locust Worker 5 Write-Heavy Load]
 LOCUST --> L1
 LOCUST --> L2
 LOCUST --> L3
 LOCUST --> L4
 LOCUST --> L5
 end

 %% Mock Services Federation
 subgraph "Mock Services Federation"
 SMOCKER[Smocker Controller :8080]
 MOCK1[Mock MCP Cluster 1 Tools Provider x20]
 MOCK2[Mock MCP Cluster 2 Resources Provider x20]
 MOCK3[Mock Gateway Fed Federated Peers x50]
 MOCK4[Mock Enterprise APIs External Systems x10]
 SMOCKER --> MOCK1
 SMOCKER --> MOCK2
 SMOCKER --> MOCK3
 SMOCKER --> MOCK4
 end

 %% Gateway Cluster Under Test
 subgraph "Gateway Cluster Under Test"
 LB[Load Balancer nginx :4444]
 GW1[Gateway Instance 1 +L1 Cache 1GB]
 GW2[Gateway Instance 2 +L1 Cache 1GB]
 GW3[Gateway Instance 3 +L1 Cache 1GB]
 GW4[Gateway Instance 4 +L1 Cache 1GB]
 LB --> GW1
 LB --> GW2
 LB --> GW3
 LB --> GW4
 end

 %% Data Layer - Enterprise Scale
 subgraph "Data Layer - Enterprise Scale"
 REDIS_M[Redis Master L2 Cache + Sessions]
 REDIS_S1[Redis Slave 1 Read Replica]
 REDIS_S2[Redis Slave 2 Read Replica]
 PG_M[(PostgreSQL Master 50GB+ Primary Data)]
 PG_S1[(PostgreSQL Slave 1 Read Replica)]
 PG_S2[(PostgreSQL Slave 2 Read Replica)]
 REDIS_M --> REDIS_S1
 REDIS_M --> REDIS_S2
 PG_M --> PG_S1
 PG_M --> PG_S2
 GW1 --> REDIS_M
 GW2 --> REDIS_M
 GW3 --> REDIS_M
 GW4 --> REDIS_M
 GW1 --> PG_M
 GW2 --> PG_M
 GW3 --> PG_M
 GW4 --> PG_M
 end

 %% Monitoring & Analysis
 subgraph "Monitoring & Analysis"
 PROM[Prometheus High-Resolution Metrics]
 GRAF[Grafana Enterprise Dashboard]
 PYSPY[py-spy Cluster Distributed Profiling]
 ELASTIC[Elasticsearch Log Aggregation]
 PROM --> GRAF
 GW1 --> PROM
 GW2 --> PROM
 GW3 --> PROM
 GW4 --> PROM
 GW1 --> ELASTIC
 GW2 --> ELASTIC
 GW3 --> ELASTIC
 GW4 --> ELASTIC
 end

 %% Connections
 L1 --> LB
 L2 --> LB
 L3 --> LB
 L4 --> LB
 L5 --> LB

 GW1 --> MOCK1
 GW1 --> MOCK2
 GW1 --> MOCK3
 GW1 --> MOCK4
 GW2 --> MOCK1
 GW2 --> MOCK2
 GW2 --> MOCK3
 GW2 --> MOCK4
 GW3 --> MOCK1
 GW3 --> MOCK2
 GW3 --> MOCK3
 GW3 --> MOCK4
 GW4 --> MOCK1
 GW4 --> MOCK2
 GW4 --> MOCK3
 GW4 --> MOCK4

 classDef load fill:#ffeb3b
 classDef mock fill:#4caf50 
 classDef gateway fill:#2196f3
 classDef data fill:#ff9800
 classDef monitor fill:#9c27b0

 class LOCUST,L1,L2,L3,L4,L5 load
 class SMOCKER,MOCK1,MOCK2,MOCK3,MOCK4 mock
 class LB,GW1,GW2,GW3,GW4 gateway
 class REDIS_M,REDIS_S1,REDIS_S2,PG_M,PG_S1,PG_S2 data
 class PROM,GRAF,PYSPY,ELASTIC monitor
```

---

### 📋 Enhanced Acceptance Criteria

| # | Criteria | Validation Method |
|---|----------|-------------------|
| 1 | **Enterprise dataset**: 50K tools, 10K servers, 5M teams, 1M users, 5 years metrics | `make seed-enterprise` completes < 60 min |
| 2 | **Wave testing**: All 4 waves (Small→Enterprise) with progressive load increase | `make soak-test-all-waves` runs 4-hour test cycle |
| 3 | **Cache performance at scale**: L1/L2 with 10GB+ datasets, 95%+ hit ratio | Grafana shows cache metrics under enterprise load |
| 4 | **Federation stress test**: 100+ mocked external gateways with realistic patterns | Smocker validates federated call patterns at scale |
| 5 | **Multi-tenancy at scale**: 5M teams query performance, scope isolation | Benchmark report shows <10% overhead vs single-tenant |
| 6 | **CI integration**: Nightly wave testing with enterprise-scale matrix | GitHub Actions uploads comprehensive reports + flamegraphs |

---

### 🛠️ Comprehensive Task List

#### **Phase 1: Enterprise Data Seeder with Wave System**

* **1.1** **Wave-based seeder script** `scripts/seed_enterprise.py`
 ```python
 # Enterprise-scale realistic data generation with wave system
 WAVE_CONFIGS = {
 'small': {
 'servers': 100, 'tools': 500, 'users': 10_000, 
 'teams': 50_000, 'metrics_days': 90, 'max_team_size': 50_000
 },
 'medium': {
 'servers': 1_000, 'tools': 2_500, 'users': 100_000,
 'teams': 500_000, 'metrics_days': 365, 'max_team_size': 100_000
 },
 'large': {
 'servers': 5_000, 'tools': 12_500, 'users': 500_000,
 'teams': 2_500_000, 'metrics_days': 1095, 'max_team_size': 500_000
 },
 'enterprise': {
 'servers': 10_000, 'tools': 50_000, 'users': 1_000_000,
 'teams': 5_000_000, 'metrics_days': 1825, 'max_team_size': 1_000_000
 }
 }

 @click.command()
 @click.option('--wave', type=click.Choice(['small', 'medium', 'large', 'enterprise']), 
 default='small', help='Scale wave to generate')
 @click.option('--parallel-workers', default=8, help='Parallel workers for data generation')
 @click.option('--batch-size', default=10000, help='Batch size for bulk operations')
 def seed_wave(wave: str, parallel_workers: int, batch_size: int):
 """Generate enterprise-scale test data for specified wave"""
 config = WAVE_CONFIGS[wave]
 logger.info(f"🌊 Seeding {wave} wave: {config}")
 
 asyncio.run(generate_wave_data(config, parallel_workers, batch_size))
 ```

* **1.2** **Enterprise data patterns with realistic distributions**
 ```python
 # Realistic enterprise distribution patterns
 ENTERPRISE_PATTERNS = {
 'team_size_distribution': {
 'micro': (1, 5, 0.6), # 60% micro teams (1-5 users)
 'small': (6, 25, 0.25), # 25% small teams (6-25 users) 
 'medium': (26, 100, 0.10), # 10% medium teams (26-100 users)
 'large': (101, 1000, 0.04), # 4% large teams (101-1000 users)
 'enterprise': (1001, 1_000_000, 0.01) # 1% enterprise teams (1001+ users)
 },
 'tools_per_team_distribution': {
 'light': (1, 10, 0.5), # 50% teams: 1-10 tools
 'moderate': (11, 50, 0.3), # 30% teams: 11-50 tools
 'heavy': (51, 200, 0.15), # 15% teams: 51-200 tools
 'power': (201, 1000, 0.05) # 5% teams: 201-1000 tools
 },
 'federation_patterns': {
 'hub_spoke': 0.4, # 40% use hub-spoke federation
 'full_mesh': 0.3, # 30% use full-mesh federation
 'tiered': 0.2, # 20% use tiered federation
 'isolated': 0.1 # 10% no federation
 },
 'cache_access_patterns': {
 'hot_data': 0.8, # 80% requests hit hot data (5% of total)
 'warm_data': 0.15, # 15% requests hit warm data (15% of total)
 'cold_data': 0.05 # 5% requests hit cold data (80% of total)
 }
 }
 ```

* **1.3** **High-performance bulk loading for enterprise scale**
 ```python
 # Optimized bulk loading for million-scale datasets
 async def bulk_load_enterprise_wave(config: dict, workers: int, batch_size: int):
 """High-performance parallel bulk loading"""
 
 # Phase 1: Generate core entities in parallel
 async with asyncio.TaskGroup() as tg:
 tg.create_task(bulk_generate_users(config['users'], workers, batch_size))
 tg.create_task(bulk_generate_teams(config['teams'], workers, batch_size))
 tg.create_task(bulk_generate_tools(config['tools'], workers, batch_size))
 tg.create_task(bulk_generate_servers(config['servers'], workers, batch_size))
 
 # Phase 2: Generate relationships with realistic patterns
 await bulk_generate_team_memberships(config, batch_size=50_000)
 await bulk_generate_tool_associations(config, batch_size=100_000)
 
 # Phase 3: Generate historical metrics (most expensive)
 await bulk_generate_enterprise_metrics(
 config['metrics_days'], 
 parallel_months=12, # Generate 12 months in parallel
 batch_size=1_000_000 # 1M metrics per batch
 )
 
 # Phase 4: Generate federation mesh
 await bulk_generate_federation_topology(config, pattern='enterprise')
 ```

#### **Phase 2: Smocker Integration for Enterprise Federation Testing**

* **2.1** **Enterprise MCP federation mocking** `docker-compose.enterprise.yml`
 ```yaml
 # Enterprise-scale federation testing
 smocker:
 image: thiht/smocker:latest
 ports:
 - "8080:8080"
 - "8081:8081"
 environment:
 - SMOCKER_MAX_MOCKS=10000 # Support 10K mock endpoints
 - SMOCKER_MEMORY_LIMIT=4GB # Handle large mock datasets
 volumes:
 - ./loadtest/mocks:/opt/mocks
 
 # Mock MCP federation cluster (100 gateways)
 mock-federation-cluster:
 image: thiht/smocker:latest
 deploy:
 replicas: 20 # 20 smocker instances
 environment:
 - SMOCKER_FEDERATION_SIZE=100
 - SMOCKER_LATENCY_RANGE=50-2000ms
 - SMOCKER_RELIABILITY=0.95-0.999
 volumes:
 - ./loadtest/mocks/federation-enterprise.yml:/opt/mocks/config.yml
 
 # Mock enterprise external APIs
 mock-enterprise-apis:
 image: thiht/smocker:latest
 deploy:
 replicas: 5
 environment:
 - SMOCKER_ENTERPRISE_APIS=true
 - SMOCKER_RATE_LIMIT=1000rps
 volumes:
 - ./loadtest/mocks/enterprise-apis.yml:/opt/mocks/config.yml
 ```

* **2.2** **Enterprise federation scenario mocks** `loadtest/mocks/federation-enterprise.yml`
 ```yaml
 # Enterprise federation patterns with realistic failure modes
 - request:
 method: POST
 path: /v1/tools
 headers:
 x-gateway-region: "us-east"
 response:
 status: 200
 delay: 75ms # US East latency
 body: |
 {
 "tools": {{range 1000}}
 {"name": "enterprise_tool_{{.}}", "description": "Enterprise tool {{.}}"},
 {{end}}
 }
 
 # Regional latency simulation 
 - request:
 method: POST
 path: /v1/tools 
 headers:
 x-gateway-region: "eu-central"
 response:
 status: 200
 delay: 150ms # EU Central latency
 
 # Failure scenarios (5% failure rate)
 - request:
 method: POST
 path: /v1/tools
 headers:
 x-test-scenario: "partial_outage"
 response:
 status: 503
 delay: 30s
 body: |
 {"error": "Gateway temporarily unavailable", "retry_after": 30}
 
 # Large dataset responses (cache stress testing)
 - request:
 method: POST
 path: /v1/federation/bulk
 response:
 status: 200
 delay: 500ms
 body: |
 {
 "tools": {{range 10000}}
 {"name": "bulk_tool_{{.}}", "size": "{{multiply . 1024}}"},
 {{end}}
 }
 ```

* **2.3** **Dynamic enterprise mock management**
 ```python
 # Enterprise-scale mock management
 class EnterpriseMockManager:
 async def setup_enterprise_federation(self, gateway_count: int = 100):
 """Setup enterprise-scale federated gateway mocks"""
 
 # Create regional clusters
 regions = ['us-east', 'us-west', 'eu-central', 'eu-west', 'ap-south', 'ap-east']
 gateways_per_region = gateway_count // len(regions)
 
 for region in regions:
 await self.create_regional_cluster(
 region=region,
 gateway_count=gateways_per_region,
 base_latency=self.get_regional_latency(region),
 reliability=random.uniform(0.95, 0.999)
 )
 
 async def simulate_enterprise_failure_patterns(self):
 """Simulate realistic enterprise failure patterns"""
 failure_scenarios = [
 {'type': 'regional_outage', 'probability': 0.01, 'duration': '15m'},
 {'type': 'high_latency_spike', 'probability': 0.05, 'duration': '2m'},
 {'type': 'rate_limit_exceeded', 'probability': 0.02, 'duration': '30s'},
 {'type': 'partial_data_corruption', 'probability': 0.001, 'duration': '5m'}
 ]
 
 for scenario in failure_scenarios:
 if random.random() < scenario['probability']:
 await self.trigger_failure_scenario(scenario)
 ```

#### **Phase 3: Enterprise Load Testing Scenarios**

* **3.1** **Enterprise Locust scenarios** `locustfiles/enterprise_scale.py`
 ```python
 # Enterprise user behavior patterns
 class EnterpriseUserBehavior(HttpUser):
 weight = 40 # Most common user type
 
 def on_start(self):
 """Initialize enterprise user context"""
 self.user_id = f"user_{random.randint(1, 1_000_000)}"
 self.private_team_id = f"private_{self.user_id}"
 self.shared_teams = random.sample(range(1, 5_000_000), k=random.randint(1, 8))
 self.tool_cache = []
 
 @task(25)
 def list_my_team_tools(self):
 """Most frequent: list tools for my primary team"""
 team_id = random.choice(self.shared_teams)
 response = self.client.get(f"/v1/tools?scope=team:{team_id}&limit=50")
 if response.status_code == 200:
 self.tool_cache = response.json().get('tools', [])[:10]
 
 @task(15)
 def search_global_tools(self):
 """Search across global tool catalog"""
 query = random.choice(['weather', 'translate', 'calculate', 'format', 'analyze'])
 self.client.get(f"/v1/tools/search?q={query}&scope=global&limit=100")
 
 @task(10)
 def access_federated_tools(self):
 """Access tools from federated gateways"""
 self.client.get(f"/v1/federation/tools?regions=us-east,eu-central&limit=50")
 
 @task(8)
 def create_private_tool(self):
 """Create tool in private workspace"""
 tool_data = self.generate_enterprise_tool()
 response = self.client.post(f"/v1/tools", json=tool_data)
 if response.status_code == 201:
 tool_id = response.json()['id']
 self.tool_cache.append(tool_id)
 
 @task(5)
 def share_tool_to_team(self):
 """Share private tool to team (triggers cache invalidation)"""
 if self.tool_cache:
 tool_id = random.choice(self.tool_cache)
 team_id = random.choice(self.shared_teams)
 self.client.post(f"/v1/tools/{tool_id}/share", 
 json={"scope": f"team:{team_id}"})
 
 @task(2)
 def bulk_operations(self):
 """Bulk operations that stress the system"""
 team_id = random.choice(self.shared_teams)
 self.client.post(f"/v1/tools/bulk", 
 json={"team_id": team_id, "action": "export", "limit": 1000})

 class EnterprisePowerUser(HttpUser):
 weight = 10 # Power users with heavy operations
 
 @task(15)
 def complex_federation_query(self):
 """Complex queries across multiple federated gateways"""
 self.client.get("/v1/federation/aggregate?regions=all&include_metrics=true&timeframe=30d")
 
 @task(10)
 def team_administration(self):
 """Team management operations"""
 team_id = random.randint(1, 5_000_000)
 self.client.get(f"/v1/teams/{team_id}/members?limit=1000")
 self.client.get(f"/v1/teams/{team_id}/tools?include_private=true&limit=500")
 
 @task(8)
 def analytics_queries(self):
 """Heavy analytics queries"""
 self.client.get("/v1/analytics/usage?timeframe=90d&breakdown=team&limit=10000")
 
 class CacheStressBehavior(HttpUser):
 weight = 5 # Cache invalidation stress testing
 
 @task
 def cache_invalidation_storm(self):
 """Rapid create/update/delete to stress cache invalidation"""
 operations = []
 
 # Create 50 tools rapidly
 for i in range(50):
 tool_data = {"name": f"stress_tool_{i}_{time.time()}", "url": "http://example.com"}
 response = self.client.post("/v1/tools", json=tool_data)
 if response.status_code == 201:
 operations.append(('created', response.json()['id']))
 
 # Update them all
 for op_type, tool_id in operations:
 if op_type == 'created':
 self.client.put(f"/v1/tools/{tool_id}", 
 json={"description": f"Updated at {time.time()}"})
 operations.append(('updated', tool_id))
 
 # Delete half of them 
 for i, (op_type, tool_id) in enumerate(operations):
 if op_type == 'updated' and i % 2 == 0:
 self.client.delete(f"/v1/tools/{tool_id}")
 ```

* **3.2** **Enterprise service-layer benchmarks** `tests/bench/enterprise_performance.py`
 ```python
 # Enterprise-scale service layer performance testing
 class TestEnterpriseServicePerformance:
 
 @pytest.mark.benchmark(group="enterprise_tool_service")
 def test_list_tools_million_scale(self, benchmark, enterprise_db_session):
 """Benchmark tool listing with 50K tools"""
 result = benchmark(tool_service.list_tools, 
 enterprise_db_session, include_inactive=False)
 assert len(result) >= 45_000 # Should return most of 50K tools
 
 @pytest.mark.benchmark(group="enterprise_cache_performance")
 def test_cache_with_10gb_dataset(self, benchmark, enterprise_cache_manager):
 """Test cache performance with 10GB+ dataset"""
 large_data = {"tools": [{"id": i, "data": "x" * 1000} for i in range(100_000)]}
 
 def cache_operation():
 return enterprise_cache_manager.get_or_set(
 "enterprise:large_dataset",
 lambda: large_data,
 ttl=3600
 )
 
 result = benchmark(cache_operation)
 assert len(result["tools"]) == 100_000
 
 @pytest.mark.benchmark(group="enterprise_multi_tenancy")
 def test_scope_filtering_million_teams(self, benchmark, enterprise_db_session):
 """Test multi-tenant scope filtering with 5M teams"""
 user_context = {
 "user_id": "enterprise_user",
 "teams": [f"team_{i}" for i in range(1000)] # User in 1000 teams
 }
 
 result = benchmark(tool_service.list_tools_with_scope, 
 enterprise_db_session, user_context)
 assert len(result) > 0
 
 @pytest.mark.benchmark(group="enterprise_federation")
 def test_federation_aggregation_100_gateways(self, benchmark, mock_federation):
 """Test federation aggregation across 100 gateways"""
 gateway_urls = [f"http://mock-gateway-{i}:9000" for i in range(100)]
 
 def federation_operation():
 return gateway_service.aggregate_federated_tools(
 gateway_urls, timeout=30, parallel_limit=20
 )
 
 result = benchmark(federation_operation)
 assert len(result) >= 50_000 # Should aggregate significant tools
 ```

#### **Phase 4: Enhanced Enterprise Monitoring & Analysis**

* **4.1** **Enterprise load test Grafana dashboard** `grafana/enterprise-loadtest.json`
 ```json
 {
 "dashboard": {
 "title": "MCP Gateway - Enterprise Load Test Analysis",
 "refresh": "5s",
 "time": {"from": "now-2h", "to": "now"},
 "panels": [
 {
 "title": "Request Rate by Wave Scale",
 "targets": [
 {"expr": "rate(http_requests_total[5m]) by (wave_scale, endpoint)", 
 "legendFormat": "{{wave_scale}} - {{endpoint}}"}
 ],
 "yAxes": [{"unit": "reqps", "max": 10000}]
 },
 {
 "title": "Enterprise Cache Performance (L1/L2)",
 "targets": [
 {"expr": "rate(cache_l1_hits_total[5m])", "legendFormat": "L1 Hits/sec"},
 {"expr": "rate(cache_l2_hits_total[5m])", "legendFormat": "L2 Hits/sec"},
 {"expr": "rate(cache_misses_total[5m])", "legendFormat": "Cache Misses/sec"},
 {"expr": "cache_l1_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "L1 Memory (GB)"}
 ]
 },
 {
 "title": "Federation Latency (100+ Gateways)",
 "targets": [
 {"expr": "histogram_quantile(0.50, rate(federation_request_duration_seconds_bucket[5m]))", 
 "legendFormat": "P50"},
 {"expr": "histogram_quantile(0.95, rate(federation_request_duration_seconds_bucket[5m]))", 
 "legendFormat": "P95"},
 {"expr": "histogram_quantile(0.99, rate(federation_request_duration_seconds_bucket[5m]))", 
 "legendFormat": "P99"}
 ]
 },
 {
 "title": "Multi-tenancy Query Performance",
 "targets": [
 {"expr": "rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m]) by (scope_type)", 
 "legendFormat": "Avg {{scope_type}}"},
 {"expr": "histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) by (scope_type)", 
 "legendFormat": "P95 {{scope_type}}"}
 ]
 },
 {
 "title": "Memory Usage - Enterprise Scale",
 "targets": [
 {"expr": "process_resident_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "Gateway RSS (GB)"},
 {"expr": "cache_l1_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "L1 Cache (GB)"},
 {"expr": "redis_used_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "Redis (GB)"},
 {"expr": "postgresql_shared_buffers_bytes / 1024 / 1024 / 1024", "legendFormat": "PostgreSQL (GB)"}
 ]
 },
 {
 "title": "Database Connection Pool Usage",
 "targets": [
 {"expr": "postgresql_connections_active", "legendFormat": "Active Connections"},
 {"expr": "postgresql_connections_idle", "legendFormat": "Idle Connections"},
 {"expr": "postgresql_connections_total", "legendFormat": "Total Connections"},
 {"expr": "postgresql_max_connections", "legendFormat": "Max Connections"}
 ]
 }
 ]
 }
 }
 ```

#### **Phase 5: Wave-Based CI Integration**

* **5.1** **Enterprise wave testing workflow** `.github/workflows/enterprise-soak.yml`
 ```yaml
 name: Enterprise Scale Soak Testing
 
 on:
 schedule:
 - cron: '0 2 * * 0' # Weekly Sunday 2 AM UTC
 workflow_dispatch:
 inputs:
 wave:
 description: 'Test wave to run'
 required: true
 default: 'small'
 type: choice
 options:
 - small
 - medium 
 - large
 - enterprise
 duration_hours:
 description: 'Test duration in hours'
 default: '2'
 
 jobs:
 enterprise-soak:
 runs-on: ubuntu-latest-8-cores # Use 8-core runner for enterprise scale
 strategy:
 matrix:
 wave: [small, medium, large, enterprise]
 database: [postgresql] # Only PostgreSQL for enterprise scale
 cache: [enabled] # Always enable cache for enterprise
 federation: [true, false]
 fail-fast: false
 
 steps:
 - uses: actions/checkout@v4
 
 - name: Setup enterprise test environment
 run: |
 # Increase system limits for enterprise testing
 echo "fs.file-max = 2097152" | sudo tee -a /etc/sysctl.conf
 echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
 echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
 sudo sysctl -p
 
 # Start enterprise test stack
 docker-compose -f docker-compose.enterprise.yml up -d
 
 - name: Tune PostgreSQL for enterprise load
 run: |
 docker exec postgresql psql -U postgres -c "
 ALTER SYSTEM SET shared_buffers = '4GB';
 ALTER SYSTEM SET work_mem = '256MB';
 ALTER SYSTEM SET max_connections = 1000;
 ALTER SYSTEM SET effective_cache_size = '12GB';
 SELECT pg_reload_conf();
 "
 
 - name: Seed enterprise data
 timeout-minutes: 120 # 2 hours max for enterprise wave
 run: |
 make seed-${{ matrix.wave }} PARALLEL_WORKERS=16 BATCH_SIZE=50000
 env:
 DATABASE_POOL_SIZE: 50
 
 - name: Run enterprise soak test
 timeout-minutes: 480 # 8 hours max
 run: |
 make soak-test-${{ matrix.wave }} \
 DURATION=${{ github.event.inputs.duration_hours || '2' }}h \
 USERS=2000 SPAWN_RATE=50 \
 FEDERATION_ENABLED=${{ matrix.federation }}
 env:
 DATABASE_TYPE: ${{ matrix.database }}
 CACHE_ENABLED: ${{ matrix.cache }}
 GUNICORN_WORKERS: 16
 LOCUST_WORKERS: 8
 
 - name: Capture enterprise flamegraph
 run: |
 make flamegraph-analysis DURATION=300 # 5 minutes
 
 - name: Generate enterprise capacity report
 run: |
 python scripts/generate_enterprise_report.py \
 --wave ${{ matrix.wave }} \
 --results reports/ \
 --output reports/enterprise-capacity-${{ matrix.wave }}.html
 
 - name: Upload enterprise artifacts
 uses: actions/upload-artifact@v4
 with:
 name: enterprise-soak-${{ matrix.wave }}-${{ matrix.federation && 'fed' || 'no-fed' }}
 retention-days: 30
 path: |
 reports/soak-*.html
 reports/flamegraph-*.svg
 reports/enterprise-capacity-*.html
 reports/cache-analysis-${{ matrix.wave }}.json
 reports/federation-analysis-*.json
 ```

#### **Phase 6: Enhanced Makefile Targets for Wave Testing**

* **6.1** **Wave-based make targets**
 ```make
 # Wave-specific data seeding
 seed-small:
 	@echo "🌊 Seeding SMALL wave (10K users, 50K teams)..."
 	@python scripts/seed_enterprise.py --wave small $(SEED_ARGS)
 	@echo "✅ Small wave data ready"
 
 seed-medium:
 	@echo "🌊 Seeding MEDIUM wave (100K users, 500K teams)..." 
 	@python scripts/seed_enterprise.py --wave medium --parallel-workers 16 $(SEED_ARGS)
 	@echo "✅ Medium wave data ready"
 
 seed-large:
 	@echo "🌊 Seeding LARGE wave (500K users, 2.5M teams)..."
 	@python scripts/seed_enterprise.py --wave large --parallel-workers 32 $(SEED_ARGS)
 	@echo "✅ Large wave data ready"
 
 seed-enterprise:
 	@echo "🌊 Seeding ENTERPRISE wave (1M users, 5M teams)..."
 	@python scripts/seed_enterprise.py --wave enterprise --parallel-workers 64 \
 		--batch-size 100000 $(SEED_ARGS)
 	@echo "✅ Enterprise wave data ready"
 
 # Wave-specific soak testing
 soak-test-small: seed-small
 	@echo "🔥 SMALL wave soak test (15 min)..."
 	@$(MAKE) _run_soak_test WAVE=small DURATION=15m USERS=100
 
 soak-test-medium: seed-medium 
 	@echo "🔥 MEDIUM wave soak test (30 min)..."
 	@$(MAKE) _run_soak_test WAVE=medium DURATION=30m USERS=500
 
 soak-test-large: seed-large
 	@echo "🔥 LARGE wave soak test (60 min)..."
 	@$(MAKE) _run_soak_test WAVE=large DURATION=60m USERS=1500
 
 soak-test-enterprise: seed-enterprise
 	@echo "🔥 ENTERPRISE wave soak test (120 min)..."
 	@$(MAKE) _run_soak_test WAVE=enterprise DURATION=120m USERS=2000
 
 # Run all waves sequentially (for comprehensive testing)
 soak-test-all-waves:
 	@echo "🌊 Running ALL wave tests (4+ hours)..."
 	@$(MAKE) soak-test-small
 	@$(MAKE) soak-test-medium 
 	@$(MAKE) soak-test-large
 	@$(MAKE) soak-test-enterprise
 	@python scripts/generate_wave_comparison_report.py
 	@echo "📊 All wave tests complete - see reports/wave-comparison.html"
 
 # Internal helper for running soak tests
 _run_soak_test:
 	@docker-compose -f docker-compose.enterprise.yml up -d smocker
 	@python scripts/setup_federation_mocks.py --wave $(WAVE)
 	@locust -f locustfiles/enterprise_scale.py --headless \
 		--users $(USERS) --spawn-rate $(SPAWN_RATE) \
 		--run-time $(DURATION) --html reports/soak-$(WAVE)-$(shell date +%Y%m%d).html
 	@pytest tests/bench/ --benchmark-only --benchmark-json reports/benchmark-$(WAVE).json
 	@python scripts/generate_wave_report.py --wave $(WAVE)
 ```

#### **Phase 7: Enterprise Documentation & Capacity Planning**

* **7.1** **Comprehensive enterprise guide** `docs/testing/enterprise-scalability.md`
 ```markdown
 # Enterprise-Scale Scalability Testing
 
 ## Wave System Overview
 
 Our testing uses a **4-wave system** to validate performance from small deployments to massive enterprises:
 
 | Wave | Scale | Use Case | Duration |
 |------|-------|----------|----------|
 | **Small** | 10K users, 50K teams | Department/Startup | 15 min |
 | **Medium** | 100K users, 500K teams | Mid-size Enterprise | 30 min |
 | **Large** | 500K users, 2.5M teams | Large Enterprise | 60 min |
 | **Enterprise** | 1M users, 5M teams | Global Enterprise | 120 min |
 
 ## Quick Start
 
 ```bash
 # Run specific wave
 make soak-test-enterprise USERS=2000
 
 # Run all waves (4+ hours)
 make soak-test-all-waves
 
 # View results
 open reports/enterprise-capacity-enterprise.html
 open http://localhost:3000/d/enterprise-loadtest
 ```
 
 ## Enterprise Capacity Planning Results
 
 ### Performance Baselines (with L1+L2 caching)
 
 | Configuration | Small Wave | Medium Wave | Large Wave | Enterprise Wave |
 |---------------|------------|-------------|------------|-----------------|
 | **Max RPS** | 500 | 1,200 | 2,500 | 4,000 |
 | **P95 Latency** | 25ms | 45ms | 85ms | 150ms |
 | **Memory Usage** | 2GB | 6GB | 15GB | 25GB |
 | **DB Connections** | 20 | 50 | 150 | 300 |
 | **Cache Hit Ratio** | 98% | 96% | 94% | 92% |
 
 ### Federation Performance
 
 | Federated Gateways | Tool Aggregation Time | Memory Overhead | Failure Tolerance |
 |--------------------|----------------------|-----------------|-------------------|
 | 10 gateways | 150ms | +500MB | 2 failures |
 | 50 gateways | 400ms | +2GB | 5 failures |
 | 100 gateways | 800ms | +4GB | 10 failures |
 
 ### Recommended Infrastructure
 
 #### Enterprise Wave (1M users, 5M teams)
 
 **Gateway Cluster:**
 - 4-8 instances: 8 CPU, 32GB RAM each
 - Load balancer with session affinity
 - Auto-scaling based on CPU >70%
 
 **Database:**
 - PostgreSQL: 16 CPU, 128GB RAM, 1TB SSD
 - Read replicas: 2-4 instances for read scaling
 - Connection pooling: pgbouncer with 300 max connections
 
 **Cache Layer:**
 - Redis cluster: 3 masters, 3 replicas
 - 64GB RAM per instance
 - Memory eviction: allkeys-lru
 
 **Monitoring:**
 - Prometheus: 8 CPU, 64GB RAM, 500GB storage
 - Grafana: 4 CPU, 16GB RAM
 - Log aggregation: Elasticsearch cluster


---

### 📦 Updated Deliverables

1. **Wave-based enterprise seeder**: `scripts/seed_enterprise.py` with 4-tier scaling system
2. **Enterprise smocker integration**: `docker-compose.enterprise.yml` + 100+ gateway mocks
3. **Enterprise load scenarios**: `locustfiles/enterprise_scale.py` with million-user patterns
4. **Service benchmarks**: `tests/bench/enterprise_performance.py` for all enterprise scales
5. **Enterprise monitoring**: `grafana/enterprise-loadtest.json` + comprehensive alerts
6. **Wave-based CI**: Weekly enterprise testing with 8-hour test cycles
7. **Performance analysis**: Distributed flamegraph capture + enterprise hotspot analysis
8. **Enterprise documentation**: Complete capacity planning for 1M+ user deployments

---

### 🎯 Expected Enterprise Outcomes

**Performance Baselines** (Enterprise Wave with L1+L2 caching):
- **50K tool listing**: <150ms P95 (vs 30+ seconds uncached)
- **Federation mesh**: 100 gateways aggregated in <800ms with intelligent caching
- **Multi-tenancy**: <15% query overhead for 5M team scope filtering
- **Memory efficiency**: L1 cache 4GB for 92% hit rate on enterprise datasets

**Enterprise Capacity Planning Data**:
- **Safe production limits**: 4,000 RPS sustained per 4-instance cluster
- **Scale-out recommendations**: Horizontal scaling patterns for 10M+ users
- **Cache sizing**: Memory requirements for different enterprise scales
- **Federation limits**: Maximum federated gateway count before timeout cascade

**Infrastructure Recommendations**:
- Detailed sizing for 1M+ user deployments
- Database sharding strategies for 10M+ teams
- Multi-region federation architecture
- Disaster recovery and failover procedures

---

### 🧩 Additional Notes

* **Enterprise-realistic patterns**: Based on actual Fortune 500 SaaS usage data
* **Federation at scale**: Tests 100+ gateway mesh with realistic failure patterns
* **Wave progression**: Each wave 5-10x larger than previous for scaling validation
* **Memory efficiency**: L1/L2 cache tuned for enterprise dataset sizes (10GB+)
* **CI scalability**: Weekly enterprise tests with trend analysis over months
* **Production readiness**: Direct infrastructure sizing for million-user deployments
* **Cost optimization**: Capacity planning includes cost-per-user analysis for different configurations

#	Criteria	Validation Method
1	Enterprise dataset: 50K tools, 10K servers, 5M teams, 1M users, 5 years metrics	`make seed-enterprise` completes < 60 min
2	Wave testing: All 4 waves (Small→Enterprise) with progressive load increase	`make soak-test-all-waves` runs 4-hour test cycle
3	Cache performance at scale: L1/L2 with 10GB+ datasets, 95%+ hit ratio	Grafana shows cache metrics under enterprise load
4	Federation stress test: 100+ mocked external gateways with realistic patterns	Smocker validates federated call patterns at scale
5	Multi-tenancy at scale: 5M teams query performance, scope isolation	Benchmark report shows <10% overhead vs single-tenant
6	CI integration: Nightly wave testing with enterprise-scale matrix	GitHub Actions uploads comprehensive reports + flamegraphs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CHORE]: Comprehensive Scalability & Soak-Test Harness (Long-term Stability & Load) - locust, pytest-benchmark, smocker mocked MCP servers #291

🧭 Chore Summary — Enterprise-Scale Scalability & Soak-Test Harness

🌊 Wave Matrix & Target Datasets

🧱 Areas Affected

⚙️ Context / Rationale

📐 Enhanced Architecture Design

📋 Enhanced Acceptance Criteria

🛠️ Comprehensive Task List

Phase 1: Enterprise Data Seeder with Wave System

Phase 2: Smocker Integration for Enterprise Federation Testing

Phase 3: Enterprise Load Testing Scenarios

Phase 4: Enhanced Enterprise Monitoring & Analysis

Phase 5: Wave-Based CI Integration

Phase 6: Enhanced Makefile Targets for Wave Testing

Phase 7: Enterprise Documentation & Capacity Planning

Enterprise Capacity Planning Results

Performance Baselines (with L1+L2 caching)

Federation Performance

Recommended Infrastructure

Enterprise Wave (1M users, 5M teams)

📦 Updated Deliverables

🎯 Expected Enterprise Outcomes

🧩 Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wave	Servers	Tools	Users	Teams	Metrics Retention	Max Users / Team	Load Test Duration
Small	100	500	10,000	50,000	90 days	50,000	15 minutes
Medium	1,000	2,500	100,000	500,000	1 year	100,000	30 minutes
Large	5,000	12,500	500,000	2,500,000	3 years	500,000	60 minutes
Enterprise	10,000	50,000	1,000,000	5,000,000	5 years	1,000,000	120 minutes
Enterprise Stability	10,000	50,000	1,000,000	5,000,000	5 years	1,000,000	48 hours

Critical Question	Today	After This Epic
Million-user multi-tenancy performance?	❓ Unknown	📊 Query performance across 5M teams
50K tools federation latency patterns?	❓ Unknown	📈 Network overhead, cache efficiency
L1/L2 cache behavior with 100GB+ datasets?	❓ Unknown	🔍 Memory usage, eviction patterns
Database scaling to 500M+ records?	❓ Unknown	📋 Connection pooling, query optimization
Memory leak patterns over 48h at enterprise scale?	❓ Unknown	🧠 RSS trends, GC patterns, cache bloat

Configuration	Small Wave	Medium Wave	Large Wave	Enterprise Wave
Max RPS	500	1,200	2,500	4,000
P95 Latency	25ms	45ms	85ms	150ms
Memory Usage	2GB	6GB	15GB	25GB
DB Connections	20	50	150	300
Cache Hit Ratio	98%	96%	94%	92%

Federated Gateways	Tool Aggregation Time	Memory Overhead	Failure Tolerance
10 gateways	150ms	+500MB	2 failures
50 gateways	400ms	+2GB	5 failures
100 gateways	800ms	+4GB	10 failures

[CHORE]: Comprehensive Scalability & Soak-Test Harness (Long-term Stability & Load) - locust, pytest-benchmark, smocker mocked MCP servers #291

Description

🧭 Chore Summary — Enterprise-Scale Scalability & Soak-Test Harness

🌊 Wave Matrix & Target Datasets

🧱 Areas Affected

⚙️ Context / Rationale

📐 Enhanced Architecture Design

📋 Enhanced Acceptance Criteria

🛠️ Comprehensive Task List

Phase 1: Enterprise Data Seeder with Wave System

Phase 2: Smocker Integration for Enterprise Federation Testing

Phase 3: Enterprise Load Testing Scenarios

Phase 4: Enhanced Enterprise Monitoring & Analysis

Phase 5: Wave-Based CI Integration

Phase 6: Enhanced Makefile Targets for Wave Testing

Phase 7: Enterprise Documentation & Capacity Planning

Enterprise Capacity Planning Results

Performance Baselines (with L1+L2 caching)

Federation Performance

Recommended Infrastructure

Enterprise Wave (1M users, 5M teams)

📦 Updated Deliverables

🎯 Expected Enterprise Outcomes

🧩 Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions