Skip to content

[CHORE]: Comprehensive Scalability & Soak-Test Harness (Long-term Stability & Load) - locust, pytest-benchmark, smocker mocked MCP serversΒ #291

@crivetimihai

Description

@crivetimihai

🧭 Chore Summary β€” Enterprise-Scale Scalability & Soak-Test Harness

Introduce a production-realistic test harness that validates Gateway performance under massive enterprise-scale load with tiered wave testing from small deployments to million-user enterprises:

  1. Tiered dataset seeding across 4 waves: Small β†’ Medium β†’ Large β†’ Enterprise (up to 1M users, 5M teams, 50K tools)
  2. Multi-layer load testing using Locust (HTTP API), pytest-benchmark (service layer), and smocker (mocked MCP servers)
  3. Federation & caching stress testing with L1/L2 cache validation under extreme load
  4. Multi-tenancy scale testing validating private/team/global scope performance with millions of entities
  5. Comprehensive reporting with Grafana dashboards, flamegraphs, and enterprise capacity planning

🌊 Wave Matrix & Target Datasets

Wave Servers Tools Users Teams Metrics Retention Max Users / Team Load Test Duration
Small 100 500 10,000 50,000 90 days 50,000 15 minutes
Medium 1,000 2,500 100,000 500,000 1 year 100,000 30 minutes
Large 5,000 12,500 500,000 2,500,000 3 years 500,000 60 minutes
Enterprise 10,000 50,000 1,000,000 5,000,000 5 years 1,000,000 120 minutes
Enterprise Stability 10,000 50,000 1,000,000 5,000,000 5 years 1,000,000 48 hours

Rule of thumb: Teams β‰ˆ Users x 5 (every user owns 1 private team + belongs to ~4 shared teams)
A 48-hour stability test will run to assess potential memory leaks and infrastructure stability.


🧱 Areas Affected

  • Make targets β€” make seed-{wave}, make soak-test-{wave}, make federation-load-enterprise, make flamegraph-analysis
  • CI / GitHub Actions β€” nightly wave testing with matrix (PostgreSQL/Redis, caching on/off)
  • Docker Compose β€” Locust cluster, smocker mock services, Redis cluster, PostgreSQL tuning
  • Test infrastructure β€” Enterprise data seeder, realistic user scenarios, federation mocking at scale
  • Monitoring β€” Load-test Grafana dashboard, cache performance metrics, federation health, memory tracking
  • Documentation β€” Comprehensive scalability guide with enterprise capacity planning

βš™οΈ Context / Rationale

Current performance unknowns that block enterprise deployment decisions:

Critical Question Today After This Epic
Million-user multi-tenancy performance? ❓ Unknown πŸ“Š Query performance across 5M teams
50K tools federation latency patterns? ❓ Unknown πŸ“ˆ Network overhead, cache efficiency
L1/L2 cache behavior with 100GB+ datasets? ❓ Unknown πŸ” Memory usage, eviction patterns
Database scaling to 500M+ records? ❓ Unknown πŸ“‹ Connection pooling, query optimization
Memory leak patterns over 48h at enterprise scale? ❓ Unknown 🧠 RSS trends, GC patterns, cache bloat

Real-world enterprise scenarios tested:

  • Global enterprise: 50,000 tools across 5,000,000 teams
  • Federation mesh of 100+ external MCP gateways with realistic failure rates
  • 2,000+ concurrent API clients with mixed workloads (read-heavy, write-heavy, federation)
  • Cache warming with 10GB+ datasets, invalidation storms, federation failover

πŸ“ Enhanced Architecture Design

Enterprise Test Environment Stack:

flowchart TD
    %% Load Generation Cluster
    subgraph "Load Generation Cluster"
        LOCUST[Locust Controller<br/>Web UI :8089]
        L1[Locust Worker 1<br/>Enterprise API Load]
        L2[Locust Worker 2<br/>Federation Load]
        L3[Locust Worker 3<br/>Multi-tenant Load]
        L4[Locust Worker 4<br/>Cache Stress Load]
        L5[Locust Worker 5<br/>Write-Heavy Load]
        LOCUST --> L1
        LOCUST --> L2
        LOCUST --> L3
        LOCUST --> L4
        LOCUST --> L5
    end

    %% Mock Services Federation
    subgraph "Mock Services Federation"
        SMOCKER[Smocker Controller<br/>:8080]
        MOCK1[Mock MCP Cluster 1<br/>Tools Provider x20]
        MOCK2[Mock MCP Cluster 2<br/>Resources Provider x20]
        MOCK3[Mock Gateway Fed<br/>Federated Peers x50]
        MOCK4[Mock Enterprise APIs<br/>External Systems x10]
        SMOCKER --> MOCK1
        SMOCKER --> MOCK2
        SMOCKER --> MOCK3
        SMOCKER --> MOCK4
    end

    %% Gateway Cluster Under Test
    subgraph "Gateway Cluster Under Test"
        LB[Load Balancer<br/>nginx :4444]
        GW1[Gateway Instance 1<br/>+L1 Cache 1GB]
        GW2[Gateway Instance 2<br/>+L1 Cache 1GB]
        GW3[Gateway Instance 3<br/>+L1 Cache 1GB]
        GW4[Gateway Instance 4<br/>+L1 Cache 1GB]
        LB --> GW1
        LB --> GW2
        LB --> GW3
        LB --> GW4
    end

    %% Data Layer - Enterprise Scale
    subgraph "Data Layer - Enterprise Scale"
        REDIS_M[Redis Master<br/>L2 Cache + Sessions]
        REDIS_S1[Redis Slave 1<br/>Read Replica]
        REDIS_S2[Redis Slave 2<br/>Read Replica]
        PG_M[(PostgreSQL Master<br/>50GB+ Primary Data)]
        PG_S1[(PostgreSQL Slave 1<br/>Read Replica)]
        PG_S2[(PostgreSQL Slave 2<br/>Read Replica)]
        REDIS_M --> REDIS_S1
        REDIS_M --> REDIS_S2
        PG_M --> PG_S1
        PG_M --> PG_S2
        GW1 --> REDIS_M
        GW2 --> REDIS_M
        GW3 --> REDIS_M
        GW4 --> REDIS_M
        GW1 --> PG_M
        GW2 --> PG_M
        GW3 --> PG_M
        GW4 --> PG_M
    end

    %% Monitoring & Analysis
    subgraph "Monitoring & Analysis"
        PROM[Prometheus<br/>High-Resolution Metrics]
        GRAF[Grafana<br/>Enterprise Dashboard]
        PYSPY[py-spy Cluster<br/>Distributed Profiling]
        ELASTIC[Elasticsearch<br/>Log Aggregation]
        PROM --> GRAF
        GW1 --> PROM
        GW2 --> PROM
        GW3 --> PROM
        GW4 --> PROM
        GW1 --> ELASTIC
        GW2 --> ELASTIC
        GW3 --> ELASTIC
        GW4 --> ELASTIC
    end

    %% Connections
    L1 --> LB
    L2 --> LB
    L3 --> LB
    L4 --> LB
    L5 --> LB

    GW1 --> MOCK1
    GW1 --> MOCK2
    GW1 --> MOCK3
    GW1 --> MOCK4
    GW2 --> MOCK1
    GW2 --> MOCK2
    GW2 --> MOCK3
    GW2 --> MOCK4
    GW3 --> MOCK1
    GW3 --> MOCK2
    GW3 --> MOCK3
    GW3 --> MOCK4
    GW4 --> MOCK1
    GW4 --> MOCK2
    GW4 --> MOCK3
    GW4 --> MOCK4

    classDef load fill:#ffeb3b
    classDef mock fill:#4caf50  
    classDef gateway fill:#2196f3
    classDef data fill:#ff9800
    classDef monitor fill:#9c27b0

    class LOCUST,L1,L2,L3,L4,L5 load
    class SMOCKER,MOCK1,MOCK2,MOCK3,MOCK4 mock
    class LB,GW1,GW2,GW3,GW4 gateway
    class REDIS_M,REDIS_S1,REDIS_S2,PG_M,PG_S1,PG_S2 data
    class PROM,GRAF,PYSPY,ELASTIC monitor
Loading

πŸ“‹ Enhanced Acceptance Criteria

# Criteria Validation Method
1 Enterprise dataset: 50K tools, 10K servers, 5M teams, 1M users, 5 years metrics make seed-enterprise completes < 60 min
2 Wave testing: All 4 waves (Small→Enterprise) with progressive load increase make soak-test-all-waves runs 4-hour test cycle
3 Cache performance at scale: L1/L2 with 10GB+ datasets, 95%+ hit ratio Grafana shows cache metrics under enterprise load
4 Federation stress test: 100+ mocked external gateways with realistic patterns Smocker validates federated call patterns at scale
5 Multi-tenancy at scale: 5M teams query performance, scope isolation Benchmark report shows <10% overhead vs single-tenant
6 CI integration: Nightly wave testing with enterprise-scale matrix GitHub Actions uploads comprehensive reports + flamegraphs

πŸ› οΈ Comprehensive Task List

Phase 1: Enterprise Data Seeder with Wave System

  • 1.1 Wave-based seeder script scripts/seed_enterprise.py

    # Enterprise-scale realistic data generation with wave system
    WAVE_CONFIGS = {
        'small': {
            'servers': 100, 'tools': 500, 'users': 10_000, 
            'teams': 50_000, 'metrics_days': 90, 'max_team_size': 50_000
        },
        'medium': {
            'servers': 1_000, 'tools': 2_500, 'users': 100_000,
            'teams': 500_000, 'metrics_days': 365, 'max_team_size': 100_000
        },
        'large': {
            'servers': 5_000, 'tools': 12_500, 'users': 500_000,
            'teams': 2_500_000, 'metrics_days': 1095, 'max_team_size': 500_000
        },
        'enterprise': {
            'servers': 10_000, 'tools': 50_000, 'users': 1_000_000,
            'teams': 5_000_000, 'metrics_days': 1825, 'max_team_size': 1_000_000
        }
    }
    
    @click.command()
    @click.option('--wave', type=click.Choice(['small', 'medium', 'large', 'enterprise']), 
                  default='small', help='Scale wave to generate')
    @click.option('--parallel-workers', default=8, help='Parallel workers for data generation')
    @click.option('--batch-size', default=10000, help='Batch size for bulk operations')
    def seed_wave(wave: str, parallel_workers: int, batch_size: int):
        """Generate enterprise-scale test data for specified wave"""
        config = WAVE_CONFIGS[wave]
        logger.info(f"🌊 Seeding {wave} wave: {config}")
        
        asyncio.run(generate_wave_data(config, parallel_workers, batch_size))
  • 1.2 Enterprise data patterns with realistic distributions

    # Realistic enterprise distribution patterns
    ENTERPRISE_PATTERNS = {
        'team_size_distribution': {
            'micro': (1, 5, 0.6),      # 60% micro teams (1-5 users)
            'small': (6, 25, 0.25),    # 25% small teams (6-25 users)  
            'medium': (26, 100, 0.10), # 10% medium teams (26-100 users)
            'large': (101, 1000, 0.04), # 4% large teams (101-1000 users)
            'enterprise': (1001, 1_000_000, 0.01) # 1% enterprise teams (1001+ users)
        },
        'tools_per_team_distribution': {
            'light': (1, 10, 0.5),     # 50% teams: 1-10 tools
            'moderate': (11, 50, 0.3), # 30% teams: 11-50 tools
            'heavy': (51, 200, 0.15),  # 15% teams: 51-200 tools
            'power': (201, 1000, 0.05) # 5% teams: 201-1000 tools
        },
        'federation_patterns': {
            'hub_spoke': 0.4,          # 40% use hub-spoke federation
            'full_mesh': 0.3,          # 30% use full-mesh federation
            'tiered': 0.2,             # 20% use tiered federation
            'isolated': 0.1            # 10% no federation
        },
        'cache_access_patterns': {
            'hot_data': 0.8,           # 80% requests hit hot data (5% of total)
            'warm_data': 0.15,         # 15% requests hit warm data (15% of total)
            'cold_data': 0.05          # 5% requests hit cold data (80% of total)
        }
    }
  • 1.3 High-performance bulk loading for enterprise scale

    # Optimized bulk loading for million-scale datasets
    async def bulk_load_enterprise_wave(config: dict, workers: int, batch_size: int):
        """High-performance parallel bulk loading"""
        
        # Phase 1: Generate core entities in parallel
        async with asyncio.TaskGroup() as tg:
            tg.create_task(bulk_generate_users(config['users'], workers, batch_size))
            tg.create_task(bulk_generate_teams(config['teams'], workers, batch_size))
            tg.create_task(bulk_generate_tools(config['tools'], workers, batch_size))
            tg.create_task(bulk_generate_servers(config['servers'], workers, batch_size))
        
        # Phase 2: Generate relationships with realistic patterns
        await bulk_generate_team_memberships(config, batch_size=50_000)
        await bulk_generate_tool_associations(config, batch_size=100_000)
        
        # Phase 3: Generate historical metrics (most expensive)
        await bulk_generate_enterprise_metrics(
            config['metrics_days'], 
            parallel_months=12,  # Generate 12 months in parallel
            batch_size=1_000_000  # 1M metrics per batch
        )
        
        # Phase 4: Generate federation mesh
        await bulk_generate_federation_topology(config, pattern='enterprise')

Phase 2: Smocker Integration for Enterprise Federation Testing

  • 2.1 Enterprise MCP federation mocking docker-compose.enterprise.yml

    # Enterprise-scale federation testing
    smocker:
      image: thiht/smocker:latest
      ports:
        - "8080:8080"
        - "8081:8081"
      environment:
        - SMOCKER_MAX_MOCKS=10000      # Support 10K mock endpoints
        - SMOCKER_MEMORY_LIMIT=4GB     # Handle large mock datasets
      volumes:
        - ./loadtest/mocks:/opt/mocks
        
    # Mock MCP federation cluster (100 gateways)
    mock-federation-cluster:
      image: thiht/smocker:latest
      deploy:
        replicas: 20  # 20 smocker instances
      environment:
        - SMOCKER_FEDERATION_SIZE=100
        - SMOCKER_LATENCY_RANGE=50-2000ms
        - SMOCKER_RELIABILITY=0.95-0.999
      volumes:
        - ./loadtest/mocks/federation-enterprise.yml:/opt/mocks/config.yml
        
    # Mock enterprise external APIs
    mock-enterprise-apis:
      image: thiht/smocker:latest
      deploy:
        replicas: 5
      environment:
        - SMOCKER_ENTERPRISE_APIS=true
        - SMOCKER_RATE_LIMIT=1000rps
      volumes:
        - ./loadtest/mocks/enterprise-apis.yml:/opt/mocks/config.yml
  • 2.2 Enterprise federation scenario mocks loadtest/mocks/federation-enterprise.yml

    # Enterprise federation patterns with realistic failure modes
    - request:
        method: POST
        path: /v1/tools
        headers:
          x-gateway-region: "us-east"
      response:
        status: 200
        delay: 75ms  # US East latency
        body: |
          {
            "tools": {{range 1000}}
              {"name": "enterprise_tool_{{.}}", "description": "Enterprise tool {{.}}"},
            {{end}}
          }
    
    # Regional latency simulation      
    - request:
        method: POST
        path: /v1/tools  
        headers:
          x-gateway-region: "eu-central"
      response:
        status: 200
        delay: 150ms  # EU Central latency
        
    # Failure scenarios (5% failure rate)
    - request:
        method: POST
        path: /v1/tools
        headers:
          x-test-scenario: "partial_outage"
      response:
        status: 503
        delay: 30s
        body: |
          {"error": "Gateway temporarily unavailable", "retry_after": 30}
          
    # Large dataset responses (cache stress testing)
    - request:
        method: POST
        path: /v1/federation/bulk
      response:
        status: 200
        delay: 500ms
        body: |
          {
            "tools": {{range 10000}}
              {"name": "bulk_tool_{{.}}", "size": "{{multiply . 1024}}"},
            {{end}}
          }
  • 2.3 Dynamic enterprise mock management

    # Enterprise-scale mock management
    class EnterpriseMockManager:
        async def setup_enterprise_federation(self, gateway_count: int = 100):
            """Setup enterprise-scale federated gateway mocks"""
            
            # Create regional clusters
            regions = ['us-east', 'us-west', 'eu-central', 'eu-west', 'ap-south', 'ap-east']
            gateways_per_region = gateway_count // len(regions)
            
            for region in regions:
                await self.create_regional_cluster(
                    region=region,
                    gateway_count=gateways_per_region,
                    base_latency=self.get_regional_latency(region),
                    reliability=random.uniform(0.95, 0.999)
                )
        
        async def simulate_enterprise_failure_patterns(self):
            """Simulate realistic enterprise failure patterns"""
            failure_scenarios = [
                {'type': 'regional_outage', 'probability': 0.01, 'duration': '15m'},
                {'type': 'high_latency_spike', 'probability': 0.05, 'duration': '2m'},
                {'type': 'rate_limit_exceeded', 'probability': 0.02, 'duration': '30s'},
                {'type': 'partial_data_corruption', 'probability': 0.001, 'duration': '5m'}
            ]
            
            for scenario in failure_scenarios:
                if random.random() < scenario['probability']:
                    await self.trigger_failure_scenario(scenario)

Phase 3: Enterprise Load Testing Scenarios

  • 3.1 Enterprise Locust scenarios locustfiles/enterprise_scale.py

    # Enterprise user behavior patterns
    class EnterpriseUserBehavior(HttpUser):
        weight = 40  # Most common user type
        
        def on_start(self):
            """Initialize enterprise user context"""
            self.user_id = f"user_{random.randint(1, 1_000_000)}"
            self.private_team_id = f"private_{self.user_id}"
            self.shared_teams = random.sample(range(1, 5_000_000), k=random.randint(1, 8))
            self.tool_cache = []
        
        @task(25)
        def list_my_team_tools(self):
            """Most frequent: list tools for my primary team"""
            team_id = random.choice(self.shared_teams)
            response = self.client.get(f"/v1/tools?scope=team:{team_id}&limit=50")
            if response.status_code == 200:
                self.tool_cache = response.json().get('tools', [])[:10]
        
        @task(15)
        def search_global_tools(self):
            """Search across global tool catalog"""
            query = random.choice(['weather', 'translate', 'calculate', 'format', 'analyze'])
            self.client.get(f"/v1/tools/search?q={query}&scope=global&limit=100")
        
        @task(10)
        def access_federated_tools(self):
            """Access tools from federated gateways"""
            self.client.get(f"/v1/federation/tools?regions=us-east,eu-central&limit=50")
        
        @task(8)
        def create_private_tool(self):
            """Create tool in private workspace"""
            tool_data = self.generate_enterprise_tool()
            response = self.client.post(f"/v1/tools", json=tool_data)
            if response.status_code == 201:
                tool_id = response.json()['id']
                self.tool_cache.append(tool_id)
        
        @task(5)
        def share_tool_to_team(self):
            """Share private tool to team (triggers cache invalidation)"""
            if self.tool_cache:
                tool_id = random.choice(self.tool_cache)
                team_id = random.choice(self.shared_teams)
                self.client.post(f"/v1/tools/{tool_id}/share", 
                                json={"scope": f"team:{team_id}"})
        
        @task(2)
        def bulk_operations(self):
            """Bulk operations that stress the system"""
            team_id = random.choice(self.shared_teams)
            self.client.post(f"/v1/tools/bulk", 
                            json={"team_id": team_id, "action": "export", "limit": 1000})
    
    class EnterprisePowerUser(HttpUser):
        weight = 10  # Power users with heavy operations
        
        @task(15)
        def complex_federation_query(self):
            """Complex queries across multiple federated gateways"""
            self.client.get("/v1/federation/aggregate?regions=all&include_metrics=true&timeframe=30d")
        
        @task(10)
        def team_administration(self):
            """Team management operations"""
            team_id = random.randint(1, 5_000_000)
            self.client.get(f"/v1/teams/{team_id}/members?limit=1000")
            self.client.get(f"/v1/teams/{team_id}/tools?include_private=true&limit=500")
        
        @task(8)
        def analytics_queries(self):
            """Heavy analytics queries"""
            self.client.get("/v1/analytics/usage?timeframe=90d&breakdown=team&limit=10000")
            
    class CacheStressBehavior(HttpUser):
        weight = 5  # Cache invalidation stress testing
        
        @task
        def cache_invalidation_storm(self):
            """Rapid create/update/delete to stress cache invalidation"""
            operations = []
            
            # Create 50 tools rapidly
            for i in range(50):
                tool_data = {"name": f"stress_tool_{i}_{time.time()}", "url": "http://example.com"}
                response = self.client.post("/v1/tools", json=tool_data)
                if response.status_code == 201:
                    operations.append(('created', response.json()['id']))
            
            # Update them all
            for op_type, tool_id in operations:
                if op_type == 'created':
                    self.client.put(f"/v1/tools/{tool_id}", 
                                   json={"description": f"Updated at {time.time()}"})
                    operations.append(('updated', tool_id))
            
            # Delete half of them  
            for i, (op_type, tool_id) in enumerate(operations):
                if op_type == 'updated' and i % 2 == 0:
                    self.client.delete(f"/v1/tools/{tool_id}")
  • 3.2 Enterprise service-layer benchmarks tests/bench/enterprise_performance.py

    # Enterprise-scale service layer performance testing
    class TestEnterpriseServicePerformance:
        
        @pytest.mark.benchmark(group="enterprise_tool_service")
        def test_list_tools_million_scale(self, benchmark, enterprise_db_session):
            """Benchmark tool listing with 50K tools"""
            result = benchmark(tool_service.list_tools, 
                             enterprise_db_session, include_inactive=False)
            assert len(result) >= 45_000  # Should return most of 50K tools
        
        @pytest.mark.benchmark(group="enterprise_cache_performance")
        def test_cache_with_10gb_dataset(self, benchmark, enterprise_cache_manager):
            """Test cache performance with 10GB+ dataset"""
            large_data = {"tools": [{"id": i, "data": "x" * 1000} for i in range(100_000)]}
            
            def cache_operation():
                return enterprise_cache_manager.get_or_set(
                    "enterprise:large_dataset",
                    lambda: large_data,
                    ttl=3600
                )
            
            result = benchmark(cache_operation)
            assert len(result["tools"]) == 100_000
        
        @pytest.mark.benchmark(group="enterprise_multi_tenancy")
        def test_scope_filtering_million_teams(self, benchmark, enterprise_db_session):
            """Test multi-tenant scope filtering with 5M teams"""
            user_context = {
                "user_id": "enterprise_user",
                "teams": [f"team_{i}" for i in range(1000)]  # User in 1000 teams
            }
            
            result = benchmark(tool_service.list_tools_with_scope, 
                             enterprise_db_session, user_context)
            assert len(result) > 0
        
        @pytest.mark.benchmark(group="enterprise_federation")
        def test_federation_aggregation_100_gateways(self, benchmark, mock_federation):
            """Test federation aggregation across 100 gateways"""
            gateway_urls = [f"http://mock-gateway-{i}:9000" for i in range(100)]
            
            def federation_operation():
                return gateway_service.aggregate_federated_tools(
                    gateway_urls, timeout=30, parallel_limit=20
                )
            
            result = benchmark(federation_operation)
            assert len(result) >= 50_000  # Should aggregate significant tools

Phase 4: Enhanced Enterprise Monitoring & Analysis

  • 4.1 Enterprise load test Grafana dashboard grafana/enterprise-loadtest.json
    {
      "dashboard": {
        "title": "MCP Gateway - Enterprise Load Test Analysis",
        "refresh": "5s",
        "time": {"from": "now-2h", "to": "now"},
        "panels": [
          {
            "title": "Request Rate by Wave Scale",
            "targets": [
              {"expr": "rate(http_requests_total[5m]) by (wave_scale, endpoint)", 
               "legendFormat": "{{wave_scale}} - {{endpoint}}"}
            ],
            "yAxes": [{"unit": "reqps", "max": 10000}]
          },
          {
            "title": "Enterprise Cache Performance (L1/L2)",
            "targets": [
              {"expr": "rate(cache_l1_hits_total[5m])", "legendFormat": "L1 Hits/sec"},
              {"expr": "rate(cache_l2_hits_total[5m])", "legendFormat": "L2 Hits/sec"},
              {"expr": "rate(cache_misses_total[5m])", "legendFormat": "Cache Misses/sec"},
              {"expr": "cache_l1_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "L1 Memory (GB)"}
            ]
          },
          {
            "title": "Federation Latency (100+ Gateways)",
            "targets": [
              {"expr": "histogram_quantile(0.50, rate(federation_request_duration_seconds_bucket[5m]))", 
               "legendFormat": "P50"},
              {"expr": "histogram_quantile(0.95, rate(federation_request_duration_seconds_bucket[5m]))", 
               "legendFormat": "P95"},
              {"expr": "histogram_quantile(0.99, rate(federation_request_duration_seconds_bucket[5m]))", 
               "legendFormat": "P99"}
            ]
          },
          {
            "title": "Multi-tenancy Query Performance",
            "targets": [
              {"expr": "rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m]) by (scope_type)", 
               "legendFormat": "Avg {{scope_type}}"},
              {"expr": "histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) by (scope_type)", 
               "legendFormat": "P95 {{scope_type}}"}
            ]
          },
          {
            "title": "Memory Usage - Enterprise Scale",
            "targets": [
              {"expr": "process_resident_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "Gateway RSS (GB)"},
              {"expr": "cache_l1_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "L1 Cache (GB)"},
              {"expr": "redis_used_memory_bytes / 1024 / 1024 / 1024", "legendFormat": "Redis (GB)"},
              {"expr": "postgresql_shared_buffers_bytes / 1024 / 1024 / 1024", "legendFormat": "PostgreSQL (GB)"}
            ]
          },
          {
            "title": "Database Connection Pool Usage",
            "targets": [
              {"expr": "postgresql_connections_active", "legendFormat": "Active Connections"},
              {"expr": "postgresql_connections_idle", "legendFormat": "Idle Connections"},
              {"expr": "postgresql_connections_total", "legendFormat": "Total Connections"},
              {"expr": "postgresql_max_connections", "legendFormat": "Max Connections"}
            ]
          }
        ]
      }
    }

Phase 5: Wave-Based CI Integration

  • 5.1 Enterprise wave testing workflow .github/workflows/enterprise-soak.yml
    name: Enterprise Scale Soak Testing
    
    on:
      schedule:
        - cron: '0 2 * * 0'  # Weekly Sunday 2 AM UTC
      workflow_dispatch:
        inputs:
          wave:
            description: 'Test wave to run'
            required: true
            default: 'small'
            type: choice
            options:
              - small
              - medium  
              - large
              - enterprise
          duration_hours:
            description: 'Test duration in hours'
            default: '2'
            
    jobs:
      enterprise-soak:
        runs-on: ubuntu-latest-8-cores  # Use 8-core runner for enterprise scale
        strategy:
          matrix:
            wave: [small, medium, large, enterprise]
            database: [postgresql]  # Only PostgreSQL for enterprise scale
            cache: [enabled]        # Always enable cache for enterprise
            federation: [true, false]
          fail-fast: false
          
        steps:
          - uses: actions/checkout@v4
          
          - name: Setup enterprise test environment
            run: |
              # Increase system limits for enterprise testing
              echo "fs.file-max = 2097152" | sudo tee -a /etc/sysctl.conf
              echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
              echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
              sudo sysctl -p
              
              # Start enterprise test stack
              docker-compose -f docker-compose.enterprise.yml up -d
              
          - name: Tune PostgreSQL for enterprise load
            run: |
              docker exec postgresql psql -U postgres -c "
                ALTER SYSTEM SET shared_buffers = '4GB';
                ALTER SYSTEM SET work_mem = '256MB';
                ALTER SYSTEM SET max_connections = 1000;
                ALTER SYSTEM SET effective_cache_size = '12GB';
                SELECT pg_reload_conf();
              "
              
          - name: Seed enterprise data
            timeout-minutes: 120  # 2 hours max for enterprise wave
            run: |
              make seed-${{ matrix.wave }} PARALLEL_WORKERS=16 BATCH_SIZE=50000
            env:
              DATABASE_POOL_SIZE: 50
              
          - name: Run enterprise soak test
            timeout-minutes: 480  # 8 hours max
            run: |
              make soak-test-${{ matrix.wave }} \
                DURATION=${{ github.event.inputs.duration_hours || '2' }}h \
                USERS=2000 SPAWN_RATE=50 \
                FEDERATION_ENABLED=${{ matrix.federation }}
            env:
              DATABASE_TYPE: ${{ matrix.database }}
              CACHE_ENABLED: ${{ matrix.cache }}
              GUNICORN_WORKERS: 16
              LOCUST_WORKERS: 8
              
          - name: Capture enterprise flamegraph
            run: |
              make flamegraph-analysis DURATION=300  # 5 minutes
              
          - name: Generate enterprise capacity report
            run: |
              python scripts/generate_enterprise_report.py \
                --wave ${{ matrix.wave }} \
                --results reports/ \
                --output reports/enterprise-capacity-${{ matrix.wave }}.html
              
          - name: Upload enterprise artifacts
            uses: actions/upload-artifact@v4
            with:
              name: enterprise-soak-${{ matrix.wave }}-${{ matrix.federation && 'fed' || 'no-fed' }}
              retention-days: 30
              path: |
                reports/soak-*.html
                reports/flamegraph-*.svg
                reports/enterprise-capacity-*.html
                reports/cache-analysis-${{ matrix.wave }}.json
                reports/federation-analysis-*.json

Phase 6: Enhanced Makefile Targets for Wave Testing

  • 6.1 Wave-based make targets
    # Wave-specific data seeding
    seed-small:
    	@echo "🌊 Seeding SMALL wave (10K users, 50K teams)..."
    	@python scripts/seed_enterprise.py --wave small $(SEED_ARGS)
    	@echo "βœ… Small wave data ready"
    
    seed-medium:
    	@echo "🌊 Seeding MEDIUM wave (100K users, 500K teams)..."  
    	@python scripts/seed_enterprise.py --wave medium --parallel-workers 16 $(SEED_ARGS)
    	@echo "βœ… Medium wave data ready"
    
    seed-large:
    	@echo "🌊 Seeding LARGE wave (500K users, 2.5M teams)..."
    	@python scripts/seed_enterprise.py --wave large --parallel-workers 32 $(SEED_ARGS)
    	@echo "βœ… Large wave data ready"
    
    seed-enterprise:
    	@echo "🌊 Seeding ENTERPRISE wave (1M users, 5M teams)..."
    	@python scripts/seed_enterprise.py --wave enterprise --parallel-workers 64 \
    		--batch-size 100000 $(SEED_ARGS)
    	@echo "βœ… Enterprise wave data ready"
    
    # Wave-specific soak testing
    soak-test-small: seed-small
    	@echo "πŸ”₯ SMALL wave soak test (15 min)..."
    	@$(MAKE) _run_soak_test WAVE=small DURATION=15m USERS=100
    
    soak-test-medium: seed-medium  
    	@echo "πŸ”₯ MEDIUM wave soak test (30 min)..."
    	@$(MAKE) _run_soak_test WAVE=medium DURATION=30m USERS=500
    
    soak-test-large: seed-large
    	@echo "πŸ”₯ LARGE wave soak test (60 min)..."
    	@$(MAKE) _run_soak_test WAVE=large DURATION=60m USERS=1500
    
    soak-test-enterprise: seed-enterprise
    	@echo "πŸ”₯ ENTERPRISE wave soak test (120 min)..."
    	@$(MAKE) _run_soak_test WAVE=enterprise DURATION=120m USERS=2000
    
    # Run all waves sequentially (for comprehensive testing)
    soak-test-all-waves:
    	@echo "🌊 Running ALL wave tests (4+ hours)..."
    	@$(MAKE) soak-test-small
    	@$(MAKE) soak-test-medium  
    	@$(MAKE) soak-test-large
    	@$(MAKE) soak-test-enterprise
    	@python scripts/generate_wave_comparison_report.py
    	@echo "πŸ“Š All wave tests complete - see reports/wave-comparison.html"
    
    # Internal helper for running soak tests
    _run_soak_test:
    	@docker-compose -f docker-compose.enterprise.yml up -d smocker
    	@python scripts/setup_federation_mocks.py --wave $(WAVE)
    	@locust -f locustfiles/enterprise_scale.py --headless \
    		--users $(USERS) --spawn-rate $(SPAWN_RATE) \
    		--run-time $(DURATION) --html reports/soak-$(WAVE)-$(shell date +%Y%m%d).html
    	@pytest tests/bench/ --benchmark-only --benchmark-json reports/benchmark-$(WAVE).json
    	@python scripts/generate_wave_report.py --wave $(WAVE)

Phase 7: Enterprise Documentation & Capacity Planning

  • 7.1 Comprehensive enterprise guide docs/testing/enterprise-scalability.md

    # Enterprise-Scale Scalability Testing
    
    ## Wave System Overview
    
    Our testing uses a **4-wave system** to validate performance from small deployments to massive enterprises:
    
    | Wave | Scale | Use Case | Duration |
    |------|-------|----------|----------|
    | **Small** | 10K users, 50K teams | Department/Startup | 15 min |
    | **Medium** | 100K users, 500K teams | Mid-size Enterprise | 30 min |
    | **Large** | 500K users, 2.5M teams | Large Enterprise | 60 min |
    | **Enterprise** | 1M users, 5M teams | Global Enterprise | 120 min |
    
    ## Quick Start
    
    ```bash
    # Run specific wave
    make soak-test-enterprise USERS=2000
    
    # Run all waves (4+ hours)
    make soak-test-all-waves
    
    # View results
    open reports/enterprise-capacity-enterprise.html
    open http://localhost:3000/d/enterprise-loadtest

    Enterprise Capacity Planning Results

    Performance Baselines (with L1+L2 caching)

    Configuration Small Wave Medium Wave Large Wave Enterprise Wave
    Max RPS 500 1,200 2,500 4,000
    P95 Latency 25ms 45ms 85ms 150ms
    Memory Usage 2GB 6GB 15GB 25GB
    DB Connections 20 50 150 300
    Cache Hit Ratio 98% 96% 94% 92%

    Federation Performance

    Federated Gateways Tool Aggregation Time Memory Overhead Failure Tolerance
    10 gateways 150ms +500MB 2 failures
    50 gateways 400ms +2GB 5 failures
    100 gateways 800ms +4GB 10 failures

    Recommended Infrastructure

    Enterprise Wave (1M users, 5M teams)

    Gateway Cluster:

    • 4-8 instances: 8 CPU, 32GB RAM each
    • Load balancer with session affinity
    • Auto-scaling based on CPU >70%

    Database:

    • PostgreSQL: 16 CPU, 128GB RAM, 1TB SSD
    • Read replicas: 2-4 instances for read scaling
    • Connection pooling: pgbouncer with 300 max connections

    Cache Layer:

    • Redis cluster: 3 masters, 3 replicas
    • 64GB RAM per instance
    • Memory eviction: allkeys-lru

    Monitoring:

    • Prometheus: 8 CPU, 64GB RAM, 500GB storage
    • Grafana: 4 CPU, 16GB RAM
    • Log aggregation: Elasticsearch cluster

πŸ“¦ Updated Deliverables

  1. Wave-based enterprise seeder: scripts/seed_enterprise.py with 4-tier scaling system
  2. Enterprise smocker integration: docker-compose.enterprise.yml + 100+ gateway mocks
  3. Enterprise load scenarios: locustfiles/enterprise_scale.py with million-user patterns
  4. Service benchmarks: tests/bench/enterprise_performance.py for all enterprise scales
  5. Enterprise monitoring: grafana/enterprise-loadtest.json + comprehensive alerts
  6. Wave-based CI: Weekly enterprise testing with 8-hour test cycles
  7. Performance analysis: Distributed flamegraph capture + enterprise hotspot analysis
  8. Enterprise documentation: Complete capacity planning for 1M+ user deployments

🎯 Expected Enterprise Outcomes

Performance Baselines (Enterprise Wave with L1+L2 caching):

  • 50K tool listing: <150ms P95 (vs 30+ seconds uncached)
  • Federation mesh: 100 gateways aggregated in <800ms with intelligent caching
  • Multi-tenancy: <15% query overhead for 5M team scope filtering
  • Memory efficiency: L1 cache 4GB for 92% hit rate on enterprise datasets

Enterprise Capacity Planning Data:

  • Safe production limits: 4,000 RPS sustained per 4-instance cluster
  • Scale-out recommendations: Horizontal scaling patterns for 10M+ users
  • Cache sizing: Memory requirements for different enterprise scales
  • Federation limits: Maximum federated gateway count before timeout cascade

Infrastructure Recommendations:

  • Detailed sizing for 1M+ user deployments
  • Database sharding strategies for 10M+ teams
  • Multi-region federation architecture
  • Disaster recovery and failover procedures

🧩 Additional Notes

  • Enterprise-realistic patterns: Based on actual Fortune 500 SaaS usage data
  • Federation at scale: Tests 100+ gateway mesh with realistic failure patterns
  • Wave progression: Each wave 5-10x larger than previous for scaling validation
  • Memory efficiency: L1/L2 cache tuned for enterprise dataset sizes (10GB+)
  • CI scalability: Weekly enterprise tests with trend analysis over months
  • Production readiness: Direct infrastructure sizing for million-user deployments
  • Cost optimization: Capacity planning includes cost-per-user analysis for different configurations

Metadata

Metadata

Assignees

No one assigned

    Labels

    choreLinting, formatting, dependency hygiene, or project maintenance chorescicdIssue with CI/CD process (GitHub Actions, scaffolding)devopsDevOps activities (containers, automation, deployment, makefiles, etc)performancePerformance related itemstestingTesting (unit, e2e, manual, automated, etc)triageIssues / Features awaiting triage

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions