diff --git a/README.md b/README.md index 55387ab..54d75bd 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,9 @@ This directory contains GitOps application manifests that are deployed as part o ``` applications/ ├── base/ # Base application configurations -│ ├── genestack-sources/ # Genestack GitOps repository sources │ ├── managed-services/ # Rackspace-managed services │ └── services/ # Core cluster services +│ └── observability/ # Observability stack components └── policies/ # Security and network policies ├── network-policies/ # Kubernetes network policies ├── pod-security-policies/ # Pod security standards @@ -18,173 +18,52 @@ applications/ ## Available Applications -### Quick Reference Table - -| Application | Category | Namespace | Purpose | -|-------------|----------|-----------|---------| -| **cert-manager** | Core Service | `cert-manager` | Automated certificate management | -| **gateway-api** | Core Service | `gateway-api` | Kubernetes Gateway API implementation | -| **ingress-nginx** | Core Service | `ingress-nginx` | NGINX-based ingress controller | -| **keycloak** | Core Service | `keycloak` | Identity and access management | -| **kube-prometheus-stack** | Core Service | `observability` | Complete monitoring and alerting stack | -| **metallb** | Core Service | `metallb-system` | Bare metal load balancer | -| **olm** | Core Service | `olm` | Operator Lifecycle Manager | -| **opentelemetry-kube-stack** | Core Service | `observability` | Complete OpenTelemetry observability stack | -| **sealed-secrets** | Core Service | `sealed-secrets` | Encrypted secrets management | -| **velero** | Core Service | `velero` | Cluster backup and disaster recovery | -| **alert-proxy** | Managed Service | `rackspace` | Rackspace alert aggregation | -| **genestack-repo** | Source Repository | `flux-system` | Genestack OpenStack deployment | -| **openstack-helm** | Source Repository | `flux-system` | OpenStack Helm charts | -| **network-policies** | Security Policy | Various | Kubernetes network segmentation | -| **pod-security-policies** | Security Policy | Various | Pod security standards enforcement | -| **rbac** | Security Policy | Various | Role-based access control | - -### Core Services (`base/services/`) - -#### **cert-manager** -- **Purpose**: Automated certificate management for Kubernetes -- **Source**: Jetstack Helm repository (`https://charts.jetstack.io`) -- **Namespace**: `cert-manager` -- **Features**: - - Let's Encrypt integration - - Automatic certificate renewal - - TLS certificate provisioning for ingress - -#### **gateway-api** -- **Purpose**: Kubernetes Gateway API implementation -- **Namespace**: `gateway-api` -- **Features**: - - Next-generation ingress and traffic management - - Advanced routing capabilities - - Service mesh integration ready - -#### **ingress-nginx** -- **Purpose**: NGINX-based ingress controller -- **Namespace**: `ingress-nginx` -- **Features**: - - HTTP/HTTPS load balancing - - SSL termination - - Path-based and host-based routing - -#### **keycloak** -- **Purpose**: Identity and access management -- **Namespace**: `keycloak` -- **Features**: - - Single sign-on (SSO) - - OAuth 2.0 and OpenID Connect - - Multi-realm support - - LDAP/Active Directory integration - -#### **kube-prometheus-stack** -- **Purpose**: Complete monitoring and alerting stack -- **Namespace**: `observability` -- **Components**: - - Prometheus for metrics collection - - Grafana for visualization - - Alertmanager for alert handling - - Node Exporter for node metrics -- **Features**: - - Pre-configured dashboards - - Alert rules for common scenarios - - ServiceMonitor auto-discovery - -#### **metallb** -- **Purpose**: Bare metal load balancer for Kubernetes -- **Namespace**: `metallb-system` -- **Features**: - - Layer 2 and BGP load balancing - - IP address pool management - - Service type LoadBalancer support - -#### **olm** -- **Purpose**: Operator Lifecycle Manager -- **Namespace**: `olm` -- **Features**: - - Operator installation and management - - Dependency resolution - - Automatic updates - -#### **opentelemetry-kube-stack** -- **Purpose**: Complete OpenTelemetry observability stack for Kubernetes -- **Source**: OpenTelemetry Kube Stack Helm repository (`https://charts.opentelemetry.io`) -- **Namespace**: `observability` -- **Features**: - - OpenTelemetry Operator for auto-instrumentation and collector management - - Pre-configured OpenTelemetry Collector for metrics, traces, and logs - - Automatic service discovery and monitoring - - Multi-language auto-instrumentation support (Java, Node.js, Python, .NET, Go) - - Integration with Prometheus and Jaeger for complete observability - - Custom resource definitions for OpenTelemetry configuration - -#### **sealed-secrets** -- **Purpose**: Encrypted secrets management -- **Namespace**: `sealed-secrets` -- **Features**: - - GitOps-friendly secret encryption - - Public/private key encryption - - Automatic secret decryption in cluster - -#### **velero** -- **Purpose**: Cluster backup and disaster recovery -- **Namespace**: `velero` -- **Features**: - - Backup and restore Kubernetes resources - - Persistent volume snapshots - - Scheduled backups - - Cross-cluster migration - -### Managed Services (`base/managed-services/`) - -#### **alert-proxy** -- **Purpose**: Rackspace alert aggregation and forwarding -- **Namespace**: `rackspace` -- **Features**: - - Alert collection from monitoring systems - - Integration with Rackspace support systems - - Alert routing and escalation - -### Source Repositories (`base/genestack-sources/`) - -#### **genestack-repo** -- **Purpose**: GitOps source for Genestack OpenStack deployment -- **Source**: `https://github.com/rackerlabs/genestack.git` -- **Version**: `release-2025.2.6` -- **Features**: - - OpenStack deployment automation - - Helm chart aggregation - - GitOps workflow integration - -#### **openstack-helm** -- **Purpose**: OpenStack Helm charts repository -- **Features**: - - Production-ready OpenStack charts - - Multi-node deployment support - - HA configuration templates - -### Security Policies (`policies/`) - -#### **network-policies** -- **Purpose**: Kubernetes network segmentation -- **Status**: Template directory (placeholder.txt) -- **Planned Features**: - - Namespace isolation - - Ingress/egress traffic control - - Zero-trust networking - -#### **pod-security-policies** -- **Purpose**: Pod security standards enforcement -- **Status**: Template directory (placeholder.txt) -- **Planned Features**: - - Security context enforcement - - Privilege escalation prevention - - Container security standards - -#### **rbac** -- **Purpose**: Role-based access control -- **Features**: - - Service account management - - Role and ClusterRole definitions - - Principle of least privilege +### Core Services + +| Service | Namespace | Purpose | Documentation | +|---------|-----------|---------|---------------| +| **[cert-manager](applications/base/services/cert-manager/)** | `cert-manager` | Automated TLS certificate management | [README](applications/base/services/cert-manager/README.md) | +| **[external-snapshotter](applications/base/services/external-snapshotter/)** | `kube-system` | Volume snapshot management | [README](applications/base/services/external-snapshotter/README.md) | +| **[gateway-api](applications/base/services/gateway-api/)** | `gateway-system` | Next-generation ingress API | [README](applications/base/services/gateway-api/README.md) | +| **[harbor](applications/base/services/harbor/)** | `harbor` | Container registry with security scanning | [README](applications/base/services/harbor/README.md) | +| **[headlamp](applications/base/services/headlamp/)** | `headlamp` | Modern Kubernetes dashboard | [README](applications/base/services/headlamp/README.md) | +| **[keycloak](applications/base/services/keycloak/)** | `keycloak` | Identity and access management | [README](applications/base/services/keycloak/README.md) | +| **[kyverno](applications/base/services/kyverno/)** | `kyverno` | Kubernetes-native policy engine | [README](applications/base/services/kyverno/README.md) | +| **[longhorn](applications/base/services/longhorn/)** | `longhorn-system` | Distributed block storage | [README](applications/base/services/longhorn/README.md) | +| **[metallb](applications/base/services/metallb/)** | `metallb-system` | Load balancer for bare-metal clusters | [README](applications/base/services/metallb/README.md) | +| **[olm](applications/base/services/olm/)** | `olm` | Operator Lifecycle Manager | [README](applications/base/services/olm/README.md) | +| **[openstack-ccm](applications/base/services/openstack-ccm/)** | `kube-system` | OpenStack Cloud Controller Manager | [README](applications/base/services/openstack-ccm/README.md) | +| **[openstack-csi](applications/base/services/openstack-csi/)** | `kube-system` | OpenStack Cinder CSI driver | [README](applications/base/services/openstack-csi/README.md) | +| **[postgres-operator](applications/base/services/postgres-operator/)** | `postgres-operator` | PostgreSQL cluster management | [README](applications/base/services/postgres-operator/README.md) | +| **[rbac-manager](applications/base/services/rbac-manager/)** | `rbac-manager` | RBAC management automation | [README](applications/base/services/rbac-manager/README.md) | +| **[sealed-secrets](applications/base/services/sealed-secrets/)** | `kube-system` | GitOps-friendly secret management | [README](applications/base/services/sealed-secrets/README.md) | +| **[velero](applications/base/services/velero/)** | `velero` | Backup and disaster recovery | [README](applications/base/services/velero/README.md) | +| **[vsphere-csi](applications/base/services/vsphere-csi/)** | `vmware-system-csi` | vSphere storage integration | [README](applications/base/services/vsphere-csi/README.md) | +| **[weave-gitops](applications/base/services/weave-gitops/)** | `flux-system` | GitOps dashboard for Flux | [README](applications/base/services/weave-gitops/README.md) | + +### Observability Stack + +| Component | Namespace | Purpose | Documentation | +|-----------|-----------|---------|---------------| +| **[observability](applications/base/services/observability/)** | `observability` | Complete observability stack | [README](applications/base/services/observability/README.md) | +| **[kube-prometheus-stack](applications/base/services/observability/kube-prometheus-stack/)** | `observability` | Prometheus, Grafana, Alertmanager | [README](applications/base/services/observability/kube-prometheus-stack/README.md) | +| **[loki](applications/base/services/observability/loki/)** | `observability` | Log aggregation and storage | [README](applications/base/services/observability/loki/README.md) | +| **[tempo](applications/base/services/observability/tempo/)** | `observability` | Distributed tracing backend | [README](applications/base/services/observability/tempo/README.md) | +| **[opentelemetry-kube-stack](applications/base/services/observability/opentelemetry-kube-stack/)** | `observability` | OpenTelemetry collection framework | [README](applications/base/services/observability/opentelemetry-kube-stack/README.md) | + +### Managed Services + +| Service | Namespace | Purpose | Documentation | +|---------|-----------|---------|---------------| +| **[alert-proxy](applications/base/managed-services/alert-proxy/)** | `rackspace` | Rackspace alert aggregation | [README](applications/base/managed-services/alert-proxy/README.md) | + +### Security Policies + +| Policy | Scope | Purpose | +|--------|-------|---------| +| **[network-policies](applications/policies/network-policies/)** | Various | Kubernetes network segmentation | +| **[pod-security-policies](applications/policies/pod-security-policies/)** | Various | Pod security standards enforcement | +| **[rbac](applications/policies/rbac/)** | Various | Role-based access control | ## Deployment Architecture @@ -203,14 +82,22 @@ All applications follow these patterns: - **Remediation**: 3-retry policy with last-failure remediation ### Namespace Organization -- `cert-manager`: Certificate management -- `ingress-nginx`: Ingress controllers -- `observability`: Monitoring and alerting -- `metallb-system`: Load balancing -- `velero`: Backup and recovery +- `cert-manager`: TLS certificate management +- `gateway-system`: Gateway API controllers +- `harbor`: Container registry and security scanning +- `headlamp`: Kubernetes dashboard - `keycloak`: Identity and access management +- `kyverno`: Policy engine and governance +- `longhorn-system`: Distributed storage +- `metallb-system`: Load balancing for bare-metal +- `observability`: Complete monitoring, logging, and tracing stack +- `olm`: Operator lifecycle management +- `postgres-operator`: PostgreSQL database management +- `rbac-manager`: RBAC automation +- `velero`: Backup and disaster recovery +- `vmware-system-csi`: vSphere storage integration +- `flux-system`: GitOps controllers and dashboards - `rackspace`: Managed services -- `gateway-api`: Next-gen traffic management ## Usage @@ -253,18 +140,59 @@ Applications can be customized through: ## Monitoring and Observability -The kube-prometheus-stack provides comprehensive monitoring: +The observability stack provides comprehensive monitoring, logging, and tracing: -- **Metrics**: Application and infrastructure metrics via Prometheus -- **Dashboards**: Pre-configured Grafana dashboards -- **Alerts**: Production-ready alerting rules -- **Logs**: Integration with cluster logging stack +### Metrics and Monitoring +- **[Kube-Prometheus-Stack](applications/base/services/observability/kube-prometheus-stack/)**: Prometheus, Grafana, and Alertmanager +- **Metrics Collection**: Application and infrastructure metrics +- **Dashboards**: Pre-configured Grafana dashboards for Kubernetes and applications +- **Alerting**: Production-ready alerting rules with notification routing + +### Logging +- **[Loki](applications/base/services/observability/loki/)**: Cost-effective log aggregation and storage +- **Log Collection**: Kubernetes and application logs via OpenTelemetry +- **Log Querying**: LogQL for powerful log filtering and analysis +- **Retention**: Configurable log retention policies + +### Tracing +- **[Tempo](applications/base/services/observability/tempo/)**: Distributed tracing backend +- **Trace Collection**: OpenTelemetry-based trace ingestion +- **Trace Analysis**: TraceQL for trace querying and analysis +- **Integration**: Unified view with metrics and logs in Grafana + +### Data Collection +- **[OpenTelemetry](applications/base/services/observability/opentelemetry-kube-stack/)**: Unified observability framework +- **Auto-instrumentation**: Automatic telemetry collection for applications +- **Data Processing**: Transformation, filtering, and enrichment pipelines +- **Multi-backend Export**: Support for multiple observability backends ## Support and Maintenance -- **Updates**: Managed through GitOps workflow -- **Backup**: Velero provides application backup/restore -- **Security**: Regular security updates via Flux automation -- **Monitoring**: Health checks via Prometheus/Grafana +- **Updates**: Managed through GitOps workflow with Flux CD +- **Backup**: [Velero](applications/base/services/velero/) provides application and persistent volume backup/restore +- **Security**: Regular security updates via Flux automation and [Kyverno](applications/base/services/kyverno/) policies +- **Monitoring**: Health checks via [Prometheus/Grafana](applications/base/services/observability/) +- **Storage**: [Longhorn](applications/base/services/longhorn/) for distributed block storage or [vSphere CSI](applications/base/services/vsphere-csi/)/[OpenStack CSI](applications/base/services/openstack-csi/) for cloud storage +- **Secrets Management**: [Sealed Secrets](applications/base/services/sealed-secrets/) for GitOps-friendly secret encryption +- **Identity Management**: [Keycloak](applications/base/services/keycloak/) for OIDC authentication and authorization + +## Documentation + +For detailed configuration and troubleshooting information, see the individual service documentation: + +- **Service Templates**: [docs/templates/](docs/templates/) - Templates for creating new service documentation +- **Configuration Guides**: Each service directory contains comprehensive README files with: + - Configuration options and examples + - Cluster-specific override guidance + - Verification and troubleshooting steps + - References to upstream documentation + +## Getting Started + +1. **Review Service Documentation**: Check individual service README files for configuration requirements +2. **Customize Overrides**: Create cluster-specific configuration overrides as needed +3. **Deploy via GitOps**: Commit changes to trigger Flux reconciliation +4. **Monitor Deployment**: Use [Weave GitOps](applications/base/services/weave-gitops/) or [Headlamp](applications/base/services/headlamp/) dashboards to monitor deployment status +5. **Verify Services**: Follow verification steps in each service's documentation For application-specific documentation, see individual application directories and their respective upstream documentation. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..6938111 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,112 @@ +# OpenCenter Service Configuration Guides + +This directory contains comprehensive configuration guides for all services available in the openCenter platform. Each guide provides detailed configuration examples, common pitfalls, troubleshooting steps, and best practices. + +## Available Configuration Guides + +### Core Infrastructure Services + +| Service | Guide | Description | +|---------|-------|-------------| +| **Cert-manager** | [cert-manager-config-guide.md](cert-manager-config-guide.md) | TLS certificate management and automation | +| **Harbor** | [harbor-config-guide.md](harbor-config-guide.md) | Container registry with security scanning | +| **Keycloak** | [keycloak-config-guide.md](keycloak-config-guide.md) | Identity and access management | +| **Kyverno** | [kyverno-config-guide.md](kyverno-config-guide.md) | Kubernetes-native policy engine | +| **Longhorn** | [longhorn-config-guide.md](longhorn-config-guide.md) | Distributed block storage system | +| **MetalLB** | [metallb-config-guide.md](metallb-config-guide.md) | Load balancer for bare-metal clusters | +| **Sealed Secrets** | [sealed-secrets-config-guide.md](sealed-secrets-config-guide.md) | GitOps-friendly secret encryption | +| **Velero** | [velero-config-guide.md](velero-config-guide.md) | Backup and disaster recovery | + +### Observability Stack + +| Component | Guide | Description | +|-----------|-------|-------------| +| **Kube-Prometheus-Stack** | [kube-prometheus-stack-config-guide.md](kube-prometheus-stack-config-guide.md) | Complete monitoring with Prometheus, Grafana, Alertmanager | +| **Loki** | [loki-config-guide.md](loki-config-guide.md) | Log aggregation and storage system | +| **Tempo** | [tempo-config-guide.md](tempo-config-guide.md) | Distributed tracing backend | +| **OpenTelemetry** | [opentelemetry-kube-stack-config-guide.md](opentelemetry-kube-stack-config-guide.md) | Unified observability data collection | + +## Guide Structure + +Each configuration guide follows a consistent structure: + +### 1. Overview +Brief description of the service and its role in the Kubernetes cluster. + +### 2. Key Configuration Choices +Detailed examples of important configuration options with explanations of why specific choices were made. + +### 3. Common Pitfalls +Description of frequently encountered issues, their causes, and step-by-step solutions with verification commands. + +### 4. Required Secrets +Documentation of all secrets required by the service, including field descriptions and examples. + +### 5. Verification +Commands to verify the service is running correctly and functioning as expected. + +### 6. Usage Examples +Practical examples of common use cases and configuration patterns. + +## Templates + +### Service Documentation Templates + +| Template | Purpose | Location | +|----------|---------|----------| +| **Service README Template** | Base template for service README files | [templates/service-readme-template.md](templates/service-readme-template.md) | +| **Configuration Guide Template** | Template for detailed configuration guides | [templates/service-config-guide-template.md](templates/service-config-guide-template.md) | +| **Service Standards Template** | Template for service standards documentation | [templates/service-standards-template.md](templates/service-standards-template.md) | + +## Getting Started + +1. **Choose Your Service**: Select the service you want to configure from the tables above +2. **Read the Configuration Guide**: Follow the detailed configuration examples and explanations +3. **Implement Configuration**: Apply the configurations to your cluster with appropriate customizations +4. **Verify Deployment**: Use the verification steps to ensure the service is working correctly +5. **Troubleshoot Issues**: Refer to the common pitfalls section if you encounter problems + +## Best Practices + +### Configuration Management +- Use GitOps principles for all configuration changes +- Store sensitive data in encrypted secrets (Sealed Secrets or SOPS) +- Implement proper resource limits and requests +- Follow security best practices for each service + +### Monitoring and Observability +- Enable monitoring for all services using the observability stack +- Set up appropriate alerts for service health and performance +- Implement proper logging and tracing for troubleshooting + +### Security +- Follow the principle of least privilege for RBAC +- Use network policies to restrict traffic between services +- Regularly update services and scan for vulnerabilities +- Implement proper backup and disaster recovery procedures + +### Maintenance +- Regularly review and update configurations +- Test backup and restore procedures +- Monitor resource usage and scale as needed +- Keep documentation up to date with configuration changes + +## Contributing + +When adding new services or updating existing ones: + +1. Use the appropriate template from the `templates/` directory +2. Follow the established structure and formatting +3. Include comprehensive examples and troubleshooting information +4. Test all configuration examples before documenting them +5. Update this README to include the new service + +## Support + +For service-specific issues: +1. Check the relevant configuration guide for troubleshooting steps +2. Review the service's upstream documentation +3. Check the service logs and Kubernetes events +4. Consult the observability dashboards for metrics and alerts + +For platform-wide issues, refer to the main [README](../README.md) and service standards documentation. \ No newline at end of file diff --git a/docs/cert-manager-config-guide.md b/docs/cert-manager-config-guide.md new file mode 100644 index 0000000..2317001 --- /dev/null +++ b/docs/cert-manager-config-guide.md @@ -0,0 +1,194 @@ +# Cert-manager Configuration Guide + +## Overview +Cert-manager automates the management and issuance of TLS certificates from various issuing sources. It ensures certificates are valid and up-to-date, and attempts to renew certificates at a configured time before expiry. + +## Key Configuration Choices + +### Certificate Issuers +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: admin@example.com + privateKeySecretRef: + name: letsencrypt-prod + solvers: + - http01: + ingress: + class: nginx +``` +**Why**: +- ClusterIssuer allows certificate issuance across all namespaces +- Let's Encrypt provides free, automated certificates +- HTTP01 challenge works with most ingress controllers + +### DNS Challenge Configuration +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-dns +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: admin@example.com + privateKeySecretRef: + name: letsencrypt-dns + solvers: + - dns01: + cloudflare: + email: admin@example.com + apiTokenSecretRef: + name: cloudflare-api-token + key: api-token +``` +**Why**: DNS01 challenges enable wildcard certificates and work behind firewalls + +### Certificate Resource +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: example-tls + namespace: default +spec: + secretName: example-tls + issuerRef: + name: letsencrypt-prod + kind: ClusterIssuer + dnsNames: + - example.com + - www.example.com +``` +**Why**: Explicit certificate management provides fine-grained control over certificate lifecycle + +## Common Pitfalls + +### Certificate Stuck in Pending State +**Problem**: Certificate remains in pending state and is never issued + +**Solution**: Check the CertificateRequest and Order resources for detailed error messages + +**Verification**: +```bash +kubectl describe certificate -n +kubectl get certificaterequest -n +kubectl describe order -n +``` + +### HTTP01 Challenge Failures +**Problem**: ACME HTTP01 challenges fail due to ingress misconfiguration + +**Solution**: Ensure ingress controller can route /.well-known/acme-challenge/ paths to cert-manager solver pods + +### Rate Limiting Issues +**Problem**: Let's Encrypt rate limits prevent certificate issuance + +**Solution**: Use staging environment for testing, implement proper retry logic + +```bash +# Check rate limit status +kubectl logs -n cert-manager deployment/cert-manager | grep "rate limit" +``` + +## Required Secrets + +### DNS Provider API Tokens +For DNS01 challenges, API tokens for your DNS provider are required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: cloudflare-api-token + namespace: cert-manager +type: Opaque +stringData: + api-token: your-cloudflare-api-token +``` + +**Key Fields**: +- `api-token`: Cloudflare API token with Zone:Read and DNS:Edit permissions (required) + +### ACME Account Private Key +Automatically generated but can be pre-created for account portability + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: letsencrypt-prod + namespace: cert-manager +type: Opaque +data: + tls.key: +``` + +**Key Fields**: +- `tls.key`: ACME account private key (automatically generated if not provided) + +## Verification +```bash +# Check cert-manager pods are running +kubectl get pods -n cert-manager + +# Verify ClusterIssuer is ready +kubectl get clusterissuer + +# Check certificate status +kubectl get certificates -A + +# View certificate details +kubectl describe certificate -n +``` + +## Usage Examples + +### Automatic Certificate with Ingress Annotations +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: example-ingress + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod +spec: + tls: + - hosts: + - example.com + secretName: example-tls + rules: + - host: example.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: example-service + port: + number: 80 +``` + +### Wildcard Certificate +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: wildcard-example-com +spec: + secretName: wildcard-example-com-tls + issuerRef: + name: letsencrypt-dns + kind: ClusterIssuer + dnsNames: + - "*.example.com" + - example.com +``` + +Certificate renewal is automatic and occurs when certificates are within 30 days of expiry. Monitor certificate expiry dates and renewal events through Prometheus metrics and Kubernetes events. \ No newline at end of file diff --git a/docs/harbor-config-guide.md b/docs/harbor-config-guide.md new file mode 100644 index 0000000..6034095 --- /dev/null +++ b/docs/harbor-config-guide.md @@ -0,0 +1,175 @@ +# Harbor Configuration Guide + +## Overview +Harbor is an open-source container registry that secures artifacts with policies and role-based access control, ensures images are scanned and free from vulnerabilities, and signs images as trusted. + +## Key Configuration Choices + +### Database Configuration +```yaml +database: + type: external + external: + host: postgres-cluster + port: 5432 + username: harbor + password: + coreDatabase: registry + notaryServerDatabase: notaryserver + notarySignerDatabase: notarysigner +``` +**Why**: +- External database provides better scalability and backup options +- Separate databases for different components improve isolation +- PostgreSQL offers better performance than internal database + +### Storage Backend Configuration +```yaml +persistence: + enabled: true + resourcePolicy: "keep" + persistentVolumeClaim: + registry: + storageClass: "longhorn" + size: 100Gi + chartmuseum: + storageClass: "longhorn" + size: 10Gi +``` +**Why**: Persistent storage ensures registry data survives pod restarts and provides reliable artifact storage + +### Ingress and TLS Configuration +```yaml +expose: + type: ingress + tls: + enabled: true + certSource: secret + secret: + secretName: harbor-tls + ingress: + hosts: + core: harbor.example.com + className: nginx + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/proxy-body-size: "0" +``` +**Why**: Ingress provides external access with proper TLS termination and large file upload support + +## Common Pitfalls + +### Image Push/Pull Failures +**Problem**: Docker push/pull operations fail with authentication or network errors + +**Solution**: Verify Harbor is accessible, credentials are correct, and proxy settings allow large uploads + +**Verification**: +```bash +# Test Harbor connectivity +curl -k https://harbor.example.com/api/v2.0/systeminfo + +# Test Docker login +docker login harbor.example.com + +# Check Harbor core logs +kubectl logs -n harbor deployment/harbor-core +``` + +### Storage Space Issues +**Problem**: Registry runs out of storage space causing push failures + +**Solution**: Monitor storage usage, implement garbage collection policies, and expand storage as needed + +### Vulnerability Scanning Not Working +**Problem**: Trivy scanner fails to update vulnerability database or scan images + +**Solution**: Ensure internet connectivity for vulnerability database updates and check scanner configuration + +```bash +# Check Trivy scanner logs +kubectl logs -n harbor deployment/harbor-trivy + +# Manually trigger vulnerability database update +kubectl exec -n harbor deployment/harbor-trivy -- trivy image --download-db-only +``` + +## Required Secrets + +### Database Credentials +Harbor requires database credentials for PostgreSQL connection + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: harbor-database + namespace: harbor +type: Opaque +stringData: + password: your-database-password +``` + +**Key Fields**: +- `password`: PostgreSQL password for Harbor database user (required) + +### Harbor Admin Credentials +Initial admin user credentials for Harbor + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: harbor-admin + namespace: harbor +type: Opaque +stringData: + password: your-admin-password +``` + +**Key Fields**: +- `password`: Harbor admin user password (required) + +## Verification +```bash +# Check Harbor pods are running +kubectl get pods -n harbor + +# Verify Harbor services +kubectl get svc -n harbor + +# Check Harbor ingress +kubectl get ingress -n harbor + +# Test Harbor API +curl -k https://harbor.example.com/api/v2.0/systeminfo +``` + +## Usage Examples + +### Push Image to Harbor +```bash +# Tag image for Harbor +docker tag myapp:latest harbor.example.com/library/myapp:latest + +# Login to Harbor +docker login harbor.example.com + +# Push image +docker push harbor.example.com/library/myapp:latest +``` + +### Create Harbor Project via API +```bash +# Create new project +curl -X POST "https://harbor.example.com/api/v2.0/projects" \ + -H "Content-Type: application/json" \ + -u "admin:password" \ + -d '{ + "project_name": "myproject", + "public": false, + "storage_limit": -1 + }' +``` + +Harbor provides comprehensive container registry capabilities with security scanning, content trust, and role-based access control. Regular maintenance includes garbage collection, vulnerability database updates, and monitoring storage usage. \ No newline at end of file diff --git a/docs/keycloak-config-guide.md b/docs/keycloak-config-guide.md new file mode 100644 index 0000000..3d82d77 --- /dev/null +++ b/docs/keycloak-config-guide.md @@ -0,0 +1,192 @@ +# Keycloak Configuration Guide + +## Overview +Keycloak is an open-source identity and access management solution that provides authentication, authorization, and single sign-on capabilities for applications and services. + +## Key Configuration Choices + +### Database Configuration +```yaml +apiVersion: k8s.keycloak.org/v2alpha1 +kind: Keycloak +metadata: + name: keycloak +spec: + instances: 2 + db: + vendor: postgres + host: postgres-cluster + usernameSecret: + name: keycloak-db-secret + key: username + passwordSecret: + name: keycloak-db-secret + key: password +``` +**Why**: +- External PostgreSQL provides better performance and scalability +- Multiple instances ensure high availability +- Separate database credentials improve security + +### Hostname and TLS Configuration +```yaml +spec: + hostname: + hostname: auth.example.com + strict: true + strictBackchannel: true + http: + tlsSecret: keycloak-tls +``` +**Why**: Proper hostname configuration ensures correct redirect URIs and prevents security issues + +### Realm and Client Configuration +```yaml +apiVersion: k8s.keycloak.org/v2alpha1 +kind: KeycloakRealmImport +metadata: + name: opencenter-realm +spec: + keycloakCRName: keycloak + realm: + realm: opencenter + enabled: true + clients: + - clientId: headlamp + enabled: true + protocol: openid-connect + publicClient: false + redirectUris: + - "https://headlamp.example.com/oidc-callback" + webOrigins: + - "https://headlamp.example.com" +``` +**Why**: Realm imports provide declarative configuration management for clients and users + +## Common Pitfalls + +### Database Connection Issues +**Problem**: Keycloak fails to start due to database connectivity problems + +**Solution**: Verify PostgreSQL is running, credentials are correct, and network policies allow connection + +**Verification**: +```bash +# Check Keycloak pod logs +kubectl logs -n keycloak deployment/keycloak + +# Test database connectivity +kubectl exec -n keycloak deployment/keycloak -- pg_isready -h postgres-cluster + +# Verify database secret +kubectl get secret -n keycloak keycloak-db-secret -o yaml +``` + +### OIDC Client Configuration Errors +**Problem**: Applications fail to authenticate with "invalid redirect URI" errors + +**Solution**: Ensure redirect URIs in client configuration exactly match the application's callback URLs + +### Theme and Customization Issues +**Problem**: Custom themes not loading or displaying incorrectly + +**Solution**: Verify theme files are properly mounted and CSS/JavaScript resources are accessible + +```bash +# Check theme files in Keycloak pod +kubectl exec -n keycloak deployment/keycloak -- ls -la /opt/keycloak/themes/ + +# Verify theme configuration +kubectl logs -n keycloak deployment/keycloak | grep -i theme +``` + +## Required Secrets + +### Database Credentials +Keycloak requires database credentials for PostgreSQL connection + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: keycloak-db-secret + namespace: keycloak +type: Opaque +stringData: + username: keycloak + password: your-database-password +``` + +**Key Fields**: +- `username`: PostgreSQL username for Keycloak (required) +- `password`: PostgreSQL password for Keycloak user (required) + +### Admin Credentials +Initial admin user credentials for Keycloak + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: keycloak-admin + namespace: keycloak +type: Opaque +stringData: + username: admin + password: your-admin-password +``` + +**Key Fields**: +- `username`: Keycloak admin username (required) +- `password`: Keycloak admin password (required) + +## Verification +```bash +# Check Keycloak pods are running +kubectl get pods -n keycloak + +# Verify Keycloak custom resource status +kubectl get keycloak -n keycloak + +# Check Keycloak service +kubectl get svc -n keycloak + +# Test Keycloak admin console +curl -k https://auth.example.com/admin/ +``` + +## Usage Examples + +### Create OIDC Client for Application +```bash +# Access Keycloak admin console +# Navigate to Clients -> Create Client +# Configure client settings: +# - Client ID: myapp +# - Client Protocol: openid-connect +# - Access Type: confidential +# - Valid Redirect URIs: https://myapp.example.com/callback +``` + +### Configure User Federation +```yaml +apiVersion: k8s.keycloak.org/v2alpha1 +kind: KeycloakRealmImport +metadata: + name: ldap-federation +spec: + keycloakCRName: keycloak + realm: + realm: opencenter + components: + org.keycloak.storage.UserStorageProvider: + - name: "ldap" + providerId: "ldap" + config: + connectionUrl: ["ldap://ldap.example.com:389"] + usersDn: ["ou=users,dc=example,dc=com"] + bindDn: ["cn=admin,dc=example,dc=com"] + bindCredential: ["admin-password"] +``` + +Keycloak provides comprehensive identity management with support for multiple authentication protocols, user federation, and extensive customization options. Regular maintenance includes monitoring user sessions, updating security policies, and backing up realm configurations. \ No newline at end of file diff --git a/docs/kube-prometheus-stack-config-guide.md b/docs/kube-prometheus-stack-config-guide.md new file mode 100644 index 0000000..72c224d --- /dev/null +++ b/docs/kube-prometheus-stack-config-guide.md @@ -0,0 +1,241 @@ +# Kube-Prometheus-Stack Configuration Guide + +## Overview +Kube-Prometheus-Stack provides a complete monitoring solution with Prometheus, Grafana, Alertmanager, and related components for Kubernetes cluster and application monitoring. + +## Key Configuration Choices + +### Prometheus Configuration +```yaml +prometheus: + prometheusSpec: + retention: 30d + retentionSize: 50GB + storageSpec: + volumeClaimTemplate: + spec: + storageClassName: longhorn + resources: + requests: + storage: 100Gi + resources: + requests: + memory: 2Gi + cpu: 1000m + limits: + memory: 4Gi + cpu: 2000m +``` +**Why**: +- Persistent storage ensures metrics survive pod restarts +- Retention policies manage storage usage and costs +- Resource limits prevent memory issues in large clusters + +### Grafana Configuration +```yaml +grafana: + persistence: + enabled: true + storageClassName: longhorn + size: 10Gi + adminPassword: + grafana.ini: + server: + root_url: https://grafana.example.com + auth.generic_oauth: + enabled: true + name: Keycloak + client_id: grafana + client_secret: + auth_url: https://auth.example.com/realms/opencenter/protocol/openid-connect/auth + token_url: https://auth.example.com/realms/opencenter/protocol/openid-connect/token +``` +**Why**: OIDC integration provides centralized authentication and persistent storage preserves dashboards and settings + +### Alertmanager Configuration +```yaml +alertmanager: + alertmanagerSpec: + storage: + volumeClaimTemplate: + spec: + storageClassName: longhorn + resources: + requests: + storage: 10Gi + config: + global: + smtp_smarthost: 'smtp.example.com:587' + smtp_from: 'alerts@example.com' + route: + group_by: ['alertname', 'cluster'] + group_wait: 10s + group_interval: 10s + repeat_interval: 1h + receiver: 'web.hook' + receivers: + - name: 'web.hook' + email_configs: + - to: 'admin@example.com' + subject: 'Alert: {{ .GroupLabels.alertname }}' +``` +**Why**: Persistent storage maintains alert state and SMTP configuration enables email notifications + +## Common Pitfalls + +### High Memory Usage +**Problem**: Prometheus consumes excessive memory causing OOM kills + +**Solution**: Tune retention settings, increase memory limits, or implement recording rules to reduce cardinality + +**Verification**: +```bash +# Check Prometheus memory usage +kubectl top pod -n observability -l app.kubernetes.io/name=prometheus + +# Review Prometheus metrics +kubectl port-forward -n observability svc/prometheus-operated 9090:9090 +# Access http://localhost:9090/metrics +``` + +### Missing Metrics +**Problem**: Expected metrics are not appearing in Prometheus + +**Solution**: Verify ServiceMonitor selectors match service labels and check scrape configuration + +### Grafana Dashboard Issues +**Problem**: Dashboards show no data or incorrect visualizations + +**Solution**: Verify data source configuration and check Prometheus query syntax + +```bash +# Check Grafana logs +kubectl logs -n observability deployment/grafana + +# Verify data source connectivity +kubectl exec -n observability deployment/grafana -- curl -s http://prometheus-operated:9090/api/v1/query?query=up +``` + +## Required Secrets + +### Grafana Admin Password +Grafana requires an admin password for initial setup + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: grafana-admin + namespace: observability +type: Opaque +stringData: + admin-password: your-secure-password +``` + +**Key Fields**: +- `admin-password`: Grafana admin user password (required) + +### OIDC Client Secret +For Grafana OIDC authentication + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: grafana-oidc + namespace: observability +type: Opaque +stringData: + client-secret: your-oidc-client-secret +``` + +**Key Fields**: +- `client-secret`: OIDC client secret for Grafana authentication (required for OIDC) + +## Verification +```bash +# Check all monitoring pods +kubectl get pods -n observability + +# Verify Prometheus targets +kubectl port-forward -n observability svc/prometheus-operated 9090:9090 +# Access http://localhost:9090/targets + +# Check Grafana access +kubectl port-forward -n observability svc/grafana 3000:80 +# Access http://localhost:3000 + +# Verify Alertmanager +kubectl port-forward -n observability svc/alertmanager-operated 9093:9093 +``` + +## Usage Examples + +### Custom ServiceMonitor +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: myapp-metrics + namespace: observability +spec: + selector: + matchLabels: + app: myapp + endpoints: + - port: metrics + interval: 30s + path: /metrics +``` + +### Custom PrometheusRule +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: myapp-alerts + namespace: observability +spec: + groups: + - name: myapp.rules + rules: + - alert: MyAppDown + expr: up{job="myapp"} == 0 + for: 5m + labels: + severity: critical + annotations: + summary: "MyApp is down" + description: "MyApp has been down for more than 5 minutes" +``` + +### Grafana Dashboard ConfigMap +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: myapp-dashboard + namespace: observability + labels: + grafana_dashboard: "1" +data: + myapp-dashboard.json: | + { + "dashboard": { + "title": "MyApp Dashboard", + "panels": [ + { + "title": "Request Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total[5m])" + } + ] + } + ] + } + } +``` + +The Kube-Prometheus-Stack provides comprehensive monitoring capabilities. Start with default configurations and gradually customize based on your specific monitoring requirements and resource constraints. \ No newline at end of file diff --git a/docs/kyverno-config-guide.md b/docs/kyverno-config-guide.md new file mode 100644 index 0000000..84fec9d --- /dev/null +++ b/docs/kyverno-config-guide.md @@ -0,0 +1,238 @@ +# Kyverno Configuration Guide + +## Overview +Kyverno is a Kubernetes-native policy engine that validates, mutates, and generates configurations using policies defined as Kubernetes resources. + +## Key Configuration Choices + +### Policy Validation Configuration +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: require-labels +spec: + validationFailureAction: enforce + background: true + rules: + - name: check-labels + match: + any: + - resources: + kinds: + - Pod + validate: + message: "Required labels are missing" + pattern: + metadata: + labels: + app.kubernetes.io/name: "?*" + app.kubernetes.io/version: "?*" +``` +**Why**: +- Validation policies enforce compliance and best practices +- Background scanning evaluates existing resources +- Pattern matching provides flexible validation rules + +### Resource Mutation Configuration +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: add-security-context +spec: + rules: + - name: add-security-context + match: + any: + - resources: + kinds: + - Pod + mutate: + patchStrategicMerge: + spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 +``` +**Why**: Mutation policies automatically apply security configurations and reduce manual configuration overhead + +### Resource Generation Configuration +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: generate-network-policy +spec: + rules: + - name: generate-netpol + match: + any: + - resources: + kinds: + - Namespace + generate: + kind: NetworkPolicy + name: default-deny + namespace: "{{request.object.metadata.name}}" + data: + spec: + podSelector: {} + policyTypes: + - Ingress + - Egress +``` +**Why**: Generation policies automatically create supporting resources and ensure consistent configurations + +## Common Pitfalls + +### Policy Conflicts and Ordering +**Problem**: Multiple policies conflict or produce unexpected results due to execution order + +**Solution**: Use policy priorities and careful rule design to avoid conflicts + +**Verification**: +```bash +# Check policy reports for conflicts +kubectl get policyreport -A + +# Review policy execution order +kubectl describe clusterpolicy + +# Check admission controller logs +kubectl logs -n kyverno -l app.kubernetes.io/component=admission-controller +``` + +### Background Scanning Performance +**Problem**: Background scanning consumes excessive resources or causes performance issues + +**Solution**: Tune background scanning settings and use resource filters to limit scope + +### Webhook Failures +**Problem**: Admission webhook failures block resource creation + +**Solution**: Configure failure policies and ensure webhook availability + +```bash +# Check webhook configuration +kubectl get validatingadmissionpolicy + +# Verify webhook endpoints +kubectl get endpoints -n kyverno + +# Test webhook connectivity +kubectl logs -n kyverno -l app.kubernetes.io/component=admission-controller +``` + +## Required Secrets + +### Webhook TLS Certificates +Kyverno automatically manages webhook certificates + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: kyverno-svc.kyverno.svc.kyverno-tls-pair + namespace: kyverno +type: kubernetes.io/tls +data: + tls.crt: + tls.key: +``` + +**Key Fields**: +- `tls.crt`: TLS certificate for webhook server (automatically generated) +- `tls.key`: TLS private key for webhook server (automatically generated) + +### Image Registry Credentials +For image verification policies, registry credentials may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: registry-creds + namespace: kyverno +type: kubernetes.io/dockerconfigjson +data: + .dockerconfigjson: +``` + +**Key Fields**: +- `.dockerconfigjson`: Docker registry credentials (required for private registries) + +## Verification +```bash +# Check Kyverno pods are running +kubectl get pods -n kyverno + +# Verify cluster policies +kubectl get clusterpolicy + +# Check policy reports +kubectl get policyreport -A + +# View policy violations +kubectl describe policyreport -n +``` + +## Usage Examples + +### Pod Security Standards Policy +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: pod-security-standards +spec: + validationFailureAction: enforce + rules: + - name: check-security-context + match: + any: + - resources: + kinds: + - Pod + validate: + message: "Containers must run as non-root" + pattern: + spec: + securityContext: + runAsNonRoot: true + containers: + - securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL +``` + +### Image Verification Policy +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: verify-images +spec: + validationFailureAction: enforce + rules: + - name: verify-signature + match: + any: + - resources: + kinds: + - Pod + verifyImages: + - imageReferences: + - "registry.example.com/*" + attestors: + - entries: + - keys: + publicKeys: |- + -----BEGIN PUBLIC KEY----- + + -----END PUBLIC KEY----- +``` + +Kyverno provides powerful policy management capabilities for Kubernetes. Start with simple validation policies and gradually implement more complex mutation and generation rules as needed. \ No newline at end of file diff --git a/docs/loki-config-guide.md b/docs/loki-config-guide.md new file mode 100644 index 0000000..b054326 --- /dev/null +++ b/docs/loki-config-guide.md @@ -0,0 +1,224 @@ +# Loki Configuration Guide + +## Overview +Loki is a horizontally-scalable, highly-available log aggregation system designed to store and query logs from all your applications and infrastructure. + +## Key Configuration Choices + +### Storage Configuration +```yaml +loki: + storage: + type: s3 + bucketNames: + chunks: loki-chunks + ruler: loki-ruler + admin: loki-admin + s3: + endpoint: s3.amazonaws.com + region: us-east-1 + accessKeyId: + secretAccessKey: + s3ForcePathStyle: false +``` +**Why**: +- Object storage provides cost-effective, scalable log storage +- Separate buckets for different data types improve organization +- S3-compatible storage offers flexibility across cloud providers + +### Retention and Limits Configuration +```yaml +loki: + limits_config: + retention_period: 30d + ingestion_rate_mb: 10 + ingestion_burst_size_mb: 20 + max_query_parallelism: 32 + max_streams_per_user: 10000 + max_line_size: 256KB + compactor: + retention_enabled: true + retention_delete_delay: 2h + retention_delete_worker_count: 150 +``` +**Why**: Retention policies manage storage costs and query limits prevent resource exhaustion + +### Multi-tenancy Configuration +```yaml +loki: + auth_enabled: true + server: + http_listen_port: 3100 + grpc_listen_port: 9095 + distributor: + ring: + kvstore: + store: memberlist + memberlist: + join_members: + - loki-memberlist +``` +**Why**: Multi-tenancy provides isolation between different teams or applications + +## Common Pitfalls + +### High Cardinality Labels +**Problem**: Too many unique label combinations cause performance issues and high storage costs + +**Solution**: Use structured logging and limit labels to low-cardinality values like service, environment, and level + +**Verification**: +```bash +# Check label cardinality +kubectl exec -n observability deployment/loki-querier -- \ + wget -qO- 'http://localhost:3100/loki/api/v1/label' + +# Monitor ingestion rate +kubectl logs -n observability deployment/loki-distributor | grep "ingestion rate" +``` + +### Query Performance Issues +**Problem**: LogQL queries are slow or time out + +**Solution**: Use proper time ranges, label filters, and avoid regex operations on large datasets + +### Storage Backend Issues +**Problem**: Loki cannot write to or read from object storage + +**Solution**: Verify storage credentials, bucket permissions, and network connectivity + +```bash +# Check Loki ingester logs +kubectl logs -n observability deployment/loki-ingester + +# Verify storage configuration +kubectl exec -n observability deployment/loki-querier -- \ + wget -qO- 'http://localhost:3100/ready' + +# Test object storage connectivity +kubectl exec -n observability deployment/loki-ingester -- \ + aws s3 ls s3://loki-chunks/ --region us-east-1 +``` + +## Required Secrets + +### Object Storage Credentials +Loki requires credentials for accessing object storage + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: loki-storage + namespace: observability +type: Opaque +stringData: + access-key-id: your-access-key + secret-access-key: your-secret-key +``` + +**Key Fields**: +- `access-key-id`: S3 access key ID (required) +- `secret-access-key`: S3 secret access key (required) + +### Gateway Authentication +For multi-tenant deployments, authentication credentials may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: loki-gateway-auth + namespace: observability +type: Opaque +stringData: + htpasswd: | + user1:$2y$10$... + user2:$2y$10$... +``` + +**Key Fields**: +- `htpasswd`: HTTP basic auth credentials file (required for gateway auth) + +## Verification +```bash +# Check Loki pods are running +kubectl get pods -n observability -l app.kubernetes.io/name=loki + +# Verify Loki services +kubectl get svc -n observability -l app.kubernetes.io/name=loki + +# Test Loki API +kubectl port-forward -n observability svc/loki 3100:3100 +curl http://localhost:3100/ready + +# Query logs +curl -G -s "http://localhost:3100/loki/api/v1/query" \ + --data-urlencode 'query={job="kubernetes-pods"}' \ + --data-urlencode 'limit=10' +``` + +## Usage Examples + +### Query Logs with LogQL +```bash +# Query logs from specific namespace +{namespace="myapp"} + +# Filter by log level +{namespace="myapp"} |= "ERROR" + +# Rate query for error logs +rate({namespace="myapp"} |= "ERROR" [5m]) + +# Extract and count HTTP status codes +{job="nginx"} | json | __error__ = "" | line_format "{{.status}}" | unwrap status | rate[5m] +``` + +### Configure Log Shipping with OpenTelemetry +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: OpenTelemetryCollector +metadata: + name: otel-collector +spec: + config: | + receivers: + filelog: + include: + - /var/log/pods/*/*/*.log + operators: + - type: json_parser + id: parser-docker + output: extract_metadata_from_filepath + exporters: + loki: + endpoint: http://loki:3100/loki/api/v1/push + tenant_id: "tenant1" + service: + pipelines: + logs: + receivers: [filelog] + exporters: [loki] +``` + +### Create Grafana Data Source +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: loki-datasource + namespace: observability +data: + datasource.yaml: | + apiVersion: 1 + datasources: + - name: Loki + type: loki + access: proxy + url: http://loki:3100 + isDefault: false + editable: true +``` + +Loki provides cost-effective log aggregation with powerful querying capabilities. Focus on proper label design and retention policies to optimize performance and storage costs. \ No newline at end of file diff --git a/docs/longhorn-config-guide.md b/docs/longhorn-config-guide.md new file mode 100644 index 0000000..ed32a96 --- /dev/null +++ b/docs/longhorn-config-guide.md @@ -0,0 +1,242 @@ +# Longhorn Configuration Guide + +## Overview +Longhorn is a distributed block storage system for Kubernetes that provides persistent storage with built-in backup, snapshot, and disaster recovery capabilities. + +## Key Configuration Choices + +### Storage Class Configuration +```yaml +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: longhorn +provisioner: driver.longhorn.io +allowVolumeExpansion: true +reclaimPolicy: Delete +volumeBindingMode: Immediate +parameters: + numberOfReplicas: "3" + staleReplicaTimeout: "2880" + fromBackup: "" + fsType: "ext4" + dataLocality: "disabled" +``` +**Why**: +- Multiple replicas provide data redundancy and high availability +- Volume expansion allows growing storage without downtime +- Configurable parameters optimize performance for different workloads + +### Backup Target Configuration +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: longhorn-backup-target + namespace: longhorn-system +type: Opaque +stringData: + AWS_ACCESS_KEY_ID: your-access-key + AWS_SECRET_ACCESS_KEY: your-secret-key + AWS_ENDPOINTS: https://s3.amazonaws.com +--- +apiVersion: longhorn.io/v1beta1 +kind: Setting +metadata: + name: backup-target + namespace: longhorn-system +spec: + value: s3://longhorn-backups@us-east-1/ +``` +**Why**: S3-compatible backup storage enables disaster recovery and cross-cluster data migration + +### Node and Disk Configuration +```yaml +apiVersion: longhorn.io/v1beta1 +kind: Setting +metadata: + name: default-data-path + namespace: longhorn-system +spec: + value: /var/lib/longhorn/ +--- +apiVersion: longhorn.io/v1beta1 +kind: Setting +metadata: + name: replica-soft-anti-affinity + namespace: longhorn-system +spec: + value: "true" +``` +**Why**: Proper data path configuration and anti-affinity rules ensure optimal storage distribution + +## Common Pitfalls + +### Volume Attachment Issues +**Problem**: Pods cannot start due to volume attachment failures + +**Solution**: Check node connectivity, iSCSI configuration, and Longhorn engine status + +**Verification**: +```bash +# Check volume status +kubectl get volumes -n longhorn-system + +# Check engine status +kubectl get engines -n longhorn-system + +# Verify node connectivity +kubectl get nodes -n longhorn-system -o wide +``` + +### Replica Scheduling Failures +**Problem**: Volumes become degraded due to replica scheduling issues + +**Solution**: Ensure sufficient storage space on nodes and check node taints/tolerations + +### Backup and Restore Issues +**Problem**: Backup operations fail or restore doesn't work + +**Solution**: Verify backup target configuration, credentials, and network connectivity + +```bash +# Check backup target settings +kubectl get setting -n longhorn-system backup-target + +# List available backups +kubectl get backups -n longhorn-system + +# Check backup job logs +kubectl logs -n longhorn-system -l app=longhorn-manager | grep backup +``` + +## Required Secrets + +### Backup Storage Credentials +For S3-compatible backup storage + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: longhorn-backup-target + namespace: longhorn-system +type: Opaque +stringData: + AWS_ACCESS_KEY_ID: your-access-key + AWS_SECRET_ACCESS_KEY: your-secret-key + AWS_ENDPOINTS: https://s3.amazonaws.com +``` + +**Key Fields**: +- `AWS_ACCESS_KEY_ID`: S3 access key ID (required for S3 backups) +- `AWS_SECRET_ACCESS_KEY`: S3 secret access key (required for S3 backups) +- `AWS_ENDPOINTS`: S3 endpoint URL (optional, defaults to AWS) + +### Registry Credentials +For private container registries + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: longhorn-registry-secret + namespace: longhorn-system +type: kubernetes.io/dockerconfigjson +data: + .dockerconfigjson: +``` + +**Key Fields**: +- `.dockerconfigjson`: Docker registry credentials (required for private registries) + +## Verification +```bash +# Check Longhorn pods +kubectl get pods -n longhorn-system + +# Verify Longhorn UI access +kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80 + +# Check storage class +kubectl get storageclass longhorn + +# Test volume creation +kubectl apply -f - < +``` + +### External IPs Not Reachable +**Problem**: Services get external IPs but are not accessible from outside the cluster + +**Solution**: Ensure network routing is configured correctly and L2Advertisement or BGPAdvertisement is properly set up + +### Speaker Pods Not Running +**Problem**: MetalLB speaker pods fail to start or crash repeatedly + +**Solution**: Check node network configuration, security contexts, and host network access + +```bash +# Check speaker pod logs +kubectl logs -n metallb-system -l app.kubernetes.io/component=speaker + +# Verify speaker daemonset +kubectl get daemonset -n metallb-system speaker + +# Check node network interfaces +kubectl exec -n metallb-system -- ip addr show +``` + +## Required Secrets + +### BGP Router Passwords +For BGP mode, router authentication passwords may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: bgp-auth + namespace: metallb-system +type: Opaque +stringData: + password: your-bgp-password +``` + +**Key Fields**: +- `password`: BGP peer authentication password (optional) + +### TLS Certificates +For webhook validation, TLS certificates are automatically managed + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: webhook-server-cert + namespace: metallb-system +type: kubernetes.io/tls +data: + tls.crt: + tls.key: +``` + +**Key Fields**: +- `tls.crt`: TLS certificate for webhook server (automatically generated) +- `tls.key`: TLS private key for webhook server (automatically generated) + +## Verification +```bash +# Check MetalLB pods are running +kubectl get pods -n metallb-system + +# Verify IP address pools +kubectl get ipaddresspool -n metallb-system + +# Check L2 advertisements +kubectl get l2advertisement -n metallb-system + +# Test LoadBalancer service +kubectl get svc --field-selector spec.type=LoadBalancer +``` + +## Usage Examples + +### Create LoadBalancer Service +```yaml +apiVersion: v1 +kind: Service +metadata: + name: nginx-lb +spec: + type: LoadBalancer + loadBalancerIP: 192.168.1.150 # Optional: request specific IP + ports: + - port: 80 + targetPort: 80 + selector: + app: nginx +``` + +### Configure Multiple IP Pools +```yaml +apiVersion: metallb.io/v1beta1 +kind: IPAddressPool +metadata: + name: production-pool + namespace: metallb-system +spec: + addresses: + - 10.0.1.100-10.0.1.200 +--- +apiVersion: metallb.io/v1beta1 +kind: IPAddressPool +metadata: + name: development-pool + namespace: metallb-system +spec: + addresses: + - 10.0.2.100-10.0.2.200 +``` + +MetalLB provides essential LoadBalancer functionality for bare-metal and on-premises Kubernetes clusters. Choose Layer 2 mode for simple deployments or BGP mode for integration with network infrastructure and better scalability. \ No newline at end of file diff --git a/docs/opentelemetry-kube-stack-config-guide.md b/docs/opentelemetry-kube-stack-config-guide.md new file mode 100644 index 0000000..5d85033 --- /dev/null +++ b/docs/opentelemetry-kube-stack-config-guide.md @@ -0,0 +1,306 @@ +# OpenTelemetry Kube Stack Configuration Guide + +## Overview +OpenTelemetry Kube Stack provides a complete observability framework for collecting, processing, and exporting telemetry data (metrics, logs, and traces) from Kubernetes workloads and infrastructure. + +## Key Configuration Choices + +### Collector Configuration +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: OpenTelemetryCollector +metadata: + name: otel-collector +spec: + mode: daemonset + config: | + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + k8s_cluster: + auth_type: serviceAccount + kubeletstats: + collection_interval: 20s + auth_type: serviceAccount + endpoint: ${env:K8S_NODE_NAME}:10250 + insecure_skip_verify: true + processors: + batch: + timeout: 1s + send_batch_size: 1024 + resource: + attributes: + - key: cluster.name + value: my-cluster + action: upsert + exporters: + otlp/tempo: + endpoint: http://tempo-distributor:4317 + tls: + insecure: true + prometheus: + endpoint: "0.0.0.0:8889" + loki: + endpoint: http://loki-distributor:3100/loki/api/v1/push + service: + pipelines: + traces: + receivers: [otlp] + processors: [batch, resource] + exporters: [otlp/tempo] + metrics: + receivers: [otlp, k8s_cluster, kubeletstats] + processors: [batch, resource] + exporters: [prometheus] + logs: + receivers: [otlp] + processors: [batch, resource] + exporters: [loki] +``` +**Why**: +- DaemonSet mode ensures telemetry collection from all nodes +- Multiple receivers support different telemetry sources +- Processors enable data transformation and enrichment +- Multiple exporters support different backend systems + +### Auto-Instrumentation Configuration +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: Instrumentation +metadata: + name: default-instrumentation +spec: + exporter: + endpoint: http://otel-collector:4317 + propagators: + - tracecontext + - baggage + - b3 + sampler: + type: parentbased_traceidratio + argument: "0.1" # 10% sampling + java: + image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest + nodejs: + image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest + python: + image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest +``` +**Why**: Auto-instrumentation reduces manual instrumentation effort and ensures consistent telemetry collection + +### Target Allocator Configuration +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: OpenTelemetryCollector +metadata: + name: otel-collector-statefulset +spec: + mode: statefulset + replicas: 3 + targetAllocator: + enabled: true + serviceAccount: opentelemetry-targetallocator-sa + prometheusCR: + enabled: true + config: | + receivers: + prometheus: + config: + scrape_configs: + - job_name: 'otel-collector' + scrape_interval: 10s + static_configs: + - targets: ['0.0.0.0:8888'] +``` +**Why**: Target allocator distributes Prometheus scraping targets across multiple collector instances + +## Common Pitfalls + +### High Resource Usage +**Problem**: OpenTelemetry collectors consume excessive CPU or memory + +**Solution**: Tune batch processors, implement sampling, and scale collectors horizontally + +**Verification**: +```bash +# Check collector resource usage +kubectl top pod -n observability -l app.kubernetes.io/name=opentelemetry-collector + +# Monitor collector metrics +kubectl port-forward -n observability svc/otel-collector 8888:8888 +curl http://localhost:8888/metrics | grep otelcol_processor +``` + +### Data Export Failures +**Problem**: Telemetry data is not reaching backend systems + +**Solution**: Verify exporter configuration, network connectivity, and backend availability + +### Auto-Instrumentation Not Working +**Problem**: Applications are not automatically instrumented + +**Solution**: Check instrumentation resource configuration and pod annotations + +```bash +# Check instrumentation status +kubectl describe instrumentation default-instrumentation + +# Verify pod annotations +kubectl get pod -o yaml | grep -A 5 -B 5 instrumentation + +# Check operator logs +kubectl logs -n opentelemetry-operator-system deployment/opentelemetry-operator-controller-manager +``` + +## Required Secrets + +### Backend Credentials +For authenticated backends, credentials may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: otel-backend-creds + namespace: observability +type: Opaque +stringData: + api-key: your-backend-api-key + endpoint: https://api.backend.com +``` + +**Key Fields**: +- `api-key`: Backend API key for authentication (if required) +- `endpoint`: Backend endpoint URL (if required) + +### TLS Certificates +For secure communication with backends + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: otel-tls-certs + namespace: observability +type: kubernetes.io/tls +data: + tls.crt: + tls.key: + ca.crt: +``` + +**Key Fields**: +- `tls.crt`: Client certificate for mTLS (if required) +- `tls.key`: Client private key for mTLS (if required) +- `ca.crt`: CA certificate for backend verification (if required) + +## Verification +```bash +# Check OpenTelemetry operator +kubectl get pods -n opentelemetry-operator-system + +# Verify collector instances +kubectl get opentelemetrycollector -n observability + +# Check instrumentation resources +kubectl get instrumentation -n observability + +# Test collector endpoints +kubectl port-forward -n observability svc/otel-collector 4317:4317 +# Send test trace using OTLP +``` + +## Usage Examples + +### Enable Auto-Instrumentation for Application +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: myapp +spec: + template: + metadata: + annotations: + instrumentation.opentelemetry.io/inject-java: "true" + instrumentation.opentelemetry.io/container-names: "myapp" + spec: + containers: + - name: myapp + image: myapp:latest + env: + - name: OTEL_SERVICE_NAME + value: myapp + - name: OTEL_SERVICE_VERSION + value: "1.0.0" +``` + +### Custom Processor Configuration +```yaml +processors: + attributes: + actions: + - key: environment + value: production + action: upsert + - key: sensitive_data + action: delete + filter: + traces: + span: + - 'attributes["http.url"] == "/health"' + transform: + trace_statements: + - context: span + statements: + - set(name, "custom_span_name") where attributes["http.method"] == "GET" +``` + +### Multi-Pipeline Configuration +```yaml +service: + pipelines: + traces/frontend: + receivers: [otlp] + processors: [batch, attributes/frontend] + exporters: [otlp/tempo] + traces/backend: + receivers: [jaeger] + processors: [batch, attributes/backend] + exporters: [otlp/tempo] + metrics/infrastructure: + receivers: [kubeletstats, k8s_cluster] + processors: [batch, resource] + exporters: [prometheus] + metrics/applications: + receivers: [otlp] + processors: [batch, filter/applications] + exporters: [prometheus] +``` + +### Sampling Configuration +```yaml +processors: + probabilistic_sampler: + sampling_percentage: 10 # 10% sampling + tail_sampling: + decision_wait: 10s + num_traces: 100 + expected_new_traces_per_sec: 10 + policies: + - name: errors + type: status_code + status_code: {status_codes: [ERROR]} + - name: slow + type: latency + latency: {threshold_ms: 1000} + - name: random + type: probabilistic + probabilistic: {sampling_percentage: 1} +``` + +OpenTelemetry provides comprehensive observability data collection and processing. Start with basic configurations and gradually add more sophisticated processing and routing as your observability needs grow. \ No newline at end of file diff --git a/docs/sealed-secrets-config-guide.md b/docs/sealed-secrets-config-guide.md new file mode 100644 index 0000000..36af70d --- /dev/null +++ b/docs/sealed-secrets-config-guide.md @@ -0,0 +1,219 @@ +# Sealed Secrets Configuration Guide + +## Overview +Sealed Secrets provides a way to encrypt secrets into SealedSecret resources, which can be safely stored in Git repositories and automatically decrypted by the controller running in the cluster. + +## Key Configuration Choices + +### Controller Configuration +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: sealed-secrets-controller + namespace: kube-system +spec: + template: + spec: + containers: + - name: sealed-secrets-controller + image: quay.io/bitnami/sealed-secrets-controller:latest + command: + - controller + args: + - --update-status + - --key-renew-period=720h # 30 days + - --key-cutoff-time=2h + env: + - name: SEALED_SECRETS_UPDATE_STATUS + value: "true" +``` +**Why**: +- Update status provides feedback on SealedSecret processing +- Key rotation ensures cryptographic freshness +- Cutoff time prevents replay attacks with old keys + +### Key Management Configuration +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: sealed-secrets-key + namespace: kube-system + labels: + sealedsecrets.bitnami.com/sealed-secrets-key: active +type: kubernetes.io/tls +data: + tls.crt: + tls.key: +``` +**Why**: Pre-created keys enable key backup and disaster recovery scenarios + +### Scope Configuration +```yaml +apiVersion: bitnami.com/v1alpha1 +kind: SealedSecret +metadata: + name: mysecret + namespace: myapp +spec: + encryptedData: + password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx... + template: + metadata: + name: mysecret + namespace: myapp + type: Opaque +``` +**Why**: Template metadata ensures proper secret creation with correct namespace and type + +## Common Pitfalls + +### Key Loss and Recovery +**Problem**: Sealed secrets cannot be decrypted after controller restart or key loss + +**Solution**: Implement proper key backup and recovery procedures + +**Verification**: +```bash +# Backup current keys +kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/sealed-secrets-key -o yaml > sealed-secrets-keys.yaml + +# Verify key is active +kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/sealed-secrets-key=active + +# Check controller logs for key issues +kubectl logs -n kube-system -l name=sealed-secrets-controller +``` + +### Encryption Scope Issues +**Problem**: SealedSecrets encrypted for wrong scope cannot be decrypted in target namespace + +**Solution**: Use correct kubeseal scope flags when encrypting secrets + +### Certificate Fetch Failures +**Problem**: kubeseal cannot fetch public certificate from controller + +**Solution**: Ensure controller is accessible and certificate endpoint is working + +```bash +# Test certificate fetch +kubeseal --fetch-cert --controller-name=sealed-secrets-controller --controller-namespace=kube-system + +# Verify controller service +kubectl get svc -n kube-system sealed-secrets-controller + +# Check controller readiness +kubectl get pods -n kube-system -l name=sealed-secrets-controller +``` + +## Required Secrets + +### TLS Certificate and Private Key +The controller automatically generates these, but they can be pre-created for backup purposes + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: sealed-secrets-key + namespace: kube-system + labels: + sealedsecrets.bitnami.com/sealed-secrets-key: active +type: kubernetes.io/tls +data: + tls.crt: + tls.key: +``` + +**Key Fields**: +- `tls.crt`: Public certificate for encryption (automatically generated) +- `tls.key`: Private key for decryption (automatically generated) + +### Backup Keys +For disaster recovery, old keys should be preserved + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: sealed-secrets-key-backup + namespace: kube-system + labels: + sealedsecrets.bitnami.com/sealed-secrets-key: "" +type: kubernetes.io/tls +data: + tls.crt: + tls.key: +``` + +**Key Fields**: +- `tls.crt`: Old public certificate (for reference) +- `tls.key`: Old private key (for decrypting old secrets) + +## Verification +```bash +# Check controller status +kubectl get pods -n kube-system -l name=sealed-secrets-controller + +# Verify service is accessible +kubectl get svc -n kube-system sealed-secrets-controller + +# Test certificate fetch +kubeseal --fetch-cert > public.pem + +# List sealed secrets +kubectl get sealedsecrets -A +``` + +## Usage Examples + +### Create SealedSecret from Command Line +```bash +# Create secret and encrypt it +echo -n mypassword | kubectl create secret generic mysecret --dry-run=client --from-file=password=/dev/stdin -o yaml | kubeseal -o yaml > mysealedsecret.yaml + +# Apply the sealed secret +kubectl apply -f mysealedsecret.yaml + +# Verify secret was created +kubectl get secret mysecret +``` + +### Encrypt Existing Secret +```bash +# Get existing secret +kubectl get secret mysecret -o yaml > mysecret.yaml + +# Remove managed fields and encrypt +cat mysecret.yaml | kubeseal -o yaml > mysealedsecret.yaml + +# Apply sealed secret +kubectl apply -f mysealedsecret.yaml +``` + +### Namespace-scoped Encryption +```bash +# Encrypt for specific namespace +echo -n mypassword | kubectl create secret generic mysecret --dry-run=client --from-file=password=/dev/stdin -o yaml | kubeseal --scope namespace-wide -o yaml > mysealedsecret.yaml +``` + +### Cluster-wide Encryption +```bash +# Encrypt for any namespace +echo -n mypassword | kubectl create secret generic mysecret --dry-run=client --from-file=password=/dev/stdin -o yaml | kubeseal --scope cluster-wide -o yaml > mysealedsecret.yaml +``` + +### Key Rotation and Backup +```bash +# Backup current keys before rotation +kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/sealed-secrets-key -o yaml > sealed-secrets-backup-$(date +%Y%m%d).yaml + +# Force key rotation (restart controller) +kubectl delete pod -n kube-system -l name=sealed-secrets-controller + +# Verify new key is generated +kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/sealed-secrets-key=active +``` + +Sealed Secrets enables GitOps-friendly secret management by allowing encrypted secrets to be stored in version control. Implement proper key backup and rotation procedures to ensure long-term secret accessibility. \ No newline at end of file diff --git a/docs/templates/service-config-guide-template.md b/docs/templates/service-config-guide-template.md new file mode 100644 index 0000000..92630c9 --- /dev/null +++ b/docs/templates/service-config-guide-template.md @@ -0,0 +1,109 @@ +# [Service Name] Configuration Guide + +## Overview +[Brief description of what the service provides and its role in the Kubernetes cluster] + +## Key Configuration Choices + +### [Configuration Section 1] +```yaml +[example configuration block] +``` +**Why**: +- [Explanation of configuration choice 1] +- [Explanation of configuration choice 2] +- [Additional context or reasoning] + +### [Configuration Section 2] +```yaml +[example configuration block] +``` +**Why**: [Explanation of why this configuration is needed and what it accomplishes] + +### [Configuration Section 3] +```yaml +[example configuration block] +``` +**Why**: [Detailed explanation of the configuration choices and their implications] + +## Common Pitfalls + +### [Problem Category 1] +**Problem**: [Description of the issue users commonly encounter] + +**Solution**: [Step-by-step solution to resolve the issue] + +**Verification**: +```bash +[command to verify the fix] +``` + +### [Problem Category 2] +**Problem**: [Description of another common issue] + +**Solution**: [Explanation of how to resolve this issue, including any configuration changes needed] + +### [Problem Category 3] +**Problem**: [Description of configuration-related problems] + +**Solution**: [Solution with code examples if applicable] + +```bash +[example commands or configuration] +``` + +## Required Secrets + +### [secret-name-1] +[Description of what this secret contains and its purpose] + +```yaml +stringData: + # [Description of key-value pairs] + [KEY_NAME]: [example-value] + [ANOTHER_KEY]: [example-value] +``` + +**Key Fields**: +- `[KEY_NAME]`: [Description of what this field does] (required/optional) +- `[ANOTHER_KEY]`: [Description of this field] (required/optional) + +### [secret-name-2] +[Description of second secret if applicable] + +```ini +[example configuration format] +[key] = "value" +``` + +**Key Fields**: +- `[field]`: [Description and requirements] + +## Verification +```bash +# [Description of verification step 1] +[command to check service status] + +# [Description of verification step 2] +[command to verify configuration] + +# [Description of test step] +[command to test functionality] + +# [Description of troubleshooting step] +[command to check logs or status] +``` + +## Usage Examples + +### [Use Case 1] +```bash +[example command or configuration for common use case] +``` + +### [Use Case 2] +```bash +[example command or configuration for another use case] +``` + +[Additional notes about usage patterns, limitations, or future considerations] \ No newline at end of file diff --git a/docs/templates/service-readme-template.md b/docs/templates/service-readme-template.md new file mode 100644 index 0000000..47751ee --- /dev/null +++ b/docs/templates/service-readme-template.md @@ -0,0 +1,137 @@ +# [Service Name] – Base Configuration + +This directory contains the **base manifests** for deploying [Service Name](https://[service-url]), [brief description of what the service does]. +It is designed to be **consumed by cluster repositories** as a remote base, allowing each cluster to apply **custom overrides** as needed. + +**About [Service Name]:** + +- [Key feature 1 with brief explanation] +- [Key feature 2 with brief explanation] +- [Key feature 3 with brief explanation] +- [Integration capability or compatibility note] +- [Advanced feature or use case] +- [Operational benefit or automation capability] +- [Security or governance benefit] +- [Common use case or deployment scenario] +- [Additional operational or architectural benefit] + +## Configuration + +### Base Components + +- **[Component 1]:** [Description of what this component does] +- **[Component 2]:** [Description of what this component does] +- **[Component 3]:** [Description of what this component does] + +### Custom Resources + +- **[CRD 1]:** [Description of the custom resource and its purpose] +- **[CRD 2]:** [Description of the custom resource and its purpose] + +### Storage/Persistence + +- **[Storage Type]:** [Description of storage requirements or capabilities] +- **[Backup/Recovery]:** [Description of backup and recovery capabilities] + +## Cluster-Specific Overrides + +Each cluster repository should provide the following overrides: + +### Required Overrides + +- **[Override 1]:** [Description and example] +- **[Override 2]:** [Description and example] +- **[Override 3]:** [Description and example] + +### Optional Overrides + +- **[Optional Override 1]:** [Description and when to use] +- **[Optional Override 2]:** [Description and when to use] +- **[Optional Override 3]:** [Description and when to use] + +## Dependencies + +- **[Dependency 1]:** [Version requirements and purpose] +- **[Dependency 2]:** [Version requirements and purpose] +- **[Dependency 3]:** [Version requirements and purpose] + +## Usage Examples + +### Basic Deployment + +```yaml +# Example kustomization.yaml for consuming this base +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - https://github.com/[org]/[repo]//applications/base/services/[service-name]?ref=[version] + +patchesStrategicMerge: + - [service-name]-values.yaml +``` + +### Configuration Override Example + +```yaml +# [service-name]-values.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: [service-name]-config +data: + [key]: [value] + [key2]: [value2] +``` + +## Verification + +After deployment, verify the service is running correctly: + +```bash +# Check pod status +kubectl get pods -n [namespace] -l app.kubernetes.io/name=[service-name] + +# Check service status +kubectl get svc -n [namespace] -l app.kubernetes.io/name=[service-name] + +# Check custom resources (if applicable) +kubectl get [crd-name] -A + +# Check logs +kubectl logs -n [namespace] -l app.kubernetes.io/name=[service-name] +``` + +## Troubleshooting + +### Common Issues + +**Issue 1: [Common problem description]** +- **Symptoms:** [What you'll see] +- **Cause:** [Why it happens] +- **Solution:** [How to fix it] + +**Issue 2: [Another common problem]** +- **Symptoms:** [What you'll see] +- **Cause:** [Why it happens] +- **Solution:** [How to fix it] + +### Useful Commands + +```bash +# Debug command 1 +kubectl [command] [options] + +# Debug command 2 +kubectl [command] [options] + +# Check configuration +kubectl describe [resource] [name] -n [namespace] +``` + +## References + +- **Upstream Documentation:** [Link to official documentation] +- **Helm Chart:** [Link to Helm chart repository] +- **GitHub Repository:** [Link to source code] +- **Configuration Guide:** [Link to detailed configuration documentation] \ No newline at end of file diff --git a/docs/templates/service-standards-template.md b/docs/templates/service-standards-template.md new file mode 100644 index 0000000..e2b9f2e --- /dev/null +++ b/docs/templates/service-standards-template.md @@ -0,0 +1,303 @@ +--- +id: [service-name]-standards +title: [Service Name] Standards & Lifecycle +sidebar_label: [Service Name] Standards +description: Standards for [service description] in the openCenter platform. +tags: [developer, operators, service, standards, lifecycle, [service-name]] +audience: [Developer, Operations] +--- + +# [Service Name] Standards & Lifecycle + +> **Purpose.** [Brief description of the service's purpose and role in the platform] +> +> **Scope.** [Define what this service covers and its boundaries] + +--- + +## 0) Service Intake Workflow + +1. **Intake Request:** [Service-specific intake requirements] +2. **Architecture Review:** [Architecture considerations specific to this service] +3. **Prototype in Dev:** [Development requirements and deliverables] +4. **Operational Review:** [Operational validation requirements] +5. **Stage Decision:** [Decision criteria and approval process] + +--- + +## 1) Service Requirements (Authoritative Checklist) + +### 1.1 Functional & Delivery + +* [ ] **[Requirement 1]:** [Description and acceptance criteria] +* [ ] **[Requirement 2]:** [Description and acceptance criteria] +* [ ] **[Requirement 3]:** [Description and acceptance criteria] + +### 1.2 Security & Compliance (Minimum) + +* [ ] **[Security Requirement 1]:** [Description and compliance mapping] +* [ ] **[Security Requirement 2]:** [Description and compliance mapping] +* [ ] **[Security Requirement 3]:** [Description and compliance mapping] + +### 1.3 Observability (Minimum) + +* [ ] **[Observability Requirement 1]:** [Metrics, logs, traces requirements] +* [ ] **[Observability Requirement 2]:** [Dashboard and alerting requirements] +* [ ] **[Observability Requirement 3]:** [SLI/SLO requirements] + +### 1.4 Operations & Support + +* [ ] **[Operations Requirement 1]:** [Documentation requirements] +* [ ] **[Operations Requirement 2]:** [Runbook requirements] +* [ ] **[Operations Requirement 3]:** [Support model requirements] + +### 1.5 Implementation Guidance + +- **[Guidance Area 1]:** [Specific implementation guidance] +- **[Guidance Area 2]:** [Best practices and recommendations] +- **[Guidance Area 3]:** [Common patterns and approaches] + +#### Sample Configuration + +```yaml +[example configuration relevant to this service] +``` + +#### Common Pitfalls + +- [Common mistake 1 and how to avoid it] +- [Common mistake 2 and how to avoid it] +- [Common mistake 3 and how to avoid it] + +--- + +## 2) Project Risk Assessment + +Create `RISK.md` capturing the following for [Service Name]: + +| Factor | Score | Notes | +| ---------------------- | ----- | ------------------------------- | +| **[Risk Factor 1]** | [1-5] | [Service-specific considerations] | +| **[Risk Factor 2]** | [1-5] | [Service-specific considerations] | +| **[Risk Factor 3]** | [1-5] | [Service-specific considerations] | + +**Risk Score:** [Calculated score and tier] + +**Additional Considerations:** +- [Service-specific risk factors] +- [Mitigation strategies] +- [Compensating controls] + +--- + +## 3) Architecture & Configuration + +### 3.1 Service Architecture + +``` +[Architecture diagram or description] +``` + +### 3.2 Configuration Principles + +- **[Principle 1]:** [Description and rationale] +- **[Principle 2]:** [Description and rationale] +- **[Principle 3]:** [Description and rationale] + +### 3.3 Example Configuration + +```yaml +[Service-specific configuration example] +``` + +### 3.4 Dependencies + +- **[Dependency 1]:** [Description and requirements] +- **[Dependency 2]:** [Description and requirements] +- **[Dependency 3]:** [Description and requirements] + +--- + +## 4) Deployment & Scheduling + +### 4.1 Node Selection Strategy + +```yaml +[Service-specific tolerations and node selectors] +``` + +### 4.2 Resource Requirements + +- **CPU:** [Requirements and limits] +- **Memory:** [Requirements and limits] +- **Storage:** [Requirements and characteristics] + +### 4.3 Scaling Considerations + +- **Horizontal Scaling:** [HPA configuration and limits] +- **Vertical Scaling:** [VPA considerations] +- **Storage Scaling:** [Volume expansion capabilities] + +--- + +## 5) Service Labels & Metadata + +### 5.1 Required Labels + +```yaml +metadata: + labels: + app.kubernetes.io/name: [service-name] + app.kubernetes.io/instance: [instance-name] + app.kubernetes.io/version: [version] + app.kubernetes.io/component: [component] + app.kubernetes.io/part-of: [system] + app.kubernetes.io/managed-by: fluxcd + opencenter.io/owner: [team-email] + opencenter.io/tier: [platform|shared|tenant] + opencenter.io/data-sensitivity: [classification] + opencenter.io/rto: [recovery-time] + opencenter.io/rpo: [recovery-point] + opencenter.io/sla: [availability-target] + opencenter.io/backup-profile: [backup-schedule] +``` + +### 5.2 Service-Specific Labels + +- **[Custom Label 1]:** [Purpose and values] +- **[Custom Label 2]:** [Purpose and values] +- **[Custom Label 3]:** [Purpose and values] + +--- + +## 6) Production Requirements + +### 6.1 Documentation Bundle + +- **README.md:** [Service overview and quick start] +- **OPERATIONS.md:** [Operational procedures] +- **TROUBLESHOOTING.md:** [Common issues and solutions] +- **SLO.md:** [Service level objectives and indicators] + +### 6.2 Observability Requirements + +- **Metrics:** [Required metrics and collection] +- **Dashboards:** [Grafana dashboard requirements] +- **Alerts:** [Alert rules and thresholds] +- **Logs:** [Logging requirements and retention] + +### 6.3 Backup & Recovery + +- **Backup Strategy:** [What needs to be backed up] +- **Recovery Procedures:** [Step-by-step recovery process] +- **Testing:** [Backup/recovery testing schedule] + +--- + +## 7) Preview Service Considerations + +- **Preview Scope:** [What functionality is available in preview] +- **Limitations:** [Known limitations and workarounds] +- **Success Criteria:** [Metrics for graduation to production] +- **Timeline:** [Expected preview duration and milestones] + +--- + +## 8) Service Lifecycle Gates + +| Stage | Entry Criteria | Exit Criteria | +| ---------- | --------------------------------- | --------------------------------- | +| Incubating | [Service-specific entry criteria] | [Service-specific exit criteria] | +| Preview | [Service-specific entry criteria] | [Service-specific exit criteria] | +| Production | [Service-specific entry criteria] | [Service-specific exit criteria] | +| Deprecated | [Service-specific entry criteria] | [Service-specific exit criteria] | +| Retired | [Service-specific entry criteria] | [Service-specific exit criteria] | + +### Stage-Specific Deliverables + +- **Incubating:** [Required deliverables] +- **Preview:** [Required deliverables] +- **Production:** [Required deliverables] +- **Deprecated:** [Required deliverables] +- **Retired:** [Required deliverables] + +--- + +## 9) Testing & Validation + +### 9.1 Test Strategy + +- **Unit Tests:** [Coverage requirements and scope] +- **Integration Tests:** [Test scenarios and dependencies] +- **End-to-End Tests:** [User journey validation] +- **Performance Tests:** [Load and stress testing] + +### 9.2 Validation Pipeline + +```yaml +[CI/CD pipeline configuration specific to this service] +``` + +### 9.3 Quality Gates + +- **Code Quality:** [Linting, formatting, complexity thresholds] +- **Security:** [SAST, dependency scanning, image scanning] +- **Performance:** [Latency, throughput, resource usage thresholds] + +--- + +## 10) Operational Procedures + +### 10.1 Deployment Procedures + +1. **Pre-deployment:** [Checklist and validation steps] +2. **Deployment:** [Step-by-step deployment process] +3. **Post-deployment:** [Verification and rollback procedures] + +### 10.2 Maintenance Procedures + +- **Regular Maintenance:** [Scheduled maintenance tasks] +- **Updates:** [Update procedures and testing] +- **Monitoring:** [Ongoing monitoring requirements] + +### 10.3 Incident Response + +- **Escalation Path:** [Who to contact and when] +- **Common Incidents:** [Typical issues and responses] +- **Recovery Procedures:** [Step-by-step recovery actions] + +--- + +## 11) Compliance & Security + +### 11.1 Compliance Mapping + +| Control | Requirement | Implementation | Evidence | +| ------- | ----------- | -------------- | -------- | +| [Control ID] | [Requirement description] | [How it's implemented] | [Evidence location] | + +### 11.2 Security Controls + +- **Authentication:** [How users/services authenticate] +- **Authorization:** [RBAC and access controls] +- **Encryption:** [Data in transit and at rest] +- **Audit:** [Logging and audit trail requirements] + +--- + +## 12) Appendices + +### 12.1 Configuration Examples + +[Additional configuration examples and templates] + +### 12.2 Troubleshooting Guide + +[Detailed troubleshooting procedures and common solutions] + +### 12.3 Reference Documentation + +- [Link to upstream documentation] +- [Link to related ADRs] +- [Link to runbooks] +- [Link to dashboards] \ No newline at end of file diff --git a/docs/tempo-config-guide.md b/docs/tempo-config-guide.md new file mode 100644 index 0000000..18bc96a --- /dev/null +++ b/docs/tempo-config-guide.md @@ -0,0 +1,251 @@ +# Tempo Configuration Guide + +## Overview +Tempo is a distributed tracing backend that provides cost-effective trace storage and querying capabilities, designed to work seamlessly with Grafana and OpenTelemetry. + +## Key Configuration Choices + +### Storage Configuration +```yaml +tempo: + storage: + trace: + backend: s3 + s3: + endpoint: s3.amazonaws.com + bucket: tempo-traces + region: us-east-1 + access_key: + secret_key: + insecure: false + retention: 30d +``` +**Why**: +- Object storage provides cost-effective, scalable trace storage +- S3-compatible storage offers flexibility across cloud providers +- Retention policies manage storage costs and compliance requirements + +### Distributor Configuration +```yaml +tempo: + distributor: + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + jaeger: + protocols: + grpc: + endpoint: 0.0.0.0:14250 + thrift_http: + endpoint: 0.0.0.0:14268 +``` +**Why**: Multiple receiver protocols support different tracing clients and migration scenarios + +### Query Configuration +```yaml +tempo: + query_frontend: + search: + duration_slo: 5s + throughput_bytes_slo: 1.073741824e+09 # 1GB + trace_by_id: + duration_slo: 5s + querier: + max_concurrent_queries: 20 + search: + external_hedge_requests_at: 8s + external_hedge_requests_up_to: 2 +``` +**Why**: Query limits and SLOs prevent resource exhaustion and ensure consistent performance + +## Common Pitfalls + +### High Ingestion Rate Issues +**Problem**: Tempo cannot keep up with high trace ingestion rates + +**Solution**: Scale distributors and ingesters, tune batch sizes, and implement sampling + +**Verification**: +```bash +# Check distributor metrics +kubectl port-forward -n observability svc/tempo-distributor 3200:3200 +curl http://localhost:3200/metrics | grep tempo_distributor + +# Monitor ingestion rate +kubectl logs -n observability deployment/tempo-distributor | grep "ingestion rate" +``` + +### Query Performance Problems +**Problem**: TraceQL queries are slow or time out + +**Solution**: Use proper time ranges, limit search scope, and optimize query patterns + +### Storage Backend Issues +**Problem**: Tempo cannot write traces to or read from object storage + +**Solution**: Verify storage credentials, bucket permissions, and network connectivity + +```bash +# Check ingester logs +kubectl logs -n observability deployment/tempo-ingester + +# Verify storage configuration +kubectl exec -n observability deployment/tempo-querier -- \ + wget -qO- 'http://localhost:3200/ready' + +# Test object storage connectivity +kubectl exec -n observability deployment/tempo-ingester -- \ + aws s3 ls s3://tempo-traces/ --region us-east-1 +``` + +## Required Secrets + +### Object Storage Credentials +Tempo requires credentials for accessing trace storage + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: tempo-storage + namespace: observability +type: Opaque +stringData: + access-key: your-access-key + secret-key: your-secret-key +``` + +**Key Fields**: +- `access-key`: S3 access key ID (required) +- `secret-key`: S3 secret access key (required) + +### Gateway Authentication +For multi-tenant deployments, authentication credentials may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: tempo-gateway-auth + namespace: observability +type: Opaque +stringData: + htpasswd: | + tenant1:$2y$10$... + tenant2:$2y$10$... +``` + +**Key Fields**: +- `htpasswd`: HTTP basic auth credentials file (required for gateway auth) + +## Verification +```bash +# Check Tempo pods are running +kubectl get pods -n observability -l app.kubernetes.io/name=tempo + +# Verify Tempo services +kubectl get svc -n observability -l app.kubernetes.io/name=tempo + +# Test Tempo API +kubectl port-forward -n observability svc/tempo-query-frontend 3200:3200 +curl http://localhost:3200/ready + +# Query traces +curl -G -s "http://localhost:3200/api/search" \ + --data-urlencode 'q={service.name="myservice"}' \ + --data-urlencode 'limit=10' +``` + +## Usage Examples + +### Query Traces with TraceQL +```bash +# Find traces by service name +{service.name="myservice"} + +# Filter by duration +{service.name="myservice" && duration > 100ms} + +# Search by span attributes +{span.http.status_code=500} + +# Complex query with multiple conditions +{service.name="frontend" && span.http.method="POST" && duration > 1s} +``` + +### Configure OpenTelemetry to Send Traces +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: OpenTelemetryCollector +metadata: + name: otel-collector +spec: + config: | + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + exporters: + otlp/tempo: + endpoint: http://tempo-distributor:4317 + tls: + insecure: true + service: + pipelines: + traces: + receivers: [otlp] + exporters: [otlp/tempo] +``` + +### Configure Grafana Data Source +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: tempo-datasource + namespace: observability +data: + datasource.yaml: | + apiVersion: 1 + datasources: + - name: Tempo + type: tempo + access: proxy + url: http://tempo-query-frontend:3200 + isDefault: false + editable: true + jsonData: + tracesToLogs: + datasourceUid: loki + tags: ['job', 'instance', 'pod', 'namespace'] + tracesToMetrics: + datasourceUid: prometheus + tags: [{ key: 'service.name', value: 'service' }] + serviceMap: + datasourceUid: prometheus +``` + +### Set Up Trace Sampling +```yaml +tempo: + distributor: + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + # Sampling configuration + log_received_traces: true + global_overrides: + max_traces_per_user: 10000 + max_bytes_per_trace: 5000000 # 5MB +``` + +Tempo provides efficient distributed tracing storage and querying. Focus on proper sampling strategies and retention policies to balance observability needs with storage costs. \ No newline at end of file diff --git a/docs/velero-config-guide.md b/docs/velero-config-guide.md new file mode 100644 index 0000000..8615701 --- /dev/null +++ b/docs/velero-config-guide.md @@ -0,0 +1,189 @@ +# Velero Configuration Guide + +## Overview +Velero provides backup and disaster recovery capabilities for Kubernetes clusters, supporting both resource backups and persistent volume snapshots. + +## Key Configuration Choices + +### Backup Storage Location Configuration +```yaml +backupStorageLocations: +- name: default + provider: aws + bucket: velero-backups + config: + region: us-east-1 + s3ForcePathStyle: "false" + credential: + name: cloud-credentials + key: cloud +``` +**Why**: +- S3-compatible storage provides reliable, scalable backup storage +- Multiple storage locations enable cross-region backup strategies +- Credentials separation improves security + +### Volume Snapshot Location Configuration +```yaml +volumeSnapshotLocations: +- name: default + provider: csi + config: + # CSI driver handles snapshot configuration +``` +**Why**: CSI snapshots provide native Kubernetes volume snapshot capabilities with better integration + +### CSI Snapshot Integration +```yaml +configuration: + features: EnableCSI + defaultSnapshotMoveData: false + defaultVolumesToFsBackup: false + volumeSnapshotLocation: [] +``` +**Why**: CSI integration provides more reliable and efficient volume backups compared to file-level backups + +## Common Pitfalls + +### Backup Storage Authentication Issues +**Problem**: Velero cannot access backup storage due to authentication failures + +**Solution**: Verify cloud credentials are correctly configured and have appropriate permissions + +**Verification**: +```bash +# Check Velero deployment logs +kubectl logs -n velero deployment/velero + +# Verify backup storage location +kubectl get backupstoragelocation -n velero + +# Test backup storage access +velero backup-location get +``` + +### Volume Snapshot Failures +**Problem**: Volume snapshots fail during backup operations + +**Solution**: Ensure CSI driver supports snapshots and VolumeSnapshotClass is properly configured + +### Node Agent Issues +**Problem**: File-level backups fail due to node agent problems + +**Solution**: Check node agent daemonset status and ensure proper node access + +```bash +# Check node agent pods +kubectl get pods -n velero -l name=node-agent + +# Check node agent logs +kubectl logs -n velero -l name=node-agent + +# Verify node agent configuration +kubectl describe daemonset -n velero node-agent +``` + +## Required Secrets + +### Cloud Storage Credentials +Velero requires credentials for accessing backup storage + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: cloud-credentials + namespace: velero +type: Opaque +stringData: + cloud: | + [default] + aws_access_key_id=your-access-key + aws_secret_access_key=your-secret-key +``` + +**Key Fields**: +- `cloud`: Cloud provider credentials file (required) +- Format varies by provider (AWS, Azure, GCP, etc.) + +### CSI Snapshot Credentials +For CSI snapshots, additional credentials may be required + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: csi-credentials + namespace: velero +type: Opaque +stringData: + username: csi-user + password: csi-password +``` + +**Key Fields**: +- `username`: CSI storage system username (if required) +- `password`: CSI storage system password (if required) + +## Verification +```bash +# Check Velero installation +velero version + +# Verify backup storage location +velero backup-location get + +# Check volume snapshot location +velero snapshot-location get + +# List existing backups +velero backup get +``` + +## Usage Examples + +### Create On-Demand Backup +```bash +# Backup entire cluster +velero backup create full-backup + +# Backup specific namespace +velero backup create app-backup --include-namespaces myapp + +# Backup with TTL +velero backup create temp-backup --ttl 24h +``` + +### Schedule Regular Backups +```yaml +apiVersion: velero.io/v1 +kind: Schedule +metadata: + name: daily-backup + namespace: velero +spec: + schedule: "0 2 * * *" # Daily at 2 AM + template: + includedNamespaces: + - production + - staging + storageLocation: default + ttl: 720h # 30 days +``` + +### Restore from Backup +```bash +# List available backups +velero backup get + +# Restore entire backup +velero restore create --from-backup full-backup + +# Restore specific namespace +velero restore create --from-backup app-backup --include-namespaces myapp + +# Restore to different namespace +velero restore create --from-backup app-backup --namespace-mappings myapp:myapp-restored +``` + +Velero provides comprehensive backup and disaster recovery capabilities. Regular testing of backup and restore procedures is essential to ensure data protection and recovery readiness. \ No newline at end of file