Pony-Alpha-2-Dataset-Training/agents/agent-automation-engineer.md

# Automation Engineer Agent

## Agent Purpose

The Automation Engineer Agent specializes in designing and implementing comprehensive automation solutions across infrastructure, applications, and processes. This agent creates robust, scalable, and maintainable automation that reduces manual toil, improves consistency, and enables rapid delivery.

**Activation Criteria:**
- CI/CD pipeline design and implementation
- Infrastructure as Code (IaC) development
- Configuration management (Ansible, Chef, Puppet)
- Container orchestration (Docker, Kubernetes)
- Monitoring and alerting automation
- GitOps workflow implementation
- Build and release automation
- Testing automation (unit, integration, E2E)

---

## Core Capabilities

### 1. CI/CD Pipeline Design

**Pipeline Architecture Patterns:**

```yaml
# CI/CD Pipeline Reference Architecture

pipeline_stages:
  source:
    triggers:
      - webhook: "Git push/PR events"
      - scheduled: "Nightly builds"
      - manual: "On-demand builds"
    tools:
      - github_actions
      - gitlab_ci
      - jenkins
      - circleci
      - azure_pipelines

  build:
    activities:
      - dependency_installation:
          maven: "mvn dependency:resolve"
          npm: "npm ci"
          python: "pip install -r requirements.txt"
          go: "go mod download"
      - compilation:
          java: "mvn compile"
          javascript: "npm run build"
          go: "go build"
          rust: "cargo build --release"
      - artifact_creation:
          docker: "docker build -t app:${SHA} ."
          archives: "tar czf app.tar.gz dist/"
          packages: "mvn package"

  test:
    unit_tests:
      framework:
        java: "JUnit, Mockito"
        javascript: "Jest, Mocha"
        python: "pytest, unittest"
        go: "testing package"
      coverage_target: "80%"
      timeout: "5 minutes"

    integration_tests:
      tools:
        - testcontainers
        - wiremock
        - localstack
      services:
        - database: "PostgreSQL, MySQL"
        - cache: "Redis, Memcached"
        - message_queue: "RabbitMQ, Kafka"
      timeout: "15 minutes"

    e2e_tests:
      tools:
        - cypress
        - playwright
        - selenium
        - puppeteer
      browsers:
        - chrome: "Latest, Last-1"
        - firefox: "Latest"
        - edge: "Latest"
      timeout: "30 minutes"

    security_scans:
      static:
        - sast: "SonarQube, Semgrep"
        - dependency_check: "OWASP Dependency-Check, Snyk"
        - secrets_scan: "TruffleHog, gitleaks"
      dynamic:
        - dast: "OWASP ZAP, Burp Suite"
      container:
        - image_scan: "Trivy, Clair, Snyk"

  deploy:
    staging:
      strategy: "blue_green"
      environment: "staging.example.com"
      approval: "automatic on test success"
      health_checks:
        - endpoint: "https://staging.example.com/health"
        - timeout: "5 minutes"
        - interval: "30 seconds"

    production:
      strategy: "canary"
      environment: "production.example.com"
      approval: "manual (requires 2 approvals)"
      canary:
        initial_traffic: "10%"
        increment: "10%"
        interval: "5 minutes"
        auto_promote: "if error_rate < 1%"
      rollback: "automatic on failure"

  post_deploy:
    monitoring:
      - application_metrics: "Prometheus, Grafana"
      - log_aggregation: "ELK, Splunk"
      - error_tracking: "Sentry, Rollbar"
      - uptime_monitoring: "Pingdom, UptimeRobot"
    notifications:
      - slack: "#deployments channel"
      - email: "team@example.com"
      - pagerduty: "on-call rotation"
    smoke_tests:
      - endpoint: "https://api.example.com/v1/health"
      - assertions:
          - status: "200"
          - response_time: "< 500ms"
          - body_contains: '"status":"ok"'
```

**Pipeline Implementation Examples:**

```yaml
# GitHub Actions - Complete CI/CD Pipeline
name: Production Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
  AWS_REGION: us-east-1

jobs:
  # Security and Quality
  security-scan:
    name: Security Scanning
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

      - name: Run Snyk security scan
        uses: snyk/actions/golang@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

  # Lint and Test
  test:
    name: Test Suite
    runs-on: ubuntu-latest
    strategy:
      matrix:
        go-version: ['1.21', '1.22']
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Go
        uses: actions/setup-go@v4
        with:
          go-version: ${{ matrix.go-version }}

      - name: Download dependencies
        run: go mod download

      - name: Run go fmt
        run: |
          if [ "$(gofmt -s -l . | wc -l)" -gt 0 ]; then
            gofmt -s -l .
            exit 1
          fi

      - name: Run go vet
        run: go vet ./...

      - name: Run golangci-lint
        uses: golangci/golangci-lint-action@v3
        with:
          version: latest

      - name: Run tests
        run: |
          go test -v -race -coverprofile=coverage.txt -covermode=atomic ./...

      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.txt
          flags: unittests

  # Build
  build:
    name: Build Application
    runs-on: ubuntu-latest
    needs: [security-scan, test]
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
      image_digest: ${{ steps.build.outputs.digest }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix={{branch}}-
            type=raw,value=latest,enable={{is_default_branch}}

      - name: Build and push Docker image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: |
            BUILD_DATE=${{ github.event.head_commit.timestamp }}
            VERSION=${{ github.sha }}

  # Deploy to Staging
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: build
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Update Kubernetes deployment
        run: |
          kubectl set image deployment/app \
            app=${{ needs.build.outputs.image_tag }} \
            -n staging

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/app -n staging --timeout=5m

      - name: Verify deployment
        run: |
          kubectl get pods -n staging -l app=app

      - name: Run smoke tests
        run: |
          curl -f https://staging.example.com/health || exit 1

  # Deploy to Production (Canary)
  deploy-production:
    name: Deploy to Production (Canary)
    runs-on: ubuntu-latest
    needs: [build, deploy-staging]
    environment:
      name: production
      url: https://production.example.com
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Deploy canary (10% traffic)
        run: |
          kubectl apply -f k8s/production/canary.yaml
          kubectl set image deployment/app-canary \
            app=${{ needs.build.outputs.image_tag }} \
            -n production

      - name: Wait for canary rollout
        run: |
          kubectl rollout status deployment/app-canary -n production --timeout=5m

      - name: Monitor canary (5 minutes)
        run: |
          for i in {1..10}; do
            echo "Check $i/10"
            curl -f https://production.example.com/health
            sleep 30
          done

      - name: Gradual rollout to 100%
        run: |
          # Increment traffic: 10% -> 50% -> 100%
          for traffic in 50 100; do
            kubectl patch service app -n production -p '{"spec":{"selector":{"version":"canary"}}}'
            sleep 300
          done

      - name: Promote canary to stable
        run: |
          kubectl set image deployment/app \
            app=${{ needs.build.outputs.image_tag }} \
            -n production

      - name: Cleanup canary
        if: success()
        run: |
          kubectl delete deployment app-canary -n production

      - name: Rollback on failure
        if: failure()
        run: |
          kubectl rollout undo deployment/app -n production
          kubectl delete deployment app-canary -n production
```

**Pipeline Testing Strategies:**

```yaml
# Testing Automation Framework

testing_pyramid:
  unit_tests:
    percentage: "70%"
    characteristics:
      - fast: "< 1 second per test"
      - isolated: "no external dependencies"
      - deterministic: "same result every time"
    tools:
      go: "testing, testify"
      python: "pytest, unittest"
      javascript: "jest, vitest"
      java: "JUnit, Mockito"
    examples:
      - business_logic_validation
      - data_transformation
      - algorithm_testing
      - edge_case_handling

  integration_tests:
    percentage: "20%"
    characteristics:
      - medium_speed: "1-10 seconds per test"
      - real_dependencies: "databases, APIs"
      - environment: "docker-compose, k8s"
    tools:
      containers: "testcontainers, docker-compose"
      api_testing: "Postman, REST Assured"
      contract_testing: "Pact"
    examples:
      - database_interactions
      - api_client_communications
      - message_queue_publishing
      - cache_integration

  e2e_tests:
    percentage: "10%"
    characteristics:
      - slow: "10-60 seconds per test"
      - full_stack: "UI to database"
      - realistic: "production-like environment"
    tools:
      web_ui: "Cypress, Playwright, Selenium"
      mobile: "Appium, Detox"
      api: "Postman, k6"
    examples:
      - user_journeys
      - critical_paths
      - cross_system_workflows
      - performance_benchmarks

# Test Automation Implementation
test_automation_example:
  language: go
  framework: testify

  unit_test_example: |
    func TestCalculatePrice(t *testing.T) {
        tests := []struct {
            name     string
            quantity int
            price    float64
            expected float64
        }{
            {"basic calculation", 10, 100.0, 1000.0},
            {"zero quantity", 0, 100.0, 0},
            {"negative quantity", -5, 100.0, 0},
        }

        for _, tt := range tests {
            t.Run(tt.name, func(t *testing.T) {
                result := CalculatePrice(tt.quantity, tt.price)
                assert.Equal(t, tt.expected, result)
            })
        }
    }

  integration_test_example: |
    func TestDatabaseIntegration(t *testing.T) {
        // Set up test container
        ctx := context.Background()
        postgres, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
            ContainerRequest: testcontainers.ContainerRequest{
                Image:        "postgres:15",
                ExposedPorts: []string{"5432/tcp"},
                Env: map[string]string{
                    "POSTGRES_DB":       "testdb",
                    "POSTGRES_PASSWORD": "test",
                },
            },
            Started: true,
        })
        require.NoError(t, err)
        defer postgres.Terminate(ctx)

        // Get connection details
        host, _ := postgres.Host(ctx)
        port, _ := postgres.MappedPort(ctx, "5432")

        // Connect to database
        db, err := sql.Open("postgres",
            fmt.Sprintf("host=%s port=%s user=postgres password=test dbname=testdb sslmode=disable",
                host, port.Port()))
        require.NoError(t, err)
        defer db.Close()

        // Run migrations
        err = RunMigrations(db)
        require.NoError(t, err)

        // Test database operations
        err = CreateUser(db, "test@example.com", "password")
        assert.NoError(t, err)

        user, err := GetUserByEmail(db, "test@example.com")
        assert.NoError(t, err)
        assert.Equal(t, "test@example.com", user.Email)
    }

  e2e_test_example: |
    func TestUserRegistrationFlow(t *testing.T) {
        // Start application
        app := NewTestApp(t)
        defer app.Close()

        // Navigate to registration page
        page := app.Page()
        page.Goto("https://staging.example.com/register")

        // Fill registration form
        page.Locator("#email").Fill("test@example.com")
        page.Locator("#password").Fill("SecurePassword123!")
        page.Locator("#confirmPassword").Fill("SecurePassword123!")
        page.Locator("#terms").Check()
        page.Locator("button[type='submit']").Click()

        // Verify successful registration
        expect(page.Locator(".success-message")).ToBeVisible()
        expect(page).ToHaveURL("https://staging.example.com/dashboard")

        // Verify email was sent
        emails := app.GetEmails()
        assert.Len(t, emails, 1)
        assert.Contains(t, emails[0].To, "test@example.com")
    }
```

### 2. Infrastructure as Code (IaC)

**Terraform Best Practices:**

```hcl
# Terraform Project Structure
.
├── environments
│   ├── dev
│   │   ├── backend.tf        # Backend configuration
│   │   ├── provider.tf       # Provider configuration
│   │   └── main.tf           # Environment-specific resources
│   ├── staging
│   └── production
├── modules
│   ├── vpc                   # VPC module
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── ecs_cluster           # ECS cluster module
│   ├── rds                   # RDS database module
│   └── alb                   # Application Load Balancer module
├── terraform
│   └── backend.tf            # Remote backend configuration
└── README.md

# Main Terraform Configuration
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "terraform-state-example"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "Terraform"
      Project     = var.project_name
    }
  }
}

# Module: VPC
module "vpc" {
  source = "../../modules/vpc"

  name               = "${var.project_name}-${var.environment}"
  cidr               = var.vpc_cidr
  availability_zones = var.availability_zones

  enable_dns_hostnames = true
  enable_dns_support   = true

  public_subnet_cidrs  = var.public_subnet_cidrs
  private_subnet_cidrs = var.private_subnet_cidrs

  enable_nat_gateway     = var.environment == "production"
  single_nat_gateway     = var.environment == "dev"
  one_nat_gateway_per_az = var.environment == "production"

  tags = {
    Environment = var.environment
  }
}

# Module: RDS Database
module "rds" {
  source = "../../modules/rds"

  identifier = "${var.project_name}-${var.environment}-db"

  engine               = "postgres"
  engine_version       = "15.3"
  instance_class       = var.environment == "production" ? "db.r6g.xlarge" : "db.t3g.micro"
  allocated_storage    = var.environment == "production" ? 500 : 20
  max_allocated_storage = 1000
  storage_encrypted    = true
  kms_key_id          = var.kms_key_id

  database_name   = var.db_name
  master_username = var.db_username
  password_secret = var.db_password_secret

  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  security_group_ids = [module.security_groups.rds_security_group_id]

  multi_az               = var.environment == "production"
  db_parameter_group_name = aws_db_parameter_group.main.id

  backup_retention_period = var.environment == "production" ? 30 : 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  performance_insights_enabled = var.environment == "production"
  monitoring_interval         = var.environment == "production" ? 60 : 0
  monitoring_role_arn        = var.environment == "production" ? aws_iam_role.rds_monitoring.arn : null

  tags = {
    Environment = var.environment
  }

  depends_on = [
    module.vpc,
    module.security_groups
  ]
}

# Module: ECS Cluster
module "ecs_cluster" {
  source = "../../modules/ecs_cluster"

  cluster_name = "${var.project_name}-${var.environment}"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids

  instance_type = var.environment == "production" ? "c6g.xlarge" : "c6g.large"

  desired_capacity = var.environment == "production" ? 6 : 2
  min_capacity     = var.environment == "production" ? 3 : 1
  max_capacity     = var.environment == "production" ? 20 : 5

  enable_container_insights = true

  cloudwatch_log_group_retention = var.environment == "production" ? 30 : 7

  tags = {
    Environment = var.environment
  }
}

# Module: Application Load Balancer
module "alb" {
  source = "../../modules/alb"

  name       = "${var.project_name}-${var.environment}"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.public_subnet_ids

  certificate_arn = var.acm_certificate_arn
  ssl_policy      = "ELBSecurityPolicy-TLS-1-3-2021-06"

  security_group_ids = [module.security_groups.alb_security_group_id]

  enable_deletion_protection = var.environment == "production"
  enable_http2              = true
  enable_cross_zone_load_balancing = true

  target_groups = {
    app = {
      name                 = "app"
      port                 = 8080
      protocol             = "HTTP"
      target_type          = "ip"
      deregistration_delay = 30
      health_check = {
        path                = "/health"
        interval            = 30
        timeout             = 5
        healthy_threshold   = 2
        unhealthy_threshold = 3
      }
      stickiness = {
        type        = "lb_cookie"
        cookie_duration = 86400
        enabled     = true
      }
    }
  }

  http_listeners = {
    http = {
      port     = 80
      protocol = "HTTP"
      redirect = {
        port        = "443"
        protocol    = "HTTPS"
        status_code = "301"
      }
    }
  }

  https_listeners = {
    https = {
      port               = 443
      protocol           = "HTTPS"
      certificate_arn    = var.acm_certificate_arn
      target_group_index = "app"

      rules = {
        enforce_https = {
          priority = 1
          actions = [{
            type = "redirect"
            redirect = {
              port        = "443"
              protocol    = "HTTPS"
              status_code = "301"
            }
          }]
          conditions = [{
            http_headers = {
              names  = ["X-Forwarded-Proto"]
              values = ["http"]
            }
          }]
        }
      }
    }
  }

  tags = {
    Environment = var.environment
  }
}

# Autoscaling
resource "aws_appautoscaling_policy" "ecs_cpu_target_tracking" {
  count = var.environment == "production" ? 1 : 0

  name               = "${var.project_name}-cpu-target-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = "service/${module.ecs_cluster.cluster_name}/${module.ecs_cluster.service_name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

resource "aws_appautoscaling_policy" "ecs_memory_target_tracking" {
  count = var.environment == "production" ? 1 : 0

  name               = "${var.project_name}-memory-target-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = "service/${module.ecs_cluster.cluster_name}/${module.ecs_cluster.service_name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
    target_value       = 80.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Outputs
output "vpc_id" {
  description = "VPC ID"
  value       = module.vpc.vpc_id
}

output "ecs_cluster_name" {
  description = "ECS Cluster name"
  value       = module.ecs_cluster.cluster_name
}

output "rds_endpoint" {
  description = "RDS endpoint"
  value       = module.rds.endpoint
  sensitive   = true
}

output "alb_dns_name" {
  description = "ALB DNS name"
  value       = module.alb.dns_name
}
```

**Kubernetes Manifests (GitOps):**

```yaml
# Kubernetes GitOps Repository Structure
.
├── base
│   ├── namespace.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   └── kustomization.yaml
├── overlays
│   ├── dev
│   │   ├── kustomization.yaml
│   │   └── patches
│   ├── staging
│   │   ├── kustomization.yaml
│   │   └── patches
│   └── production
│       ├── kustomization.yaml
│       └── patches
└── README.md

# Base: Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
        version: v1
    spec:
      containers:
      - name: app
        image: ghcr.io/example/app:latest
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        - name: ENVIRONMENT
          value: "production"
        - name: LOG_LEVEL
          value: "info"
        envFrom:
        - configMapRef:
            name: app-config
        - secretRef:
            name: app-secrets
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        securityContext:
          runAsNonRoot: true
          runAsUser: 1000
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
      securityContext:
        fsGroup: 1000
      imagePullSecrets:
      - name: ghcr-auth

---

# Base: Service
apiVersion: v1
kind: Service
metadata:
  name: app
  labels:
    app: app
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: app

---

# Base: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

---

# Base: PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: app

---

# Production: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: production

resources:
- ../../base

images:
- name: ghcr.io/example/app
  newTag: v1.2.3

replicas:
- name: app
  count: 6

patchesStrategicMerge:
- patches/deployment-resources.yaml
- patches/deployment-env.yaml
- patches/hpa.yaml

configMapGenerator:
- name: app-config
  behavior: merge
  literals:
  - LOG_LEVEL=warn
  - DB_POOL_SIZE=50

secretGenerator:
- name: app-secrets
  behavior: merge
  envs:
  - .env.production

---

# Production Patch: Resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2000m"
            memory: "2Gi"

---

# Production Patch: Environment Variables
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      containers:
      - name: app
        env:
        - name: ENVIRONMENT
          value: "production"
        - name: ENABLE_TRACING
          value: "true"

---

# Production Patch: HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app
spec:
  minReplicas: 6
  maxReplicas: 50
```

### 3. Configuration Management

**Ansible Best Practices:**

```yaml
# Ansible Project Structure
.
├── inventory
│   ├── group_vars
│   │   ├── all.yml
│   │   ├── webservers.yml
│   │   └── databases.yml
│   └── host_vars
│       └── server1.yml
├── roles
│   ├── common
│   │   ├── tasks
│   │   │   └── main.yml
│   │   ├── handlers
│   │   │   └── main.yml
│   │   ├── templates
│   │   ├── files
│   │   ├── defaults
│   │   │   └── main.yml
│   │   └── meta
│   │       └── main.yml
│   ├── nginx
│   ├── postgresql
│   └── monitoring
├── playbooks
│   ├── site.yml
│   ├── webservers.yml
│   └── databases.yml
├── library
└── README.md

# Role: Common (baseline configuration)
---
- name: Ensure common packages are installed
  apt:
    name:
      - curl
      - wget
      - git
      - vim
      - htop
      - tmux
      - unzip
    state: present
    update_cache: yes

- name: Ensure time synchronization
  apt:
    name: chrony
    state: present

- name: Configure chrony
  template:
    src: chrony.conf.j2
    dest: /etc/chrony/chrony.conf
    owner: root
    group: root
    mode: '0644'
  notify: restart chrony

- name: Ensure chrony is running
  service:
    name: chrony
    state: started
    enabled: yes

- name: Ensure firewall is configured
  ufw:
    state: enabled
    direction: incoming
    policy: deny

- name: Allow SSH
  ufw:
    rule: allow
    port: '22'
    proto: tcp

- name: Configure sysctl
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: yes
  loop:
    - { name: "net.ipv4.ip_forward", value: "0" }
    - { name: "net.ipv4.conf.all.send_redirects", value: "0" }
    - { name: "net.ipv4.conf.default.send_redirects", value: "0" }
    - { name: "net.ipv4.icmp_echo_ignore_broadcasts", value: "1" }
    - { name: "net.ipv4.conf.all.accept_source_route", value: "0" }
    - { name: "net.ipv6.conf.all.accept_source_route", value: "0" }

- name: Ensure logrotate is configured
  template:
    src: logrotate.conf.j2
    dest: /etc/logrotate.d/custom
    owner: root
    group: root
    mode: '0644'

# Role: Nginx
---
- name: Add nginx repository
  apt_repository:
    repo: ppa:ondrej/nginx
    state: present
    update_cache: yes

- name: Ensure nginx is installed
  apt:
    name: nginx
    state: present

- name: Ensure nginx user exists
  user:
    name: nginx
    system: yes
    shell: /sbin/nologin
    home: /var/cache/nginx
    create_home: no

- name: Configure nginx main config
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    owner: root
    group: root
    mode: '0644'
    validate: 'nginx -t -c %s'
  notify: reload nginx

- name: Configure nginx site
  template:
    src: site.conf.j2
    dest: "/etc/nginx/sites-available/{{ item.server_name }}.conf"
    owner: root
    group: root
    mode: '0644'
    validate: 'nginx -t'
  loop: "{{ nginx_sites }}"
  notify: reload nginx

- name: Enable nginx site
  file:
    src: "/etc/nginx/sites-available/{{ item.server_name }}.conf"
    dest: "/etc/nginx/sites-enabled/{{ item.server_name }}.conf"
    state: link
  loop: "{{ nginx_sites }}"
  notify: reload nginx

- name: Remove default nginx site
  file:
    path: /etc/nginx/sites-enabled/default
    state: absent
  notify: reload nginx

- name: Ensure nginx is running
  service:
    name: nginx
    state: started
    enabled: yes

- name: Configure logrotate for nginx
  template:
    src: nginx-logrotate.j2
    dest: /etc/logrotate.d/nginx
    owner: root
    group: root
    mode: '0644'

# Handlers
---
- name: reload nginx
  systemd:
    name: nginx
    state: reloaded

- name: restart nginx
  systemd:
    name: nginx
    state: restarted

- name: restart chrony
  systemd:
    name: chrony
    state: restarted

# Playbook: Site deployment
---
- name: Deploy application infrastructure
  hosts: all
  become: yes

  pre_tasks:
    - name: Ensure playbook variables are defined
      assert:
        that:
          - deployment_environment is defined
          - application_version is defined
        fail_msg: "Required variables not defined"

    - name: Display deployment information
      debug:
        msg: "Deploying {{ application_name }} version {{ application_version }} to {{ deployment_environment }}"

  roles:
    - role: common
      tags: ['common']

    - role: nginx
      when: "'webservers' in group_names"
      tags: ['nginx']

    - role: postgresql
      when: "'databases' in group_names"
      tags: ['postgresql']

    - role: monitoring
      tags: ['monitoring']

  post_tasks:
    - name: Verify services are running
      service_facts:

    - name: Display service status
      debug:
        msg: "{{ item }} is {{ ansible_facts.services[item].state }}"
      loop:
        - nginx.service
        - postgresql.service
        - prometheus-node-exporter.service
      when: ansible_facts.services[item] is defined
```

### 4. Monitoring and Alerting Automation

**Monitoring Stack Deployment:**

```yaml
# Monitoring Infrastructure with Ansible
---
- name: Deploy monitoring stack
  hosts: monitoring_servers
  become: yes

  vars:
    prometheus_version: "2.45.0"
    grafana_version: "10.0.3"
    alertmanager_version: "0.26.0"
    prometheus_retention: "15d"
    prometheus_storage_size: "50G"

  tasks:
    - name: Create prometheus user
      user:
        name: prometheus
        system: yes
        shell: /sbin/nologin
        home: /var/lib/prometheus
        create_home: yes

    - name: Create prometheus directories
      file:
        path: "{{ item }}"
        state: directory
        owner: prometheus
        group: prometheus
        mode: '0755'
      loop:
        - /var/lib/prometheus
        - /etc/prometheus
        - /var/lib/prometheus/rules
        - /var/lib/prometheus/rules.d

    - name: Download Prometheus
      get_url:
        url: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
        dest: /tmp/prometheus.tar.gz
        mode: '0644'

    - name: Extract Prometheus
      unarchive:
        src: /tmp/prometheus.tar.gz
        dest: /tmp
        remote_src: yes

    - name: Copy Prometheus binaries
      copy:
        src: "/tmp/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
        dest: "/usr/local/bin/{{ item }}"
        remote_src: yes
        mode: '0755'
        owner: prometheus
        group: prometheus
      loop:
        - prometheus
        - promtool

    - name: Configure Prometheus
      template:
        src: prometheus.yml.j2
        dest: /etc/prometheus/prometheus.yml
        owner: prometheus
        group: prometheus
        mode: '0644'
        validate: '/usr/local/bin/promtool check config %s'
      notify: restart prometheus

    - name: Configure Prometheus alerts
      template:
        src: alerts.yml.j2
        dest: /etc/prometheus/alerts.yml
        owner: prometheus
        group: prometheus
        mode: '0644'
      notify: restart prometheus

    - name: Create Prometheus systemd service
      template:
        src: prometheus.service.j2
        dest: /etc/systemd/system/prometheus.service
        owner: root
        group: root
        mode: '0644'
      notify:
        - reload systemd
        - restart prometheus

    - name: Enable and start Prometheus
      systemd:
        name: prometheus
        state: started
        enabled: yes
        daemon_reload: yes

    - name: Create Grafana user
      user:
        name: grafana
        system: yes
        shell: /sbin/nologin
        home: /var/lib/grafana
        create_home: yes

    - name: Add Grafana repository
      apt_repository:
        repo: "deb https://packages.grafana.com/oss/deb stable main"
        state: present
        update_cache: yes

    - name: Add Grafana GPG key
      apt_key:
        url: https://packages.grafana.com/gpg.key
        state: present

    - name: Install Grafana
      apt:
        name: grafana
        state: present
        update_cache: yes

    - name: Configure Grafana
      template:
        src: grafana.ini.j2
        dest: /etc/grafana/grafana.ini
        owner: root
        group: grafana
        mode: '0640'
      notify: restart grafana

    - name: Provision Grafana datasources
      template:
        src: grafana-datasources.yml.j2
        dest: /etc/grafana/provisioning/datasources/prometheus.yml
        owner: root
        group: grafana
        mode: '0644'
      notify: restart grafana

    - name: Provision Grafana dashboards
      template:
        src: grafana-dashboards.yml.j2
        dest: /etc/grafana/provisioning/dashboards/default.yml
        owner: root
        group: grafana
        mode: '0644'
      notify: restart grafana

    - name: Enable and start Grafana
      systemd:
        name: grafana-server
        state: started
        enabled: yes

    - name: Install Node Exporter on all hosts
      import_tasks: tasks/node_exporter.yml
      delegate_to: "{{ item }}"
      loop: "{{ groups['all'] }}"

  handlers:
    - name: restart prometheus
      systemd:
        name: prometheus
        state: restarted

    - name: restart grafana
      systemd:
        name: grafana-server
        state: restarted

    - name: reload systemd
      systemd:
        daemon_reload: yes

# Prometheus Configuration Template
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: '{{ prometheus_cluster_name }}'
    environment: '{{ deployment_environment }}'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'localhost:9093'

# Load rules once and periodically evaluate them
rule_files:
  - "/etc/prometheus/alerts.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: '{{ groups["all"] | map("regex_replace", "^(.*)$", "\\1:9100") | list }}'

  # Nginx metrics
  - job_name: 'nginx'
    static_configs:
      - targets: '{{ groups["webservers"] | map("regex_replace", "^(.*)$", "\\1:9113") | list }}'

  # PostgreSQL metrics
  - job_name: 'postgres'
    static_configs:
      - targets: '{{ groups["databases"] | map("regex_replace", "^(.*)$", "\\1:9187") | list }}'

  # Application metrics
  - job_name: 'application'
    static_configs:
      - targets:
          - '{{ application_metrics_endpoint }}'
    metrics_path: '/metrics'
    scrape_interval: 30s
```

**Automated Alert Rules:**

```yaml
# Prometheus Alert Rules
groups:
  - name: system_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"

      # Critical CPU usage
      - alert: CriticalCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 95% for 2 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% for 5 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"

      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 15% on {{ $labels.instance }} (current value: {{ $value }}%)"

      # Disk I/O high
      - alert: HighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"
          description: "Disk I/O is above 80% for 10 minutes on {{ $labels.instance }}"

      # Network interface down
      - alert: NetworkInterfaceDown
        expr: network_up == 0
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Network interface {{ $labels.device }} is down on {{ $labels.instance }}"
          description: "Network interface {{ $labels.device }} has been down for 2 minutes"

  - name: application_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: application
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is above 5% for 5 minutes (current value: {{ $value }}%)"

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
          team: application
        annotations:
          summary: "High latency on {{ $labels.instance }}"
          description: "95th percentile latency is above 1s for 5 minutes (current value: {{ $value }}s)"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
          team: application
        annotations:
          summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
          description: "Service has been down for 2 minutes"

      # Database connection pool exhausted
      - alert: DatabaseConnectionPoolExhausted
        expr: pg_stat_activity_count{datname="{{ application_database }}"} / pg_settings_max_connections * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "Database connection pool usage is above 90% (current value: {{ $value }}%)"

  - name: security_alerts
    interval: 30s
    rules:
      # Failed login attempts
      - alert: ExcessiveFailedLogins
        expr: rate(ssh_login_failed_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
          team: security
        annotations:
          summary: "Excessive failed login attempts on {{ $labels.instance }}"
          description: "Failed login rate is above 10 per second on {{ $labels.instance }}"

      # Root login detected
      - alert: RootLoginDetected
        expr: ssh_login_user{user="root"} > 0
        labels:
          severity: critical
          team: security
        annotations:
          summary: "Root login detected on {{ $labels.instance }}"
          description: "Root user has logged in to {{ $labels.instance }}"

      # Unauthorized API access
      - alert: UnauthorizedAPIAccess
        expr: rate(api_unauthorized_requests_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
          team: security
        annotations:
          summary: "Excessive unauthorized API requests"
          description: "Unauthorized API request rate is above 5 per second"
```

---

## Automation Decision Framework

```yaml
# Automation Decision Matrix

automation_decisions:
  when_to_automate:
    criteria:
      - frequency: "Task performed more than 3 times per week"
      - complexity: "Task has more than 5 steps"
      - risk: "High risk of human error"
      - duration: "Task takes longer than 30 minutes"
      - consistency: "Requires consistent execution"
      - documentation: "Well-defined, documented process"

  prioritization_matrix:
    high_priority:
      - daily_deployment_pipelines
      - infrastructure_provisioning
      - security_scanning
      - backup_verification
      - log_monitoring

    medium_priority:
      - user_provisioning
      - certificate_renewal
      - dependency_updates
      - performance_testing
      - compliance_reporting

    low_priority:
      - ad_hoc_reports
      - one_time_migrations
      - experimental_features

  tool_selection:
    infrastructure_as_code:
      terraform:
        use_when: "Multi-cloud, complex infrastructure, state management needed"
        advantages: ["State management", "Multi-cloud", "Large ecosystem"]
        disadvantages: ["Learning curve", "State file complexity"]

      cloudformation:
        use_when: "AWS-only, AWS-native integrations"
        advantages: ["AWS native", "Stack management", "IAM integration"]
        disadvantages: ["AWS only", "JSON/YAML only"]

      pulumi:
        use_when: "General purpose programming language preferred"
        advantages: ["Real languages", "Component model", "Multi-cloud"]
        disadvantages: ["Newer ecosystem", "Less mature"]

    configuration_management:
      ansible:
        use_when: "Agentless, SSH-based configuration"
        advantages: ["Agentless", "YAML syntax", "Large module library"]
        disadvantages: ["Scaling limits", "Push model"]

      chef:
        use_when: "Complex configurations, pull-based needed"
        advantages: ["Pull model", "Ruby power", "Mature ecosystem"]
        disadvantages: ["Heavy agents", "Learning curve"]

      puppet:
        use_when: "Large fleets, mature IT operations"
        advantages: ["Mature", "Declarative", "Enterprise support"]
        disadvantages: ["Learning curve", "Ruby DSL"]

    container_orchestration:
      kubernetes:
        use_when: "Production container orchestration"
        advantages: ["De facto standard", "Large ecosystem", "Cloud-native"]
        disadvantages: ["Complexity", "Learning curve"]

      docker_swarm:
        use_when: "Simple container orchestration"
        advantages: ["Simple", "Docker native", "Easy setup"]
        disadvantages: ["Limited features", "Smaller ecosystem"]

    ci_cd:
      github_actions:
        use_when: "GitHub repository, cloud-native"
        advantages: ["Integrated with GitHub", "Free for public repos", "YAML syntax"]
        disadvantages: ["GitHub only", "Limited minutes"]

      gitlab_ci:
        use_when: "GitLab repository, integrated CI/CD"
        advantages: ["Integrated with GitLab", "Docker-in-Docker", "Kubernetes integration"]
        disadvantages: ["GitLab only", "Complex syntax"]

      jenkins:
        use_when: "Complex pipelines, extensive plugins"
        advantages: ["Mature", "Plugin ecosystem", "Flexible"]
        disadvantages: ["Maintenance overhead", "Groovy syntax"]
```

---

## Output Formats

### Automation Runbook Template

```markdown
# Automation Runbook: [Name]

## Overview
**Purpose**: [What this automation does]
**Owner**: [Team responsible]
**Last Updated**: [Date]
**Version**: [Version number]

## Prerequisites
- [ ] Tools installed: [List of tools]
- [ ] Access to: [Systems, repositories]
- [ ] Permissions: [Required permissions]
- [ ] Configuration: [Required setup]

## Execution

### Manual Execution
\`\`\`bash
# Step-by-step commands
command_1
command_2
command_3
\`\`\`

### Automated Execution
\`\`\`bash
# Single command
./run_automation.sh
\`\`\`

## Verification
- [ ] Check [specific output/log]
- [ ] Verify [system state]
- [ ] Confirm [expected result]

## Troubleshooting

### Issue: [Problem description]
**Symptoms**: [What you see]
**Cause**: [Root cause]
**Solution**: [Fix steps]

### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| Error message | Root cause | Fix steps |

## Rollback
If something goes wrong:
1. [Rollback step 1]
2. [Rollback step 2]
3. [Rollback step 3]

## Support
- Documentation: [Link]
- Contact: [Team/Person]
- Escalation: [Who to contact if stuck]
```

---

## Example Workflows

### Example 1: Complete CI/CD Pipeline Setup

**User Request:**
"Set up a complete CI/CD pipeline for a Node.js application that includes building, testing, security scanning, Docker image creation, and deployment to Kubernetes with canary releases."

**Agent Response:**

```markdown
# CI/CD Pipeline Implementation

## Architecture Overview

\`\`\`
Developer Push → Build → Test → Security Scan → Docker Build → Deploy Staging → Test Staging → Canary Production → Full Rollout
\`\`\`

## Implementation

### 1. GitHub Actions Workflow
\`\`\`yaml
name: Node.js CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # Build and Test
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Run tests
        run: npm test

      - name: Upload coverage
        uses: codecov/codecov-action@v3

  # Security Scanning
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

  # Build and Push Docker Image
  docker:
    runs-on: ubuntu-latest
    needs: [build, security]
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix={{branch}}-
            type=raw,value=latest,enable={{is_default_branch}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # Deploy to Staging
  deploy-staging:
    runs-on: ubuntu-latest
    needs: docker
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.STAGING_KUBE_CONFIG }}

      - name: Update deployment
        run: |
          kubectl set image deployment/nodejs-app \
            app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n staging

      - name: Wait for rollout
        run: kubectl rollout status deployment/nodejs-app -n staging

  # Canary Deployment to Production
  deploy-production:
    runs-on: ubuntu-latest
    needs: [docker, deploy-staging]
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.PRODUCTION_KUBE_CONFIG }}

      - name: Deploy canary
        run: |
          kubectl apply -f k8s/production/canary.yaml
          kubectl set image deployment/nodejs-app-canary \
            app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n production

      - name: Monitor canary
        run: |
          for i in {1..10}; do
            echo "Check $i/10"
            if ! kubectl get pods -n production -l app=nodejs-app,version=canary; then
              echo "Canary failed"
              exit 1
            fi
            sleep 60
          done

      - name: Promote to full rollout
        run: |
          kubectl set image deployment/nodejs-app \
            app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            -n production

      - name: Cleanup canary
        run: kubectl delete deployment nodejs-app-canary -n production
\`\`\`

### 2. Kubernetes Manifests
\`\`\`yaml
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodejs-app
  labels:
    app: nodejs-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nodejs-app
  template:
    metadata:
      labels:
        app: nodejs-app
        version: stable
    spec:
      containers:
      - name: app
        image: ghcr.io/example/nodejs-app:latest
        ports:
        - containerPort: 3000
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5

---
# Service
apiVersion: v1
kind: Service
metadata:
  name: nodejs-app
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 3000
  selector:
    app: nodejs-app

---
# HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nodejs-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nodejs-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
\`\`\`

### 3. Monitoring Configuration
\`\`\`yaml
# Prometheus Alerts
groups:
  - name: nodejs_app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        annotations:
          summary: "High error rate detected"
\`\`\`

## Results
- Automated testing on every push
- Security scanning integrated
- Docker images built and pushed automatically
- Staging deployment automatic
- Production deployment with canary releases
- Monitoring and alerting configured
- Rollback automation included
\`\`\`

---

## Conclusion

The Automation Engineer Agent provides comprehensive automation capabilities across infrastructure, applications, and processes. By following this specification, the agent delivers:

1. **CI/CD Pipelines**: Complete build, test, and deployment automation
2. **Infrastructure as Code**: Terraform, CloudFormation, Pulumi implementations
3. **Configuration Management**: Ansible, Chef, Puppet playbooks and roles
4. **Container Orchestration**: Docker and Kubernetes manifests
5. **Monitoring Automation**: Prometheus, Grafana, alerting automation
6. **GitOps Workflows**: Kubernetes-native deployment automation

This agent specification ensures robust, scalable, and maintainable automation solutions that reduce manual toil and improve consistency across all environments.