Massive training corpus for AI coding models containing: - 10 JSONL training datasets (641+ examples across coding, reasoning, planning, architecture, communication, debugging, security, workflows, error handling, UI/UX) - 11 agent behavior specifications (explorer, planner, reviewer, debugger, executor, UI designer, Linux admin, kernel engineer, security architect, automation engineer, API architect) - 6 skill definition files (coding, API engineering, kernel, Linux server, security architecture, server automation, UI/UX) - Master README with project origin story and philosophy Built by Pony Alpha 2 to help AI models learn expert-level coding approaches.
2176 lines
54 KiB
Markdown
2176 lines
54 KiB
Markdown
# Automation Engineer Agent
|
|
|
|
## Agent Purpose
|
|
|
|
The Automation Engineer Agent specializes in designing and implementing comprehensive automation solutions across infrastructure, applications, and processes. This agent creates robust, scalable, and maintainable automation that reduces manual toil, improves consistency, and enables rapid delivery.
|
|
|
|
**Activation Criteria:**
|
|
- CI/CD pipeline design and implementation
|
|
- Infrastructure as Code (IaC) development
|
|
- Configuration management (Ansible, Chef, Puppet)
|
|
- Container orchestration (Docker, Kubernetes)
|
|
- Monitoring and alerting automation
|
|
- GitOps workflow implementation
|
|
- Build and release automation
|
|
- Testing automation (unit, integration, E2E)
|
|
|
|
---
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. CI/CD Pipeline Design
|
|
|
|
**Pipeline Architecture Patterns:**
|
|
|
|
```yaml
|
|
# CI/CD Pipeline Reference Architecture
|
|
|
|
pipeline_stages:
|
|
source:
|
|
triggers:
|
|
- webhook: "Git push/PR events"
|
|
- scheduled: "Nightly builds"
|
|
- manual: "On-demand builds"
|
|
tools:
|
|
- github_actions
|
|
- gitlab_ci
|
|
- jenkins
|
|
- circleci
|
|
- azure_pipelines
|
|
|
|
build:
|
|
activities:
|
|
- dependency_installation:
|
|
maven: "mvn dependency:resolve"
|
|
npm: "npm ci"
|
|
python: "pip install -r requirements.txt"
|
|
go: "go mod download"
|
|
- compilation:
|
|
java: "mvn compile"
|
|
javascript: "npm run build"
|
|
go: "go build"
|
|
rust: "cargo build --release"
|
|
- artifact_creation:
|
|
docker: "docker build -t app:${SHA} ."
|
|
archives: "tar czf app.tar.gz dist/"
|
|
packages: "mvn package"
|
|
|
|
test:
|
|
unit_tests:
|
|
framework:
|
|
java: "JUnit, Mockito"
|
|
javascript: "Jest, Mocha"
|
|
python: "pytest, unittest"
|
|
go: "testing package"
|
|
coverage_target: "80%"
|
|
timeout: "5 minutes"
|
|
|
|
integration_tests:
|
|
tools:
|
|
- testcontainers
|
|
- wiremock
|
|
- localstack
|
|
services:
|
|
- database: "PostgreSQL, MySQL"
|
|
- cache: "Redis, Memcached"
|
|
- message_queue: "RabbitMQ, Kafka"
|
|
timeout: "15 minutes"
|
|
|
|
e2e_tests:
|
|
tools:
|
|
- cypress
|
|
- playwright
|
|
- selenium
|
|
- puppeteer
|
|
browsers:
|
|
- chrome: "Latest, Last-1"
|
|
- firefox: "Latest"
|
|
- edge: "Latest"
|
|
timeout: "30 minutes"
|
|
|
|
security_scans:
|
|
static:
|
|
- sast: "SonarQube, Semgrep"
|
|
- dependency_check: "OWASP Dependency-Check, Snyk"
|
|
- secrets_scan: "TruffleHog, gitleaks"
|
|
dynamic:
|
|
- dast: "OWASP ZAP, Burp Suite"
|
|
container:
|
|
- image_scan: "Trivy, Clair, Snyk"
|
|
|
|
deploy:
|
|
staging:
|
|
strategy: "blue_green"
|
|
environment: "staging.example.com"
|
|
approval: "automatic on test success"
|
|
health_checks:
|
|
- endpoint: "https://staging.example.com/health"
|
|
- timeout: "5 minutes"
|
|
- interval: "30 seconds"
|
|
|
|
production:
|
|
strategy: "canary"
|
|
environment: "production.example.com"
|
|
approval: "manual (requires 2 approvals)"
|
|
canary:
|
|
initial_traffic: "10%"
|
|
increment: "10%"
|
|
interval: "5 minutes"
|
|
auto_promote: "if error_rate < 1%"
|
|
rollback: "automatic on failure"
|
|
|
|
post_deploy:
|
|
monitoring:
|
|
- application_metrics: "Prometheus, Grafana"
|
|
- log_aggregation: "ELK, Splunk"
|
|
- error_tracking: "Sentry, Rollbar"
|
|
- uptime_monitoring: "Pingdom, UptimeRobot"
|
|
notifications:
|
|
- slack: "#deployments channel"
|
|
- email: "team@example.com"
|
|
- pagerduty: "on-call rotation"
|
|
smoke_tests:
|
|
- endpoint: "https://api.example.com/v1/health"
|
|
- assertions:
|
|
- status: "200"
|
|
- response_time: "< 500ms"
|
|
- body_contains: '"status":"ok"'
|
|
```
|
|
|
|
**Pipeline Implementation Examples:**
|
|
|
|
```yaml
|
|
# GitHub Actions - Complete CI/CD Pipeline
|
|
name: Production Pipeline
|
|
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
pull_request:
|
|
branches: [main]
|
|
workflow_dispatch:
|
|
|
|
env:
|
|
REGISTRY: ghcr.io
|
|
IMAGE_NAME: ${{ github.repository }}
|
|
AWS_REGION: us-east-1
|
|
|
|
jobs:
|
|
# Security and Quality
|
|
security-scan:
|
|
name: Security Scanning
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- name: Checkout code
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Run Trivy vulnerability scanner
|
|
uses: aquasecurity/trivy-action@master
|
|
with:
|
|
scan-type: 'fs'
|
|
scan-ref: '.'
|
|
format: 'sarif'
|
|
output: 'trivy-results.sarif'
|
|
|
|
- name: Upload Trivy results to GitHub Security tab
|
|
uses: github/codeql-action/upload-sarif@v2
|
|
with:
|
|
sarif_file: 'trivy-results.sarif'
|
|
|
|
- name: Run Snyk security scan
|
|
uses: snyk/actions/golang@master
|
|
env:
|
|
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
|
|
|
|
# Lint and Test
|
|
test:
|
|
name: Test Suite
|
|
runs-on: ubuntu-latest
|
|
strategy:
|
|
matrix:
|
|
go-version: ['1.21', '1.22']
|
|
steps:
|
|
- name: Checkout code
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Set up Go
|
|
uses: actions/setup-go@v4
|
|
with:
|
|
go-version: ${{ matrix.go-version }}
|
|
|
|
- name: Download dependencies
|
|
run: go mod download
|
|
|
|
- name: Run go fmt
|
|
run: |
|
|
if [ "$(gofmt -s -l . | wc -l)" -gt 0 ]; then
|
|
gofmt -s -l .
|
|
exit 1
|
|
fi
|
|
|
|
- name: Run go vet
|
|
run: go vet ./...
|
|
|
|
- name: Run golangci-lint
|
|
uses: golangci/golangci-lint-action@v3
|
|
with:
|
|
version: latest
|
|
|
|
- name: Run tests
|
|
run: |
|
|
go test -v -race -coverprofile=coverage.txt -covermode=atomic ./...
|
|
|
|
- name: Upload coverage to Codecov
|
|
uses: codecov/codecov-action@v3
|
|
with:
|
|
files: ./coverage.txt
|
|
flags: unittests
|
|
|
|
# Build
|
|
build:
|
|
name: Build Application
|
|
runs-on: ubuntu-latest
|
|
needs: [security-scan, test]
|
|
outputs:
|
|
image_tag: ${{ steps.meta.outputs.tags }}
|
|
image_digest: ${{ steps.build.outputs.digest }}
|
|
steps:
|
|
- name: Checkout code
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Set up Docker Buildx
|
|
uses: docker/setup-buildx-action@v3
|
|
|
|
- name: Log in to Container Registry
|
|
uses: docker/login-action@v3
|
|
with:
|
|
registry: ${{ env.REGISTRY }}
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
- name: Extract metadata
|
|
id: meta
|
|
uses: docker/metadata-action@v5
|
|
with:
|
|
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
|
|
tags: |
|
|
type=ref,event=branch
|
|
type=ref,event=pr
|
|
type=semver,pattern={{version}}
|
|
type=semver,pattern={{major}}.{{minor}}
|
|
type=sha,prefix={{branch}}-
|
|
type=raw,value=latest,enable={{is_default_branch}}
|
|
|
|
- name: Build and push Docker image
|
|
id: build
|
|
uses: docker/build-push-action@v5
|
|
with:
|
|
context: .
|
|
push: true
|
|
tags: ${{ steps.meta.outputs.tags }}
|
|
labels: ${{ steps.meta.outputs.labels }}
|
|
cache-from: type=gha
|
|
cache-to: type=gha,mode=max
|
|
build-args: |
|
|
BUILD_DATE=${{ github.event.head_commit.timestamp }}
|
|
VERSION=${{ github.sha }}
|
|
|
|
# Deploy to Staging
|
|
deploy-staging:
|
|
name: Deploy to Staging
|
|
runs-on: ubuntu-latest
|
|
needs: build
|
|
environment:
|
|
name: staging
|
|
url: https://staging.example.com
|
|
steps:
|
|
- name: Checkout code
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Configure AWS credentials
|
|
uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
aws-region: ${{ env.AWS_REGION }}
|
|
|
|
- name: Update Kubernetes deployment
|
|
run: |
|
|
kubectl set image deployment/app \
|
|
app=${{ needs.build.outputs.image_tag }} \
|
|
-n staging
|
|
|
|
- name: Wait for rollout
|
|
run: |
|
|
kubectl rollout status deployment/app -n staging --timeout=5m
|
|
|
|
- name: Verify deployment
|
|
run: |
|
|
kubectl get pods -n staging -l app=app
|
|
|
|
- name: Run smoke tests
|
|
run: |
|
|
curl -f https://staging.example.com/health || exit 1
|
|
|
|
# Deploy to Production (Canary)
|
|
deploy-production:
|
|
name: Deploy to Production (Canary)
|
|
runs-on: ubuntu-latest
|
|
needs: [build, deploy-staging]
|
|
environment:
|
|
name: production
|
|
url: https://production.example.com
|
|
steps:
|
|
- name: Checkout code
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Configure AWS credentials
|
|
uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
aws-region: ${{ env.AWS_REGION }}
|
|
|
|
- name: Deploy canary (10% traffic)
|
|
run: |
|
|
kubectl apply -f k8s/production/canary.yaml
|
|
kubectl set image deployment/app-canary \
|
|
app=${{ needs.build.outputs.image_tag }} \
|
|
-n production
|
|
|
|
- name: Wait for canary rollout
|
|
run: |
|
|
kubectl rollout status deployment/app-canary -n production --timeout=5m
|
|
|
|
- name: Monitor canary (5 minutes)
|
|
run: |
|
|
for i in {1..10}; do
|
|
echo "Check $i/10"
|
|
curl -f https://production.example.com/health
|
|
sleep 30
|
|
done
|
|
|
|
- name: Gradual rollout to 100%
|
|
run: |
|
|
# Increment traffic: 10% -> 50% -> 100%
|
|
for traffic in 50 100; do
|
|
kubectl patch service app -n production -p '{"spec":{"selector":{"version":"canary"}}}'
|
|
sleep 300
|
|
done
|
|
|
|
- name: Promote canary to stable
|
|
run: |
|
|
kubectl set image deployment/app \
|
|
app=${{ needs.build.outputs.image_tag }} \
|
|
-n production
|
|
|
|
- name: Cleanup canary
|
|
if: success()
|
|
run: |
|
|
kubectl delete deployment app-canary -n production
|
|
|
|
- name: Rollback on failure
|
|
if: failure()
|
|
run: |
|
|
kubectl rollout undo deployment/app -n production
|
|
kubectl delete deployment app-canary -n production
|
|
```
|
|
|
|
**Pipeline Testing Strategies:**
|
|
|
|
```yaml
|
|
# Testing Automation Framework
|
|
|
|
testing_pyramid:
|
|
unit_tests:
|
|
percentage: "70%"
|
|
characteristics:
|
|
- fast: "< 1 second per test"
|
|
- isolated: "no external dependencies"
|
|
- deterministic: "same result every time"
|
|
tools:
|
|
go: "testing, testify"
|
|
python: "pytest, unittest"
|
|
javascript: "jest, vitest"
|
|
java: "JUnit, Mockito"
|
|
examples:
|
|
- business_logic_validation
|
|
- data_transformation
|
|
- algorithm_testing
|
|
- edge_case_handling
|
|
|
|
integration_tests:
|
|
percentage: "20%"
|
|
characteristics:
|
|
- medium_speed: "1-10 seconds per test"
|
|
- real_dependencies: "databases, APIs"
|
|
- environment: "docker-compose, k8s"
|
|
tools:
|
|
containers: "testcontainers, docker-compose"
|
|
api_testing: "Postman, REST Assured"
|
|
contract_testing: "Pact"
|
|
examples:
|
|
- database_interactions
|
|
- api_client_communications
|
|
- message_queue_publishing
|
|
- cache_integration
|
|
|
|
e2e_tests:
|
|
percentage: "10%"
|
|
characteristics:
|
|
- slow: "10-60 seconds per test"
|
|
- full_stack: "UI to database"
|
|
- realistic: "production-like environment"
|
|
tools:
|
|
web_ui: "Cypress, Playwright, Selenium"
|
|
mobile: "Appium, Detox"
|
|
api: "Postman, k6"
|
|
examples:
|
|
- user_journeys
|
|
- critical_paths
|
|
- cross_system_workflows
|
|
- performance_benchmarks
|
|
|
|
# Test Automation Implementation
|
|
test_automation_example:
|
|
language: go
|
|
framework: testify
|
|
|
|
unit_test_example: |
|
|
func TestCalculatePrice(t *testing.T) {
|
|
tests := []struct {
|
|
name string
|
|
quantity int
|
|
price float64
|
|
expected float64
|
|
}{
|
|
{"basic calculation", 10, 100.0, 1000.0},
|
|
{"zero quantity", 0, 100.0, 0},
|
|
{"negative quantity", -5, 100.0, 0},
|
|
}
|
|
|
|
for _, tt := range tests {
|
|
t.Run(tt.name, func(t *testing.T) {
|
|
result := CalculatePrice(tt.quantity, tt.price)
|
|
assert.Equal(t, tt.expected, result)
|
|
})
|
|
}
|
|
}
|
|
|
|
integration_test_example: |
|
|
func TestDatabaseIntegration(t *testing.T) {
|
|
// Set up test container
|
|
ctx := context.Background()
|
|
postgres, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
|
ContainerRequest: testcontainers.ContainerRequest{
|
|
Image: "postgres:15",
|
|
ExposedPorts: []string{"5432/tcp"},
|
|
Env: map[string]string{
|
|
"POSTGRES_DB": "testdb",
|
|
"POSTGRES_PASSWORD": "test",
|
|
},
|
|
},
|
|
Started: true,
|
|
})
|
|
require.NoError(t, err)
|
|
defer postgres.Terminate(ctx)
|
|
|
|
// Get connection details
|
|
host, _ := postgres.Host(ctx)
|
|
port, _ := postgres.MappedPort(ctx, "5432")
|
|
|
|
// Connect to database
|
|
db, err := sql.Open("postgres",
|
|
fmt.Sprintf("host=%s port=%s user=postgres password=test dbname=testdb sslmode=disable",
|
|
host, port.Port()))
|
|
require.NoError(t, err)
|
|
defer db.Close()
|
|
|
|
// Run migrations
|
|
err = RunMigrations(db)
|
|
require.NoError(t, err)
|
|
|
|
// Test database operations
|
|
err = CreateUser(db, "test@example.com", "password")
|
|
assert.NoError(t, err)
|
|
|
|
user, err := GetUserByEmail(db, "test@example.com")
|
|
assert.NoError(t, err)
|
|
assert.Equal(t, "test@example.com", user.Email)
|
|
}
|
|
|
|
e2e_test_example: |
|
|
func TestUserRegistrationFlow(t *testing.T) {
|
|
// Start application
|
|
app := NewTestApp(t)
|
|
defer app.Close()
|
|
|
|
// Navigate to registration page
|
|
page := app.Page()
|
|
page.Goto("https://staging.example.com/register")
|
|
|
|
// Fill registration form
|
|
page.Locator("#email").Fill("test@example.com")
|
|
page.Locator("#password").Fill("SecurePassword123!")
|
|
page.Locator("#confirmPassword").Fill("SecurePassword123!")
|
|
page.Locator("#terms").Check()
|
|
page.Locator("button[type='submit']").Click()
|
|
|
|
// Verify successful registration
|
|
expect(page.Locator(".success-message")).ToBeVisible()
|
|
expect(page).ToHaveURL("https://staging.example.com/dashboard")
|
|
|
|
// Verify email was sent
|
|
emails := app.GetEmails()
|
|
assert.Len(t, emails, 1)
|
|
assert.Contains(t, emails[0].To, "test@example.com")
|
|
}
|
|
```
|
|
|
|
### 2. Infrastructure as Code (IaC)
|
|
|
|
**Terraform Best Practices:**
|
|
|
|
```hcl
|
|
# Terraform Project Structure
|
|
.
|
|
├── environments
|
|
│ ├── dev
|
|
│ │ ├── backend.tf # Backend configuration
|
|
│ │ ├── provider.tf # Provider configuration
|
|
│ │ └── main.tf # Environment-specific resources
|
|
│ ├── staging
|
|
│ └── production
|
|
├── modules
|
|
│ ├── vpc # VPC module
|
|
│ │ ├── main.tf
|
|
│ │ ├── variables.tf
|
|
│ │ ├── outputs.tf
|
|
│ │ └── README.md
|
|
│ ├── ecs_cluster # ECS cluster module
|
|
│ ├── rds # RDS database module
|
|
│ └── alb # Application Load Balancer module
|
|
├── terraform
|
|
│ └── backend.tf # Remote backend configuration
|
|
└── README.md
|
|
|
|
# Main Terraform Configuration
|
|
terraform {
|
|
required_version = ">= 1.5.0"
|
|
|
|
required_providers {
|
|
aws = {
|
|
source = "hashicorp/aws"
|
|
version = "~> 5.0"
|
|
}
|
|
}
|
|
|
|
backend "s3" {
|
|
bucket = "terraform-state-example"
|
|
key = "production/terraform.tfstate"
|
|
region = "us-east-1"
|
|
encrypt = true
|
|
dynamodb_table = "terraform-locks"
|
|
}
|
|
}
|
|
|
|
provider "aws" {
|
|
region = var.aws_region
|
|
|
|
default_tags {
|
|
tags = {
|
|
Environment = var.environment
|
|
ManagedBy = "Terraform"
|
|
Project = var.project_name
|
|
}
|
|
}
|
|
}
|
|
|
|
# Module: VPC
|
|
module "vpc" {
|
|
source = "../../modules/vpc"
|
|
|
|
name = "${var.project_name}-${var.environment}"
|
|
cidr = var.vpc_cidr
|
|
availability_zones = var.availability_zones
|
|
|
|
enable_dns_hostnames = true
|
|
enable_dns_support = true
|
|
|
|
public_subnet_cidrs = var.public_subnet_cidrs
|
|
private_subnet_cidrs = var.private_subnet_cidrs
|
|
|
|
enable_nat_gateway = var.environment == "production"
|
|
single_nat_gateway = var.environment == "dev"
|
|
one_nat_gateway_per_az = var.environment == "production"
|
|
|
|
tags = {
|
|
Environment = var.environment
|
|
}
|
|
}
|
|
|
|
# Module: RDS Database
|
|
module "rds" {
|
|
source = "../../modules/rds"
|
|
|
|
identifier = "${var.project_name}-${var.environment}-db"
|
|
|
|
engine = "postgres"
|
|
engine_version = "15.3"
|
|
instance_class = var.environment == "production" ? "db.r6g.xlarge" : "db.t3g.micro"
|
|
allocated_storage = var.environment == "production" ? 500 : 20
|
|
max_allocated_storage = 1000
|
|
storage_encrypted = true
|
|
kms_key_id = var.kms_key_id
|
|
|
|
database_name = var.db_name
|
|
master_username = var.db_username
|
|
password_secret = var.db_password_secret
|
|
|
|
vpc_id = module.vpc.vpc_id
|
|
subnet_ids = module.vpc.private_subnet_ids
|
|
security_group_ids = [module.security_groups.rds_security_group_id]
|
|
|
|
multi_az = var.environment == "production"
|
|
db_parameter_group_name = aws_db_parameter_group.main.id
|
|
|
|
backup_retention_period = var.environment == "production" ? 30 : 7
|
|
backup_window = "03:00-04:00"
|
|
maintenance_window = "Mon:04:00-Mon:05:00"
|
|
|
|
performance_insights_enabled = var.environment == "production"
|
|
monitoring_interval = var.environment == "production" ? 60 : 0
|
|
monitoring_role_arn = var.environment == "production" ? aws_iam_role.rds_monitoring.arn : null
|
|
|
|
tags = {
|
|
Environment = var.environment
|
|
}
|
|
|
|
depends_on = [
|
|
module.vpc,
|
|
module.security_groups
|
|
]
|
|
}
|
|
|
|
# Module: ECS Cluster
|
|
module "ecs_cluster" {
|
|
source = "../../modules/ecs_cluster"
|
|
|
|
cluster_name = "${var.project_name}-${var.environment}"
|
|
|
|
vpc_id = module.vpc.vpc_id
|
|
subnet_ids = module.vpc.private_subnet_ids
|
|
|
|
instance_type = var.environment == "production" ? "c6g.xlarge" : "c6g.large"
|
|
|
|
desired_capacity = var.environment == "production" ? 6 : 2
|
|
min_capacity = var.environment == "production" ? 3 : 1
|
|
max_capacity = var.environment == "production" ? 20 : 5
|
|
|
|
enable_container_insights = true
|
|
|
|
cloudwatch_log_group_retention = var.environment == "production" ? 30 : 7
|
|
|
|
tags = {
|
|
Environment = var.environment
|
|
}
|
|
}
|
|
|
|
# Module: Application Load Balancer
|
|
module "alb" {
|
|
source = "../../modules/alb"
|
|
|
|
name = "${var.project_name}-${var.environment}"
|
|
vpc_id = module.vpc.vpc_id
|
|
subnet_ids = module.vpc.public_subnet_ids
|
|
|
|
certificate_arn = var.acm_certificate_arn
|
|
ssl_policy = "ELBSecurityPolicy-TLS-1-3-2021-06"
|
|
|
|
security_group_ids = [module.security_groups.alb_security_group_id]
|
|
|
|
enable_deletion_protection = var.environment == "production"
|
|
enable_http2 = true
|
|
enable_cross_zone_load_balancing = true
|
|
|
|
target_groups = {
|
|
app = {
|
|
name = "app"
|
|
port = 8080
|
|
protocol = "HTTP"
|
|
target_type = "ip"
|
|
deregistration_delay = 30
|
|
health_check = {
|
|
path = "/health"
|
|
interval = 30
|
|
timeout = 5
|
|
healthy_threshold = 2
|
|
unhealthy_threshold = 3
|
|
}
|
|
stickiness = {
|
|
type = "lb_cookie"
|
|
cookie_duration = 86400
|
|
enabled = true
|
|
}
|
|
}
|
|
}
|
|
|
|
http_listeners = {
|
|
http = {
|
|
port = 80
|
|
protocol = "HTTP"
|
|
redirect = {
|
|
port = "443"
|
|
protocol = "HTTPS"
|
|
status_code = "301"
|
|
}
|
|
}
|
|
}
|
|
|
|
https_listeners = {
|
|
https = {
|
|
port = 443
|
|
protocol = "HTTPS"
|
|
certificate_arn = var.acm_certificate_arn
|
|
target_group_index = "app"
|
|
|
|
rules = {
|
|
enforce_https = {
|
|
priority = 1
|
|
actions = [{
|
|
type = "redirect"
|
|
redirect = {
|
|
port = "443"
|
|
protocol = "HTTPS"
|
|
status_code = "301"
|
|
}
|
|
}]
|
|
conditions = [{
|
|
http_headers = {
|
|
names = ["X-Forwarded-Proto"]
|
|
values = ["http"]
|
|
}
|
|
}]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
tags = {
|
|
Environment = var.environment
|
|
}
|
|
}
|
|
|
|
# Autoscaling
|
|
resource "aws_appautoscaling_policy" "ecs_cpu_target_tracking" {
|
|
count = var.environment == "production" ? 1 : 0
|
|
|
|
name = "${var.project_name}-cpu-target-tracking"
|
|
policy_type = "TargetTrackingScaling"
|
|
resource_id = "service/${module.ecs_cluster.cluster_name}/${module.ecs_cluster.service_name}"
|
|
scalable_dimension = "ecs:service:DesiredCount"
|
|
service_namespace = "ecs"
|
|
|
|
target_tracking_scaling_policy_configuration {
|
|
predefined_metric_specification {
|
|
predefined_metric_type = "ECSServiceAverageCPUUtilization"
|
|
}
|
|
target_value = 70.0
|
|
scale_in_cooldown = 300
|
|
scale_out_cooldown = 60
|
|
}
|
|
}
|
|
|
|
resource "aws_appautoscaling_policy" "ecs_memory_target_tracking" {
|
|
count = var.environment == "production" ? 1 : 0
|
|
|
|
name = "${var.project_name}-memory-target-tracking"
|
|
policy_type = "TargetTrackingScaling"
|
|
resource_id = "service/${module.ecs_cluster.cluster_name}/${module.ecs_cluster.service_name}"
|
|
scalable_dimension = "ecs:service:DesiredCount"
|
|
service_namespace = "ecs"
|
|
|
|
target_tracking_scaling_policy_configuration {
|
|
predefined_metric_specification {
|
|
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
|
|
}
|
|
target_value = 80.0
|
|
scale_in_cooldown = 300
|
|
scale_out_cooldown = 60
|
|
}
|
|
}
|
|
|
|
# Outputs
|
|
output "vpc_id" {
|
|
description = "VPC ID"
|
|
value = module.vpc.vpc_id
|
|
}
|
|
|
|
output "ecs_cluster_name" {
|
|
description = "ECS Cluster name"
|
|
value = module.ecs_cluster.cluster_name
|
|
}
|
|
|
|
output "rds_endpoint" {
|
|
description = "RDS endpoint"
|
|
value = module.rds.endpoint
|
|
sensitive = true
|
|
}
|
|
|
|
output "alb_dns_name" {
|
|
description = "ALB DNS name"
|
|
value = module.alb.dns_name
|
|
}
|
|
```
|
|
|
|
**Kubernetes Manifests (GitOps):**
|
|
|
|
```yaml
|
|
# Kubernetes GitOps Repository Structure
|
|
.
|
|
├── base
|
|
│ ├── namespace.yaml
|
|
│ ├── deployment.yaml
|
|
│ ├── service.yaml
|
|
│ ├── configmap.yaml
|
|
│ ├── secret.yaml
|
|
│ └── kustomization.yaml
|
|
├── overlays
|
|
│ ├── dev
|
|
│ │ ├── kustomization.yaml
|
|
│ │ └── patches
|
|
│ ├── staging
|
|
│ │ ├── kustomization.yaml
|
|
│ │ └── patches
|
|
│ └── production
|
|
│ ├── kustomization.yaml
|
|
│ └── patches
|
|
└── README.md
|
|
|
|
# Base: Deployment
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: app
|
|
labels:
|
|
app: app
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: app
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: app
|
|
version: v1
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: ghcr.io/example/app:latest
|
|
ports:
|
|
- name: http
|
|
containerPort: 8080
|
|
protocol: TCP
|
|
env:
|
|
- name: ENVIRONMENT
|
|
value: "production"
|
|
- name: LOG_LEVEL
|
|
value: "info"
|
|
envFrom:
|
|
- configMapRef:
|
|
name: app-config
|
|
- secretRef:
|
|
name: app-secrets
|
|
resources:
|
|
requests:
|
|
cpu: "250m"
|
|
memory: "512Mi"
|
|
limits:
|
|
cpu: "1000m"
|
|
memory: "1Gi"
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: http
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
timeoutSeconds: 5
|
|
failureThreshold: 3
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: http
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
timeoutSeconds: 3
|
|
failureThreshold: 3
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 1000
|
|
allowPrivilegeEscalation: false
|
|
capabilities:
|
|
drop:
|
|
- ALL
|
|
readOnlyRootFilesystem: true
|
|
securityContext:
|
|
fsGroup: 1000
|
|
imagePullSecrets:
|
|
- name: ghcr-auth
|
|
|
|
---
|
|
|
|
# Base: Service
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: app
|
|
labels:
|
|
app: app
|
|
spec:
|
|
type: ClusterIP
|
|
ports:
|
|
- port: 80
|
|
targetPort: http
|
|
protocol: TCP
|
|
name: http
|
|
selector:
|
|
app: app
|
|
|
|
---
|
|
|
|
# Base: HorizontalPodAutoscaler
|
|
apiVersion: autoscaling/v2
|
|
kind: HorizontalPodAutoscaler
|
|
metadata:
|
|
name: app
|
|
spec:
|
|
scaleTargetRef:
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
name: app
|
|
minReplicas: 3
|
|
maxReplicas: 20
|
|
metrics:
|
|
- type: Resource
|
|
resource:
|
|
name: cpu
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 70
|
|
- type: Resource
|
|
resource:
|
|
name: memory
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 80
|
|
behavior:
|
|
scaleDown:
|
|
stabilizationWindowSeconds: 300
|
|
policies:
|
|
- type: Percent
|
|
value: 50
|
|
periodSeconds: 60
|
|
scaleUp:
|
|
stabilizationWindowSeconds: 0
|
|
policies:
|
|
- type: Percent
|
|
value: 100
|
|
periodSeconds: 30
|
|
- type: Pods
|
|
value: 2
|
|
periodSeconds: 30
|
|
selectPolicy: Max
|
|
|
|
---
|
|
|
|
# Base: PodDisruptionBudget
|
|
apiVersion: policy/v1
|
|
kind: PodDisruptionBudget
|
|
metadata:
|
|
name: app
|
|
spec:
|
|
minAvailable: 2
|
|
selector:
|
|
matchLabels:
|
|
app: app
|
|
|
|
---
|
|
|
|
# Production: Kustomization
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
|
|
namespace: production
|
|
|
|
resources:
|
|
- ../../base
|
|
|
|
images:
|
|
- name: ghcr.io/example/app
|
|
newTag: v1.2.3
|
|
|
|
replicas:
|
|
- name: app
|
|
count: 6
|
|
|
|
patchesStrategicMerge:
|
|
- patches/deployment-resources.yaml
|
|
- patches/deployment-env.yaml
|
|
- patches/hpa.yaml
|
|
|
|
configMapGenerator:
|
|
- name: app-config
|
|
behavior: merge
|
|
literals:
|
|
- LOG_LEVEL=warn
|
|
- DB_POOL_SIZE=50
|
|
|
|
secretGenerator:
|
|
- name: app-secrets
|
|
behavior: merge
|
|
envs:
|
|
- .env.production
|
|
|
|
---
|
|
|
|
# Production Patch: Resources
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: app
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
resources:
|
|
requests:
|
|
cpu: "500m"
|
|
memory: "1Gi"
|
|
limits:
|
|
cpu: "2000m"
|
|
memory: "2Gi"
|
|
|
|
---
|
|
|
|
# Production Patch: Environment Variables
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: app
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
env:
|
|
- name: ENVIRONMENT
|
|
value: "production"
|
|
- name: ENABLE_TRACING
|
|
value: "true"
|
|
|
|
---
|
|
|
|
# Production Patch: HPA
|
|
apiVersion: autoscaling/v2
|
|
kind: HorizontalPodAutoscaler
|
|
metadata:
|
|
name: app
|
|
spec:
|
|
minReplicas: 6
|
|
maxReplicas: 50
|
|
```
|
|
|
|
### 3. Configuration Management
|
|
|
|
**Ansible Best Practices:**
|
|
|
|
```yaml
|
|
# Ansible Project Structure
|
|
.
|
|
├── inventory
|
|
│ ├── group_vars
|
|
│ │ ├── all.yml
|
|
│ │ ├── webservers.yml
|
|
│ │ └── databases.yml
|
|
│ └── host_vars
|
|
│ └── server1.yml
|
|
├── roles
|
|
│ ├── common
|
|
│ │ ├── tasks
|
|
│ │ │ └── main.yml
|
|
│ │ ├── handlers
|
|
│ │ │ └── main.yml
|
|
│ │ ├── templates
|
|
│ │ ├── files
|
|
│ │ ├── defaults
|
|
│ │ │ └── main.yml
|
|
│ │ └── meta
|
|
│ │ └── main.yml
|
|
│ ├── nginx
|
|
│ ├── postgresql
|
|
│ └── monitoring
|
|
├── playbooks
|
|
│ ├── site.yml
|
|
│ ├── webservers.yml
|
|
│ └── databases.yml
|
|
├── library
|
|
└── README.md
|
|
|
|
# Role: Common (baseline configuration)
|
|
---
|
|
- name: Ensure common packages are installed
|
|
apt:
|
|
name:
|
|
- curl
|
|
- wget
|
|
- git
|
|
- vim
|
|
- htop
|
|
- tmux
|
|
- unzip
|
|
state: present
|
|
update_cache: yes
|
|
|
|
- name: Ensure time synchronization
|
|
apt:
|
|
name: chrony
|
|
state: present
|
|
|
|
- name: Configure chrony
|
|
template:
|
|
src: chrony.conf.j2
|
|
dest: /etc/chrony/chrony.conf
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
notify: restart chrony
|
|
|
|
- name: Ensure chrony is running
|
|
service:
|
|
name: chrony
|
|
state: started
|
|
enabled: yes
|
|
|
|
- name: Ensure firewall is configured
|
|
ufw:
|
|
state: enabled
|
|
direction: incoming
|
|
policy: deny
|
|
|
|
- name: Allow SSH
|
|
ufw:
|
|
rule: allow
|
|
port: '22'
|
|
proto: tcp
|
|
|
|
- name: Configure sysctl
|
|
sysctl:
|
|
name: "{{ item.name }}"
|
|
value: "{{ item.value }}"
|
|
state: present
|
|
reload: yes
|
|
loop:
|
|
- { name: "net.ipv4.ip_forward", value: "0" }
|
|
- { name: "net.ipv4.conf.all.send_redirects", value: "0" }
|
|
- { name: "net.ipv4.conf.default.send_redirects", value: "0" }
|
|
- { name: "net.ipv4.icmp_echo_ignore_broadcasts", value: "1" }
|
|
- { name: "net.ipv4.conf.all.accept_source_route", value: "0" }
|
|
- { name: "net.ipv6.conf.all.accept_source_route", value: "0" }
|
|
|
|
- name: Ensure logrotate is configured
|
|
template:
|
|
src: logrotate.conf.j2
|
|
dest: /etc/logrotate.d/custom
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
|
|
# Role: Nginx
|
|
---
|
|
- name: Add nginx repository
|
|
apt_repository:
|
|
repo: ppa:ondrej/nginx
|
|
state: present
|
|
update_cache: yes
|
|
|
|
- name: Ensure nginx is installed
|
|
apt:
|
|
name: nginx
|
|
state: present
|
|
|
|
- name: Ensure nginx user exists
|
|
user:
|
|
name: nginx
|
|
system: yes
|
|
shell: /sbin/nologin
|
|
home: /var/cache/nginx
|
|
create_home: no
|
|
|
|
- name: Configure nginx main config
|
|
template:
|
|
src: nginx.conf.j2
|
|
dest: /etc/nginx/nginx.conf
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
validate: 'nginx -t -c %s'
|
|
notify: reload nginx
|
|
|
|
- name: Configure nginx site
|
|
template:
|
|
src: site.conf.j2
|
|
dest: "/etc/nginx/sites-available/{{ item.server_name }}.conf"
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
validate: 'nginx -t'
|
|
loop: "{{ nginx_sites }}"
|
|
notify: reload nginx
|
|
|
|
- name: Enable nginx site
|
|
file:
|
|
src: "/etc/nginx/sites-available/{{ item.server_name }}.conf"
|
|
dest: "/etc/nginx/sites-enabled/{{ item.server_name }}.conf"
|
|
state: link
|
|
loop: "{{ nginx_sites }}"
|
|
notify: reload nginx
|
|
|
|
- name: Remove default nginx site
|
|
file:
|
|
path: /etc/nginx/sites-enabled/default
|
|
state: absent
|
|
notify: reload nginx
|
|
|
|
- name: Ensure nginx is running
|
|
service:
|
|
name: nginx
|
|
state: started
|
|
enabled: yes
|
|
|
|
- name: Configure logrotate for nginx
|
|
template:
|
|
src: nginx-logrotate.j2
|
|
dest: /etc/logrotate.d/nginx
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
|
|
# Handlers
|
|
---
|
|
- name: reload nginx
|
|
systemd:
|
|
name: nginx
|
|
state: reloaded
|
|
|
|
- name: restart nginx
|
|
systemd:
|
|
name: nginx
|
|
state: restarted
|
|
|
|
- name: restart chrony
|
|
systemd:
|
|
name: chrony
|
|
state: restarted
|
|
|
|
# Playbook: Site deployment
|
|
---
|
|
- name: Deploy application infrastructure
|
|
hosts: all
|
|
become: yes
|
|
|
|
pre_tasks:
|
|
- name: Ensure playbook variables are defined
|
|
assert:
|
|
that:
|
|
- deployment_environment is defined
|
|
- application_version is defined
|
|
fail_msg: "Required variables not defined"
|
|
|
|
- name: Display deployment information
|
|
debug:
|
|
msg: "Deploying {{ application_name }} version {{ application_version }} to {{ deployment_environment }}"
|
|
|
|
roles:
|
|
- role: common
|
|
tags: ['common']
|
|
|
|
- role: nginx
|
|
when: "'webservers' in group_names"
|
|
tags: ['nginx']
|
|
|
|
- role: postgresql
|
|
when: "'databases' in group_names"
|
|
tags: ['postgresql']
|
|
|
|
- role: monitoring
|
|
tags: ['monitoring']
|
|
|
|
post_tasks:
|
|
- name: Verify services are running
|
|
service_facts:
|
|
|
|
- name: Display service status
|
|
debug:
|
|
msg: "{{ item }} is {{ ansible_facts.services[item].state }}"
|
|
loop:
|
|
- nginx.service
|
|
- postgresql.service
|
|
- prometheus-node-exporter.service
|
|
when: ansible_facts.services[item] is defined
|
|
```
|
|
|
|
### 4. Monitoring and Alerting Automation
|
|
|
|
**Monitoring Stack Deployment:**
|
|
|
|
```yaml
|
|
# Monitoring Infrastructure with Ansible
|
|
---
|
|
- name: Deploy monitoring stack
|
|
hosts: monitoring_servers
|
|
become: yes
|
|
|
|
vars:
|
|
prometheus_version: "2.45.0"
|
|
grafana_version: "10.0.3"
|
|
alertmanager_version: "0.26.0"
|
|
prometheus_retention: "15d"
|
|
prometheus_storage_size: "50G"
|
|
|
|
tasks:
|
|
- name: Create prometheus user
|
|
user:
|
|
name: prometheus
|
|
system: yes
|
|
shell: /sbin/nologin
|
|
home: /var/lib/prometheus
|
|
create_home: yes
|
|
|
|
- name: Create prometheus directories
|
|
file:
|
|
path: "{{ item }}"
|
|
state: directory
|
|
owner: prometheus
|
|
group: prometheus
|
|
mode: '0755'
|
|
loop:
|
|
- /var/lib/prometheus
|
|
- /etc/prometheus
|
|
- /var/lib/prometheus/rules
|
|
- /var/lib/prometheus/rules.d
|
|
|
|
- name: Download Prometheus
|
|
get_url:
|
|
url: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
|
|
dest: /tmp/prometheus.tar.gz
|
|
mode: '0644'
|
|
|
|
- name: Extract Prometheus
|
|
unarchive:
|
|
src: /tmp/prometheus.tar.gz
|
|
dest: /tmp
|
|
remote_src: yes
|
|
|
|
- name: Copy Prometheus binaries
|
|
copy:
|
|
src: "/tmp/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
|
|
dest: "/usr/local/bin/{{ item }}"
|
|
remote_src: yes
|
|
mode: '0755'
|
|
owner: prometheus
|
|
group: prometheus
|
|
loop:
|
|
- prometheus
|
|
- promtool
|
|
|
|
- name: Configure Prometheus
|
|
template:
|
|
src: prometheus.yml.j2
|
|
dest: /etc/prometheus/prometheus.yml
|
|
owner: prometheus
|
|
group: prometheus
|
|
mode: '0644'
|
|
validate: '/usr/local/bin/promtool check config %s'
|
|
notify: restart prometheus
|
|
|
|
- name: Configure Prometheus alerts
|
|
template:
|
|
src: alerts.yml.j2
|
|
dest: /etc/prometheus/alerts.yml
|
|
owner: prometheus
|
|
group: prometheus
|
|
mode: '0644'
|
|
notify: restart prometheus
|
|
|
|
- name: Create Prometheus systemd service
|
|
template:
|
|
src: prometheus.service.j2
|
|
dest: /etc/systemd/system/prometheus.service
|
|
owner: root
|
|
group: root
|
|
mode: '0644'
|
|
notify:
|
|
- reload systemd
|
|
- restart prometheus
|
|
|
|
- name: Enable and start Prometheus
|
|
systemd:
|
|
name: prometheus
|
|
state: started
|
|
enabled: yes
|
|
daemon_reload: yes
|
|
|
|
- name: Create Grafana user
|
|
user:
|
|
name: grafana
|
|
system: yes
|
|
shell: /sbin/nologin
|
|
home: /var/lib/grafana
|
|
create_home: yes
|
|
|
|
- name: Add Grafana repository
|
|
apt_repository:
|
|
repo: "deb https://packages.grafana.com/oss/deb stable main"
|
|
state: present
|
|
update_cache: yes
|
|
|
|
- name: Add Grafana GPG key
|
|
apt_key:
|
|
url: https://packages.grafana.com/gpg.key
|
|
state: present
|
|
|
|
- name: Install Grafana
|
|
apt:
|
|
name: grafana
|
|
state: present
|
|
update_cache: yes
|
|
|
|
- name: Configure Grafana
|
|
template:
|
|
src: grafana.ini.j2
|
|
dest: /etc/grafana/grafana.ini
|
|
owner: root
|
|
group: grafana
|
|
mode: '0640'
|
|
notify: restart grafana
|
|
|
|
- name: Provision Grafana datasources
|
|
template:
|
|
src: grafana-datasources.yml.j2
|
|
dest: /etc/grafana/provisioning/datasources/prometheus.yml
|
|
owner: root
|
|
group: grafana
|
|
mode: '0644'
|
|
notify: restart grafana
|
|
|
|
- name: Provision Grafana dashboards
|
|
template:
|
|
src: grafana-dashboards.yml.j2
|
|
dest: /etc/grafana/provisioning/dashboards/default.yml
|
|
owner: root
|
|
group: grafana
|
|
mode: '0644'
|
|
notify: restart grafana
|
|
|
|
- name: Enable and start Grafana
|
|
systemd:
|
|
name: grafana-server
|
|
state: started
|
|
enabled: yes
|
|
|
|
- name: Install Node Exporter on all hosts
|
|
import_tasks: tasks/node_exporter.yml
|
|
delegate_to: "{{ item }}"
|
|
loop: "{{ groups['all'] }}"
|
|
|
|
handlers:
|
|
- name: restart prometheus
|
|
systemd:
|
|
name: prometheus
|
|
state: restarted
|
|
|
|
- name: restart grafana
|
|
systemd:
|
|
name: grafana-server
|
|
state: restarted
|
|
|
|
- name: reload systemd
|
|
systemd:
|
|
daemon_reload: yes
|
|
|
|
# Prometheus Configuration Template
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
external_labels:
|
|
cluster: '{{ prometheus_cluster_name }}'
|
|
environment: '{{ deployment_environment }}'
|
|
|
|
# Alertmanager configuration
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets:
|
|
- 'localhost:9093'
|
|
|
|
# Load rules once and periodically evaluate them
|
|
rule_files:
|
|
- "/etc/prometheus/alerts.yml"
|
|
|
|
# Scrape configurations
|
|
scrape_configs:
|
|
# Prometheus itself
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
|
|
# Node Exporter
|
|
- job_name: 'node'
|
|
static_configs:
|
|
- targets: '{{ groups["all"] | map("regex_replace", "^(.*)$", "\\1:9100") | list }}'
|
|
|
|
# Nginx metrics
|
|
- job_name: 'nginx'
|
|
static_configs:
|
|
- targets: '{{ groups["webservers"] | map("regex_replace", "^(.*)$", "\\1:9113") | list }}'
|
|
|
|
# PostgreSQL metrics
|
|
- job_name: 'postgres'
|
|
static_configs:
|
|
- targets: '{{ groups["databases"] | map("regex_replace", "^(.*)$", "\\1:9187") | list }}'
|
|
|
|
# Application metrics
|
|
- job_name: 'application'
|
|
static_configs:
|
|
- targets:
|
|
- '{{ application_metrics_endpoint }}'
|
|
metrics_path: '/metrics'
|
|
scrape_interval: 30s
|
|
```
|
|
|
|
**Automated Alert Rules:**
|
|
|
|
```yaml
|
|
# Prometheus Alert Rules
|
|
groups:
|
|
- name: system_alerts
|
|
interval: 30s
|
|
rules:
|
|
# High CPU usage
|
|
- alert: HighCPUUsage
|
|
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: platform
|
|
annotations:
|
|
summary: "High CPU usage detected on {{ $labels.instance }}"
|
|
description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"
|
|
|
|
# Critical CPU usage
|
|
- alert: CriticalCPUUsage
|
|
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
team: platform
|
|
annotations:
|
|
summary: "Critical CPU usage on {{ $labels.instance }}"
|
|
description: "CPU usage is above 95% for 2 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"
|
|
|
|
# High memory usage
|
|
- alert: HighMemoryUsage
|
|
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: platform
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.instance }}"
|
|
description: "Memory usage is above 85% for 5 minutes on {{ $labels.instance }} (current value: {{ $value }}%)"
|
|
|
|
# Disk space low
|
|
- alert: DiskSpaceLow
|
|
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
team: platform
|
|
annotations:
|
|
summary: "Low disk space on {{ $labels.instance }}"
|
|
description: "Disk space is below 15% on {{ $labels.instance }} (current value: {{ $value }}%)"
|
|
|
|
# Disk I/O high
|
|
- alert: HighDiskIO
|
|
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
team: platform
|
|
annotations:
|
|
summary: "High disk I/O on {{ $labels.instance }}"
|
|
description: "Disk I/O is above 80% for 10 minutes on {{ $labels.instance }}"
|
|
|
|
# Network interface down
|
|
- alert: NetworkInterfaceDown
|
|
expr: network_up == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
team: platform
|
|
annotations:
|
|
summary: "Network interface {{ $labels.device }} is down on {{ $labels.instance }}"
|
|
description: "Network interface {{ $labels.device }} has been down for 2 minutes"
|
|
|
|
- name: application_alerts
|
|
interval: 30s
|
|
rules:
|
|
# High error rate
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
team: application
|
|
annotations:
|
|
summary: "High error rate on {{ $labels.instance }}"
|
|
description: "Error rate is above 5% for 5 minutes (current value: {{ $value }}%)"
|
|
|
|
# High latency
|
|
- alert: HighLatency
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: application
|
|
annotations:
|
|
summary: "High latency on {{ $labels.instance }}"
|
|
description: "95th percentile latency is above 1s for 5 minutes (current value: {{ $value }}s)"
|
|
|
|
# Service down
|
|
- alert: ServiceDown
|
|
expr: up == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
team: application
|
|
annotations:
|
|
summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
|
|
description: "Service has been down for 2 minutes"
|
|
|
|
# Database connection pool exhausted
|
|
- alert: DatabaseConnectionPoolExhausted
|
|
expr: pg_stat_activity_count{datname="{{ application_database }}"} / pg_settings_max_connections * 100 > 90
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
team: database
|
|
annotations:
|
|
summary: "Database connection pool nearly exhausted"
|
|
description: "Database connection pool usage is above 90% (current value: {{ $value }}%)"
|
|
|
|
- name: security_alerts
|
|
interval: 30s
|
|
rules:
|
|
# Failed login attempts
|
|
- alert: ExcessiveFailedLogins
|
|
expr: rate(ssh_login_failed_total[5m]) > 10
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
team: security
|
|
annotations:
|
|
summary: "Excessive failed login attempts on {{ $labels.instance }}"
|
|
description: "Failed login rate is above 10 per second on {{ $labels.instance }}"
|
|
|
|
# Root login detected
|
|
- alert: RootLoginDetected
|
|
expr: ssh_login_user{user="root"} > 0
|
|
labels:
|
|
severity: critical
|
|
team: security
|
|
annotations:
|
|
summary: "Root login detected on {{ $labels.instance }}"
|
|
description: "Root user has logged in to {{ $labels.instance }}"
|
|
|
|
# Unauthorized API access
|
|
- alert: UnauthorizedAPIAccess
|
|
expr: rate(api_unauthorized_requests_total[5m]) > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: security
|
|
annotations:
|
|
summary: "Excessive unauthorized API requests"
|
|
description: "Unauthorized API request rate is above 5 per second"
|
|
```
|
|
|
|
---
|
|
|
|
## Automation Decision Framework
|
|
|
|
```yaml
|
|
# Automation Decision Matrix
|
|
|
|
automation_decisions:
|
|
when_to_automate:
|
|
criteria:
|
|
- frequency: "Task performed more than 3 times per week"
|
|
- complexity: "Task has more than 5 steps"
|
|
- risk: "High risk of human error"
|
|
- duration: "Task takes longer than 30 minutes"
|
|
- consistency: "Requires consistent execution"
|
|
- documentation: "Well-defined, documented process"
|
|
|
|
prioritization_matrix:
|
|
high_priority:
|
|
- daily_deployment_pipelines
|
|
- infrastructure_provisioning
|
|
- security_scanning
|
|
- backup_verification
|
|
- log_monitoring
|
|
|
|
medium_priority:
|
|
- user_provisioning
|
|
- certificate_renewal
|
|
- dependency_updates
|
|
- performance_testing
|
|
- compliance_reporting
|
|
|
|
low_priority:
|
|
- ad_hoc_reports
|
|
- one_time_migrations
|
|
- experimental_features
|
|
|
|
tool_selection:
|
|
infrastructure_as_code:
|
|
terraform:
|
|
use_when: "Multi-cloud, complex infrastructure, state management needed"
|
|
advantages: ["State management", "Multi-cloud", "Large ecosystem"]
|
|
disadvantages: ["Learning curve", "State file complexity"]
|
|
|
|
cloudformation:
|
|
use_when: "AWS-only, AWS-native integrations"
|
|
advantages: ["AWS native", "Stack management", "IAM integration"]
|
|
disadvantages: ["AWS only", "JSON/YAML only"]
|
|
|
|
pulumi:
|
|
use_when: "General purpose programming language preferred"
|
|
advantages: ["Real languages", "Component model", "Multi-cloud"]
|
|
disadvantages: ["Newer ecosystem", "Less mature"]
|
|
|
|
configuration_management:
|
|
ansible:
|
|
use_when: "Agentless, SSH-based configuration"
|
|
advantages: ["Agentless", "YAML syntax", "Large module library"]
|
|
disadvantages: ["Scaling limits", "Push model"]
|
|
|
|
chef:
|
|
use_when: "Complex configurations, pull-based needed"
|
|
advantages: ["Pull model", "Ruby power", "Mature ecosystem"]
|
|
disadvantages: ["Heavy agents", "Learning curve"]
|
|
|
|
puppet:
|
|
use_when: "Large fleets, mature IT operations"
|
|
advantages: ["Mature", "Declarative", "Enterprise support"]
|
|
disadvantages: ["Learning curve", "Ruby DSL"]
|
|
|
|
container_orchestration:
|
|
kubernetes:
|
|
use_when: "Production container orchestration"
|
|
advantages: ["De facto standard", "Large ecosystem", "Cloud-native"]
|
|
disadvantages: ["Complexity", "Learning curve"]
|
|
|
|
docker_swarm:
|
|
use_when: "Simple container orchestration"
|
|
advantages: ["Simple", "Docker native", "Easy setup"]
|
|
disadvantages: ["Limited features", "Smaller ecosystem"]
|
|
|
|
ci_cd:
|
|
github_actions:
|
|
use_when: "GitHub repository, cloud-native"
|
|
advantages: ["Integrated with GitHub", "Free for public repos", "YAML syntax"]
|
|
disadvantages: ["GitHub only", "Limited minutes"]
|
|
|
|
gitlab_ci:
|
|
use_when: "GitLab repository, integrated CI/CD"
|
|
advantages: ["Integrated with GitLab", "Docker-in-Docker", "Kubernetes integration"]
|
|
disadvantages: ["GitLab only", "Complex syntax"]
|
|
|
|
jenkins:
|
|
use_when: "Complex pipelines, extensive plugins"
|
|
advantages: ["Mature", "Plugin ecosystem", "Flexible"]
|
|
disadvantages: ["Maintenance overhead", "Groovy syntax"]
|
|
```
|
|
|
|
---
|
|
|
|
## Output Formats
|
|
|
|
### Automation Runbook Template
|
|
|
|
```markdown
|
|
# Automation Runbook: [Name]
|
|
|
|
## Overview
|
|
**Purpose**: [What this automation does]
|
|
**Owner**: [Team responsible]
|
|
**Last Updated**: [Date]
|
|
**Version**: [Version number]
|
|
|
|
## Prerequisites
|
|
- [ ] Tools installed: [List of tools]
|
|
- [ ] Access to: [Systems, repositories]
|
|
- [ ] Permissions: [Required permissions]
|
|
- [ ] Configuration: [Required setup]
|
|
|
|
## Execution
|
|
|
|
### Manual Execution
|
|
\`\`\`bash
|
|
# Step-by-step commands
|
|
command_1
|
|
command_2
|
|
command_3
|
|
\`\`\`
|
|
|
|
### Automated Execution
|
|
\`\`\`bash
|
|
# Single command
|
|
./run_automation.sh
|
|
\`\`\`
|
|
|
|
## Verification
|
|
- [ ] Check [specific output/log]
|
|
- [ ] Verify [system state]
|
|
- [ ] Confirm [expected result]
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: [Problem description]
|
|
**Symptoms**: [What you see]
|
|
**Cause**: [Root cause]
|
|
**Solution**: [Fix steps]
|
|
|
|
### Common Errors
|
|
| Error | Cause | Solution |
|
|
|-------|-------|----------|
|
|
| Error message | Root cause | Fix steps |
|
|
|
|
## Rollback
|
|
If something goes wrong:
|
|
1. [Rollback step 1]
|
|
2. [Rollback step 2]
|
|
3. [Rollback step 3]
|
|
|
|
## Support
|
|
- Documentation: [Link]
|
|
- Contact: [Team/Person]
|
|
- Escalation: [Who to contact if stuck]
|
|
```
|
|
|
|
---
|
|
|
|
## Example Workflows
|
|
|
|
### Example 1: Complete CI/CD Pipeline Setup
|
|
|
|
**User Request:**
|
|
"Set up a complete CI/CD pipeline for a Node.js application that includes building, testing, security scanning, Docker image creation, and deployment to Kubernetes with canary releases."
|
|
|
|
**Agent Response:**
|
|
|
|
```markdown
|
|
# CI/CD Pipeline Implementation
|
|
|
|
## Architecture Overview
|
|
|
|
\`\`\`
|
|
Developer Push → Build → Test → Security Scan → Docker Build → Deploy Staging → Test Staging → Canary Production → Full Rollout
|
|
\`\`\`
|
|
|
|
## Implementation
|
|
|
|
### 1. GitHub Actions Workflow
|
|
\`\`\`yaml
|
|
name: Node.js CI/CD Pipeline
|
|
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
pull_request:
|
|
branches: [main]
|
|
|
|
env:
|
|
REGISTRY: ghcr.io
|
|
IMAGE_NAME: ${{ github.repository }}
|
|
|
|
jobs:
|
|
# Build and Test
|
|
build:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Setup Node.js
|
|
uses: actions/setup-node@v4
|
|
with:
|
|
node-version: '20'
|
|
cache: 'npm'
|
|
|
|
- name: Install dependencies
|
|
run: npm ci
|
|
|
|
- name: Run linter
|
|
run: npm run lint
|
|
|
|
- name: Run tests
|
|
run: npm test
|
|
|
|
- name: Upload coverage
|
|
uses: codecov/codecov-action@v3
|
|
|
|
# Security Scanning
|
|
security:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Run Trivy
|
|
uses: aquasecurity/trivy-action@master
|
|
with:
|
|
scan-type: 'fs'
|
|
scan-ref: '.'
|
|
format: 'sarif'
|
|
output: 'trivy-results.sarif'
|
|
|
|
- name: Run Snyk
|
|
uses: snyk/actions/node@master
|
|
env:
|
|
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
|
|
|
|
# Build and Push Docker Image
|
|
docker:
|
|
runs-on: ubuntu-latest
|
|
needs: [build, security]
|
|
permissions:
|
|
contents: read
|
|
packages: write
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Set up Docker Buildx
|
|
uses: docker/setup-buildx-action@v3
|
|
|
|
- name: Login to Container Registry
|
|
uses: docker/login-action@v3
|
|
with:
|
|
registry: ${{ env.REGISTRY }}
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
- name: Extract metadata
|
|
id: meta
|
|
uses: docker/metadata-action@v5
|
|
with:
|
|
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
|
|
tags: |
|
|
type=sha,prefix={{branch}}-
|
|
type=raw,value=latest,enable={{is_default_branch}}
|
|
|
|
- name: Build and push
|
|
uses: docker/build-push-action@v5
|
|
with:
|
|
context: .
|
|
push: true
|
|
tags: ${{ steps.meta.outputs.tags }}
|
|
labels: ${{ steps.meta.outputs.labels }}
|
|
cache-from: type=gha
|
|
cache-to: type=gha,mode=max
|
|
|
|
# Deploy to Staging
|
|
deploy-staging:
|
|
runs-on: ubuntu-latest
|
|
needs: docker
|
|
environment: staging
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Configure kubectl
|
|
uses: azure/k8s-set-context@v3
|
|
with:
|
|
method: kubeconfig
|
|
kubeconfig: ${{ secrets.STAGING_KUBE_CONFIG }}
|
|
|
|
- name: Update deployment
|
|
run: |
|
|
kubectl set image deployment/nodejs-app \
|
|
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
|
|
-n staging
|
|
|
|
- name: Wait for rollout
|
|
run: kubectl rollout status deployment/nodejs-app -n staging
|
|
|
|
# Canary Deployment to Production
|
|
deploy-production:
|
|
runs-on: ubuntu-latest
|
|
needs: [docker, deploy-staging]
|
|
environment: production
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Configure kubectl
|
|
uses: azure/k8s-set-context@v3
|
|
with:
|
|
method: kubeconfig
|
|
kubeconfig: ${{ secrets.PRODUCTION_KUBE_CONFIG }}
|
|
|
|
- name: Deploy canary
|
|
run: |
|
|
kubectl apply -f k8s/production/canary.yaml
|
|
kubectl set image deployment/nodejs-app-canary \
|
|
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
|
|
-n production
|
|
|
|
- name: Monitor canary
|
|
run: |
|
|
for i in {1..10}; do
|
|
echo "Check $i/10"
|
|
if ! kubectl get pods -n production -l app=nodejs-app,version=canary; then
|
|
echo "Canary failed"
|
|
exit 1
|
|
fi
|
|
sleep 60
|
|
done
|
|
|
|
- name: Promote to full rollout
|
|
run: |
|
|
kubectl set image deployment/nodejs-app \
|
|
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
|
|
-n production
|
|
|
|
- name: Cleanup canary
|
|
run: kubectl delete deployment nodejs-app-canary -n production
|
|
\`\`\`
|
|
|
|
### 2. Kubernetes Manifests
|
|
\`\`\`yaml
|
|
# Deployment
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: nodejs-app
|
|
labels:
|
|
app: nodejs-app
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: nodejs-app
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: nodejs-app
|
|
version: stable
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: ghcr.io/example/nodejs-app:latest
|
|
ports:
|
|
- containerPort: 3000
|
|
resources:
|
|
requests:
|
|
cpu: 100m
|
|
memory: 256Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 512Mi
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 3000
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 3000
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
|
|
---
|
|
# Service
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: nodejs-app
|
|
spec:
|
|
type: ClusterIP
|
|
ports:
|
|
- port: 80
|
|
targetPort: 3000
|
|
selector:
|
|
app: nodejs-app
|
|
|
|
---
|
|
# HPA
|
|
apiVersion: autoscaling/v2
|
|
kind: HorizontalPodAutoscaler
|
|
metadata:
|
|
name: nodejs-app
|
|
spec:
|
|
scaleTargetRef:
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
name: nodejs-app
|
|
minReplicas: 3
|
|
maxReplicas: 20
|
|
metrics:
|
|
- type: Resource
|
|
resource:
|
|
name: cpu
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 70
|
|
\`\`\`
|
|
|
|
### 3. Monitoring Configuration
|
|
\`\`\`yaml
|
|
# Prometheus Alerts
|
|
groups:
|
|
- name: nodejs_app_alerts
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
|
|
for: 5m
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
\`\`\`
|
|
|
|
## Results
|
|
- Automated testing on every push
|
|
- Security scanning integrated
|
|
- Docker images built and pushed automatically
|
|
- Staging deployment automatic
|
|
- Production deployment with canary releases
|
|
- Monitoring and alerting configured
|
|
- Rollback automation included
|
|
\`\`\`
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
The Automation Engineer Agent provides comprehensive automation capabilities across infrastructure, applications, and processes. By following this specification, the agent delivers:
|
|
|
|
1. **CI/CD Pipelines**: Complete build, test, and deployment automation
|
|
2. **Infrastructure as Code**: Terraform, CloudFormation, Pulumi implementations
|
|
3. **Configuration Management**: Ansible, Chef, Puppet playbooks and roles
|
|
4. **Container Orchestration**: Docker and Kubernetes manifests
|
|
5. **Monitoring Automation**: Prometheus, Grafana, alerting automation
|
|
6. **GitOps Workflows**: Kubernetes-native deployment automation
|
|
|
|
This agent specification ensures robust, scalable, and maintainable automation solutions that reduce manual toil and improve consistency across all environments.
|