DevOps Fundamentals: CI/CD, Docker, Kubernetes, Automation, Monitoring & Infrastructure as Code
This article is a comprehensive introduction to DevOps fundamentals – including CI/CD, Docker, Kubernetes, Automation, Monitoring and Infrastructure as Code with practical examples.
In a Nutshell
DevOps is a culture and methodology that brings together software development (Dev) and IT operations (Ops) to automate and accelerate the software delivery chain.
Compact Technical Description
DevOps is an approach that overcomes the gap between development and operations through automation, collaboration and continuous improvement.
Core Components:
Continuous Integration/Continuous Deployment (CI/CD)
- Version Control: Git, GitHub, GitLab, Bitbucket
- Build Automation: Jenkins, GitHub Actions, GitLab CI
- Testing: Unit Tests, Integration Tests, E2E Tests
- Deployment: Automated Rollouts, Blue/Green, Canary
Containerization
- Docker: Container platform for application isolation
- Docker Compose: Multi-container applications
- Container Registry: Docker Hub, Harbor, AWS ECR
- Image Optimization: Multi-stage Builds, Layer Caching
Orchestration
- Kubernetes: Container orchestration platform
- Services: Pods, Deployments, Services, Ingress
- Configuration: ConfigMaps, Secrets, Helm Charts
- Scaling: Horizontal Pod Autoscaling, Cluster Autoscaling
Infrastructure as Code (IaC)
- Terraform: Multi-cloud infrastructure provisioning
- Ansible: Configuration management
- CloudFormation: AWS-native IaC
- Pulumi: Programmable infrastructure
Monitoring & Observability
- Metrics: Prometheus, Grafana, InfluxDB
- Logging: ELK Stack, Fluentd, Loki
- Tracing: Jaeger, Zipkin, OpenTelemetry
- APM: Application Performance Monitoring
Exam-Relevant Key Points
- DevOps: Culture and methodology for software development and operations
- CI/CD: Continuous Integration and Continuous Deployment
- Docker: Container platform for application isolation
- Kubernetes: Container orchestration platform
- Infrastructure as Code: Automated infrastructure management
- Monitoring: Monitoring of systems and applications
- Automation: Automation of recurring tasks
- GitOps: Git-based operations workflows
- IHK-relevant: Modern DevOps practices and tools
Core Components
- Version Control: Git workflows, branching strategies
- CI/CD Pipeline: Build, Test, Deploy, Monitor
- Containerization: Docker, container images, registry
- Orchestration: Kubernetes, services, scaling
- IaC: Terraform, Ansible, configuration management
- Monitoring: Metrics, logging, tracing
- Security: Scanning, compliance, secret management
- Collaboration: Team workflows, communication
Practical Examples
1. CI/CD Pipeline with GitHub Actions
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
release:
types: [ published ]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
NODE_VERSION: '18'
PYTHON_VERSION: '3.11'
jobs:
# Code Quality and Security
quality:
name: Code Quality & Security
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Install dependencies
run: |
npm ci
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run ESLint
run: npm run lint
- name: Run Prettier check
run: npm run format:check
- name: Run Python linting
run: |
flake8 src/
black --check src/
isort --check-only src/
- name: Run security scan
run: |
npm audit --audit-level moderate
safety check
- name: Run SonarCloud scan
uses: SonarSource/sonarcloud-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
# Testing
test:
name: Test Suite
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [16, 18, 20]
python-version: [3.9, 3.11, 3.12]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
npm ci
pip install -r requirements.txt
pip install -r requirements-test.txt
- name: Run unit tests
run: |
npm run test:unit
pytest tests/unit/ -v --cov=src --cov-report=xml
- name: Run integration tests
run: |
npm run test:integration
pytest tests/integration/ -v
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
name: codecov-umbrella
# Build and Test Docker Image
build:
name: Build Docker Image
runs-on: ubuntu-latest
needs: [quality, test]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha,prefix={{branch}}-
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Run container security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
# Deploy to Staging
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/develop'
environment: staging
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: 'v1.28.0'
- name: Configure kubectl
run: |
echo "${{ secrets.KUBE_CONFIG_STAGING }}" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
- name: Deploy to Kubernetes
run: |
export KUBECONFIG=kubeconfig
helm upgrade --install app-staging ./helm/app \
--namespace staging \
--create-namespace \
--set image.tag=${{ github.sha }} \
--set environment=staging \
--values helm/values-staging.yaml
- name: Run smoke tests
run: |
export KUBECONFIG=kubeconfig
kubectl wait --for=condition=ready pod -l app=app-staging -n staging --timeout=300s
npm run test:smoke -- --env=staging
- name: Run integration tests against staging
run: |
npm run test:integration -- --env=staging
# Deploy to Production
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: build
if: github.event_name == 'release'
environment: production
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: 'v1.28.0'
- name: Configure kubectl
run: |
echo "${{ secrets.KUBE_CONFIG_PRODUCTION }}" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
- name: Deploy to Kubernetes (Blue/Green)
run: |
export KUBECONFIG=kubeconfig
# Deploy to green environment
helm upgrade --install app-green ./helm/app \
--namespace production \
--set image.tag=${{ github.sha }} \
--set environment=production \
--set deployment.color=green \
--values helm/values-production.yaml
# Wait for green deployment to be ready
kubectl wait --for=condition=ready pod -l app=app-green,color=green -n production --timeout=600s
# Run health checks
npm run test:health -- --env=production-green
# Switch traffic to green
kubectl patch service app-production -n production -p '{"spec":{"selector":{"color":"green"}}}'
# Wait for traffic switch
sleep 30
# Run final tests
npm run test:smoke -- --env=production
- name: Cleanup blue environment
run: |
export KUBECONFIG=kubeconfig
helm uninstall app-blue -n production || true
kubectl delete deployment app-blue -n production || true
- name: Notify deployment
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
channel: '#deployments'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
if: always()
# Performance Testing
performance:
name: Performance Testing
runs-on: ubuntu-latest
needs: deploy-staging
if: github.ref == 'refs/heads/develop'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup k6
run: |
sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6
- name: Run performance tests
run: |
k6 run --out json=performance-results.json tests/performance/load-test.js
- name: Upload performance results
uses: actions/upload-artifact@v3
with:
name: performance-results
path: performance-results.json
- name: Analyze performance
run: |
npm run analyze:performance -- performance-results.json
# Documentation
docs:
name: Build Documentation
runs-on: ubuntu-latest
needs: test
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build documentation
run: |
npm run docs:build
npm run docs:generate-api
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
if: github.ref == 'refs/heads/main'
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs/build
# Workflow for dependency updates
name: Dependency Updates
on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 2 AM
workflow_dispatch:
jobs:
update-dependencies:
name: Update Dependencies
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Update Node.js dependencies
run: |
npm update
npm audit fix
- name: Update Python dependencies
run: |
pip-compile requirements.in
pip-compile requirements-dev.in
- name: Run tests
run: |
npm ci
npm run test
pip install -r requirements.txt
pytest tests/
- name: Create Pull Request
uses: peter-evans/create-pull-request@v5
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: 'chore: update dependencies'
title: 'chore: update dependencies'
body: |
Automated dependency update
- Updated Node.js dependencies
- Updated Python dependencies
Please review the changes and ensure all tests pass.
branch: chore/update-dependencies
delete-branch: true
2. Docker Multi-Stage Build with Best Practices
# Multi-stage Dockerfile for production-ready application
# Stage 1: Build stage
FROM node:18-alpine AS builder
# Set build arguments
ARG NODE_ENV=production
ARG APP_VERSION=1.0.0
# Set environment variables
ENV NODE_ENV=$NODE_ENV
ENV APP_VERSION=$APP_VERSION
# Install build dependencies
RUN apk add --no-cache \
python3 \
make \
g++ \
git
# Create app directory
WORKDIR /app
# Copy package files
COPY package*.json ./
COPY requirements.txt ./
# Install Node.js dependencies
RUN npm ci --only=production && npm cache clean --force
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy source code
COPY . .
# Run build and tests
RUN npm run build
RUN npm run test
# Stage 2: Runtime stage
FROM python:3.11-slim AS runtime
# Set runtime arguments
ARG APP_USER=appuser
ARG APP_UID=1001
ARG APP_GID=1001
# Set environment variables
ENV NODE_ENV=production
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV APP_PORT=3000
# Install runtime dependencies
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -g $APP_GID $APP_USER && \
useradd -m -u $APP_UID -g $APP_GID -s /bin/bash $APP_USER
# Create app directory
WORKDIR /app
# Copy built application from builder stage
COPY --from=builder --chown=$APP_USER:$APP_GID /app/dist ./dist
COPY --from=builder --chown=$APP_USER:$APP_GID /app/node_modules ./node_modules
COPY --from=builder --chown=$APP_USER:$APP_GID /app/requirements.txt ./
# Install Python production dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy configuration files
COPY --chown=$APP_USER:$APP_GID config/ ./config/
COPY --chown=$APP_USER:$APP_GID scripts/ ./scripts/
# Set permissions
RUN chmod +x scripts/*.sh
# Switch to non-root user
USER $APP_USER
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:$APP_PORT/health || exit 1
# Expose port
EXPOSE $APP_PORT
# Set entrypoint
ENTRYPOINT ["./scripts/entrypoint.sh"]
# Default command
CMD ["npm", "start"]
# Stage 3: Development stage
FROM runtime AS development
# Override environment for development
ENV NODE_ENV=development
# Install development dependencies
RUN apt-get update && apt-get install -y \
git \
vim \
&& rm -rf /var/lib/apt/lists/*
# Install Node.js development dependencies
RUN npm install
# Switch back to root for development tools
USER root
# Install development tools
RUN pip install --no-cache-dir pytest pytest-cov black flake8
# Switch back to app user
USER $APP_USER
# Override command for development
CMD ["npm", "run", "dev"]
# Stage 4: Testing stage
FROM builder AS testing
# Install test dependencies
RUN npm install --no-save
RUN pip install --no-cache-dir pytest pytest-cov
# Run comprehensive tests
RUN npm run test:coverage
RUN pytest tests/ --cov=src --cov-report=xml
# Security scanning
RUN npm audit --audit-level high
RUN safety check
# Stage 5: Security scanning stage
FROM builder AS security
# Install security scanning tools
RUN npm install -g audit-ci
RUN pip install safety bandit
# Run security scans
RUN audit-ci --moderate
RUN safety check --json --output safety-report.json
RUN bandit -r src/ -f json -o bandit-report.json
# Export security reports
COPY --from=security /app/safety-report.json /reports/
COPY --from=security /app/bandit-report.json /reports/
3. Kubernetes Deployment with Helm and GitOps
# helm/app/Chart.yaml
apiVersion: v2
name: app
description: A Helm chart for deploying the application
type: application
version: 1.0.0
appVersion: "1.0.0"
home: https://github.com/organization/app
sources:
- https://github.com/organization/app
maintainers:
- name: DevOps Team
email: devops@organization.com
keywords:
- web
- application
- devops
annotations:
category: WebApplication
# helm/app/values.yaml
# Default values for the application
replicaCount: 3
image:
repository: ghcr.io/organization/app
pullPolicy: IfNotPresent
tag: "latest"
nameOverride: ""
fullnameOverride: ""
serviceAccount:
create: true
annotations: {}
name: ""
podAnnotations: {}
podSecurityContext:
fsGroup: 1001
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 1001
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
service:
type: ClusterIP
port: 80
targetPort: 3000
ingress:
enabled: true
className: "nginx"
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: app-tls
hosts:
- app.example.com
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
nodeSelector: {}
tolerations: []
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- app
topologyKey: kubernetes.io/hostname
config:
environment: production
logLevel: info
database:
host: postgres.example.com
port: 5432
name: app_prod
redis:
host: redis.example.com
port: 6379
monitoring:
enabled: true
port: 9090
secrets:
databasePassword: ""
jwtSecret: ""
apiKeys: ""
# helm/app/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "app.fullname" . }}
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "app.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "app.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "app.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
initContainers:
- name: wait-for-db
image: postgres:15-alpine
command:
- sh
- -c
- |
until pg_isready -h {{ .Values.config.database.host }} -p {{ .Values.config.database.port }}; do
echo "Waiting for database..."
sleep 2
done
- name: migrate-db
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command:
- npm
- run
- migrate
envFrom:
- configMapRef:
name: {{ include "app.fullname" . }}
- secretRef:
name: {{ include "app.fullname" . }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
- name: metrics
containerPort: {{ .Values.config.monitoring.port }}
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
resources:
{{- toYaml .Values.resources | nindent 12 }}
envFrom:
- configMapRef:
name: {{ include "app.fullname" . }}
- secretRef:
name: {{ include "app.fullname" . }}
volumeMounts:
- name: tmp
mountPath: /tmp
- name: config
mountPath: /app/config
readOnly: true
- name: log-shipper
image: fluent/fluent-bit:2.0
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: tmp
emptyDir: {}
- name: config
configMap:
name: {{ include "app.fullname" . }}
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
# helm/app/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "app.fullname" . }}
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "app.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
{{- end }}
# helm/app/templates/monitoring.yaml
{{- if .Values.config.monitoring.enabled }}
apiVersion: v1
kind: Service
metadata:
name: {{ include "app.fullname" . }}-metrics
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
type: ClusterIP
ports:
- port: {{ .Values.config.monitoring.port }}
targetPort: metrics
protocol: TCP
name: metrics
selector:
{{- include "app.selectorLabels" . | nindent 4 }}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "app.fullname" . }}
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
selector:
matchLabels:
{{- include "app.selectorLabels" . | nindent 6 }}
endpoints:
- port: metrics
interval: 30s
path: /metrics
{{- end }}
# GitOps Application Manifest (ArgoCD)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/organization/app-helm
targetRevision: HEAD
path: helm/app
helm:
valueFiles:
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
4. Terraform Infrastructure as Code
# terraform/main.tf
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
Project = var.project_name
ManagedBy = "terraform"
}
}
}
# Terraform backend configuration
terraform {
backend "s3" {
bucket = "terraform-state-${var.project_name}"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks-${var.project_name}"
}
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.0"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.0"
}
null = {
source = "hashicorp/null"
version = "~> 3.0"
}
}
}
# terraform/variables.tf
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "project_name" {
description = "Project name"
type = string
default = "my-app"
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
default = "10.0.0.0/16"
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
variable "cluster_name" {
description = "EKS cluster name"
type = string
default = "my-app-cluster"
}
variable "cluster_version" {
description = "EKS cluster version"
type = string
default = "1.28"
}
variable "node_groups" {
description = "EKS node groups configuration"
type = map(object({
instance_type = string
min_size = number
max_size = number
desired_size = number
disk_size = number
}))
default = {
general = {
instance_type = "t3.medium"
min_size = 3
max_size = 10
desired_size = 3
disk_size = 50
}
compute = {
instance_type = "c5.large"
min_size = 2
max_size = 5
desired_size = 2
disk_size = 100
}
}
}
# terraform/vpc.tf
# VPC Configuration
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project_name}-vpc"
}
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${var.project_name}-igw"
}
}
# Public Subnets
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-public-${count.index}"
Type = "Public"
}
}
# Private Subnets
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 3)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.project_name}-private-${count.index}"
Type = "Private"
}
}
# Database Subnets
resource "aws_subnet" "database" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 6)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.project_name}-database-${count.index}"
Type = "Database"
}
}
# Route Tables
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "${var.project_name}-public-rt"
}
}
resource "aws_route_table_association" "public" {
count = length(aws_subnet.public)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
# EKS Cluster
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.eks_cluster.arn
version = var.cluster_version
vpc_config {
subnet_ids = concat(
aws_subnet.public[*].id,
aws_subnet.private[*].id
)
endpoint_public_access = true
endpoint_private_access = true
public_access_cidrs = ["0.0.0.0/0"]
}
depends_on = [
aws_iam_role_policy_attachment.eks_cluster_policy,
]
tags = {
Name = var.cluster_name
}
}
# EKS Node Groups
resource "aws_eks_node_group" "main" {
for_each = var.node_groups
cluster_name = aws_eks_cluster.main.name
node_group_name = each.key
node_role_arn = aws_iam_role.eks_node.arn
subnet_ids = aws_subnet.private[*].id
scaling_config {
desired_size = each.value.desired_size
max_size = each.value.max_size
min_size = each.value.min_size
}
instance_types = [each.value.instance_type]
disk_size = each.value.disk_size
remote_access {
ec2_ssh_key = aws_key_pair.main.key_name
source_security_group_ids = [aws_security_group.eks_nodes.id]
}
depends_on = [
aws_iam_role_policy_attachment.eks_worker_node_policy,
aws_iam_role_policy_attachment.eks_cni_policy,
aws_iam_role_policy_attachment.eks_container_registry_policy,
]
tags = {
Name = "${var.cluster_name}-${each.key}"
Type = each.key
}
}
# IAM Roles
resource "aws_iam_role" "eks_cluster" {
name = "${var.project_name}-eks-cluster-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "eks.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.eks_cluster.name
}
resource "aws_iam_role" "eks_node" {
name = "${var.project_name}-eks-node-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "eks_worker_node_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.eks_node.name
}
resource "aws_iam_role_policy_attachment" "eks_cni_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.eks_node.name
}
resource "aws_iam_role_policy_attachment" "eks_container_registry_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.eks_node.name
}
# Security Groups
resource "aws_security_group" "eks_cluster" {
name = "${var.project_name}-eks-cluster-sg"
description = "Security group for EKS cluster"
vpc_id = aws_vpc.main.id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-eks-cluster-sg"
}
}
resource "aws_security_group" "eks_nodes" {
name = "${var.project_name}-eks-nodes-sg"
description = "Security group for EKS nodes"
vpc_id = aws_vpc.main.id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-eks-nodes-sg"
}
}
# RDS Database
resource "aws_db_subnet_group" "main" {
name = "${var.project_name}-db-subnet-group"
subnet_ids = aws_subnet.database[*].id
tags = {
Name = "${var.project_name}-db-subnet-group"
}
}
resource "aws_security_group" "rds" {
name = "${var.project_name}-rds-sg"
description = "Security group for RDS database"
vpc_id = aws_vpc.main.id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.eks_nodes.id]
}
tags = {
Name = "${var.project_name}-rds-sg"
}
}
resource "aws_db_instance" "postgres" {
identifier = "${var.project_name}-postgres"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
max_allocated_storage = 1000
storage_type = "gp2"
storage_encrypted = true
db_name = "app"
username = "app_user"
password = random_password.db_password.result
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
skip_final_snapshot = false
final_snapshot_identifier = "${var.project_name}-postgres-final-snapshot"
deletion_protection = true
tags = {
Name = "${var.project_name}-postgres"
}
}
# Redis ElastiCache
resource "aws_elasticache_subnet_group" "main" {
name = "${var.project_name}-cache-subnet-group"
subnet_ids = aws_subnet.private[*].id
tags = {
Name = "${var.project_name}-cache-subnet-group"
}
}
resource "aws_security_group" "redis" {
name = "${var.project_name}-redis-sg"
description = "Security group for Redis"
vpc_id = aws_vpc.main.id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.eks_nodes.id]
}
tags = {
Name = "${var.project_name}-redis-sg"
}
}
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "${var.project_name}-redis"
description = "Redis cluster for ${var.project_name}"
node_type = "cache.t3.micro"
port = 6379
parameter_group_name = "default.redis7"
num_cache_clusters = 2
automatic_failover_enabled = true
multi_az_enabled = true
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = random_password.redis_auth_token.result
snapshot_retention_limit = 7
snapshot_window = "05:00-06:00"
maintenance_window = "sun:06:00-sun:07:00"
tags = {
Name = "${var.project_name}-redis"
}
}
# S3 Buckets
resource "aws_s3_bucket" "app_storage" {
bucket = "${var.project_name}-storage-${random_string.bucket_suffix.result}"
tags = {
Name = "${var.project_name}-storage"
}
}
resource "aws_s3_bucket_versioning" "app_storage" {
bucket = aws_s3_bucket.app_storage.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_encryption" "app_storage" {
bucket = aws_s3_bucket.app_storage.id
server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
}
resource "aws_s3_bucket_public_access_block" "app_storage" {
bucket = aws_s3_bucket.app_storage.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# Random resources
resource "random_password" "db_password" {
length = 32
special = true
override_special = "!#$%&*()-_=+[]{}<>:?"
}
resource "random_password" "redis_auth_token" {
length = 64
special = true
override_special = "!#$%&*()-_=+[]{}<>:?"
}
resource "random_string" "bucket_suffix" {
length = 8
special = false
upper = false
}
# Outputs
output "cluster_name" {
description = "EKS cluster name"
value = aws_eks_cluster.main.name
}
output "cluster_endpoint" {
description = "EKS cluster endpoint"
value = aws_eks_cluster.main.endpoint
}
output "cluster_certificate_authority_data" {
description = "EKS cluster certificate authority data"
value = aws_eks_cluster.main.certificate_authority[0].data
}
output "database_endpoint" {
description = "RDS database endpoint"
value = aws_db_instance.postgres.endpoint
sensitive = true
}
output "redis_endpoint" {
description = "Redis endpoint"
value = aws_elasticache_replication_group.redis.primary_endpoint_address
sensitive = true
}
output "storage_bucket" {
description = "S3 storage bucket name"
value = aws_s3_bucket.app_storage.bucket
}
5. Monitoring with Prometheus and Grafana
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# monitoring/alert_rules.yml
groups:
- name: kubernetes-apps
rules:
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping."
- alert: KubernetesPodNotReady
expr: kube_pod_status_ready{condition="true"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not ready"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready."
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 10 minutes."
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}."
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s for {{ $labels.job }}."
- alert: LowThroughput
expr: rate(http_requests_total[5m]) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low throughput detected"
description: "Request rate is {{ $value }} requests/second for {{ $labels.job }}."
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}."
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is {{ $value }}% available on {{ $labels.device }}."
# grafana/dashboards/app-dashboard.json
{
"dashboard": {
"id": null,
"title": "Application Dashboard",
"tags": ["app", "production"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{status}}"
}
],
"yAxes": [
{
"label": "Requests/sec"
}
]
},
{
"id": 2,
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "99th percentile"
}
],
"yAxes": [
{
"label": "Seconds"
}
]
},
{
"id": 3,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
"legendFormat": "Error Rate"
}
],
"yAxes": [
{
"label": "Percentage",
"max": 1,
"min": 0
}
]
},
{
"id": 4,
"title": "Application Status",
"type": "stat",
"targets": [
{
"expr": "up{job=\"app\"}",
"legendFormat": "Application Status"
}
],
"fieldConfig": {
"defaults": {
"mappings": [
{
"options": {
"0": {
"text": "DOWN",
"color": "red"
},
"1": {
"text": "UP",
"color": "green"
}
},
"type": "value"
}
]
}
}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "5s"
}
}
DevOps Pipeline Architecture
CI/CD Pipeline Stages
graph TD
A[Code Commit] --> B[Build Stage]
B --> C[Test Stage]
C --> D[Security Scan]
D --> E[Package Stage]
E --> F[Deploy Staging]
F --> G[Integration Tests]
G --> H[Approve Production]
H --> I[Deploy Production]
I --> J[Monitoring]
J --> K[Rollback if needed]
A1[Git Push] --> A
B1[Docker Build] --> B
C1[Unit Tests] --> C
C2[Integration Tests] --> C
D1[Vulnerability Scan] --> D
E1[Image Registry] --> E
F1[Kubernetes Deploy] --> F
G1[E2E Tests] --> G
H1[Manual Approval] --> H
I1[Blue/Green Deploy] --> I
J1[Prometheus/Grafana] --> J
K1[Automated Rollback] --> K
Containerization Comparison
Container Runtimes
| Runtime | Language | Security | Performance | Use Case |
|---|---|---|---|---|
| Docker | Go | Medium | Good | General Purpose |
| containerd | Go | High | Very Good | Production |
| CRI-O | Go | High | Good | Kubernetes |
| Podman | Go | High | Good | Daemonless |
Orchestration Platforms
| Platform | Complexity | Scalability | Cloud-Native | Use Case |
|---|---|---|---|---|
| Kubernetes | High | Very High | Yes | Enterprise |
| Docker Swarm | Low | Medium | Partial | Small/Medium |
| OpenShift | High | Very High | Yes | Enterprise |
| Nomad | Medium | High | Yes | Multi-Cloud |
Infrastructure as Code Tools
Terraform vs. CloudFormation vs. Pulumi
| Tool | Language | Multi-Cloud | State Management | Use Case |
|---|---|---|---|---|
| Terraform | HCL | Yes | Custom State | Multi-Cloud |
| CloudFormation | YAML | No | AWS Managed | AWS-only |
| Pulumi | Various | Yes | Custom State | Programmable |
| Ansible | YAML | Yes | No State | Configuration |
IaC Best Practices
- Modularity: Small, reusable modules
- Versioning: Git-based version control
- Testing: Automated infrastructure testing
- Documentation: Automated documentation
- Security: Security scanning and compliance
Monitoring and Observability
Observability Pillars
| Pillar | Tools | Metrics | Use Case |
|---|---|---|---|
| Metrics | Prometheus, InfluxDB | Numerical data | Performance |
| Logs | ELK Stack, Loki | Textual data | Troubleshooting |
| Traces | Jaeger, Zipkin | Request flows | Distributed Systems |
| Events | CloudWatch, EventBridge | State changes | Audit Trail |
Alerting Strategies
- Threshold-based: Static thresholds
- Anomaly Detection: Automatic anomaly detection
- Predictive: Problem prediction
- Business Metrics: Business-relevant metrics
Advantages and Disadvantages
Benefits of DevOps
- Faster Delivery: Accelerated software development
- Higher Quality: Automated testing and quality assurance
- Better Collaboration: Integration of Dev and Ops
- Scalability: Automated infrastructure scaling
- Reliability: Consistent and repeatable deployments
Disadvantages
- Complexity: High initial complexity
- Costs: Investment in tools and training
- Cultural Change: Requires organizational changes
- Learning Curve: Steep learning curve for teams
- Tool Overload: Many different tools
Common Exam Questions
-
What is the difference between CI and CD? CI (Continuous Integration) automates code build and testing, CD (Continuous Deployment) automates deployment to production.
-
Explain containerization with Docker! Docker isolates applications in containers with all dependencies, ensuring consistent environments across different systems.
-
When do you use Kubernetes vs. Docker Swarm? Kubernetes for complex, scalable applications in enterprise environments, Docker Swarm for simpler setups and small to medium-sized companies.
-
What is Infrastructure as Code? Infrastructure as Code is the practice of defining and managing infrastructure through code, enabling automation and versioning.
Important Sources
- https://docs.docker.com/
- https://kubernetes.io/docs/
- https://www.terraform.io/docs/
- https://prometheus.io/docs/