실습
Introduction
Prometheus 서버 구축
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] --- apiVersion: v1 kind: Service metadata: name: prometheus labels: app: prometheus namespace: monitoring spec: ports: - port: 9090 clusterIP: None selector: app: prometheus --- apiVersion: v1 kind: Service metadata: name: prometheus-external labels: app: prometheus namespace: monitoring spec: type: LoadBalancer ports: - port: 80 targetPort: 9090 selector: statefulset.kubernetes.io/pod-name: prometheus-0 --- apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus labels: app: prometheus namespace: monitoring spec: selector: matchLabels: app: prometheus serviceName: prometheus template: metadata: labels: app: prometheus spec: securityContext: fsGroup: 2000 containers: - name: prometheus image: quay.io/prometheus/prometheus args: - --config.file=/etc/prometheus/prometheus.yaml - --storage.tsdb.path=/data ports: - containerPort: 9090 volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-data mountPath: /data volumes: - name: prometheus-config configMap: name: prometheus-config volumeClaimTemplates: - metadata: name: prometheus-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF
Pod가 생성되었는지 확인
kubectl -n monitoring get pod prometheus-0
Prometheus 서버 엔드포인트 확인
kubectl -n monitoring get svc prometheus-external \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
상단에 있는 메뉴에서 Status -> Targets 클릭
상단에 있는 메뉴에서 Status -> Command-Line Flags 클릭
상단에 있는 메뉴에서 Status -> Configuration 클릭
상단에 있는 메뉴에서 Graph 클릭
웹브라우저에서 새로운 탭을 열고 프로메테우스 서버의 /metrics 경로로 접속 - 아래의 명령어로 접속 URL 확인 가능
kubectl -n monitoring get svc prometheus-external \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"/metrics"}{"\n"}'
Expression 브라우저에 다음과 같은 쿼리 입력 후 실행
Prometheus에 수집된 총 샘플 갯수
prometheus_tsdb_head_samples_appended_total
지난 1분간 초당 수집된 샘플 갯수
rate(prometheus_tsdb_head_samples_appended_total[1m])
prometheus Job의 상태
up{job="prometheus"}
Expose metrics
NGINX 서버 생성
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: nginx data: nginx.conf: | user nginx; worker_processes 1; events { worker_connections 1024; } http { server { listen 80; server_name localhost; rewrite ^/(.*)/$ /$1 permanent; location / { root /usr/share/nginx/html; index index.html index.htm; } location /metrics { default_type "text/plain"; alias /usr/share/nginx/html/metrics.txt; } } } metrics.txt: | requests_total 1234 --- apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - image: nginx name: nginx ports: - containerPort: 80 volumeMounts: - name: nginx-conf mountPath: /etc/nginx - name: metrics mountPath: /usr/share/nginx/html volumes: - name: nginx-conf configMap: name: nginx items: - key: nginx.conf path: nginx.conf - name: metrics configMap: name: nginx items: - key: metrics.txt path: metrics.txt EOF
생성된 NGINX 웹서버의 /metrics 경로 호출
kubectl exec -it nginx -- curl localhost/metrics
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Pod가 추가되는지 확인
NGINX Pod의 IP주소 확인
kubectl get pod nginx \ --output=custom-columns="NAME:.metadata.name,IP:.status.podIP"
Prometheus 설정파일 수정
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'nginx' static_configs: - targets: ['$(kubectl get pod nginx -o=jsonpath="{.status.podIP}")'] EOF
Prometheus 설정파일이 수정되었는지 확인
kubectl -n monitoring get cm prometheus-config -o yaml | yq e '.data' -
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
Prometheus 설정파일 Reload
curl -X POST http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags으로 이동해서 실행 옵션 확인
Prometheus 컨테이너에 명시한 실행옵션 확인
kubectl -n monitoring get sts prometheus \ --output=custom-columns="NAME:.metadata.name,ARGS:.spec.template.spec.containers[0].args"
Lifecycle API 활성화
kubectl -n monitoring patch sts prometheus --type=json \ -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--web.enable-lifecycle"}]'
Prometheus Pod가 재생성되었는지 확인
kubectl -n monitoring get pod prometheus-0
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags으로 이동해서 실행 옵션 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Pod가 추가되는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 NGINX 서버에서 내보내는 지표가 수집되는지 확인
requests_total
Prometheus Python Client로 작성한 소스코드 리뷰 - https://github.com/youngwjung/prometheus-python-client/blob/main/app.py
Pod 생성
kubectl run prom-py --image=youngwjung/prometheus-python-client
애플리케이션 호출
kubectl exec prom-py -- curl -s localhost:8000
리소스 삭제
{ kubectl delete cm nginx kubectl delete pod nginx prom-py }
Exporters
Exporter란 무엇인가? - https://prometheus.io/docs/introduction/glossary/#exporter
Exporter 종류 - https://prometheus.io/docs/instrumenting/exporters
Python Flask 웹 애플리케이션에 Exporter를 적용한 소스코드 리뷰 - https://github.com/youngwjung/prometheus-flask-exporter/blob/main/app.py 기존의 소스코드에 아래의 두줄의 코드만 추가됨
from prometheus_flask_exporter import PrometheusMetrics metrics = PrometheusMetrics(app)
데모 애플리케이션 생성
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: flask name: flask spec: replicas: 1 selector: matchLabels: app: flask template: metadata: labels: app: flask spec: containers: - name: flask image: youngwjung/prometheus-flask-exporter --- apiVersion: v1 kind: Service metadata: name: flask labels: app: flask spec: ports: - port: 80 selector: app: flask EOF
Exporter가 내보내는 지표 확인
kubectl exec -it deploy/flask -- curl -s localhost/metrics
Flask 애플리케이션에 부하를 발생시키는 Pod 생성
kubectl run load-generator --image=busybox \ -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://flask; done"
HTTP 관련 지표가 발생하는지 확인
kubectl exec -it deploy/flask -- curl -s localhost/metrics
Flask 애플리케이션에서 발생하는 지표를 수집하도록 Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'flask' static_configs: - targets: ['flask.default'] EOF
Prometheus 설정파일 Reload
curl -X POST http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Service가 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
지난 5분동안 평균 초당 요청수 확인
rate(flask_http_request_total[5m])
Graph를 선택해서 지표를 라인 그래프로 표시
데모 애플리케이션 삭제
{ kubectl delete svc flask kubectl delete deploy flask kubectl delete pod load-generator }
Config Reloader
Prometheus 설정 파일 변경을 감지하고 다시 불러오는 컨테이너 추가
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus labels: app: prometheus namespace: monitoring spec: selector: matchLabels: app: prometheus serviceName: prometheus template: metadata: labels: app: prometheus spec: securityContext: fsGroup: 2000 containers: - name: prometheus image: quay.io/prometheus/prometheus args: - --config.file=/etc/prometheus/prometheus.yaml - --storage.tsdb.path=/data - --web.enable-lifecycle ports: - containerPort: 9090 volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-data mountPath: /data - name: config-reloader image: quay.io/prometheus-operator/prometheus-config-reloader:v0.61.1 args: - --reload-url=http://127.0.0.1:9090/-/reload - --config-file=/etc/prometheus/prometheus.yaml volumeMounts: - name: prometheus-config mountPath: /etc/prometheus volumes: - name: prometheus-config configMap: name: prometheus-config volumeClaimTemplates: - metadata: name: prometheus-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF
Prometheus Pod가 재생성되었는지 확인
kubectl -n monitoring get pod prometheus-0
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] EOF
Prometheus 로그 확인
kubectl -n monitoring logs prometheus-0 -c prometheus --tail 20 -f
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
Prometheus 설정 파일을 변경하면 변경분을 감지하고 서버에 반영되기 까지 최대 2-3분정도 시간이 걸림. 실습 진행시 Prometheus 설정 파일 변경이 이루어지는 경우에는 3-4분 정도 대기 후 다음 단계를 진행
Service Discovery
데모 애플리케이션 생성
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: flask name: flask spec: replicas: 1 selector: matchLabels: app: flask template: metadata: labels: app: flask spec: containers: - name: flask image: youngwjung/prometheus-flask-exporter --- apiVersion: v1 kind: Service metadata: name: flask labels: app: flask spec: ports: - port: 80 selector: app: flask --- apiVersion: v1 kind: Pod metadata: name: load-generator spec: containers: - name: load-generator image: busybox args: - /bin/sh - -c - while sleep 0.01; do wget -q -O- http://flask; done EOF
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service namespaces: names: - default EOF
Service 목록 확인
kubectl get svc -n default
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 확인한 Service들이 추가되었는지 확인
Prometheus 서버 로그 확인
kubectl -n monitoring logs prometheus-0 -c prometheus --tail 10
권한 설정
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - apiGroups: - discovery.k8s.io resources: - endpointslices verbs: ["get", "list", "watch"] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring EOF
권한 반영
kubectl -n monitoring patch sts prometheus --type=json \ -p='[{"op": "replace", "path": "/spec/template/spec/serviceAccountName", "value": "prometheus"}]'
Prometheus 서버 로그 확인
kubectl -n monitoring logs prometheus-0 -c prometheus --tail 10
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 확인한 Service들이 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
Service 생성
kubectl create service clusterip demo --tcp=80:80
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Service가 추가되었는지 확인
Service 삭제
kubectl delete svc demo
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 삭제한 Service가 목록에서 없어졌는지 확인
Service Discovery를 통해서 확인 가능한 Metadata 확인 - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Target 별로 발견된 Label 목록 확인
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: kubernetes action: drop EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kubernetes Service가 목록에서 없어졌는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Targets 확인
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: kubernetes action: drop - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod EOF
모든 Pod 목록 확인
kubectl get pod -A
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 클러스터에 존재하는 Pod들이 추가되었는지 확인
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: kubernetes action: drop - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: kubernetes action: drop - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service EOF
Endpoints 목록 확인
kubectl get ep
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
Flask 애플리케이션의 Pod 갯수를 3개로 조정
kubectl scale deployment flask --replicas=3
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
데모 애플리케이션 삭제
{ kubectl delete svc flask kubectl delete deploy flask kubectl delete pod load-generator }
Relabeling
데모 애플리케이션 생성
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: flask name: flask spec: replicas: 1 selector: matchLabels: app: flask template: metadata: labels: app: flask spec: containers: - name: flask image: youngwjung/prometheus-flask-exporter --- apiVersion: v1 kind: Service metadata: name: flask labels: app: flask spec: ports: - port: 80 selector: app: flask --- apiVersion: v1 kind: Pod metadata: name: load-generator spec: containers: - name: load-generator image: busybox args: - /bin/sh - -c - while sleep 0.01; do wget -q -O- http://flask; done EOF
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Targets 확인
Service에 Annotation 추가
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: flask labels: app: flask annotations: prometheus.io/scrape: "true" spec: ports: - port: 80 selector: app: flask EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint 경로 확인
Service에 Annotation 추가
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: flask labels: app: flask annotations: prometheus.io/scrape: "true" prometheus.io/path: "/status" prometheus.io/port: "8080" spec: ports: - port: 80 selector: app: flask EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint 경로 확인
Service에 Annotation 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: flask labels: app: flask annotations: prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "80" spec: ports: - port: 80 selector: app: flask EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들 경로 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
지난 5분동안 평균 초당 요청수 확인
rate(flask_http_request_total[5m])
지난 5분동안 평균 초당 요청수 합을 확인
sum(rate(flask_http_request_total[5m]))
새로운 애플리케이션 배포
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: flask-two name: flask-two spec: replicas: 2 selector: matchLabels: app: flask-two template: metadata: labels: app: flask-two spec: containers: - name: flask image: youngwjung/prometheus-flask-exporter --- apiVersion: v1 kind: Service metadata: name: flask-two labels: app: flask-two annotations: prometheus.io/scrape: "true" spec: ports: - port: 80 selector: app: flask-two --- apiVersion: v1 kind: Pod metadata: name: load-generator-two labels: app: load-generator spec: containers: - name: load-generator image: busybox args: - /bin/sh - -c - while sleep 0.1; do wget -q -O- http://flask-two; done EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 각 서비스별로 지난 5분동안 평균 초당 요청수 확인
sum by (service)(rate(flask_http_request_total[5m]))
Service Discovery를 통해서 확인 가능한 Metadata 확인 - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인
flask_http_request_total
Pod에 부여된 Label 확인
kubectl get pod --show-labels
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - action: labelmap regex: __meta_kubernetes_pod_label_(.+) replacement: $1 EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - action: labelmap regex: __meta_kubernetes_pod_label_(.+) replacement: $1 - action: labeldrop regex: pod_template_hash EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 불필요한 Label이 삭제되었는지 확인
Flask 애플리케이션이 내보내는 지표 확인
kubectl exec -it deploy/flask -- curl -s localhost/metrics
Expression 브라우저에서
python_
으로 시작하는 지표 확인Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - action: labelmap regex: __meta_kubernetes_pod_label_(.+) replacement: $1 - action: labeldrop regex: pod_template_hash metric_relabel_configs: - source_labels: [__name__] regex: python_(.+) action: drop EOF
Expression 브라우저에서
python_
으로 시작하는 지표 확인Expression 브라우저에서
process_
로 시작하는 지표 확인Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - action: labelmap regex: __meta_kubernetes_pod_label_(.+) replacement: $1 - action: labeldrop regex: pod_template_hash metric_relabel_configs: - source_labels: [__name__] regex: flask_(.+) action: keep EOF
Expression 브라우저에서
process_
로 시작하는 지표 확인Expression 브라우저에서
flask_
로 시작하는 지표 확인Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'kubernetes-endpoints' kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - action: labelmap regex: __meta_kubernetes_pod_label_(.+) replacement: $1 - action: labeldrop regex: pod_template_hash metric_relabel_configs: - source_labels: [__name__] regex: flask_(.+) action: keep - source_labels: [__name__] action: replace regex: flask_(.+) replacement: $1 target_label: __name__ EOF
지표 이름이 변경되었는지 확인
데모 애플리케이션 삭제
{ kubectl delete svc flask flask-two kubectl delete deploy flask flask-two kubectl delete pod load-generator load-generator-two }
Node Exporter
Node Exporter 설치 가이드 - https://prometheus.io/docs/guides/node-exporter
Node Exporter GitHub - https://github.com/prometheus/node_exporter
Node Exporter 설치
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter labels: app: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter annotations: prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "9100" spec: hostNetwork: true hostPID: true containers: - name: node-exporter image: quay.io/prometheus/node-exporter args: - --path.procfs=/host/proc - --path.sysfs=/host/sys - --path.rootfs=/host/root - --web.listen-address=0.0.0.0:9100 ports: - name: metrics containerPort: 9100 protocol: TCP volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root mountPropagation: HostToContainer readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / EOF
Node Exporter가 실행중인지 확인
kubectl -n monitoring get pod -l app=node-exporter
Node Exporter가 내보내는 지표 확인
kubectl run nginx --image=nginx -it --rm --restart=Never \ -- curl -s $(kubectl -n monitoring get pod -l app=node-exporter -o=jsonpath="{.items[0].status.podIP}"):9100/metrics
Node Exporter 실행옵션 확인
kubectl -n monitoring exec ds/node-exporter -- node_exporter -h
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'node-exporter' kubernetes_sd_configs: - role: pod namespaces: names: - monitoring relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: node-exporter action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: instance EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 Node Exporter가 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 노드에 마운트된 파일시스템 크기 확인
node_filesystem_size_bytes
각 노드별로 루트 볼륨 크기 확인
sum by (instance) (node_filesystem_size_bytes{mountpoint="/"})
각 노드별로 루트 볼륨 사용량 확인
1 - node_filesystem_avail_bytes{job="node-exporter",mountpoint="/"} / node_filesystem_size_bytes{job="node-exporter",mountpoint="/"}
Session Manager 플러그인 설치
{ curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm" -o "session-manager-plugin.rpm" sudo yum install -y session-manager-plugin.rpm }
한개의 Node로 Session Manager 연결
aws ssm start-session --target \ $(kubectl get node -o jsonpath='{.items[0].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
디스크 샤용량 확인
df -h
Session Manager 종료
exit
파일시스템 관련 지표만 수집되도록 Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: 'node-exporter' kubernetes_sd_configs: - role: pod namespaces: names: - monitoring relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: node-exporter action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] regex: true action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] regex: (.+) action: replace target_label: __metrics_path__ - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) action: replace replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: instance metric_relabel_configs: - source_labels: [__name__] regex: node_filesystem_(.+) action: keep EOF
Prometheus에 저장된 모든 지표 목록 확인
curl -s http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/api/v1/label/__name__/values | jq
Expression 브라우저에 다음과 같은 쿼리를 실행해서 최근 1분안에 수집된 지표 목록 확인
group by(__name__) ({__name__!=""})
Node Exporter 삭제
kubectl -n monitoring delete ds node-exporter
Kubernetes system component metrics
지표를 제공하는 쿠버네티스 구성요소 확인 - https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/
HTTP 요청을 수행할 Pod 생성
cat <<EOF | kubectl apply -f - kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: metrics-access rules: - nonResourceURLs: - "/metrics" verbs: - get - apiGroups: [""] resources: ["nodes/metrics"] verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: metrics-access --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: metrics-access subjects: - kind: ServiceAccount name: metrics-access namespace: default roleRef: kind: ClusterRole name: metrics-access apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: Pod metadata: name: curl spec: serviceAccountName: metrics-access containers: - image: curlimages/curl name: curl command: ["sleep", "3600"] env: - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP EOF
API 서버에서 제공하는 지표 확인
kubectl exec -it curl -- \ sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ https://kubernetes/metrics'
kubelet에서 제공하는 지표 확인 - /metrics
kubectl exec -it curl -- \ sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ https://$HOST_IP:10250/metrics'
kubelet에서 제공하는 지표 확인 - /metrics/cadvisor
kubectl exec -it curl -- \ sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ https://$HOST_IP:10250/metrics/cadvisor'
kubelet에서 제공하는 지표 확인 - /metrics/resource
kubectl exec -it curl -- \ sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ https://$HOST_IP:10250/metrics/resource'
kubelet에서 제공하는 지표 확인 - /metrics/probes
kubectl exec -it curl -- \ sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ https://$HOST_IP:10250/metrics/probes'
CoreDNS에서 제공하는 지표 확인
kubectl exec -it curl -- \ curl $(kubectl get pod -l k8s-app=kube-dns -A -o=jsonpath='{.items[0].status.podIP}'):9153/metrics
Prometheus 서버에 지표 접근 권한 부여
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - apiGroups: - discovery.k8s.io resources: - endpointslices verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] - apiGroups: [""] resources: ["nodes/metrics"] verbs: ["get"] EOF
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: kube-apiserver scheme: https authorization: type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] regex: apiserver action: keep - source_labels: [__meta_kubernetes_service_label_provider] regex: kubernetes action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: https action: keep - source_labels: [__meta_kubernetes_service_name] regex: (.*) action: replace target_label: service metric_relabel_configs: - source_labels: [__name__] regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds) action: drop - source_labels: [__name__] regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds) action: drop - source_labels: [__name__] regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs) action: drop - source_labels: [__name__] regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout) action: drop - source_labels: [__name__] regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total) action: drop - source_labels: [__name__] regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary) action: drop - source_labels: [__name__] regex: transformation_(transformation_latencies_microseconds|failures_total) action: drop - source_labels: [__name__] regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count) action: drop - source_labels: [__name__] regex: etcd_(debugging|disk|server).* action: drop - source_labels: [__name__] regex: apiserver_admission_controller_admission_latencies_seconds_.* action: drop - source_labels: [__name__] regex: apiserver_admission_step_admission_latencies_seconds_.* action: drop - source_labels: [__name__, le] regex: apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50) action: drop EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-apiserver가 추가되었는지 확인
최근 1분안에 수집된 지표 목록 확인
group by(__name__) ({__name__!=""})
쿠버네티스 객체별로 요청 갯수 확인
sum by(resource) (apiserver_request_total)
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: kubelet-cadvisor metrics_path: /metrics/cadvisor scheme: https authorization: type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: insecure_skip_verify: true kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__metrics_path__] regex: (.*) action: replace target_label: metrics_path metric_relabel_configs: - source_labels: [__name__] regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s) action: drop - source_labels: [__name__, pod, namespace] regex: (container_fs_.*|container_spec_.*|container_blkio_device_usage_total|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);; action: drop EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kubelet-cadvisor가 추가되었는지 확인
최근 1분안에 수집된 지표 목록 확인
group by(__name__) ({__name__!=""})
Pod별 CPU 사용시간 확인
sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod)
리소스 삭제
{ kubectl delete clusterrole metrics-access kubectl delete clusterrolebinding metrics-access kubectl delete sa metrics-access kubectl delete pod curl }
kube-state-metrics
kube-state-metrics 설치
{ git clone https://github.com/kubernetes/kube-state-metrics.git kubectl apply -f kube-state-metrics/examples/standard }
kube-state-metrics가 내보내는 지표 확인
kubectl run nginx --image=nginx -it --rm --restart=Never \ -- curl $(kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics -o=jsonpath="{.items[0].status.podIP}"):8080/metrics
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 5s scrape_configs: - job_name: kube-state-metrics scrape_interval: 30s metrics_path: /metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - kube-system relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component] regex: exporter action: keep - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: http-metrics action: keep EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-state-metrics가 추가되었는지 확인
최근 1분안에 수집된 지표 목록 확인
group by(__name__) ({__name__!=""})
Node 상태 확인
kube_node_status_condition
Pod 상태 확인
kube_pod_status_phase
Pod 생성
kubectl run nginx --image=nginx:notexist
Pod 상태 확인
kubectl get pod -l run=nginx
실행되고 있지 않는 Pod 목록 확인
kube_pod_status_phase{phase !="Running"} == 1
리소스 삭제
{ kubectl delete pod nginx kubectl delete -f kube-state-metrics/examples/standard }
Alerting
kube-state-metrics 설치
{ git clone https://github.com/kubernetes/kube-state-metrics.git kubectl apply -f kube-state-metrics/examples/standard }
Prometheus 설정 변경
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: prometheus.yaml: | global: scrape_interval: 10s evaluation_interval: 10s scrape_configs: - job_name: kubelet scheme: https authorization: type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: insecure_skip_verify: true kubernetes_sd_configs: - role: node metric_relabel_configs: - source_labels: [__name__] regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds) action: drop - source_labels: [__name__] regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds) action: drop - source_labels: [__name__] regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs) action: drop - source_labels: [__name__] regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout) action: drop - source_labels: [__name__] regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total) action: drop - source_labels: [__name__] regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary) action: drop - source_labels: [__name__] regex: transformation_(transformation_latencies_microseconds|failures_total) action: drop - source_labels: [__name__] regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count) action: drop - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - kube-system relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component] regex: exporter action: keep - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: http-metrics action: keep EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에 명시한 Job들이 추가되었는지 확인
Alerting 규칙 추가
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: alerts.yaml: | groups: - name: kubernetes-apps rules: - alert: KubePodCrashLooping annotations: description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").' runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping summary: Pod is crash looping. expr: | max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1 for: 15m labels: severity: warning team: dev - name: kubernetes-system-kubelet rules: - alert: KubeNodeNotReady annotations: description: '{{ $labels.node }} has been unready for more than 1 minutes.' runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready summary: Node is not ready. expr: | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 for: 1m labels: severity: warning - name: kubernetes-storage rules: - alert: KubePersistentVolumeFillingUp annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup summary: PersistentVolume is filling up. expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0 for: 1m labels: severity: critical - alert: KubePersistentVolumeAlmostFillingUp annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup summary: PersistentVolume is almost filling up. expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.20 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0 for: 1m labels: severity: warning prometheus.yaml: | global: scrape_interval: 10s evaluation_interval: 10s rule_files: - alerts.yaml scrape_configs: - job_name: kubelet scheme: https authorization: type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: insecure_skip_verify: true kubernetes_sd_configs: - role: node metric_relabel_configs: - source_labels: [__name__] regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds) action: drop - source_labels: [__name__] regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds) action: drop - source_labels: [__name__] regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs) action: drop - source_labels: [__name__] regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout) action: drop - source_labels: [__name__] regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total) action: drop - source_labels: [__name__] regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary) action: drop - source_labels: [__name__] regex: transformation_(transformation_latencies_microseconds|failures_total) action: drop - source_labels: [__name__] regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count) action: drop - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - kube-system relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component] regex: exporter action: keep - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: http-metrics action: keep EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 Alert 규칙들이 추가되었는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Pod 생성
kubectl run busybox --image=busybox
Pod 상태 확인
kubectl get pod -l run=busybox
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Cluster Autoscaler가 활성화 되어 있을 경우에 비활성화
kubectl -n kube-system scale deployment cluster-autoscaler --replicas=0
노드그룹에 설정된 노드 갯수 확인
{ export DESIRED_SIZE=$(aws eks describe-nodegroup \ --cluster-name mycluster \ --nodegroup-name nodegroup \ --query nodegroup.scalingConfig.desiredSize) echo $DESIRED_SIZE }
노드그룹을 조정해서 노드 1개 추가
aws eks update-nodegroup-config \ --cluster-name mycluster \ --nodegroup-name nodegroup \ --scaling-config desiredSize=$(($DESIRED_SIZE+1))
노드가 추가되었는지 확인
kubectl get node
새로 추가된 노드의 상태가 Ready가 될때까지 대기
새로 생성된 노드로 Session Manager 연결
aws ssm start-session --target \ $(kubectl get node --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
컨테이너 런타임 정지
{ sudo systemctl stop containerd }
Session Manager 종료
exit
노드 상태 확인
kubectl get node
위에서 컨테이너 런타임을 정지한 노드의 상세 상태 확인
kubectl describe node $(kubectl get node --sort-by='.metadata.creationTimestamp' -o=jsonpath='{.items[-1].metadata.name}')
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
데모 애플리케이션 배포
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx spec: ports: - port: 80 clusterIP: None selector: app: nginx --- apiVersion: apps/v1 kind: StatefulSet metadata: name: nginx spec: serviceName: nginx replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx volumeMounts: - mountPath: /data name: data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi EOF
Pod 생성 확인
kubectl get pod -l app=nginx
Expression 브라우저에 다음과 같은 쿼리를 입력해서 PV별 가용한 디스크 크기 확인
sum (kubelet_volume_stats_available_bytes) by (persistentvolumeclaim)
PV에 999MB 크기의 파일 생성
kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
https://webhook.site 에서 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
위에서 생성한 URL을 환경변수로 지정
export WEBHOOK_URL=<생성한 Webhook URL>
Alertmanager 생성
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config labels: app: alertmanager namespace: monitoring data: alertmanager.yaml: | route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 3h receiver: infra routes: - matchers: - team=dev routes: - matchers: - severity=warning receiver: dev active_time_intervals: - daytime mute_time_intervals: - weekends - matchers: - severity=critical receiver: urgent receivers: - name: infra slack_configs: - api_url: $WEBHOOK_URL channel: #infra send_resolved: true - name: dev slack_configs: - api_url: $WEBHOOK_URL channel: #dev send_resolved: true - name: urgent slack_configs: - api_url: $WEBHOOK_URL channel: #urgent send_resolved: true time_intervals: - name: daytime time_intervals: - times: - start_time: '07:00' end_time: '23:00' - name: weekends time_intervals: - weekdays: ['saturday', 'sunday'] inhibit_rules: - source_matchers: - alertname=KubePersistentVolumeFillingUp target_matchers: - alertname=KubePersistentVolumeAlmostFillingUp equal: - persistentvolumeclaim --- apiVersion: v1 kind: Service metadata: name: alertmanager labels: app: alertmanager namespace: monitoring spec: type: LoadBalancer ports: - port: 80 targetPort: 9093 selector: app: alertmanager --- apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager labels: app: alertmanager namespace: monitoring spec: selector: matchLabels: app: alertmanager template: metadata: labels: app: alertmanager spec: securityContext: fsGroup: 2000 containers: - name: alertmanager image: prom/alertmanager args: - --config.file=/etc/alertmanager/alertmanager.yaml ports: - containerPort: 9093 volumeMounts: - name: alertmanager-config mountPath: /etc/alertmanager volumes: - name: alertmanager-config configMap: name: alertmanager-config EOF
Pod가 생성되었는지 확인
kubectl -n monitoring get pod -l app=alertmanager
Alertmanager 서버 엔드포인트 확인
kubectl -n monitoring get svc alertmanager \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
Prometheus 설정에서 Alertmanger 추가
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config labels: app: prometheus namespace: monitoring data: alerts.yaml: | groups: - name: kubernetes-apps rules: - alert: KubePodCrashLooping annotations: description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").' runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping summary: Pod is crash looping. expr: | max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1 for: 15m labels: severity: warning team: dev - name: kubernetes-system-kubelet rules: - alert: KubeNodeNotReady annotations: description: '{{ $labels.node }} has been unready for more than 1 minutes.' runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready summary: Node is not ready. expr: | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 for: 1m labels: severity: warning - name: kubernetes-storage rules: - alert: KubePersistentVolumeFillingUp annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup summary: PersistentVolume is filling up. expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0 for: 1m labels: severity: critical - alert: KubePersistentVolumeAlmostFillingUp annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup summary: PersistentVolume is almost filling up. expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.20 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0 for: 1m labels: severity: warning prometheus.yaml: | global: scrape_interval: 10s evaluation_interval: 10s rule_files: - alerts.yaml alerting: alertmanagers: - static_configs: - targets: ['alertmanager.monitoring.svc'] scrape_configs: - job_name: kubelet scheme: https authorization: type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: insecure_skip_verify: true kubernetes_sd_configs: - role: node metric_relabel_configs: - source_labels: [__name__] regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds) action: drop - source_labels: [__name__] regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds) action: drop - source_labels: [__name__] regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs) action: drop - source_labels: [__name__] regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout) action: drop - source_labels: [__name__] regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total) action: drop - source_labels: [__name__] regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary) action: drop - source_labels: [__name__] regex: transformation_(transformation_latencies_microseconds|failures_total) action: drop - source_labels: [__name__] regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count) action: drop - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - kube-system relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component] regex: exporter action: keep - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: http-metrics action: keep EOF
Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
https://webhook.site 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인
메시지 내용에서 title_link에 명시된 URL 확인
PV에 생성한 파일 삭제
kubectl exec -it nginx-0 -- rm /data/bigfile
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Alertmanager 대시보드로 이동해서 Alert이 없어졌는지 확인
https://webhook.site 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인 - 이전 Alert 발생한 시점에서 group_interval에 명시한 값만큼 지난 이후에 발송
새로운 메시지가 수신되지 않을 경우에는 Alertmanager 로그 확인
kubectl -n monitoring logs deploy/alertmanager
Alertmanager 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config labels: app: alertmanager namespace: monitoring data: alertmanager.yaml: | route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 3h receiver: infra routes: - matchers: - team=dev routes: - matchers: - severity=warning receiver: dev active_time_intervals: - daytime mute_time_intervals: - weekends - matchers: - severity=critical receiver: urgent receivers: - name: infra slack_configs: - api_url: $WEBHOOK_URL channel: #infra send_resolved: true - name: dev slack_configs: - api_url: $WEBHOOK_URL channel: #dev send_resolved: true - name: urgent slack_configs: - api_url: $WEBHOOK_URL channel: #urgent send_resolved: true actions: - type: button text: 'Query :mag:' url: '{{ (index .Alerts 0).GeneratorURL }}' time_intervals: - name: daytime time_intervals: - times: - start_time: '07:00' end_time: '23:00' - name: weekends time_intervals: - weekdays: ['saturday', 'sunday'] inhibit_rules: - source_matchers: - alertname=KubePersistentVolumeFillingUp target_matchers: - alertname=KubePersistentVolumeAlmostFillingUp equal: - persistentvolumeclaim --- apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager labels: app: alertmanager namespace: monitoring spec: selector: matchLabels: app: alertmanager template: metadata: labels: app: alertmanager spec: securityContext: fsGroup: 2000 containers: - name: alertmanager image: prom/alertmanager args: - --config.file=/etc/alertmanager/alertmanager.yaml - --web.external-url=http://$(kubectl -n monitoring get svc alertmanager -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}') ports: - containerPort: 9093 volumeMounts: - name: alertmanager-config mountPath: /etc/alertmanager volumes: - name: alertmanager-config configMap: name: alertmanager-config EOF
Alertmanager 대시보드 상단에 있는 메뉴에서 Status로 이동해서 설정파일이 업데이트 되었는지 확인
Alertmanager 설정파일 Reload - 위의 단계에서 설정 파일이 업데이트 되어있지 않을 경우에 수행
curl -X POST http://$(kubectl -n monitoring get svc alertmanager -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
Prometheus 설정 변경
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus labels: app: prometheus namespace: monitoring spec: selector: matchLabels: app: prometheus serviceName: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus securityContext: fsGroup: 2000 containers: - name: prometheus image: quay.io/prometheus/prometheus args: - --config.file=/etc/prometheus/prometheus.yaml - --storage.tsdb.path=/data - --web.enable-lifecycle - --web.external-url=http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}') ports: - containerPort: 9090 volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-data mountPath: /data - name: config-reloader image: quay.io/prometheus-operator/prometheus-config-reloader:v0.61.1 args: - --reload-url=http://127.0.0.1:9090/-/reload - --config-file=/etc/prometheus/prometheus.yaml volumeMounts: - name: prometheus-config mountPath: /etc/prometheus volumes: - name: prometheus-config configMap: name: prometheus-config volumeClaimTemplates: - metadata: name: prometheus-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags 클릭해서 설정파일이 업데이트 되었는지 확인
PV에 999MB 크기의 파일 생성
kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
https://webhook.site 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인 - title_link 및 actions에 명시된 URL로 접속해서 어떤 내용이 표시되는지 확인
리소스 삭제
{ kubectl delete ns monitoring kubectl delete sts nginx kubectl delete svc nginx kubectl delete pvc -l app=nginx kubectl -n kube-system scale deployment cluster-autoscaler --replicas=1 kubectl delete pod busybox kubectl delete -f kube-state-metrics/examples/standard rm -rf kube-state-metrics }
NotReady 상태의 노드 삭제
aws ec2 terminate-instances --instance-ids \ $(kubectl get node --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
노드 갯수를 원래대로 조정
aws eks update-nodegroup-config \ --cluster-name mycluster \ --nodegroup-name nodegroup \ --scaling-config desiredSize=$DESIRED_SIZE
Prometheus Operator
Custom Resource 목록 - https://prometheus-operator.dev/docs/operator/design/
Prometheus Operator 설치
kubectl create -f \ https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
prometheuses.monitoring.coreos.com/v1 객체 내용 확인 - https://prometheus-operator.dev/docs/operator/api/#prometheus
Prometheus 설치
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/metrics - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - apiGroups: - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring --- apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} EOF
Prometheus Operator 로그 확인
kubectl logs deploy/prometheus-operator
생성된 StatefulSet 확인
kubectl -n monitoring get sts
생성된 StatefulSet의 상세 스펙 확인
kubectl -n monitoring get sts prometheus-k8s -o yaml
Service 생성
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus name: prometheus-k8s namespace: monitoring spec: ports: - name: web port: 80 targetPort: web - name: reloader-web port: 8080 targetPort: reloader-web selector: app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus type: LoadBalancer EOF
Prometheus 서버 엔드포인트 확인
kubectl -n monitoring get svc prometheus-k8s \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
상단에 있는 메뉴에서 Status -> Configuration 클릭
생성된 Prometheus 객체 상세 내용 확인
kubectl -n monitoring get prom k8s -o yaml
생성된 Secret 확인
kubectl -n monitoring get secret
Prometheus 설정파일이 저장된 Secret 상세내용 확인
kubectl -n monitoring get secret prometheus-k8s -o yaml
Base64로 인코딩된 Prometheus 설정파일 디코딩
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
새로운 scrape_interval 값 지정
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} scrapeInterval: 10s EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 로그 확인
kubectl -n monitoring logs prometheus-k8s-0
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
ServiceMonitor 생성 - https://prometheus-operator.dev/docs/operator/api/#servicemonitor
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: kube-apiserver namespace: monitoring spec: selector: matchLabels: component: apiserver provider: kubernetes namespaceSelector: matchNames: - default endpoints: - interval: 30s port: https scheme: https bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token tlsConfig: caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt serverName: kubernetes metricRelabelings: - sourceLabels: - __name__ regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds) action: drop - sourceLabels: - __name__ regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds) action: drop - sourceLabels: - __name__ regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs|longrunning_gauge|registered_watchers) action: drop - sourceLabels: - __name__ regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout) action: drop - sourceLabels: - __name__ regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total) action: drop - sourceLabels: - __name__ regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary) action: drop - sourceLabels: - __name__ regex: transformation_(transformation_latencies_microseconds|failures_total) action: drop - sourceLabels: - __name__ regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count) action: drop - sourceLabels: - __name__ regex: etcd_(debugging|disk|server).* action: drop - sourceLabels: - __name__ regex: apiserver_admission_controller_admission_latencies_seconds_.* action: drop - sourceLabels: - __name__ regex: apiserver_admission_step_admission_latencies_seconds_.* action: drop - sourceLabels: - __name__ - le regex: apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50) action: drop EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-apiserver가 추가되었는지 확인
최근 1분안에 수집된 지표 목록 확인
group by(__name__) ({__name__!=""})
쿠버네티스 객체별로 요청 갯수 확인
sum by(resource) (apiserver_request_total)
데모 애플리케이션 배포
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: nginx data: nginx.conf: | user nginx; worker_processes 1; events { worker_connections 1024; } http { server { listen 80; server_name localhost; rewrite ^/(.*)/$ /$1 permanent; location / { root /usr/share/nginx/html; index index.html index.htm; } location /metrics { stub_status on; access_log off; allow all; } } } --- apiVersion: v1 kind: Pod metadata: name: nginx labels: app: nginx spec: containers: - name: nginx image: nginx ports: - name: http containerPort: 80 volumeMounts: - name: nginx-conf mountPath: /etc/nginx - name: nginx-exporter image: nginx/nginx-prometheus-exporter:0.10.0 ports: - name: http-metric containerPort: 9113 args: - "-nginx.scrape-uri=http://localhost/metrics" volumes: - name: nginx-conf configMap: name: nginx items: - key: nginx.conf path: nginx.conf EOF
Pod 생성 확인
kubectl get pod -l app=nginx
NGINX Exporter가 내보내는 지표 확인
kubectl exec -it nginx -c nginx -- curl localhost:9113/metrics
PodMonitor 생성 - https://prometheus-operator.dev/docs/operator/api/#podmonitor
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: nginx namespace: monitoring spec: namespaceSelector: matchNames: - default selector: matchLabels: app: nginx podMetricsEndpoints: - port: http-metric EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 nginx가 추가되었는지 확인
PodMonitor 수정
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: nginx namespace: monitoring spec: namespaceSelector: matchNames: - default selector: matchLabels: app: nginx podMetricsEndpoints: - port: http-metric relabelings: - regex: container action: labeldrop - regex: endpoint action: labeldrop jobLabel: app EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 nginx의 Label이 변경되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 NGINX Exporter에서 내보내는 지표가 수집되는지 확인
nginx_http_requests_total
수집 설정 파일 생성
cat > additional-scrape-job.yaml <<EOF - job_name: prometheus static_configs: - targets: [localhost:9090] EOF
Secret 생성
kubectl -n monitoring create secret generic additional-scrape-configs \ --from-file=additional-scrape-job.yaml
수동으로 생성한 수집설정 파일 반영
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} scrapeInterval: 10s additionalScrapeConfigs: name: additional-scrape-configs key: additional-scrape-job.yaml EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 prometheus가 추가되었는지 확인
Expression 브라우저에 다음과 같은 쿼리를 실행해서 Prometheus에서 내보내는 지표가 수집되는지 확인
{job="prometheus"}
Alertmanager 설치 - https://prometheus-operator.dev/docs/operator/api/#alertmanager
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: k8s namespace: monitoring spec: {} EOF
생성된 StatefulSet 확인
kubectl -n monitoring get sts
생성된 StatefulSet의 상세 스펙 확인
kubectl -n monitoring get sts alertmanager-k8s -o yaml
Service 생성
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/instance: k8s app.kubernetes.io/name: alertmanager name: alertmanager-k8s namespace: monitoring spec: ports: - name: web port: 80 targetPort: web - name: reloader-web port: 8080 targetPort: reloader-web selector: app.kubernetes.io/instance: k8s app.kubernetes.io/name: alertmanager type: LoadBalancer EOF
Alertmanager 서버 엔드포인트 확인
kubectl -n monitoring get svc alertmanager-k8s \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
상단에 있는 메뉴에서 Status 클릭
https://webhook.site 에 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
위에서 확인한 Webhook URL으로 Secret 생성
kubectl -n monitoring create secret generic slack-config \ --from-literal=api-url=<WEBHOOK_URL>
AlertmanagerConfig 생성
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: alertmanager-k8s labels: alertmanagerConfig: default namespace: monitoring spec: route: groupBy: ['alertname'] groupWait: 30s groupInterval: 5m repeatInterval: 3h receiver: infra receivers: - name: infra slackConfigs: - apiURL: name: slack-config key: api-url channel: infra sendResolved: true EOF
Alertmanager에 AlertmanagerConfig 반영
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: k8s namespace: monitoring spec: alertmanagerConfigSelector: matchLabels: alertmanagerConfig: default EOF
Alertmanager 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret alertmanager-k8s-generated \ -o jsonpath="{.data['alertmanager\.yaml\.gz']}" | base64 -d | gunzip
위에서 생성한 AlertmanagerConfig를 글로벌 설정으로 반영 - https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md#specify-global-alertmanager-config
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: k8s namespace: monitoring spec: alertmanagerConfiguration: name: alertmanager-k8s EOF
Alertmanager 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret alertmanager-k8s-generated \ -o jsonpath="{.data['alertmanager\.yaml\.gz']}" | base64 -d | gunzip
Alertmanger 상단에 있는 메뉴에서 Status 클릭해서 설정 파일이 업데이트 되었는지 확인
Prometheus 설정에 Alertmanger 추가
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} scrapeInterval: 10s additionalScrapeConfigs: name: additional-scrape-configs key: additional-scrape-job.yaml alerting: alertmanagers: - namespace: monitoring name: alertmanager-operated port: web EOF
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
Alert 규칙 생성
cat <<'EOF' | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: nginx-alert namespace: monitoring labels: app: nginx spec: groups: - name: nginx rules: - alert: TooManyRequest annotations: description: '{{ $labels.pod }} is demanding.' summary: RPS is higher than 10. expr: rate(nginx_http_requests_total[1m]) > 10 for: 1m labels: team: frontend EOF
Prometheus 설정에 위에서 생성한 규칙 추가
cat <<EOF | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} podMonitorSelector: {} scrapeInterval: 10s additionalScrapeConfigs: name: additional-scrape-configs key: additional-scrape-job.yaml alerting: alertmanagers: - namespace: monitoring name: alertmanager-operated port: web ruleSelector: matchLabels: app: nginx EOF
규칙 파일이 생성되었는지 확인
kubectl -n monitoring get cm prometheus-k8s-rulefiles-0 -o yaml
Prometheus 설정파일이 업데이트 되었는지 확인
kubectl -n monitoring get secret prometheus-k8s \ -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 Alert 규칙들이 추가되었는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
NGINX에 부하를 발생시키는 Pod 생성
kubectl run load-generator --image=busybox \ -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://$(kubectl get pod nginx -o=jsonpath='{.status.podIP}'); done"
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
49번에서 접속한 https://webhook.site 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인
리소스 삭제
{ kubectl delete ns monitoring kubectl delete pod nginx load-generator kubectl delete cm nginx kubectl delete clusterrolebinding prometheus kubectl delete clusterrole prometheus kubectl delete -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml rm additional-scrape-job.yaml }
kube-prometheus
kube-prometheus-stack 헬름 차트 리뷰 - https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
리포지토리 추가
{ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update }
차트 설치
{ kubectl create ns monitoring helm -n monitoring install prometheus prometheus-community/kube-prometheus-stack \ --set fullnameOverride=kube-prometheus \ --set prometheus.service.type=LoadBalancer \ --set prometheus.service.port=80 \ --set alertmanager.service.type=LoadBalancer \ --set alertmanager.service.port=80 \ --set alertmanager.serviceMonitor.selfMonitor=false \ --set grafana.service.type=LoadBalancer \ --set grafana.adminPassword=asdf1234 \ --set grafana.serviceMonitor.enabled=false \ --set defaultRules.create=false \ --set kubeApiServer.enabled=false \ --set kubelet.enabled=false \ --set kubeControllerManager.enabled=false \ --set coreDns.enabled=false \ --set kubeEtcd.enabled=false \ --set kubeScheduler.enabled=false \ --set kubeProxy.enabled=false \ --set kubeStateMetrics.enabled=false \ --set nodeExporter.enabled=false }
생성된 객체 확인
kubectl get all -n monitoring
생성된 ServiceMonitor 확인
kubectl get servicemonitors.monitoring.coreos.com -A
Prometheus 서버 엔드포인트 확인
kubectl -n monitoring get svc kube-prometheus-prometheus
OR
kubectl -n monitoring get svc kube-prometheus-prometheus \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Job 목록 확인
Alertmanager 서버 엔드포인트 확인
kubectl -n monitoring get svc kube-prometheus-alertmanager \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속
Grafana 서버 엔드포인트 확인
kubectl -n monitoring get svc prometheus-grafana \ -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
웹브라우저에서 위에서 확인한 URL로 접속 - 아이디: admin, 비밀번호: asdf1234
대시보드 목록에서 Prometheus / Overview 확인
대시보드 목록에서 Kubernetes / Compute Resources / Pod 확인
kubelet 지표 수집 활성화
helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \ --reuse-values \ --set kubelet.enabled=true
ServiceMonitor 목록 확인
kubectl get servicemonitors.monitoring.coreos.com -A
새로 생성된 ServiceMonitor 상세내용 확인
kubectl get servicemonitors.monitoring.coreos.com \ kube-prometheus-kubelet -n monitoring -o yaml
kube-system 네임스페이스에 있는 Service 목록 확인
kubectl get svc -n kube-system
kube-system 네임스페이스에 있는 Endpoint 목록 확인
kubectl get ep -n kube-system
Node 아이피 주소 확인
kubectl get node \ -o=custom-columns='NodeName:.metadata.name,InternalIP:status.addresses[?(@.type=="InternalIP")].address,ExternalIP:status.addresses[?(@.type=="ExternalIP")].address'
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Job 목록 확인
Grafana 대시보드 목록에서 Kubernetes / Kubelet 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 목록 확인
kubelet 규칙 추가
helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \ --reuse-values \ --set defaultRules.create=true \ --set defaultRules.rules.kubelet=true \ --set defaultRules.rules.alertmanager=false \ --set defaultRules.rules.etcd=false \ --set defaultRules.rules.configReloaders=false \ --set defaultRules.rules.general=false \ --set defaultRules.rules.k8s=false \ --set defaultRules.rules.kubeApiserverAvailability=false \ --set defaultRules.rules.kubeApiserverBurnrate=false \ --set defaultRules.rules.kubeApiserverHistogram=false \ --set defaultRules.rules.kubeApiserverSlos=false \ --set defaultRules.rules.kubeProxy=false \ --set defaultRules.rules.kubePrometheusGeneral=false \ --set defaultRules.rules.kubePrometheusNodeRecording=false \ --set defaultRules.rules.kubernetesApps=false \ --set defaultRules.rules.kubernetesResources=false \ --set defaultRules.rules.kubernetesStorage=false \ --set defaultRules.rules.kubernetesSystem=false \ --set defaultRules.rules.kubeScheduler=false \ --set defaultRules.rules.kubeStateMetrics=false \ --set defaultRules.rules.node=false \ --set defaultRules.rules.nodeExporterAlerting=false \ --set defaultRules.rules.nodeExporterRecording=false \ --set defaultRules.rules.prometheus=false \ --set defaultRules.rules.prometheusOperator=false
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 새로운 규칙이 추가되었는지 확인
kubernetesStorage 규칙 활성화
helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \ --reuse-values \ --set defaultRules.rules.kubernetesStorage=true
Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 새로운 규칙이 추가되었는지 확인
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert이 추가되었는지 확인
데모 애플리케이션 배포
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx spec: ports: - port: 80 clusterIP: None selector: app: nginx --- apiVersion: apps/v1 kind: StatefulSet metadata: name: nginx spec: serviceName: nginx replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx volumeMounts: - mountPath: /data name: data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi EOF
Pod 생성 확인
kubectl get pod -l app=nginx
Expression 브라우저에 다음과 같은 쿼리를 입력해서 PV별 가용한 디스크 크기 확인
sum (kubelet_volume_stats_available_bytes) by (persistentvolumeclaim)
Grafana 대시보드 목록에서 Kubernetes / Persistent Volumes 확인
PV에 999MB 크기의 파일 생성
kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
Grafana 대시보드 목록에서 Kubernetes / Persistent Volumes 확인
Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
https://webhook.site 에 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
위에서 확인한 Webhook URL을 환경변수로 지정
export WEBHOOK_URL=<Webhook URL>
Alertmanager에 수신자 설정 추가
helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \ --reuse-values \ --set alertmanager.config.route.receiver=infra \ --set alertmanager.config.route.routes=null \ --set alertmanager.config.receivers[0].name=infra \ --set alertmanager.config.receivers[0].slack_configs[0].api_url=$WEBHOOK_URL \ --set alertmanager.config.receivers[0].slack_configs[0].channel="#infra" \ --set alertmanager.config.receivers[0].slack_configs[0].send_resolved="true"
https://webhook.site 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인
리소스 삭제
{ kubectl delete svc nginx kubectl delete sts nginx kubectl delete pvc -l app=nginx helm -n monitoring uninstall prometheus kubectl delete crd alertmanagerconfigs.monitoring.coreos.com kubectl delete crd alertmanagers.monitoring.coreos.com kubectl delete crd podmonitors.monitoring.coreos.com kubectl delete crd probes.monitoring.coreos.com kubectl delete crd prometheuses.monitoring.coreos.com kubectl delete crd prometheusrules.monitoring.coreos.com kubectl delete crd servicemonitors.monitoring.coreos.com kubectl delete crd thanosrulers.monitoring.coreos.com kubectl delete ns monitoring }
Last updated