# 실습

### Introduction

1. Prometheus 서버 구축

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Namespace
   metadata:
     name: monitoring
   ---
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
         - job_name: 'prometheus'
           static_configs:
             - targets: ['localhost:9090']
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: prometheus
     labels:
       app: prometheus
     namespace: monitoring
   spec:
     ports:
     - port: 9090
     clusterIP: None
     selector:
       app: prometheus
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: prometheus-external
     labels:
       app: prometheus
     namespace: monitoring
   spec:
     type: LoadBalancer
     ports:
     - port: 80
       targetPort: 9090
     selector:
       statefulset.kubernetes.io/pod-name: prometheus-0
   ---
   apiVersion: apps/v1
   kind: StatefulSet
   metadata:
     name: prometheus
     labels:
       app: prometheus
     namespace: monitoring
   spec:
     selector:
       matchLabels:
         app: prometheus
     serviceName: prometheus
     template:
       metadata:
         labels:
           app: prometheus
       spec:
         securityContext:
           fsGroup: 2000
         containers:
         - name: prometheus
           image: quay.io/prometheus/prometheus
           args:
           - --config.file=/etc/prometheus/prometheus.yaml
           - --storage.tsdb.path=/data
           ports:
           - containerPort: 9090
           volumeMounts:
           - name: prometheus-config
             mountPath: /etc/prometheus
           - name: prometheus-data
             mountPath: /data
         volumes:
         - name: prometheus-config
           configMap:
             name: prometheus-config
     volumeClaimTemplates:
     - metadata:
         name: prometheus-data
       spec:
         accessModes:
         - ReadWriteOnce
         resources:
           requests:
             storage: 10Gi
   EOF
   ```
2. Pod가 생성되었는지 확인

   ```
   kubectl -n monitoring get pod prometheus-0
   ```
3. Prometheus 서버 엔드포인트 확인

   ```
   kubectl -n monitoring get svc prometheus-external \
   -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
   ```
4. 웹브라우저에서 위에서 확인한 URL로 접속
5. 상단에 있는 메뉴에서 Status -> Targets 클릭
6. 상단에 있는 메뉴에서 Status -> Command-Line Flags 클릭
7. 상단에 있는 메뉴에서 Status -> Configuration 클릭
8. 상단에 있는 메뉴에서 Graph 클릭
9. 웹브라우저에서 새로운 탭을 열고 프로메테우스 서버의 */metrics* 경로로 접속 - 아래의 명령어로 접속 URL 확인 가능

   ```
   kubectl -n monitoring get svc prometheus-external \
   -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"/metrics"}{"\n"}'
   ```
10. Expression 브라우저에 다음과 같은 쿼리 입력 후 실행
    1. Prometheus에 수집된 총 샘플 갯수

       ```
       prometheus_tsdb_head_samples_appended_total
       ```
    2. 지난 1분간 초당 수집된 샘플 갯수

       ```
       rate(prometheus_tsdb_head_samples_appended_total[1m])
       ```
    3. prometheus Job의 상태

       ```
       up{job="prometheus"}
       ```

### Expose metrics

1. NGINX 서버 생성

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: nginx
   data:
     nginx.conf: |
       user nginx;
       worker_processes  1;
       events {
           worker_connections  1024;
       }
       http {
           server {
               listen       80;
               server_name  localhost;
               rewrite ^/(.*)/$ /$1 permanent;
               
               location / {
                   root   /usr/share/nginx/html;
                   index  index.html index.htm;
               }
               location /metrics {
                   default_type "text/plain";
                   alias   /usr/share/nginx/html/metrics.txt;
               }
           }
       }
     metrics.txt: |
       requests_total 1234
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: nginx
   spec:
     containers:
     - image: nginx
       name: nginx
       ports:
       - containerPort: 80
       volumeMounts:
       - name: nginx-conf
         mountPath: /etc/nginx
       - name: metrics
         mountPath: /usr/share/nginx/html
     volumes:
     - name: nginx-conf
       configMap:
         name: nginx
         items:
         - key: nginx.conf
           path: nginx.conf
     - name: metrics
       configMap:
         name: nginx
         items:
         - key: metrics.txt
           path: metrics.txt
   EOF
   ```
2. 생성된 NGINX 웹서버의 */metrics* 경로 호출

   ```
   kubectl exec -it nginx -- curl localhost/metrics
   ```
3. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Pod가 추가되는지 확인
4. NGINX Pod의 IP주소 확인&#x20;

   ```
   kubectl get pod nginx \
   --output=custom-columns="NAME:.metadata.name,IP:.status.podIP"
   ```
5. Prometheus 설정파일 수정

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
         - job_name: 'prometheus'
           static_configs:
             - targets: ['localhost:9090']
         - job_name: 'nginx'
           static_configs:
             - targets: ['$(kubectl get pod nginx -o=jsonpath="{.status.podIP}")']
   EOF
   ```
6. Prometheus 설정파일이 수정되었는지 확인

   ```
   kubectl -n monitoring get cm prometheus-config -o yaml | yq e '.data' -
   ```
7. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
8. Prometheus 설정파일 Reload

   ```
   curl -X POST http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
   ```
9. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags으로 이동해서 실행 옵션 확인
10. Prometheus 컨테이너에 명시한 실행옵션 확인

    ```
    kubectl -n monitoring get sts prometheus \
    --output=custom-columns="NAME:.metadata.name,ARGS:.spec.template.spec.containers[0].args"
    ```
11. Lifecycle API 활성화

    ```
    kubectl -n monitoring patch sts prometheus --type=json \
    -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--web.enable-lifecycle"}]'
    ```
12. Prometheus Pod가 재생성되었는지 확인

    ```
    kubectl -n monitoring get pod prometheus-0
    ```
13. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags으로 이동해서 실행 옵션 확인
14. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
15. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Pod가 추가되는지 확인
16. Expression 브라우저에 다음과 같은 쿼리를 실행해서 NGINX 서버에서 내보내는 지표가 수집되는지 확인

    ```
    requests_total
    ```
17. Prometheus Python Client로 작성한 소스코드 리뷰 - <https://github.com/youngwjung/prometheus-python-client/blob/main/app.py>
18. Pod 생성

    ```
    kubectl run prom-py --image=youngwjung/prometheus-python-client
    ```
19. 애플리케이션 호출

    ```
    kubectl exec prom-py -- curl -s localhost:8000
    ```
20. 리소스 삭제

    ```shell
    {
        kubectl delete cm nginx
        kubectl delete pod nginx prom-py
    }
    ```

### Exporters

1. Exporter란 무엇인가? - <https://prometheus.io/docs/introduction/glossary/#exporter>
2. Exporter 종류 - [https://prometheus.io/docs/instrumenting/exporters](https://prometheus.io/docs/instrumenting/exporters/)
3. Python Flask 웹 애플리케이션에 Exporter를 적용한 소스코드 리뷰 - <https://github.com/youngwjung/prometheus-flask-exporter/blob/main/app.py>\
   \
   기존의 소스코드에 아래의 두줄의 코드만 추가됨

   ```
   from prometheus_flask_exporter import PrometheusMetrics
   metrics = PrometheusMetrics(app)
   ```
4. &#x20;데모 애플리케이션 생성

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     labels:
       app: flask
     name: flask
   spec:
     replicas: 1
     selector:
       matchLabels:
         app: flask
     template:
       metadata:
         labels:
           app: flask
       spec:
         containers:
         - name: flask
           image: youngwjung/prometheus-flask-exporter
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: flask
     labels:
       app: flask
   spec:
     ports:
     - port: 80
     selector:
       app: flask
   EOF
   ```
5. Exporter가 내보내는 지표 확인

   ```
   kubectl exec -it deploy/flask -- curl -s localhost/metrics
   ```
6. Flask 애플리케이션에 부하를 발생시키는 Pod 생성

   ```
   kubectl run load-generator --image=busybox \
   -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://flask; done"
   ```
7. HTTP 관련 지표가 발생하는지 확인

   ```
   kubectl exec -it deploy/flask -- curl -s localhost/metrics
   ```
8. Flask 애플리케이션에서 발생하는 지표를 수집하도록 Prometheus 설정 변경

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
         - job_name: 'prometheus'
           static_configs:
             - targets: ['localhost:9090']
         - job_name: 'flask'
           static_configs:
             - targets: ['flask.default']
   EOF
   ```
9. Prometheus 설정파일 Reload

   ```
   curl -X POST http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
   ```
10. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Service가 추가되었는지 확인
11. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
12. 지난 5분동안 평균 초당 요청수 확인

    ```
    rate(flask_http_request_total[5m])
    ```
13. Graph를 선택해서 지표를 라인 그래프로 표시
14. 데모 애플리케이션 삭제

    ```
    {
        kubectl delete svc flask
        kubectl delete deploy flask
        kubectl delete pod load-generator
    }
    ```

### Config Reloader

1. Prometheus 설정 파일 변경을 감지하고 다시 불러오는 컨테이너 추가

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: apps/v1
   kind: StatefulSet
   metadata:
     name: prometheus
     labels:
       app: prometheus
     namespace: monitoring
   spec:
     selector:
       matchLabels:
         app: prometheus
     serviceName: prometheus
     template:
       metadata:
         labels:
           app: prometheus
       spec:
         securityContext:
           fsGroup: 2000
         containers:
         - name: prometheus
           image: quay.io/prometheus/prometheus
           args:
           - --config.file=/etc/prometheus/prometheus.yaml
           - --storage.tsdb.path=/data
           - --web.enable-lifecycle
           ports:
           - containerPort: 9090
           volumeMounts:
           - name: prometheus-config
             mountPath: /etc/prometheus
           - name: prometheus-data
             mountPath: /data
         - name: config-reloader
           image: quay.io/prometheus-operator/prometheus-config-reloader:v0.61.1
           args:
           - --reload-url=http://127.0.0.1:9090/-/reload
           - --config-file=/etc/prometheus/prometheus.yaml
           volumeMounts:
           - name: prometheus-config
             mountPath: /etc/prometheus
         volumes:
         - name: prometheus-config
           configMap:
             name: prometheus-config
     volumeClaimTemplates:
     - metadata:
         name: prometheus-data
       spec:
         accessModes:
         - ReadWriteOnce
         resources:
           requests:
             storage: 10Gi
   EOF
   ```
2. Prometheus Pod가 재생성되었는지 확인

   ```
   kubectl -n monitoring get pod prometheus-0
   ```
3. Prometheus 설정 변경

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
         - job_name: 'prometheus'
           static_configs:
             - targets: ['localhost:9090']
   EOF
   ```
4. Prometheus 로그 확인&#x20;

   ```
   kubectl -n monitoring logs prometheus-0 -c prometheus --tail 20 -f
   ```
5. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
6. Prometheus 설정 파일을 변경하면 변경분을 감지하고 서버에 반영되기 까지 최대 2-3분정도 시간이 걸림. 실습 진행시 Prometheus 설정 파일 변경이 이루어지는 경우에는 3-4분 정도 대기 후 다음 단계를 진행

### Service Discovery

1. 데모 애플리케이션 생성

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     labels:
       app: flask
     name: flask
   spec:
     replicas: 1
     selector:
       matchLabels:
         app: flask
     template:
       metadata:
         labels:
           app: flask
       spec:
         containers:
         - name: flask
           image: youngwjung/prometheus-flask-exporter
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: flask
     labels:
       app: flask
   spec:
     ports:
     - port: 80
     selector:
       app: flask
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: load-generator
   spec:
     containers:
     - name: load-generator
       image: busybox
       args:
       - /bin/sh
       - -c
       - while sleep 0.01; do wget -q -O- http://flask; done
   EOF
   ```
2. Prometheus 설정 변경

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
       - job_name: 'kubernetes-services'
         kubernetes_sd_configs:
         - role: service
           namespaces:
             names:
             - default
   EOF
   ```
3. Service 목록 확인

   ```
   kubectl get svc -n default
   ```
4. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 확인한 Service들이 추가되었는지 확인
5. Prometheus 서버 로그 확인

   ```
   kubectl -n monitoring logs prometheus-0 -c prometheus --tail 10
   ```
6. 권한 설정

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRole
   metadata:
     name: prometheus
   rules:
     - apiGroups: [""]
       resources:
         - nodes
         - services
         - endpoints
         - pods
       verbs: ["get", "list", "watch"]
     - apiGroups:
         - extensions
         - networking.k8s.io
       resources:
         - ingresses
       verbs: ["get", "list", "watch"]
     - apiGroups:
         - discovery.k8s.io
       resources:
         - endpointslices
       verbs: ["get", "list", "watch"]
   ---
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: prometheus
     namespace: monitoring
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRoleBinding
   metadata:
     name: prometheus
   roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: ClusterRole
     name: prometheus
   subjects:
     - kind: ServiceAccount
       name: prometheus
       namespace: monitoring
   EOF
   ```
7. 권한 반영

   ```
   kubectl -n monitoring patch sts prometheus --type=json \
   -p='[{"op": "replace", "path": "/spec/template/spec/serviceAccountName", "value": "prometheus"}]'
   ```
8. Prometheus 서버 로그 확인&#x20;

   ```
   kubectl -n monitoring logs prometheus-0 -c prometheus --tail 10
   ```
9. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 확인한 Service들이 추가되었는지 확인
10. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
11. Service 생성

    ```
    kubectl create service clusterip demo --tcp=80:80
    ```
12. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 생성한 Service가 추가되었는지 확인
13. Service 삭제

    ```
    kubectl delete svc demo
    ```
14. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 삭제한 Service가 목록에서 없어졌는지 확인
15. Service Discovery를 통해서 확인 가능한 Metadata 확인 - <https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config>
16. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Target 별로 발견된 Label 목록 확인
17. Prometheus 설정 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-services'
          kubernetes_sd_configs:
          - role: service
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            regex: kubernetes
            action: drop
    EOF
    ```
18. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 *kubernetes* Service가 목록에서 없어졌는지 확인
19. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Targets 확인
20. Prometheus 설정 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-services'
          kubernetes_sd_configs:
          - role: service
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            regex: kubernetes
            action: drop
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
          - role: pod
          relabel_configs:
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
    EOF
    ```
21. 모든 Pod 목록 확인

    ```
    kubectl get pod -A
    ```
22. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 클러스터에 존재하는 Pod들이 추가되었는지 확인
23. Prometheus 설정 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-services'
          kubernetes_sd_configs:
          - role: service
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            regex: kubernetes
            action: drop
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
          - role: pod
          relabel_configs:
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            regex: kubernetes
            action: drop
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
    EOF
    ```
24. Endpoints 목록 확인

    ```
    kubectl get ep
    ```
25. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
26. Flask 애플리케이션의 Pod 갯수를 3개로 조정

    ```
    kubectl scale deployment flask --replicas=3
    ```
27. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
28. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
29. 데모 애플리케이션 삭제

    ```
    {
        kubectl delete svc flask
        kubectl delete deploy flask
        kubectl delete pod load-generator
    }
    ```

### Relabeling

1. 데모 애플리케이션 생성

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     labels:
       app: flask
     name: flask
   spec:
     replicas: 1
     selector:
       matchLabels:
         app: flask
     template:
       metadata:
         labels:
           app: flask
       spec:
         containers:
         - name: flask
           image: youngwjung/prometheus-flask-exporter
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: flask
     labels:
       app: flask
   spec:
     ports:
     - port: 80
     selector:
       app: flask
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: load-generator
   spec:
     containers:
     - name: load-generator
       image: busybox
       args:
       - /bin/sh
       - -c
       - while sleep 0.01; do wget -q -O- http://flask; done
   EOF
   ```
2. Prometheus 설정 변경

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
       - job_name: 'kubernetes-endpoints'
         kubernetes_sd_configs:
         - role: endpoints
           namespaces:
             names:
             - default
         relabel_configs:
         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
           regex: true
           action: keep
   EOF
   ```
3. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
4. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Service Discovery로 이동해서 Targets 확인
5. Service에 Annotation 추가

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Service
   metadata:
     name: flask
     labels:
       app: flask
     annotations:
       prometheus.io/scrape: "true"
   spec:
     ports:
     - port: 80
     selector:
       app: flask
   EOF
   ```
6. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
7. Prometheus 설정 변경

   ```
   cat <<'EOF' | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s

       scrape_configs:
       - job_name: 'kubernetes-endpoints'
         kubernetes_sd_configs:
         - role: endpoints
           namespaces:
             names:
             - default
         relabel_configs:
         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
           regex: true
           action: keep
         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
           regex: (.+)
           action: replace
           target_label: __metrics_path__
         - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
           regex: ([^:]+)(?::\d+)?;(\d+)
           action: replace
           replacement: $1:$2
           target_label: __address__
   EOF
   ```
8. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint 경로 확인&#x20;
9. Service에 Annotation 추가

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Service
   metadata:
     name: flask
     labels:
       app: flask
     annotations:
       prometheus.io/scrape: "true"
       prometheus.io/path: "/status"
       prometheus.io/port: "8080"
   spec:
     ports:
     - port: 80
     selector:
       app: flask
   EOF
   ```
10. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint 경로 확인&#x20;
11. Service에 Annotation 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: flask
      labels:
        app: flask
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "80"
    spec:
      ports:
      - port: 80
      selector:
        app: flask
    EOF
    ```
12. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들 경로 확인&#x20;
13. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
14. 지난 5분동안 평균 초당 요청수 확인

    ```
    rate(flask_http_request_total[5m])
    ```
15. 지난 5분동안 평균 초당 요청수 합을 확인

    ```
    sum(rate(flask_http_request_total[5m]))
    ```
16. 새로운 애플리케이션 배포

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: flask-two
      name: flask-two
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: flask-two
      template:
        metadata:
          labels:
            app: flask-two
        spec:
          containers:
          - name: flask
            image: youngwjung/prometheus-flask-exporter
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: flask-two
      labels:
        app: flask-two
      annotations:
        prometheus.io/scrape: "true"
    spec:
      ports:
      - port: 80
      selector:
        app: flask-two
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: load-generator-two
      labels:
        app: load-generator
    spec:
      containers:
      - name: load-generator
        image: busybox
        args:
        - /bin/sh
        - -c
        - while sleep 0.1; do wget -q -O- http://flask-two; done
    EOF
    ```
17. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Endpoint들이 추가되었는지 확인
18. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
19. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
    EOF
    ```
20. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
21. Expression 브라우저에 다음과 같은 쿼리를 실행해서 각 서비스별로 지난 5분동안 평균 초당 요청수 확인

    ```
    sum by (service)(rate(flask_http_request_total[5m]))
    ```
22. Service Discovery를 통해서 확인 가능한 Metadata 확인 - <https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config>
23. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
    EOF
    ```
24. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
25. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Flask 서버에서 내보내는 지표가 수집되는지 확인

    ```
    flask_http_request_total
    ```
26. Pod에 부여된 Label 확인

    ```
    kubectl get pod --show-labels
    ```
27. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
            replacement: $1
    EOF
    ```
28. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Label이 추가되었는지 확인
29. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
            replacement: $1
          - action: labeldrop
            regex: pod_template_hash
    EOF
    ```
30. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 불필요한 Label이 삭제되었는지 확인
31. Flask 애플리케이션이 내보내는 지표 확인

    ```
    kubectl exec -it deploy/flask -- curl -s localhost/metrics
    ```
32. Expression 브라우저에서 `python_` 으로 시작하는 지표 확인
33. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
            replacement: $1
          - action: labeldrop
            regex: pod_template_hash
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: python_(.+)
            action: drop
    EOF
    ```
34. Expression 브라우저에서 `python_` 으로 시작하는 지표 확인
35. Expression 브라우저에서 `process_` 로 시작하는 지표 확인
36. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
            replacement: $1
          - action: labeldrop
            regex: pod_template_hash
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: flask_(.+)
            action: keep
    EOF
    ```
37. Expression 브라우저에서 `process_` 로 시작하는 지표 확인
38. Expression 브라우저에서 `flask_` 로 시작하는 지표 확인
39. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s

        scrape_configs:
        - job_name: 'kubernetes-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: node
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
            replacement: $1
          - action: labeldrop
            regex: pod_template_hash
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: flask_(.+)
            action: keep
          - source_labels: [__name__]
            action: replace
            regex: flask_(.+)
            replacement: $1
            target_label: __name__
    EOF
    ```
40. 지표 이름이 변경되었는지 확인
41. 데모 애플리케이션 삭제

    ```
    {
        kubectl delete svc flask flask-two
        kubectl delete deploy flask flask-two
        kubectl delete pod load-generator load-generator-two
    }
    ```

### Node Exporter

1. Node Exporter 설치 가이드 - [https://prometheus.io/docs/guides/node-exporter](https://prometheus.io/docs/guides/node-exporter/)
2. Node Exporter GitHub - <https://github.com/prometheus/node_exporter>
3. Node Exporter 설치

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: apps/v1
   kind: DaemonSet
   metadata:
     name: node-exporter
     labels:     
       app: node-exporter
     namespace: monitoring
   spec:
     selector:
       matchLabels:
         app: node-exporter
     template:
       metadata:
         labels:         
           app: node-exporter
         annotations:
           prometheus.io/scrape: "true"
           prometheus.io/path: "/metrics"
           prometheus.io/port: "9100"
       spec:
         hostNetwork: true
         hostPID: true
         containers:
         - name: node-exporter
           image: quay.io/prometheus/node-exporter
           args:
           - --path.procfs=/host/proc
           - --path.sysfs=/host/sys
           - --path.rootfs=/host/root
           - --web.listen-address=0.0.0.0:9100
           ports:
           - name: metrics
             containerPort: 9100
             protocol: TCP
           volumeMounts:
           - name: proc
             mountPath: /host/proc
             readOnly: true
           - name: sys
             mountPath: /host/sys
             readOnly: true
           - name: root
             mountPath: /host/root
             mountPropagation: HostToContainer
             readOnly: true
         volumes:
         - name: proc
           hostPath:
             path: /proc
         - name: sys
           hostPath:
             path: /sys
         - name: root
           hostPath:
             path: /
   EOF
   ```
4. Node Exporter가 실행중인지 확인

   ```
   kubectl -n monitoring get pod -l app=node-exporter
   ```
5. Node Exporter가 내보내는 지표 확인

   ```
   kubectl run nginx --image=nginx -it --rm --restart=Never \
   -- curl -s $(kubectl -n monitoring get pod -l app=node-exporter -o=jsonpath="{.items[0].status.podIP}"):9100/metrics
   ```
6. Node Exporter 실행옵션 확인

   ```
   kubectl -n monitoring exec ds/node-exporter -- node_exporter -h
   ```
7. Prometheus 설정 변경

   ```
   cat <<'EOF' | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s
       scrape_configs:
       - job_name: 'node-exporter'
         kubernetes_sd_configs:
         - role: pod
           namespaces:
             names:
             - monitoring
         relabel_configs:
         - source_labels: [__meta_kubernetes_pod_label_app]
           regex: node-exporter
           action: keep
         - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
           regex: true
           action: keep
         - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
           regex: (.+)
           action: replace
           target_label: __metrics_path__
         - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
           regex: ([^:]+)(?::\d+)?;(\d+)
           action: replace
           replacement: $1:$2
           target_label: __address__
         - source_labels: [__meta_kubernetes_pod_node_name]
           action: replace
           target_label: instance
   EOF
   ```
8. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에서 Node Exporter가 추가되었는지 확인
9. Expression 브라우저에 다음과 같은 쿼리를 실행해서 노드에 마운트된 파일시스템 크기 확인

   ```
   node_filesystem_size_bytes
   ```
10. 각 노드별로 루트 볼륨 크기 확인

    ```
    sum by (instance) (node_filesystem_size_bytes{mountpoint="/"})
    ```
11. 각 노드별로 루트 볼륨 사용량 확인

    ```
    1 - node_filesystem_avail_bytes{job="node-exporter",mountpoint="/"} / node_filesystem_size_bytes{job="node-exporter",mountpoint="/"}
    ```
12. Session Manager 플러그인 설치&#x20;

    ```
    {
        curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm" -o "session-manager-plugin.rpm"
        sudo yum install -y session-manager-plugin.rpm
    }
    ```
13. 한개의 Node로 Session Manager 연결

    ```
    aws ssm start-session --target \
    $(kubectl get node -o jsonpath='{.items[0].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
    ```
14. 디스크 샤용량 확인

    ```
    df -h
    ```
15. Session Manager 종료

    ```
    exit
    ```
16. 파일시스템 관련 지표만 수집되도록 Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s
        scrape_configs:
        - job_name: 'node-exporter'
          kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
              - monitoring
          relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app]
            regex: node-exporter
            action: keep
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            regex: true
            action: keep
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            regex: (.+)
            action: replace
            target_label: __metrics_path__
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            regex: ([^:]+)(?::\d+)?;(\d+)
            action: replace
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: instance
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: node_filesystem_(.+)
            action: keep
    EOF
    ```
17. Prometheus에 저장된 모든 지표 목록 확인

    ```
    curl -s http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/api/v1/label/__name__/values | jq
    ```
18. Expression 브라우저에 다음과 같은 쿼리를 실행해서 최근 1분안에 수집된 지표 목록 확인

    ```
    group by(__name__) ({__name__!=""})
    ```
19. Node Exporter 삭제

    ```
    kubectl -n monitoring delete ds node-exporter
    ```

### Kubernetes system component metrics

1. 지표를 제공하는 쿠버네티스 구성요소 확인 - <https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/>
2. HTTP 요청을 수행할 Pod 생성

   ```
   cat <<EOF | kubectl apply -f -
   kind: ClusterRole
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     name: metrics-access
   rules:
     - nonResourceURLs:
       - "/metrics"
       verbs:
       - get
     - apiGroups: [""]
       resources: ["nodes/metrics"]
       verbs: ["get"]
   ---
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: metrics-access
   ---
   kind: ClusterRoleBinding
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     name: metrics-access
   subjects:
   - kind: ServiceAccount
     name: metrics-access
     namespace: default
   roleRef:
     kind: ClusterRole
     name: metrics-access
     apiGroup: rbac.authorization.k8s.io
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: curl
   spec:
     serviceAccountName: metrics-access
     containers:
     - image: curlimages/curl
       name: curl
       command: ["sleep", "3600"]
       env:
       - name: HOST_IP
         valueFrom:
           fieldRef:
             fieldPath: status.hostIP
   EOF
   ```
3. API 서버에서 제공하는 지표 확인

   ```
   kubectl exec -it curl -- \
   sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
   --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
   https://kubernetes/metrics'
   ```
4. kubelet에서 제공하는 지표 확인 - */metrics*

   ```
   kubectl exec -it curl -- \
   sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
   --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
   https://$HOST_IP:10250/metrics'
   ```
5. kubelet에서 제공하는 지표 확인 - */metrics/cadvisor*

   ```
   kubectl exec -it curl -- \
   sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
   --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
   https://$HOST_IP:10250/metrics/cadvisor'
   ```
6. kubelet에서 제공하는 지표 확인 - */metrics/resource*

   ```
   kubectl exec -it curl -- \
   sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
   --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
   https://$HOST_IP:10250/metrics/resource'
   ```
7. kubelet에서 제공하는 지표 확인 - */metrics/probes*

   ```
   kubectl exec -it curl -- \
   sh -c 'curl -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
   --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
   https://$HOST_IP:10250/metrics/probes'
   ```
8. &#x20;CoreDNS에서 제공하는 지표 확인

   ```
   kubectl exec -it curl -- \
   curl $(kubectl get pod -l k8s-app=kube-dns -A -o=jsonpath='{.items[0].status.podIP}'):9153/metrics
   ```
9. Prometheus 서버에 지표 접근 권한 부여

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRole
   metadata:
     name: prometheus
   rules:
   - apiGroups: [""]
     resources:
     - nodes
     - services
     - endpoints
     - pods
     verbs: ["get", "list", "watch"]
   - apiGroups:
     - extensions
     - networking.k8s.io
     resources:
     - ingresses
     verbs: ["get", "list", "watch"]
   - apiGroups:
     - discovery.k8s.io
     resources:
     - endpointslices
     verbs: ["get", "list", "watch"]
   - nonResourceURLs: ["/metrics"]
     verbs: ["get"]
   - apiGroups: [""]
     resources: ["nodes/metrics"]
     verbs: ["get"]
   EOF
   ```
10. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s
        scrape_configs:
        - job_name: kube-apiserver
          scheme: https
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - default
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_label_component]
            regex: apiserver
            action: keep
          - source_labels: [__meta_kubernetes_service_label_provider]
            regex: kubernetes
            action: keep
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            regex: https
            action: keep
          - source_labels: [__meta_kubernetes_service_name]
            regex: (.*)
            action: replace
            target_label: service
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
            action: drop
          - source_labels: [__name__]
            regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
            action: drop
          - source_labels: [__name__]
            regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
            action: drop
          - source_labels: [__name__]
            regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
            action: drop
          - source_labels: [__name__]
            regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
            action: drop
          - source_labels: [__name__]
            regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
            action: drop
          - source_labels: [__name__]
            regex: transformation_(transformation_latencies_microseconds|failures_total)
            action: drop
          - source_labels: [__name__]
            regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
            action: drop
          - source_labels: [__name__]
            regex: etcd_(debugging|disk|server).*
            action: drop
          - source_labels: [__name__]
            regex: apiserver_admission_controller_admission_latencies_seconds_.*
            action: drop
          - source_labels: [__name__]
            regex: apiserver_admission_step_admission_latencies_seconds_.*
            action: drop
          - source_labels: [__name__, le]
            regex: apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)
            action: drop
    EOF
    ```
11. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-apiserver가 추가되었는지 확인
12. 최근 1분안에 수집된 지표 목록 확인

    ```
    group by(__name__) ({__name__!=""})
    ```
13. 쿠버네티스 객체별로 요청 갯수 확인

    ```
    sum by(resource) (apiserver_request_total)
    ```
14. Prometheus 설정 변경

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      prometheus.yaml: |
        global:
          scrape_interval: 5s
        scrape_configs:
        - job_name: kubelet-cadvisor
          metrics_path: /metrics/cadvisor
          scheme: https
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            insecure_skip_verify: true
          kubernetes_sd_configs:
          - role: node
          relabel_configs:
          - source_labels: [__metrics_path__]
            regex: (.*)
            action: replace
            target_label: metrics_path
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
            action: drop
          - source_labels: [__name__, pod, namespace]
            regex: (container_fs_.*|container_spec_.*|container_blkio_device_usage_total|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);;
            action: drop
    EOF
    ```
15. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kubelet-cadvisor가 추가되었는지 확인
16. 최근 1분안에 수집된 지표 목록 확인

    ```
    group by(__name__) ({__name__!=""})
    ```
17. Pod별 CPU 사용시간 확인

    ```
    sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod)
    ```
18. 리소스 삭제

    ```
    {
        kubectl delete clusterrole metrics-access
        kubectl delete clusterrolebinding metrics-access
        kubectl delete sa metrics-access
        kubectl delete pod curl
    }
    ```

### kube-state-metrics

1. 공식문서 리뷰 - <https://github.com/kubernetes/kube-state-metrics>
2. Manifest 리뷰 - <https://github.com/kubernetes/kube-state-metrics/tree/master/examples/standard>
3. kube-state-metrics 설치

   ```
   {
       git clone https://github.com/kubernetes/kube-state-metrics.git
       kubectl apply -f kube-state-metrics/examples/standard
   }
   ```
4. kube-state-metrics가 내보내는 지표 확인

   ```
   kubectl run nginx --image=nginx -it --rm --restart=Never \
   -- curl $(kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics -o=jsonpath="{.items[0].status.podIP}"):8080/metrics
   ```
5. Prometheus 설정 변경

   ```
   cat <<'EOF' | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 5s
       scrape_configs:
       - job_name: kube-state-metrics
         scrape_interval: 30s
         metrics_path: /metrics
         kubernetes_sd_configs:
         - role: endpoints
           namespaces:
             names:
             - kube-system
         relabel_configs:
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
           regex: exporter
           action: keep
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
           regex: kube-state-metrics
           action: keep
         - source_labels: [__meta_kubernetes_endpoint_port_name]
           regex: http-metrics
           action: keep
   EOF
   ```
6. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-state-metrics가 추가되었는지 확인
7. 최근 1분안에 수집된 지표 목록 확인

   ```
   group by(__name__) ({__name__!=""})
   ```
8. Node 상태 확인

   ```
   kube_node_status_condition
   ```
9. Pod 상태 확인

   ```
   kube_pod_status_phase
   ```
10. Pod 생성

    ```
    kubectl run nginx --image=nginx:notexist
    ```
11. Pod 상태 확인&#x20;

    ```
    kubectl get pod -l run=nginx
    ```
12. 실행되고 있지 않는 Pod 목록 확인

    ```
    kube_pod_status_phase{phase !="Running"} == 1
    ```
13. 리소스 삭제

    ```
    {
        kubectl delete pod nginx
        kubectl delete -f kube-state-metrics/examples/standard
    }
    ```

### Alerting

1. kube-state-metrics 설치

   ```
   {
       git clone https://github.com/kubernetes/kube-state-metrics.git
       kubectl apply -f kube-state-metrics/examples/standard
   }
   ```
2. Prometheus 설정 변경

   ```
   cat <<'EOF' | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     prometheus.yaml: |
       global:
         scrape_interval: 10s
         evaluation_interval: 10s 
       scrape_configs:
       - job_name: kubelet
         scheme: https
         authorization:
           type: Bearer
           credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
         tls_config:
           insecure_skip_verify: true
         kubernetes_sd_configs:
         - role: node
         metric_relabel_configs:
         - source_labels: [__name__]
           regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
           action: drop
         - source_labels: [__name__]
           regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
           action: drop
         - source_labels: [__name__]
           regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
           action: drop
         - source_labels: [__name__]
           regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
           action: drop
         - source_labels: [__name__]
           regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
           action: drop
         - source_labels: [__name__]
           regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
           action: drop
         - source_labels: [__name__]
           regex: transformation_(transformation_latencies_microseconds|failures_total)
           action: drop
         - source_labels: [__name__]
           regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
           action: drop
       - job_name: kube-state-metrics
         kubernetes_sd_configs:
         - role: endpoints
           namespaces:
             names:
             - kube-system
         relabel_configs:
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
           regex: exporter
           action: keep
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
           regex: kube-state-metrics
           action: keep
         - source_labels: [__meta_kubernetes_endpoint_port_name]
           regex: http-metrics
           action: keep
   EOF
   ```
3. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 위에 명시한 Job들이 추가되었는지 확인
4. Alerting 규칙 추가

   ```
   cat <<'EOF' | kubectl apply -f -
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: prometheus-config
     labels:
       app: prometheus
     namespace: monitoring
   data:
     alerts.yaml: |
       groups:
       - name: kubernetes-apps
         rules:
         - alert: KubePodCrashLooping
           annotations:
             description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
               }}) is in waiting state (reason: "CrashLoopBackOff").'
             runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping
             summary: Pod is crash looping.
           expr: |
             max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
           for: 15m
           labels:
             severity: warning
             team: dev
       - name: kubernetes-system-kubelet
         rules:
         - alert: KubeNodeNotReady
           annotations:
             description: '{{ $labels.node }} has been unready for more than 1 minutes.'
             runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready
             summary: Node is not ready.
           expr: |
             kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
           for: 1m
           labels:
             severity: warning
       - name: kubernetes-storage
         rules:
         - alert: KubePersistentVolumeFillingUp
           annotations:
             description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
               }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
               }} free.
             runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
             summary: PersistentVolume is filling up.
           expr: |
             (
               kubelet_volume_stats_available_bytes{job="kubelet"}
                 /
               kubelet_volume_stats_capacity_bytes{job="kubelet"}
             ) < 0.03
             and
             kubelet_volume_stats_used_bytes{job="kubelet"} > 0
           for: 1m
           labels:
             severity: critical
         - alert: KubePersistentVolumeAlmostFillingUp
           annotations:
             description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
               }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
               }} free.
             runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
             summary: PersistentVolume is almost filling up.
           expr: |
             (
               kubelet_volume_stats_available_bytes{job="kubelet"}
                 /
               kubelet_volume_stats_capacity_bytes{job="kubelet"}
             ) < 0.20
             and
             kubelet_volume_stats_used_bytes{job="kubelet"} > 0
           for: 1m
           labels:
             severity: warning
     prometheus.yaml: |
       global:
         scrape_interval: 10s
         evaluation_interval: 10s
       rule_files:
       - alerts.yaml
       scrape_configs:
       - job_name: kubelet
         scheme: https
         authorization:
           type: Bearer
           credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
         tls_config:
           insecure_skip_verify: true
         kubernetes_sd_configs:
         - role: node
         metric_relabel_configs:
         - source_labels: [__name__]
           regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
           action: drop
         - source_labels: [__name__]
           regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
           action: drop
         - source_labels: [__name__]
           regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
           action: drop
         - source_labels: [__name__]
           regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
           action: drop
         - source_labels: [__name__]
           regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
           action: drop
         - source_labels: [__name__]
           regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
           action: drop
         - source_labels: [__name__]
           regex: transformation_(transformation_latencies_microseconds|failures_total)
           action: drop
         - source_labels: [__name__]
           regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
           action: drop
       - job_name: kube-state-metrics
         kubernetes_sd_configs:
         - role: endpoints
           namespaces:
             names:
             - kube-system
         relabel_configs:
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
           regex: exporter
           action: keep
         - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
           regex: kube-state-metrics
           action: keep
         - source_labels: [__meta_kubernetes_endpoint_port_name]
           regex: http-metrics
           action: keep
   EOF
   ```
5. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 Alert 규칙들이 추가되었는지 확인
6. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
7. Pod 생성

   ```
   kubectl run busybox --image=busybox
   ```
8. Pod 상태 확인

   ```
   kubectl get pod -l run=busybox
   ```
9. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
10. Cluster Autoscaler가 활성화 되어 있을 경우에 비활성화

    ```
    kubectl -n kube-system scale deployment cluster-autoscaler --replicas=0
    ```
11. 노드그룹에 설정된 노드 갯수 확인

    ```
    {
        export DESIRED_SIZE=$(aws eks describe-nodegroup \
        --cluster-name mycluster \
        --nodegroup-name nodegroup \
        --query nodegroup.scalingConfig.desiredSize)
        echo $DESIRED_SIZE
    }
    ```
12. 노드그룹을 조정해서 노드 1개 추가

    ```
    aws eks update-nodegroup-config \
    --cluster-name mycluster \
    --nodegroup-name nodegroup \
    --scaling-config desiredSize=$(($DESIRED_SIZE+1))
    ```
13. 노드가 추가되었는지 확인

    ```
    kubectl get node
    ```
14. 새로 추가된 노드의 상태가 Ready가 될때까지 대기
15. 새로 생성된 노드로 Session Manager 연결

    ```
    aws ssm start-session --target \
    $(kubectl get node --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
    ```
16. 컨테이너 런타임 정지

    ```
    {
        sudo systemctl stop containerd
    }
    ```
17. Session Manager 종료

    ```
    exit
    ```
18. 노드 상태 확인&#x20;

    ```
    kubectl get node
    ```
19. 위에서 컨테이너 런타임을 정지한 노드의 상세 상태 확인

    ```
    kubectl describe node $(kubectl get node --sort-by='.metadata.creationTimestamp' -o=jsonpath='{.items[-1].metadata.name}')
    ```
20. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
21. 데모 애플리케이션 배포

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      ports:
      - port: 80
      clusterIP: None
      selector:
        app: nginx
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: nginx
    spec:
      serviceName: nginx
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx
            volumeMounts:
            - mountPath: /data
              name: data
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 1Gi
    EOF
    ```
22. Pod 생성 확인

    ```
    kubectl get pod -l app=nginx
    ```
23. Expression 브라우저에 다음과 같은 쿼리를 입력해서 PV별 가용한 디스크 크기 확인

    ```
    sum (kubelet_volume_stats_available_bytes) by (persistentvolumeclaim)
    ```
24. PV에 999MB 크기의 파일 생성

    ```
    kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
    ```
25. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
26. [https://webhook.site](https://webhook.site/) 에서 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
27. 위에서 생성한 URL을 환경변수로 지정

    ```
    export WEBHOOK_URL=<생성한 Webhook URL>
    ```
28. Alertmanager 생성

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: alertmanager-config
      labels:
        app: alertmanager
      namespace: monitoring
    data:
      alertmanager.yaml: |
        route:
          group_by: ['alertname']
        
          group_wait: 30s
          group_interval: 5m
          repeat_interval: 3h
        
          receiver: infra
          routes:
          - matchers:
            - team=dev
            routes:
            - matchers:
              - severity=warning
              receiver: dev
              active_time_intervals:
              - daytime
              mute_time_intervals:
              - weekends
          - matchers:
            - severity=critical
            receiver: urgent
        receivers:
        - name: infra
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #infra
            send_resolved: true
        - name: dev
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #dev
            send_resolved: true
        - name: urgent
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #urgent
            send_resolved: true
        time_intervals:
        - name: daytime
          time_intervals:
          - times:
            - start_time: '07:00'
              end_time: '23:00'
        - name: weekends
          time_intervals:
          - weekdays: ['saturday', 'sunday']
        inhibit_rules:
        - source_matchers:
          - alertname=KubePersistentVolumeFillingUp
          target_matchers:
          - alertname=KubePersistentVolumeAlmostFillingUp
          equal:
          - persistentvolumeclaim
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
      namespace: monitoring
    spec:
      type: LoadBalancer
      ports:
      - port: 80
        targetPort: 9093
      selector:
        app: alertmanager
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
      namespace: monitoring
    spec:
      selector:
        matchLabels:
          app: alertmanager
      template:
        metadata:
          labels:
            app: alertmanager
        spec:
          securityContext:
            fsGroup: 2000
          containers:
          - name: alertmanager
            image: prom/alertmanager
            args:
            - --config.file=/etc/alertmanager/alertmanager.yaml
            ports:
            - containerPort: 9093
            volumeMounts:
            - name: alertmanager-config
              mountPath: /etc/alertmanager
          volumes:
          - name: alertmanager-config
            configMap:
              name: alertmanager-config
    EOF
    ```
29. Pod가 생성되었는지 확인

    ```
    kubectl -n monitoring get pod -l app=alertmanager
    ```
30. Alertmanager 서버 엔드포인트 확인

    ```
    kubectl -n monitoring get svc alertmanager \
    -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
    ```
31. 웹브라우저에서 위에서 확인한 URL로 접속
32. Prometheus 설정에서 Alertmanger 추가

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      labels:
        app: prometheus
      namespace: monitoring
    data:
      alerts.yaml: |
        groups:
        - name: kubernetes-apps
          rules:
          - alert: KubePodCrashLooping
            annotations:
              description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
                }}) is in waiting state (reason: "CrashLoopBackOff").'
              runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping
              summary: Pod is crash looping.
            expr: |
              max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
            for: 15m
            labels:
              severity: warning
              team: dev
        - name: kubernetes-system-kubelet
          rules:
          - alert: KubeNodeNotReady
            annotations:
              description: '{{ $labels.node }} has been unready for more than 1 minutes.'
              runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready
              summary: Node is not ready.
            expr: |
              kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
            for: 1m
            labels:
              severity: warning
        - name: kubernetes-storage
          rules:
          - alert: KubePersistentVolumeFillingUp
            annotations:
              description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
                }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
                }} free.
              runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
              summary: PersistentVolume is filling up.
            expr: |
              (
                kubelet_volume_stats_available_bytes{job="kubelet"}
                  /
                kubelet_volume_stats_capacity_bytes{job="kubelet"}
              ) < 0.03
              and
              kubelet_volume_stats_used_bytes{job="kubelet"} > 0
            for: 1m
            labels:
              severity: critical
          - alert: KubePersistentVolumeAlmostFillingUp
            annotations:
              description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
                }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
                }} free.
              runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
              summary: PersistentVolume is almost filling up.
            expr: |
              (
                kubelet_volume_stats_available_bytes{job="kubelet"}
                  /
                kubelet_volume_stats_capacity_bytes{job="kubelet"}
              ) < 0.20
              and
              kubelet_volume_stats_used_bytes{job="kubelet"} > 0
            for: 1m
            labels:
              severity: warning
      prometheus.yaml: |
        global:
          scrape_interval: 10s
          evaluation_interval: 10s
        rule_files:
        - alerts.yaml
        alerting:
          alertmanagers:
          - static_configs:
            - targets: ['alertmanager.monitoring.svc']
        scrape_configs:
        - job_name: kubelet
          scheme: https
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            insecure_skip_verify: true
          kubernetes_sd_configs:
          - role: node
          metric_relabel_configs:
          - source_labels: [__name__]
            regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
            action: drop
          - source_labels: [__name__]
            regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
            action: drop
          - source_labels: [__name__]
            regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)
            action: drop
          - source_labels: [__name__]
            regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
            action: drop
          - source_labels: [__name__]
            regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
            action: drop
          - source_labels: [__name__]
            regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
            action: drop
          - source_labels: [__name__]
            regex: transformation_(transformation_latencies_microseconds|failures_total)
            action: drop
          - source_labels: [__name__]
            regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
            action: drop
        - job_name: kube-state-metrics
          kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
              - kube-system
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
            regex: exporter
            action: keep
          - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
            regex: kube-state-metrics
            action: keep
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            regex: http-metrics
            action: keep
    EOF
    ```
33. Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
34. <https://webhook.site> 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인&#x20;
35. 메시지 내용에서 title\_link에 명시된 URL 확인
36. PV에 생성한 파일 삭제

    ```
    kubectl exec -it nginx-0 -- rm /data/bigfile
    ```
37. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
38. Alertmanager 대시보드로 이동해서 Alert이 없어졌는지 확인
39. <https://webhook.site> 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인 - *이전 Alert 발생한 시점에서 group\_interval에 명시한 값만큼 지난 이후에 발송*
40. 새로운 메시지가 수신되지 않을 경우에는 Alertmanager 로그 확인&#x20;

    ```
    kubectl -n monitoring logs deploy/alertmanager
    ```
41. Alertmanager 설정 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: alertmanager-config
      labels:
        app: alertmanager
      namespace: monitoring
    data:
      alertmanager.yaml: |
        route:
          group_by: ['alertname']
        
          group_wait: 30s
          group_interval: 5m
          repeat_interval: 3h
        
          receiver: infra
          routes:
          - matchers:
            - team=dev
            routes:
            - matchers:
              - severity=warning
              receiver: dev
              active_time_intervals:
              - daytime
              mute_time_intervals:
              - weekends
          - matchers:
            - severity=critical
            receiver: urgent
        receivers:
        - name: infra
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #infra
            send_resolved: true
        - name: dev
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #dev
            send_resolved: true
        - name: urgent
          slack_configs:
          - api_url: $WEBHOOK_URL
            channel: #urgent
            send_resolved: true
            actions:
            - type: button
              text: 'Query :mag:'
              url: '{{ (index .Alerts 0).GeneratorURL }}'
        time_intervals:
        - name: daytime
          time_intervals:
          - times:
            - start_time: '07:00'
              end_time: '23:00'
        - name: weekends
          time_intervals:
          - weekdays: ['saturday', 'sunday']
        inhibit_rules:
        - source_matchers:
          - alertname=KubePersistentVolumeFillingUp
          target_matchers:
          - alertname=KubePersistentVolumeAlmostFillingUp
          equal:
          - persistentvolumeclaim
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
      namespace: monitoring
    spec:
      selector:
        matchLabels:
          app: alertmanager
      template:
        metadata:
          labels:
            app: alertmanager
        spec:
          securityContext:
            fsGroup: 2000
          containers:
          - name: alertmanager
            image: prom/alertmanager
            args:
            - --config.file=/etc/alertmanager/alertmanager.yaml
            - --web.external-url=http://$(kubectl -n monitoring get svc alertmanager -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')
            ports:
            - containerPort: 9093
            volumeMounts:
            - name: alertmanager-config
              mountPath: /etc/alertmanager
          volumes:
          - name: alertmanager-config
            configMap:
              name: alertmanager-config
    EOF
    ```
42. Alertmanager 대시보드 상단에 있는 메뉴에서 Status로 이동해서 설정파일이 업데이트 되었는지 확인
43. Alertmanager 설정파일 Reload - 위의 단계에서 설정 파일이 업데이트 되어있지 않을 경우에 수행

    ```
    curl -X POST http://$(kubectl -n monitoring get svc alertmanager -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')/-/reload
    ```
44. Prometheus 설정 변경

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: prometheus
      labels:
        app: prometheus
      namespace: monitoring
    spec:
      selector:
        matchLabels:
          app: prometheus
      serviceName: prometheus
      template:
        metadata:
          labels:
            app: prometheus
        spec:
          serviceAccountName: prometheus
          securityContext:
            fsGroup: 2000
          containers:
          - name: prometheus
            image: quay.io/prometheus/prometheus
            args:
            - --config.file=/etc/prometheus/prometheus.yaml
            - --storage.tsdb.path=/data
            - --web.enable-lifecycle
            - --web.external-url=http://$(kubectl -n monitoring get svc prometheus-external -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')
            ports:
            - containerPort: 9090
            volumeMounts:
            - name: prometheus-config
              mountPath: /etc/prometheus
            - name: prometheus-data
              mountPath: /data
          - name: config-reloader
            image: quay.io/prometheus-operator/prometheus-config-reloader:v0.61.1
            args:
            - --reload-url=http://127.0.0.1:9090/-/reload
            - --config-file=/etc/prometheus/prometheus.yaml
            volumeMounts:
            - name: prometheus-config
              mountPath: /etc/prometheus
          volumes:
          - name: prometheus-config
            configMap:
              name: prometheus-config
      volumeClaimTemplates:
      - metadata:
          name: prometheus-data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
    EOF
    ```
45. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Command-Line Flags 클릭해서 설정파일이 업데이트 되었는지 확인
46. PV에 999MB 크기의 파일 생성

    ```
    kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
    ```
47. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
48. Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
49. <https://webhook.site> 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인 - title\_link 및 actions에 명시된 URL로 접속해서 어떤 내용이 표시되는지 확인
50. 리소스 삭제

    ```
    {
        kubectl delete ns monitoring 
        kubectl delete sts nginx
        kubectl delete svc nginx
        kubectl delete pvc -l app=nginx
        kubectl -n kube-system scale deployment cluster-autoscaler --replicas=1
        kubectl delete pod busybox
        kubectl delete -f kube-state-metrics/examples/standard
        rm -rf kube-state-metrics
    }
    ```
51. NotReady 상태의 노드 삭제

    ```
    aws ec2 terminate-instances --instance-ids \
    $(kubectl get node --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].spec.providerID}{"\n"}' | grep -oE "i-[a-z0-9]+")
    ```
52. 노드 갯수를 원래대로 조정

    ```
    aws eks update-nodegroup-config \
    --cluster-name mycluster \
    --nodegroup-name nodegroup \
    --scaling-config desiredSize=$DESIRED_SIZE
    ```

### Prometheus Operator

1. 동작원리 소개 - <https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md#prometheus-operator>
2. Custom Resource 목록 - <https://prometheus-operator.dev/docs/operator/design/>
3. Prometheus Operator 설치

   ```
   kubectl create -f \
   https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
   ```
4. prometheuses.monitoring.coreos.com/v1 객체 내용 확인 - <https://prometheus-operator.dev/docs/operator/api/#prometheus>
5. Prometheus 설치

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Namespace
   metadata:
     name: monitoring
   ---
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: prometheus
     namespace: monitoring
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRole
   metadata:
     name: prometheus
   rules:
   - apiGroups: [""]
     resources:
     - nodes
     - nodes/metrics
     - services
     - endpoints
     - pods
     verbs: ["get", "list", "watch"]
   - apiGroups: [""]
     resources:
     - configmaps
     verbs: ["get"]
   - apiGroups:
     - networking.k8s.io
     resources:
     - ingresses
     verbs: ["get", "list", "watch"]
   - nonResourceURLs: ["/metrics"]
     verbs: ["get"]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRoleBinding
   metadata:
     name: prometheus
   roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: ClusterRole
     name: prometheus
   subjects:
   - kind: ServiceAccount
     name: prometheus
     namespace: monitoring
   ---
   apiVersion: monitoring.coreos.com/v1
   kind: Prometheus
   metadata:
     name: k8s
     namespace: monitoring
   spec:
     serviceAccountName: prometheus
     serviceMonitorNamespaceSelector: {}
     serviceMonitorSelector: {}
     podMonitorSelector: {}
   EOF
   ```
6. Prometheus Operator 로그 확인

   ```
   kubectl logs deploy/prometheus-operator
   ```
7. 생성된 StatefulSet 확인

   ```
   kubectl -n monitoring get sts
   ```
8. 생성된 StatefulSet의 상세 스펙 확인&#x20;

   ```
   kubectl -n monitoring get sts prometheus-k8s -o yaml
   ```
9. Service 생성

   ```
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Service
   metadata:
     labels:
       app.kubernetes.io/instance: k8s
       app.kubernetes.io/name: prometheus
     name: prometheus-k8s
     namespace: monitoring
   spec:
     ports:
     - name: web
       port: 80
       targetPort: web
     - name: reloader-web
       port: 8080
       targetPort: reloader-web
     selector:
       app.kubernetes.io/instance: k8s
       app.kubernetes.io/name: prometheus
     type: LoadBalancer
   EOF
   ```
10. Prometheus 서버 엔드포인트 확인

    ```
    kubectl -n monitoring get svc prometheus-k8s \
    -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
    ```
11. 웹브라우저에서 위에서 확인한 URL로 접속
12. 상단에 있는 메뉴에서 Status -> Configuration 클릭
13. 생성된 Prometheus 객체 상세 내용 확인

    ```
    kubectl -n monitoring get prom k8s -o yaml
    ```
14. 생성된 Secret 확인

    ```
    kubectl -n monitoring get secret
    ```
15. Prometheus 설정파일이 저장된 Secret 상세내용 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s -o yaml
    ```
16. Base64로 인코딩된 Prometheus 설정파일 디코딩

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
17. 새로운 scrape\_interval 값 지정

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector: {}
      serviceMonitorSelector: {}
      podMonitorSelector: {}
      scrapeInterval: 10s
    EOF
    ```
18. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
19. Prometheus  로그 확인

    ```
    kubectl -n monitoring logs prometheus-k8s-0
    ```
20. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
21. ServiceMonitor 생성 - <https://prometheus-operator.dev/docs/operator/api/#servicemonitor>

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: kube-apiserver
      namespace: monitoring
    spec:
      selector:
        matchLabels:
          component: apiserver
          provider: kubernetes
      namespaceSelector:
        matchNames:
        - default
      endpoints:
      - interval: 30s
        port: https
        scheme: https
        bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          serverName: kubernetes
        metricRelabelings:
        - sourceLabels:
          - __name__
          regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
          action: drop
        - sourceLabels:
          - __name__
          regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
          action: drop
        - sourceLabels:
          - __name__
          regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs|longrunning_gauge|registered_watchers)
          action: drop
        - sourceLabels:
          - __name__
          regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
          action: drop
        - sourceLabels:
          - __name__
          regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
          action: drop
        - sourceLabels:
          - __name__
          regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
          action: drop
        - sourceLabels:
          - __name__
          regex: transformation_(transformation_latencies_microseconds|failures_total)
          action: drop
        - sourceLabels:
          - __name__
          regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
          action: drop
        - sourceLabels:
          - __name__
          regex: etcd_(debugging|disk|server).*
          action: drop
        - sourceLabels:
          - __name__
          regex: apiserver_admission_controller_admission_latencies_seconds_.*
          action: drop
        - sourceLabels:
          - __name__
          regex: apiserver_admission_step_admission_latencies_seconds_.*
          action: drop
        - sourceLabels:
          - __name__
          - le
          regex: apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)
          action: drop
    EOF
    ```
22. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
23. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 kube-apiserver가 추가되었는지 확인
24. 최근 1분안에 수집된 지표 목록 확인

    ```
    group by(__name__) ({__name__!=""})
    ```
25. 쿠버네티스 객체별로 요청 갯수 확인

    ```
    sum by(resource) (apiserver_request_total)
    ```
26. 데모 애플리케이션 배포

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nginx
    data:
      nginx.conf: |
        user nginx;
        worker_processes  1;
        events {
            worker_connections  1024;
        }
        http {
            server {
                listen       80;
                server_name  localhost;
                rewrite ^/(.*)/$ /$1 permanent;
                
                location / {
                    root   /usr/share/nginx/html;
                    index  index.html index.htm;
                }
                location /metrics {
                    stub_status on;
                    access_log off;
                    allow all;
                }
            }
        }
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - name: http
          containerPort: 80
        volumeMounts:
        - name: nginx-conf
          mountPath: /etc/nginx
      - name: nginx-exporter
        image: nginx/nginx-prometheus-exporter:0.10.0
        ports:
        - name: http-metric
          containerPort: 9113
        args:
        - "-nginx.scrape-uri=http://localhost/metrics"
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx
          items:
          - key: nginx.conf
            path: nginx.conf
    EOF
    ```
27. Pod 생성 확인

    ```
    kubectl get pod -l app=nginx
    ```
28. NGINX Exporter가 내보내는 지표 확인

    ```
    kubectl exec -it nginx -c nginx -- curl localhost:9113/metrics
    ```
29. PodMonitor 생성 - <https://prometheus-operator.dev/docs/operator/api/#podmonitor>

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: nginx
      namespace: monitoring
    spec:
      namespaceSelector:
        matchNames:
        - default
      selector:
        matchLabels:
          app: nginx
      podMetricsEndpoints:
      - port: http-metric
    EOF
    ```
30. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
31. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 nginx가 추가되었는지 확인
32. PodMonitor 수정

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: nginx
      namespace: monitoring
    spec:
      namespaceSelector:
        matchNames:
        - default
      selector:
        matchLabels:
          app: nginx
      podMetricsEndpoints:
      - port: http-metric
        relabelings:
        - regex: container
          action: labeldrop
        - regex: endpoint
          action: labeldrop
      jobLabel: app
    EOF
    ```
33. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
34. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 nginx의 Label이 변경되었는지 확인
35. Expression 브라우저에 다음과 같은 쿼리를 실행해서 NGINX Exporter에서 내보내는 지표가 수집되는지 확인

    ```
    nginx_http_requests_total
    ```
36. 수집 설정 파일 생성

    ```
    cat > additional-scrape-job.yaml <<EOF
    - job_name: prometheus
      static_configs:
      - targets: [localhost:9090]
    EOF
    ```
37. Secret 생성

    ```
    kubectl -n monitoring create secret generic additional-scrape-configs \
    --from-file=additional-scrape-job.yaml
    ```
38. 수동으로 생성한 수집설정 파일 반영

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector: {}
      serviceMonitorSelector: {}
      podMonitorSelector: {}
      scrapeInterval: 10s
      additionalScrapeConfigs:
        name: additional-scrape-configs
        key: additional-scrape-job.yaml
    EOF
    ```
39. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
40. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 prometheus가 추가되었는지 확인
41. Expression 브라우저에 다음과 같은 쿼리를 실행해서 Prometheus에서 내보내는 지표가 수집되는지 확인

    ```
    {job="prometheus"}
    ```
42. Alertmanager 설치 - <https://prometheus-operator.dev/docs/operator/api/#alertmanager>

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
      name: k8s
      namespace: monitoring
    spec: {}
    EOF
    ```
43. 생성된 StatefulSet 확인

    ```
    kubectl -n monitoring get sts
    ```
44. 생성된 StatefulSet의 상세 스펙 확인

    ```
    kubectl -n monitoring get sts alertmanager-k8s -o yaml
    ```
45. Service 생성

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app.kubernetes.io/instance: k8s
        app.kubernetes.io/name: alertmanager
      name: alertmanager-k8s
      namespace: monitoring
    spec:
      ports:
      - name: web
        port: 80
        targetPort: web
      - name: reloader-web
        port: 8080
        targetPort: reloader-web
      selector:
        app.kubernetes.io/instance: k8s
        app.kubernetes.io/name: alertmanager
      type: LoadBalancer
    EOF
    ```
46. Alertmanager 서버 엔드포인트 확인

    ```
    kubectl -n monitoring get svc alertmanager-k8s \
    -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
    ```
47. 웹브라우저에서 위에서 확인한 URL로 접속
48. 상단에 있는 메뉴에서 Status 클릭
49. [https://webhook.site](https://webhook.site/) 에 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
50. 위에서 확인한 Webhook URL으로 Secret 생성

    ```
    kubectl -n monitoring create secret generic slack-config \
    --from-literal=api-url=<WEBHOOK_URL>
    ```
51. AlertmanagerConfig 생성

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1alpha1
    kind: AlertmanagerConfig
    metadata:
      name: alertmanager-k8s
      labels:
        alertmanagerConfig: default
      namespace: monitoring
    spec:
      route:
        groupBy: ['alertname']
        groupWait: 30s
        groupInterval: 5m
        repeatInterval: 3h
        receiver: infra
      receivers:
      - name: infra
        slackConfigs:
        - apiURL:
            name: slack-config
            key: api-url
          channel: infra
          sendResolved: true
    EOF
    ```
52. Alertmanager에 AlertmanagerConfig 반영

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      alertmanagerConfigSelector:
        matchLabels:
          alertmanagerConfig: default
    EOF
    ```
53. Alertmanager 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret alertmanager-k8s-generated \
    -o jsonpath="{.data['alertmanager\.yaml\.gz']}" | base64 -d | gunzip
    ```
54. 위에서 생성한 AlertmanagerConfig를 글로벌 설정으로 반영 - <https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md#specify-global-alertmanager-config>

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      alertmanagerConfiguration:
        name: alertmanager-k8s
    EOF
    ```
55. Alertmanager 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret alertmanager-k8s-generated \
    -o jsonpath="{.data['alertmanager\.yaml\.gz']}" | base64 -d | gunzip
    ```
56. Alertmanger 상단에 있는 메뉴에서 Status 클릭해서 설정 파일이 업데이트 되었는지 확인
57. Prometheus 설정에 Alertmanger 추가

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector: {}
      serviceMonitorSelector: {}
      podMonitorSelector: {}
      scrapeInterval: 10s
      additionalScrapeConfigs:
        name: additional-scrape-configs
        key: additional-scrape-job.yaml
      alerting:
        alertmanagers:
        - namespace: monitoring
          name: alertmanager-operated
          port: web
    EOF
    ```
58. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
59. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Configuration으로 이동해서 설정 변경분이 반영되었는지 확인
60. Alert 규칙 생성

    ```
    cat <<'EOF' | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: nginx-alert
      namespace: monitoring
      labels:
        app: nginx
    spec:
      groups:
      - name: nginx
        rules:
        - alert: TooManyRequest
          annotations:
            description: '{{ $labels.pod }} is demanding.'
            summary: RPS is higher than 10.
          expr: rate(nginx_http_requests_total[1m]) > 10
          for: 1m
          labels:
            team: frontend
    EOF
    ```
61. Prometheus 설정에 위에서 생성한 규칙 추가

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: k8s
      namespace: monitoring
    spec:
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector: {}
      serviceMonitorSelector: {}
      podMonitorSelector: {}
      scrapeInterval: 10s
      additionalScrapeConfigs:
        name: additional-scrape-configs
        key: additional-scrape-job.yaml
      alerting:
        alertmanagers:
        - namespace: monitoring
          name: alertmanager-operated
          port: web
      ruleSelector:
        matchLabels:
          app: nginx
    EOF
    ```
62. 규칙 파일이 생성되었는지 확인

    ```
    kubectl -n monitoring get cm prometheus-k8s-rulefiles-0 -o yaml
    ```
63. Prometheus 설정파일이 업데이트 되었는지 확인

    ```
    kubectl -n monitoring get secret prometheus-k8s \
    -o jsonpath="{.data['prometheus\.yaml\.gz']}" | base64 -d | gunzip
    ```
64. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 Alert 규칙들이 추가되었는지 확인
65. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
66. NGINX에 부하를 발생시키는 Pod 생성

    ```
    kubectl run load-generator --image=busybox \
    -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://$(kubectl get pod nginx -o=jsonpath='{.status.podIP}'); done"
    ```
67. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
68. Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
69. 49번에서 접속한 <https://webhook.site> 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인
70. 리소스 삭제

    ```
    {
        kubectl delete ns monitoring
        kubectl delete pod nginx load-generator
        kubectl delete cm nginx
        kubectl delete clusterrolebinding prometheus
        kubectl delete clusterrole prometheus
        kubectl delete -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
        rm additional-scrape-job.yaml
    }
    ```

### kube-prometheus

1. 공식문서 리뷰 - <https://github.com/prometheus-operator/kube-prometheus>
2. kube-prometheus-stack 헬름 차트 리뷰 - <https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack>
3. values.yaml 파일 리뷰 - <https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml>
4. 리포지토리 추가

   ```
   {
       helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
       helm repo update
   }
   ```
5. 차트 설치

   ```
   {
       kubectl create ns monitoring
       helm -n monitoring install prometheus prometheus-community/kube-prometheus-stack \
       --set fullnameOverride=kube-prometheus \
       --set prometheus.service.type=LoadBalancer \
       --set prometheus.service.port=80 \
       --set alertmanager.service.type=LoadBalancer \
       --set alertmanager.service.port=80 \
       --set alertmanager.serviceMonitor.selfMonitor=false \
       --set grafana.service.type=LoadBalancer \
       --set grafana.adminPassword=asdf1234 \
       --set grafana.serviceMonitor.enabled=false \
       --set defaultRules.create=false \
       --set kubeApiServer.enabled=false \
       --set kubelet.enabled=false \
       --set kubeControllerManager.enabled=false \
       --set coreDns.enabled=false \
       --set kubeEtcd.enabled=false \
       --set kubeScheduler.enabled=false \
       --set kubeProxy.enabled=false \
       --set kubeStateMetrics.enabled=false \
       --set nodeExporter.enabled=false
   }
   ```
6. 생성된 객체 확인

   ```
   kubectl get all -n monitoring
   ```
7. 생성된 ServiceMonitor 확인

   ```
   kubectl get servicemonitors.monitoring.coreos.com -A
   ```
8. Prometheus 서버 엔드포인트 확인

   ```
   kubectl -n monitoring get svc kube-prometheus-prometheus
   ```

   OR

   ```
   kubectl -n monitoring get svc kube-prometheus-prometheus \
   -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
   ```
9. 웹브라우저에서 위에서 확인한 URL로 접속
10. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Job 목록 확인
11. Alertmanager 서버 엔드포인트 확인

    ```
    kubectl -n monitoring get svc kube-prometheus-alertmanager \
    -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
    ```
12. 웹브라우저에서 위에서 확인한 URL로 접속
13. Grafana 서버 엔드포인트 확인

    ```
    kubectl -n monitoring get svc prometheus-grafana \
    -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'
    ```
14. 웹브라우저에서 위에서 확인한 URL로 접속 - 아이디: admin, 비밀번호: asdf1234
15. 대시보드 목록에서 **Prometheus / Overview** 확인
16. 대시보드 목록에서 **Kubernetes / Compute Resources / Pod** 확인
17. kubelet 지표 수집 활성화

    ```
    helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \
    --reuse-values \
    --set kubelet.enabled=true
    ```
18. ServiceMonitor 목록 확인

    ```
    kubectl get servicemonitors.monitoring.coreos.com -A
    ```
19. 새로 생성된 ServiceMonitor 상세내용 확인

    ```
    kubectl get servicemonitors.monitoring.coreos.com \
    kube-prometheus-kubelet -n monitoring -o yaml
    ```
20. kube-system 네임스페이스에 있는 Service 목록 확인

    ```
    kubectl get svc -n kube-system
    ```
21. kube-system 네임스페이스에 있는 Endpoint 목록 확인

    ```
    kubectl get ep -n kube-system 
    ```
22. Node 아이피 주소 확인

    ```
    kubectl get node \
    -o=custom-columns='NodeName:.metadata.name,InternalIP:status.addresses[?(@.type=="InternalIP")].address,ExternalIP:status.addresses[?(@.type=="ExternalIP")].address'
    ```
23. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Targets으로 이동해서 Job 목록 확인
24. Grafana 대시보드 목록에서 **Kubernetes / Kubelet** 확인
25. 대시보드 템플릿 확인 - <https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack/templates/grafana/dashboards-1.14>
26. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 목록 확인
27. kubelet 규칙 추가

    ```
    helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \
    --reuse-values \
    --set defaultRules.create=true \
    --set defaultRules.rules.kubelet=true \
    --set defaultRules.rules.alertmanager=false \
    --set defaultRules.rules.etcd=false \
    --set defaultRules.rules.configReloaders=false \
    --set defaultRules.rules.general=false \
    --set defaultRules.rules.k8s=false \
    --set defaultRules.rules.kubeApiserverAvailability=false \
    --set defaultRules.rules.kubeApiserverBurnrate=false \
    --set defaultRules.rules.kubeApiserverHistogram=false \
    --set defaultRules.rules.kubeApiserverSlos=false \
    --set defaultRules.rules.kubeProxy=false \
    --set defaultRules.rules.kubePrometheusGeneral=false \
    --set defaultRules.rules.kubePrometheusNodeRecording=false \
    --set defaultRules.rules.kubernetesApps=false \
    --set defaultRules.rules.kubernetesResources=false \
    --set defaultRules.rules.kubernetesStorage=false \
    --set defaultRules.rules.kubernetesSystem=false \
    --set defaultRules.rules.kubeScheduler=false \
    --set defaultRules.rules.kubeStateMetrics=false \
    --set defaultRules.rules.node=false \
    --set defaultRules.rules.nodeExporterAlerting=false \
    --set defaultRules.rules.nodeExporterRecording=false \
    --set defaultRules.rules.prometheus=false \
    --set defaultRules.rules.prometheusOperator=false
    ```
28. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 새로운 규칙이 추가되었는지 확인
29. Helm 차트에 포함된 규칙 목록 확인 - <https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14>
30. kubernetesStorage 규칙 활성화

    ```
    helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \
    --reuse-values \
    --set defaultRules.rules.kubernetesStorage=true
    ```
31. Prometheus 대시보드 상단에 있는 메뉴에서 Status -> Rules 이동해서 새로운 규칙이 추가되었는지 확인
32. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert이 추가되었는지 확인
33. 데모 애플리케이션 배포

    ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      ports:
      - port: 80
      clusterIP: None
      selector:
        app: nginx
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: nginx
    spec:
      serviceName: nginx
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx
            volumeMounts:
            - mountPath: /data
              name: data
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 1Gi
    EOF
    ```
34. Pod 생성 확인

    ```
    kubectl get pod -l app=nginx
    ```
35. Expression 브라우저에 다음과 같은 쿼리를 입력해서 PV별 가용한 디스크 크기 확인

    ```
    sum (kubelet_volume_stats_available_bytes) by (persistentvolumeclaim)
    ```
36. Grafana 대시보드 목록에서 **Kubernetes / Persistent Volumes** 확인
37. PV에 999MB 크기의 파일 생성

    ```
    kubectl exec -it nginx-0 -- dd if=/dev/zero of=/data/bigfile bs=1M count=999
    ```
38. Prometheus 대시보드 상단에 있는 메뉴에서 Alerts로 이동 Alert 상태 확인
39. Grafana 대시보드 목록에서 **Kubernetes / Persistent Volumes** 확인
40. Alertmanager 대시보드로 이동해서 Alert이 발생되었는지 확인
41. <https://webhook.site> 에 접속해서 생성된 Webhook URL를 확인 - 웹페이지를 닫지 마세요
42. 위에서 확인한 Webhook URL을 환경변수로 지정

    ```
    export WEBHOOK_URL=<Webhook URL>
    ```
43. Alertmanager에 수신자 설정 추가&#x20;

    ```
    helm -n monitoring upgrade prometheus prometheus-community/kube-prometheus-stack \
    --reuse-values \
    --set alertmanager.config.route.receiver=infra \
    --set alertmanager.config.route.routes=null \
    --set alertmanager.config.receivers[0].name=infra \
    --set alertmanager.config.receivers[0].slack_configs[0].api_url=$WEBHOOK_URL \
    --set alertmanager.config.receivers[0].slack_configs[0].channel="#infra" \
    --set alertmanager.config.receivers[0].slack_configs[0].send_resolved="true"
    ```
44. <https://webhook.site> 웹페이지가 열린 브라우저로 이동해서 새로운 메세지가 수신되었는지 확인
45. 리소스 삭제

    ```
    {
        kubectl delete svc nginx
        kubectl delete sts nginx
        kubectl delete pvc -l app=nginx
        helm -n monitoring uninstall prometheus
        kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
        kubectl delete crd alertmanagers.monitoring.coreos.com
        kubectl delete crd podmonitors.monitoring.coreos.com
        kubectl delete crd probes.monitoring.coreos.com
        kubectl delete crd prometheuses.monitoring.coreos.com
        kubectl delete crd prometheusrules.monitoring.coreos.com
        kubectl delete crd servicemonitors.monitoring.coreos.com
        kubectl delete crd thanosrulers.monitoring.coreos.com
        kubectl delete ns monitoring
    }
    ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://kubernetes.youngwjung.com/extras/prometheus/lab.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
