在云原生时代,Kubernetes(K8s)已成为企业容器化部署的标准。但随着集群规模扩大,组件复杂度、资源利用率、服务稳定性等问题也随之凸显。夜莺监控(Nightingale)作为国产开源的全栈可观测性工具,支持轻量级部署、多模态数据采集、AI告警自愈,完美适配K8s环境。本文将带你从零开始,详细部署一套完整的夜莺监控系统!
一、部署前准备
1. 环境要求
• Kubernetes集群(1.20+版本,建议1.24+)
• 安装kubectl并配置集群访问权限
• 存储类(如openebs、NFS等,用于持久化存储)
2. 创建命名空间
kubectl create namespace nightingale-system
3. 安装依赖组件
• Prometheus:指标采集核心组件。
• kube-state-metrics:采集K8s对象元数据(Pod、Deployment等)。
二、详细部署步骤
步骤1:配置RBAC权限
夜莺需要访问K8s API Server和组件指标,需创建ClusterRole和ServiceAccount:
1. 创建ClusterRole(rbac.yaml)
cat >> /data/k8s/yy/rbac.yaml << \EOFapiVersion: rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:name:nightingale-rolerules:# 采集K8s组件指标 -apiGroups: [""] resources: ["nodes", "pods", "services", "endpoints"] verbs: ["get", "list", "watch"] # 采集API Server和Controller Manager指标 -nonResourceURLs: ["/metrics", "/metrics/cadvisor"] verbs: ["get"] EOF
2. 创建ServiceAccount并绑定权限
cat >> /data/k8s/yy/rbac.yaml << \EOFapiVersion: v1kind:ServiceAccountmetadata:name:nightingale-sanamespace:nightingale-system--- apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRoleBindingmetadata:name:nightingale-bindingroleRef:apiGroup:rbac.authorization.k8s.iokind:ClusterRolename:nightingale-rolesubjects:-kind:ServiceAccount name:nightingale-sa namespace:nightingale-system EOF
应用配置:
kubectl apply -f rbac.yaml -n nightingale-system
步骤2:部署Categraf(数据采集器)
Categraf是夜莺的轻量级采集器,支持Prometheus、K8s API等数据源。
1. 创建Categraf DaemonSet
cat >> /data/k8s/yy/categraf-daemonset.yaml << \EOFapiVersion: apps/v1kind:DaemonSetmetadata:name:categrafnamespace:nightingale-systemspec:selector: matchLabels: app:categraftemplate: metadata: labels: app:categraf spec: serviceAccountName:nightingale-sa containers: -name:categraf image:flashcatcloud/categraf:latest args: -"-config" -"/etc/categraf/categraf.yaml" volumeMounts: -name:config-volume mountPath:/etc/categraf volumes: -name:config-volume configMap: name:categraf-config EOF
2. 配置Categraf采集规则(categraf-config.yaml)
cat >> /data/k8s/yy/categraf-config.yaml << \EOFapiVersion: v1kind:ConfigMapmetadata:name:categraf-confignamespace:nightingale-systemdata:categraf.yaml: | inputs: - name: kubernetes type: kubernetes interval: 10s # 采集kubelet指标 kubelet: host: "https://${NODE_NAME}:10250" insecure_skip_verify: true # 采集Prometheus Exporter指标 exporters: - name: prometheus type: prometheus endpoint: "http://n9e-prometheus.monitoring:9090/api/v1/write" EOF
应用配置:
kubectl apply -f categraf-daemonset.yaml -n nightingale-system kubectl apply -f categraf-config.yaml -n nightingale-system
步骤3:部署kube-state-metrics
用于采集K8s对象元数据(如Pod状态、Deployment副本数等)。
部署YAML(kube-state-metrics.yaml):
cat >> /data/k8s/yy/kube-state-metrics.yaml << \EOFapiVersion: apps/v1kind:Deploymentmetadata:name:kube-state-metricsnamespace:nightingale-systemspec:replicas:1selector: matchLabels: app:kube-state-metricstemplate: metadata: labels: app:kube-state-metrics spec: containers: -name:kube-state-metrics image:gcr.io/google-containers/kube-state-metrics:v2.8.2 ports: -containerPort:8080 args: -"--enable-custom-metrics" -"--metrics-prefix=kube_state_metrics_" EOF
暴露Service:
cat >> /data/k8s/yy/kube-state-metrics.yaml << \EOFapiVersion: v1kind:Servicemetadata:name:kube-state-metricsnamespace:nightingale-systemspec:ports: -port:80 targetPort:8080selector: app:kube-state-metrics EOF
应用配置:
kubectl apply -f kube-state-metrics.yaml -n nightingale-system
步骤4:部署Nightingale Server
1. 拉取镜像并部署
# 创建持久化存储(以openebs为例) kubectl apply -f https://openebs.github.io/charts/openebs-operator.yaml kubectl patch storageclass openebs-hostpath -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' # 部署Nightingale(使用Helm) helm repo add n9e https://n9e.github.io/charts helm install nightingale n9e/nightingale \ --namespace nightingale-system \ --set persistence.storageClass=openebs-hostpath \ --set persistence.size=10Gi
2. 暴露Web界面
cat >> /data/k8s/yy/nightingale-web-service.yaml << \EOFapiVersion: v1kind:Servicemetadata:name:nightingale-webnamespace:nightingale-systemspec:type:NodePortports: -port:80 targetPort:17000 nodePort:30007# 可自定义端口 selector: app:nightingale EOF
应用配置:
kubectl apply -f nightingale-web-service.yaml -n nightingale-system
步骤5:配置Prometheus采集指标
1. 编辑Prometheus配置
# 在Nightingale的Prometheus配置中添加以下job -job_name:"kubernetes"kubernetes_sd_configs: -role:node# 采集kubelet指标 relabel_configs: -source_labels: [__address__] target_label:__address__ replacement:"${1}:10250"-job_name:"kube-state-metrics"static_configs: -targets: ["kube-state-metrics.nightingale-system:80"]
2. 重启Prometheus
kubectl rollout restart deployment/nightingale -n nightingale-system
三、验证部署
1. 检查Pod状态
kubectl get pods -n nightingale-system # 应显示categraf、kube-state-metrics、nightingale等Pod为Running状态
2. 访问Web界面
通过NodePort访问:
http://<K8s节点IP>:30007
默认账号/密码: root/root.2020
3. 验证数据采集
• 进入监控大盘,查看K8s节点资源(CPU/内存/存储)。
• 检查kube-state-metrics指标(如kube_pod_status_phase)。
四、常见问题及解决方案
问题1:权限不足(Forbidden错误)
原因:未正确绑定ServiceAccount或RBAC权限不足。
解决:
1. 检查ClusterRole是否包含/metrics访问权限。
2. 确保ServiceAccount与ClusterRoleBinding绑定正确。
问题2:Categraf无法采集kubelet指标
原因:K8s 1.24+版本需使用PodIdentityWebhook。
解决:
# 在DaemonSet中添加注解
annotations:
"iam.amazonaws.com/role-arn": "arn:aws:iam::123456789012:role/kubelet"
问题3:Prometheus无法获取指标
原因:Service配置错误或端口未暴露。
解决:
1. 确保kube-state-metrics的Service端口正确(如80→8080)。
2. 检查Prometheus配置的target是否可达。
五、扩展配置
1. 配置告警规则
# 示例:CPU使用率超过80%触发告警 groups:-name:k8s-cpu-alarmrules:-alert:HighCPUUsage expr:100-(avgby(node)(machine:node_cpu_utilisation:sum)*100)<80 for:5m labels: severity:warning annotations: summary:"Node {{ $labels.node }} CPU usage is too high!"
2. 导入监控模板
# 导入K8s集群大盘模板 kubectl exec -it -n nightingale-system \ -- /bin/sh -c "curl -o /etc/nightingale/dashboards/k8s-dashboard.json \ https://raw.githubusercontent.com/flashcatcloud/categraf/main/k8s/pod-dash.json"
3. 远程访问(可选)
使用cpolar或Nginx Ingress穿透内网:
# 使用cpolar映射端口 cpolar --subdomain my-n9e http://localhost:30007