一、问题背景
某项目反馈出现系统崩溃现象,服务器为VM虚拟化平台,疑似硬件出现了故障,需要查看相关日志定位分析。
二、系统启动分析
1、查看系统启动时间
systemd-analyze time 查看开机耗时
[root@WQ-MySQL ~]# systemd-analyze time
Startup finished in 1.742s (kernel) + 2.522s (initrd) + 23.152s (userspace) = 27.416s
lastb 查看上次启动时间
[
root pts/0 Sun Feb 25 16:23 - 16:23 (00:00)
btmp begins Sun Feb 25 16:23:50 2024
uptime -s 查看本次服务器什么时候启动的
[root@WQ-MySQL ~]# uptime
09:22:59 up 41 min, 3 users, load average: 1.14, 1.78, 1.74
[root@WQ-MySQL ~]# uptime -s
2024-02-26 08:41:09
last reboot 查看历史重启记录
2、日志目录结构
1. /var/log/messages:这个文件包含了系统各种服务的日志信息,可以通过tail命令来实时检查日志的变化,如`tail -f /var/log/messages`
2. /var/log/syslog:这个文件主要是记录系统的运行信息,比如内存映像文件的反转、系统守护进程的信息等等。
3. /var/log/auth.log:这个日志记录的是用户登录和身份验证信息,可以查看谁登录了系统,以及登录操作是否成功等等。
4. /var/log/dmesg:这个文件记录的是系统启动时的信息,比如硬件检测、驱动加载等等。
5. /var/log/kern.log:这个文件记录了内核的信息,包括内核错误、警告信息以及调试信息等等。
6. /var/log/cron.log:这个文件记录了定时任务的执行情况,可以查看定时任务是否执行成功等等。
7. /var/log/lastlog:这个文件记录了所有用户最后一次登录的时间和信息。
这些日志文件默认情况下都只能被root用户或管理员查看。使用cat、tail、less等命令可以查看日志内容。也可以使用logrotate来管理和清理日志。
3、分析Message日志
tail -1000 /var/log/messages | grep -i error
[root@WQ-MySQL ~]# tail -1000 /var/log/messages | grep -i error
Feb 26 08:41:21 WQ-MySQL journal: Error probing device: Error sending ATA command IDENTIFY PACKET DEVICE to '/dev/sr0': ATA command failed: error=0x01 count=0x02 status=0x50 (g-io-error-quark, 0)
Feb 26 08:41:22 WQ-MySQL journal: Error probing device: Error sending ATA command IDENTIFY PACKET DEVICE to '/dev/sr0': ATA command failed: error=0x01 count=0x02 status=0x50 (g-io-error-quark, 0)
Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.819564665+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n": exit status 1"
Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.829636035+08:00" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin"
Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.829643386+08:00" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n": exit status 1"
Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.596+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录
Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.612+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录
Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.613+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录
Feb 26 08:41:29 WQ-MySQL dockerd: time="2024-02-26T08:41:29.992556361+08:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint aa5db7879db865bbc76a2232ee8d1e49cc15188613264608c54f36349ceee78f 477daea1678e52e2908c451eb27763f9b251adc9fc75c7add048063ba112a703], retrying...."
Feb 26 08:41:30 WQ-MySQL dockerd: time="2024-02-26T08:41:30.411541110+08:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint aa5db7879db865bbc76a2232ee8d1e49cc15188613264608c54f36349ceee78f d1a799377a1f7e3728d0b824bb3c9620885bd6afdf23cb7c7545c2f3e93c6533], retrying...."
Feb 26 08:41:56 WQ-MySQL journal: Error looking up permission: GDBus.Error:org.freedesktop.portal.Error.NotFound: No entry for geolocation
Feb 26 08:41:56 WQ-MySQL kernel: [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.
Feb 26 08:44:38 WQ-MySQL YLMonitor: 2024/02/26 08:44:38 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0m
Feb 26 08:50:39 WQ-MySQL YLMonitor: 2024/02/26 08:50:39 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0m
Feb 26 08:56:40 WQ-MySQL YLMonitor: 2024/02/26 08:56:40 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0m
报错信息分析如下,无法明确定位:
1. ATA command failed: error=0x01 count=0x02 status=0x50,可能是硬件设备出现问题。
2. modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n",可能是缺少了aufs模块。
3. 无法打开 tty /dev/kvm: 没有那个文件或目录,可能是缺少了/dev/kvm文件或目录。
4. Error (Unable to complete atomic operation, key modified) deleting object,可能是删除对象时出现了问题。
5. GDBus.Error:org.freedesktop.portal.Error.NotFound: No entry for geolocation,可能是找不到geolocation的权限。
6. [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.,可能是显示内存不足。
7. connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused,可能是连接本地服务失败。
确定具体是什么硬件故障:
dmesg | grep -i sr0
[
[
[
可以看到系统检测到了一个SCSI CD-ROM设备,即/dev/sr0。
该设备是一种可读写的光驱设备,支持DVD-RAM、CD/RW等多种格式。
因此,可能是光驱硬件故障或者光驱驱动程序问题导致的。
4、分析Dmesg日志
dmesg | grep -E "error|Error|ERROR|fail|Fail|FAIL"
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
该日志显示了PCI设备的BAR 13分配IO大小失败的错误。
查看BAR 13相关报错:
dmesg | grep -E "BAR 13"
[
[
[
分析:可能是网卡故障,需要换一个IP试试
https://blog.csdn.net/Doudou_Mylove/article/details/118417103
4、分析Journal服务日志
journalctl -p err
-- Logs begin at 一 2024-02-26 08:41:10 CST, end at 一 2024-02-26 09:41:09 CST. --
2月 26 08:41:16 WQ-MySQL kernel: piix4_smbus 0000:00:07.3: SMBus Host Controller not enabled!
2月 26 08:41:21 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:26 WQ-MySQL rsyslogd[1935]: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/2027 ]
2月 26 08:41:26 WQ-MySQL rsyslogd[1935]: imjournal: ignoring invalid state file [v8.24.0-52.el7_8.2]
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted
2月 26 08:41:54 WQ-MySQL spice-vdagent[3517]: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
2月 26 08:41:54 WQ-MySQL spice-streaming-agent[3523]: Failed to open the streaming device "/dev/virtio-ports/org.spice-space.stream.0": 2 - No such file or directory
2月 26 08:41:56 WQ-MySQL kernel: [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.
该日志文件中包含了多条信息,其中大部分是关于权限问题或者服务启动失败的信息,与硬件故障无关。但是,最后一条信息可能与显卡驱动程序或者显卡硬件故障有关。该信息显示了一个drm错误,提示"Combined output size too large",可能是由于显存不足或者显卡驱动程序问题导致的。
三、小结
该故障大概率为网卡设备故障,其他报错无特别影响,具体结合日志分析,暂未发现系统层面的故障。