Linux服务器疑似硬件启动故障排查及dmesg/Message/Journal相关日志分析

艺帆风顺 发布于 2025-04-03 21 次阅读


一、问题背景

    某项目反馈出现系统崩溃现象,服务器为VM虚拟化平台,疑似硬件出现了故障,需要查看相关日志定位分析。

二、系统启动分析

1、查看系统启动时间

systemd-analyze time        查看开机耗时

    [root@WQ-MySQL ~]# systemd-analyze timeStartup finished in 1.742s (kernel) + 2.522s (initrd) + 23.152s (userspace) = 27.416s

    lastb        查看上次启动时间

      [root@WQ-MySQL ~]# lastbroot pts/0 Sun Feb 25 16:23 - 16:23 (00:00) 
      btmp begins Sun Feb 25 16:23:50 2024

       uptime -s        查看本次服务器什么时候启动的

        [root@WQ-MySQL ~]# uptime 09:22:59 up 41 min, 3 users, load average: 1.14, 1.78, 1.74[root@WQ-MySQL ~]# uptime -s2024-02-26 08:41:09

         last reboot   查看历史重启记录

        2、日志目录结构

        1. /var/log/messages:这个文件包含了系统各种服务的日志信息,可以通过tail命令来实时检查日志的变化,如`tail -f /var/log/messages`

        2. /var/log/syslog:这个文件主要是记录系统的运行信息,比如内存映像文件的反转、系统守护进程的信息等等。

        3. /var/log/auth.log:这个日志记录的是用户登录和身份验证信息,可以查看谁登录了系统,以及登录操作是否成功等等。

        4. /var/log/dmesg:这个文件记录的是系统启动时的信息,比如硬件检测、驱动加载等等。

        5. /var/log/kern.log:这个文件记录了内核的信息,包括内核错误、警告信息以及调试信息等等。

        6. /var/log/cron.log:这个文件记录了定时任务的执行情况,可以查看定时任务是否执行成功等等。

        7. /var/log/lastlog:这个文件记录了所有用户最后一次登录的时间和信息。

        这些日志文件默认情况下都只能被root用户或管理员查看。使用cat、tail、less等命令可以查看日志内容。也可以使用logrotate来管理和清理日志。

        3、分析Message日志

        tail -1000 /var/log/messages | grep -i error

          [root@WQ-MySQL ~]# tail -1000 /var/log/messages | grep -i errorFeb 26 08:41:21 WQ-MySQL journal: Error probing device: Error sending ATA command IDENTIFY PACKET DEVICE to '/dev/sr0': ATA command failed: error=0x01 count=0x02 status=0x50 (g-io-error-quark, 0)Feb 26 08:41:22 WQ-MySQL journal: Error probing device: Error sending ATA command IDENTIFY PACKET DEVICE to '/dev/sr0': ATA command failed: error=0x01 count=0x02 status=0x50 (g-io-error-quark, 0)Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.819564665+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n": exit status 1"Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.829636035+08:00" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin"Feb 26 08:41:26 WQ-MySQL containerd: time="2024-02-26T08:41:26.829643386+08:00" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n": exit status 1"Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.596+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.612+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录Feb 26 08:41:27 WQ-MySQL libvirtd: 2024-02-26 00:41:27.613+0000: 2476: error : virHostCPUGetTscInfo:1389 : 无法打开 tty /dev/kvm: 没有那个文件或目录Feb 26 08:41:29 WQ-MySQL dockerd: time="2024-02-26T08:41:29.992556361+08:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint aa5db7879db865bbc76a2232ee8d1e49cc15188613264608c54f36349ceee78f 477daea1678e52e2908c451eb27763f9b251adc9fc75c7add048063ba112a703], retrying...."Feb 26 08:41:30 WQ-MySQL dockerd: time="2024-02-26T08:41:30.411541110+08:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint aa5db7879db865bbc76a2232ee8d1e49cc15188613264608c54f36349ceee78f d1a799377a1f7e3728d0b824bb3c9620885bd6afdf23cb7c7545c2f3e93c6533], retrying...."Feb 26 08:41:56 WQ-MySQL journal: Error looking up permission: GDBus.Error:org.freedesktop.portal.Error.NotFound: No entry for geolocationFeb 26 08:41:56 WQ-MySQL kernel: [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.Feb 26 08:44:38 WQ-MySQL YLMonitor: 2024/02/26 08:44:38 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0mFeb 26 08:50:39 WQ-MySQL YLMonitor: 2024/02/26 08:50:39 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0mFeb 26 08:56:40 WQ-MySQL YLMonitor: 2024/02/26 08:56:40 #033[1;31m[E] [proxy.go:196] [2d4e60d28d7f52e0] [mysql-wuqcldpmysql] connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused#033[0m

          报错信息分析如下,无法明确定位:

            1. ATA command failed: error=0x01 count=0x02 status=0x50,可能是硬件设备出现问题。2. modprobe aufs failed: "modprobe: FATAL: Module aufs not found.n",可能是缺少了aufs模块。3. 无法打开 tty /dev/kvm: 没有那个文件或目录,可能是缺少了/dev/kvm文件或目录。4. Error (Unable to complete atomic operation, key modified) deleting object,可能是删除对象时出现了问题。5. GDBus.Error:org.freedesktop.portal.Error.NotFound: No entry for geolocation,可能是找不到geolocation的权限。6. [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.,可能是显示内存不足。7. connect to local service [127.0.0.1:13306] error: dial tcp 127.0.0.1:13306: connect: connection refused,可能是连接本地服务失败。

            确定具体是什么硬件故障:

            dmesg | grep -i sr0

              [root@WQ-MySQL ~]# dmesg | grep -i sr0[ 2.684158] sr 1:0:0:0: [sr0] scsi3-mmc drive: 1x/1x writer dvd-ram cd/rw xa/form2 cdda tray[ 2.684301] sr 1:0:0:0: Attached scsi CD-ROM sr0
              可以看到系统检测到了一个SCSI CD-ROM设备,即/dev/sr0。该设备是一种可读写的光驱设备,支持DVD-RAM、CD/RW等多种格式。因此,可能是光驱硬件故障或者光驱驱动程序问题导致的。

              4、分析Dmesg日志

              dmesg | grep -E "error|Error|ERROR|fail|Fail|FAIL"

                [root@WQ-MySQL ~]# dmesg | grep -E "error|Error|ERROR|fail|Fail|FAIL"[ 1.817207] pci 0000:00:15.3: BAR 13: failed to assign [io size 0x1000][ 1.817210] pci 0000:00:15.4: BAR 13: failed to assign [io size 0x1000][ 1.817213] pci 0000:00:15.5: BAR 13: failed to assign [io size 0x1000][ 1.817216] pci 0000:00:15.6: BAR 13: failed to assign [io size 0x1000][ 1.817219] pci 0000:00:15.7: BAR 13: failed to assign [io size 0x1000][ 1.817222] pci 0000:00:16.3: BAR 13: failed to assign [io size 0x1000][ 1.817224] pci 0000:00:16.4: BAR 13: failed to assign [io size 0x1000][ 1.817227] pci 0000:00:16.5: BAR 13: failed to assign [io size 0x1000][ 1.817230] pci 0000:00:16.6: BAR 13: failed to assign [io size 0x1000][ 1.817233] pci 0000:00:16.7: BAR 13: failed to assign [io size 0x1000][ 1.817236] pci 0000:00:17.3: BAR 13: failed to assign [io size 0x1000][ 1.817239] pci 0000:00:17.4: BAR 13: failed to assign [io size 0x1000][ 1.817241] pci 0000:00:17.5: BAR 13: failed to assign [io size 0x1000][ 1.817244] pci 0000:00:17.6: BAR 13: failed to assign [io size 0x1000][ 1.817247] pci 0000:00:17.7: BAR 13: failed to assign [io size 0x1000][ 1.817250] pci 0000:00:18.2: BAR 13: failed to assign [io size 0x1000][ 1.817253] pci 0000:00:18.3: BAR 13: failed to assign [io size 0x1000][ 1.817255] pci 0000:00:18.4: BAR 13: failed to assign [io size 0x1000][ 1.817258] pci 0000:00:18.5: BAR 13: failed to assign [io size 0x1000][ 1.817261] pci 0000:00:18.6: BAR 13: failed to assign [io size 0x1000][ 1.817264] pci 0000:00:18.7: BAR 13: failed to assign [io size 0x1000][ 1.817268] pci 0000:00:18.7: BAR 13: failed to assign [io size 0x1000][ 1.817272] pci 0000:00:18.6: BAR 13: failed to assign [io size 0x1000][ 1.817274] pci 0000:00:18.5: BAR 13: failed to assign [io size 0x1000][ 1.817277] pci 0000:00:18.4: BAR 13: failed to assign [io size 0x1000][ 1.817280] pci 0000:00:18.3: BAR 13: failed to assign [io size 0x1000][ 1.817283] pci 0000:00:18.2: BAR 13: failed to assign [io size 0x1000][ 1.817286] pci 0000:00:17.7: BAR 13: failed to assign [io size 0x1000][ 1.817289] pci 0000:00:17.6: BAR 13: failed to assign [io size 0x1000][ 1.817291] pci 0000:00:17.5: BAR 13: failed to assign [io size 0x1000][ 1.817294] pci 0000:00:17.4: BAR 13: failed to assign [io size 0x1000][ 1.817297] pci 0000:00:17.3: BAR 13: failed to assign [io size 0x1000][ 1.817300] pci 0000:00:16.7: BAR 13: failed to assign [io size 0x1000][ 1.817303] pci 0000:00:16.6: BAR 13: failed to assign [io size 0x1000][ 1.817305] pci 0000:00:16.5: BAR 13: failed to assign [io size 0x1000][ 1.817308] pci 0000:00:16.4: BAR 13: failed to assign [io size 0x1000][ 1.817311] pci 0000:00:16.3: BAR 13: failed to assign [io size 0x1000][ 1.817314] pci 0000:00:15.7: BAR 13: failed to assign [io size 0x1000][ 1.817317] pci 0000:00:15.6: BAR 13: failed to assign [io size 0x1000][ 1.817320] pci 0000:00:15.5: BAR 13: failed to assign [io size 0x1000][ 1.817323] pci 0000:00:15.4: BAR 13: failed to assign [io size 0x1000][ 1.817325] pci 0000:00:15.3: BAR 13: failed to assign [io size 0x1000][ 47.811380] [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.

                该日志显示了PCI设备的BAR 13分配IO大小失败的错误。

                查看BAR 13相关报错:

                dmesg | grep -E "BAR 13"

                  [root@WQ-MySQL ~]# dmesg | grep -E "BAR 13"[ 1.817205] pci 0000:00:15.3: BAR 13: no space for [io size 0x1000][ 1.817207] pci 0000:00:15.3: BAR 13: failed to assign [io size 0x1000]

                  分析:可能是网卡故障,需要换一个IP试试

                  https://blog.csdn.net/Doudou_Mylove/article/details/118417103

                  4、分析Journal服务日志

                  journalctl -p err

                    -- Logs begin at 一 2024-02-26 08:41:10 CST, end at 一 2024-02-26 09:41:09 CST. --2月 26 08:41:16 WQ-MySQL kernel: piix4_smbus 0000:00:07.3: SMBus Host Controller not enabled!2月 26 08:41:21 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:26 WQ-MySQL rsyslogd[1935]: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/2027 ]2月 26 08:41:26 WQ-MySQL rsyslogd[1935]: imjournal: ignoring invalid state file [v8.24.0-52.el7_8.2]2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:40 WQ-MySQL rtkit-daemon[1440]: Failed to make ourselves RT: Operation not permitted2月 26 08:41:54 WQ-MySQL spice-vdagent[3517]: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.02月 26 08:41:54 WQ-MySQL spice-streaming-agent[3523]: Failed to open the streaming device "/dev/virtio-ports/org.spice-space.stream.0": 2 - No such file or directory2月 26 08:41:56 WQ-MySQL kernel: [drm:vmw_kms_check_display_memory [vmwgfx]] *ERROR* Combined output size too large.

                        该日志文件中包含了多条信息,其中大部分是关于权限问题或者服务启动失败的信息,与硬件故障无关。但是,最后一条信息可能与显卡驱动程序或者显卡硬件故障有关。该信息显示了一个drm错误,提示"Combined output size too large",可能是由于显存不足或者显卡驱动程序问题导致的。

                    三、小结

                        该故障大概率为网卡设备故障,其他报错无特别影响,具体结合日志分析,暂未发现系统层面的故障。