问题描述:(centos7)
最近自建主机不知道什么原因总是在工作时间无法连接,排查了这个期间的服务器相关日志/var/log/message和其他相关的日志都没有任何问题。之前服务器遇到可能性的排查和解决办法。
文章源自玩技e族-https://www.playezu.com/826851.html
临时方案:
因为安装了远程开机卡,重新断电启动没有任何问题,过一段时间又无法连接,尝试重新安装各种服务和运行环境、病毒查杀、网络原因、定时任务等等均无效果。所以准备写个脚本排查下问题。文章源自玩技e族-https://www.playezu.com/826851.html
可能情况一:
想着是因为负载太高导致系统进程卡死或者是因为图形化页面卡死所以做了以下配置。文章源自玩技e族-https://www.playezu.com/826851.html
解决思路:
关闭图形:centos系统其他系统请百度查找相关方法
sudo systemctl set-default graphical.target sudo reboot
(最终还不是这个问题,过一段时间之后还是无法连接主机)文章源自玩技e族-https://www.playezu.com/826851.html
定时任务:
观察结果:
★[2024-05-13 12:30:03] Successful ---------------------------------------------------------------------------- 当前系统负载:2.58 --------------------------------------------------------- 当前系统资源情况: --------------------------------------------------------- CPU使用率: 11.5% 内存使用率: 37.69% 磁盘使用率: 37% 负载最高的前10个进程: --------------------------------------------------------- 24686 15.6 49.3 /www/server/mysql/bin/mysqld 23739 0.0 24.0 /www/server/redis/src/redis-server 3912 0.3 7.9 /www/server/bt-monitor/pyenv/bin/python3.7 1712 0.0 6.5 /usr/local/btmonitoragent/plugin/networktraffic/networktraffic 3294 0.2 6.5 /www/server/bt-monitor/pyenv/bin/python3 4285 0.2 3.2 /usr/local/btmonitoragent/BT-MonitorAgent 9962 0.4 2.8 php-fpm: 5588 0.3 2.5 php-fpm: 5609 0.4 2.5 php-fpm: 5612 0.4 2.5 php-fpm: 检查是否可以连接到百度: --------------------------------------------------------- PING baidu.com (110.242.68.66) 56(84) bytes of data. 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=1 ttl=50 time=18.3 ms 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=2 ttl=50 time=17.5 ms 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=3 ttl=50 time=18.4 ms --- baidu.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 17.583/18.131/18.419/0.387 ms 网络日志: --------------------------------------------------------- [ 201.795264] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.795284] IPv6: ADDRCONF(NETDEV_CHANGE): veth301f04c: link becomes ready [ 201.795319] docker0: port 1(veth301f04c) entered blocking state [ 201.795322] docker0: port 1(veth301f04c) entered forwarding state [ 201.795366] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready [ 201.992254] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 201.992271] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.992292] IPv6: ADDRCONF(NETDEV_CHANGE): veth704bd4f: link becomes ready [ 201.992327] docker0: port 2(veth704bd4f) entered blocking state [ 201.992329] docker0: port 2(veth704bd4f) entered forwarding state ---------------------------------------------------------------------------- ★[2024-05-13 13:00:03] Successful ---------------------------------------------------------------------------- 当前系统负载:2.11 --------------------------------------------------------- 当前系统资源情况: --------------------------------------------------------- CPU使用率: 17.1% 内存使用率: 36.94% 磁盘使用率: 37% 负载最高的前10个进程: --------------------------------------------------------- 24686 15.6 49.3 /www/server/mysql/bin/mysqld 23739 0.0 24.0 /www/server/redis/src/redis-server 3912 0.3 7.9 /www/server/bt-monitor/pyenv/bin/python3.7 21095 0.0 7.8 /usr/local/btmonitoragent/plugin/networktraffic/networktraffic 3294 0.2 6.5 /www/server/bt-monitor/pyenv/bin/python3 4285 0.2 3.2 /usr/local/btmonitoragent/BT-MonitorAgent 12986 0.4 3.2 php-fpm: 13017 0.4 3.2 php-fpm: 12982 0.4 3.0 php-fpm: 12996 0.4 3.0 php-fpm: 检查是否可以连接到百度: --------------------------------------------------------- PING baidu.com (39.156.66.10) 56(84) bytes of data. 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=1 ttl=55 time=7.38 ms 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=2 ttl=55 time=7.29 ms 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=3 ttl=55 time=7.19 ms --- baidu.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 7.192/7.291/7.385/0.078 ms 网络日志: --------------------------------------------------------- [ 201.795264] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.795284] IPv6: ADDRCONF(NETDEV_CHANGE): veth301f04c: link becomes ready [ 201.795319] docker0: port 1(veth301f04c) entered blocking state [ 201.795322] docker0: port 1(veth301f04c) entered forwarding state [ 201.795366] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready [ 201.992254] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 201.992271] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.992292] IPv6: ADDRCONF(NETDEV_CHANGE): veth704bd4f: link becomes ready [ 201.992327] docker0: port 2(veth704bd4f) entered blocking state [ 201.992329] docker0: port 2(veth704bd4f) entered forwarding state ---------------------------------------------------------------------------- ★[2024-05-13 13:30:04] Successful ---------------------------------------------------------------------------- 当前系统负载:3.92 --------------------------------------------------------- 当前系统资源情况: --------------------------------------------------------- CPU使用率: 24.6% 内存使用率: 38.74% 磁盘使用率: 37% 负载最高的前10个进程: --------------------------------------------------------- 24686 15.6 49.4 /www/server/mysql/bin/mysqld 23739 0.0 24.1 /www/server/redis/src/redis-server 8350 0.0 11.5 /usr/local/btmonitoragent/plugin/networktraffic/networktraffic 3912 0.3 7.9 /www/server/bt-monitor/pyenv/bin/python3.7 3294 0.2 6.5 /www/server/bt-monitor/pyenv/bin/python3 4285 0.2 3.2 /usr/local/btmonitoragent/BT-MonitorAgent 12986 0.3 3.2 php-fpm: 13017 0.4 3.2 php-fpm: 12987 0.3 3.1 php-fpm: 12996 0.4 3.1 php-fpm: 检查是否可以连接到百度: --------------------------------------------------------- PING baidu.com (39.156.66.10) 56(84) bytes of data. 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=1 ttl=55 time=7.20 ms 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=2 ttl=55 time=7.24 ms 64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=3 ttl=55 time=7.13 ms --- baidu.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 7.130/7.194/7.246/0.084 ms 网络日志: --------------------------------------------------------- [ 201.795264] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.795284] IPv6: ADDRCONF(NETDEV_CHANGE): veth301f04c: link becomes ready [ 201.795319] docker0: port 1(veth301f04c) entered blocking state [ 201.795322] docker0: port 1(veth301f04c) entered forwarding state [ 201.795366] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready [ 201.992254] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 201.992271] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.992292] IPv6: ADDRCONF(NETDEV_CHANGE): veth704bd4f: link becomes ready [ 201.992327] docker0: port 2(veth704bd4f) entered blocking state [ 201.992329] docker0: port 2(veth704bd4f) entered forwarding state ---------------------------------------------------------------------------- ★[2024-05-13 14:00:03] Successful ---------------------------------------------------------------------------- 当前系统负载:2.96 --------------------------------------------------------- 当前系统资源情况: --------------------------------------------------------- CPU使用率: 26% 内存使用率: 36.91% 磁盘使用率: 37% 负载最高的前10个进程: --------------------------------------------------------- 24686 15.6 49.5 /www/server/mysql/bin/mysqld 23739 0.0 24.1 /www/server/redis/src/redis-server 3912 0.3 7.9 /www/server/bt-monitor/pyenv/bin/python3.7 3294 0.1 6.5 /www/server/bt-monitor/pyenv/bin/python3 27440 0.0 6.2 /usr/local/btmonitoragent/plugin/networktraffic/networktraffic 4285 0.2 3.2 /usr/local/btmonitoragent/BT-MonitorAgent 13017 0.4 3.1 php-fpm: 12996 0.4 3.0 php-fpm: 12987 0.4 2.9 php-fpm: 12997 0.3 2.9 php-fpm: 检查是否可以连接到百度: --------------------------------------------------------- PING baidu.com (110.242.68.66) 56(84) bytes of data. 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=1 ttl=50 time=18.3 ms 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=2 ttl=50 time=18.2 ms 64 bytes from 110.242.68.66 (110.242.68.66): icmp_seq=3 ttl=50 time=18.2 ms --- baidu.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 18.202/18.239/18.307/0.048 ms 网络日志: --------------------------------------------------------- [ 201.795264] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.795284] IPv6: ADDRCONF(NETDEV_CHANGE): veth301f04c: link becomes ready [ 201.795319] docker0: port 1(veth301f04c) entered blocking state [ 201.795322] docker0: port 1(veth301f04c) entered forwarding state [ 201.795366] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready [ 201.992254] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 201.992271] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 201.992292] IPv6: ADDRCONF(NETDEV_CHANGE): veth704bd4f: link becomes ready [ 201.992327] docker0: port 2(veth704bd4f) entered blocking state [ 201.992329] docker0: port 2(veth704bd4f) entered forwarding state ---------------------------------------------------------------------------- ★[2024-05-13 14:30:03] Successful ----------------------------------------------------------------------------
网络状态:
观测结果:
★[2024-05-12 08:30:03] Successful ---------------------------------------------------------------------------- 网站可访问,无需执行操作。 ---------------------------------------------------------------------------- ★[2024-05-12 09:00:02] Successful ---------------------------------------------------------------------------- 网站不可访问,正在重启相关服务和网络连接。 检查网络连接并尝试重启网络... 网络已成功重启。 Job for nginx.service failed because the control process exited with error code. See "systemctl status nginx.service" and "journalctl -xe" for details. nginx 服务未正常运行。正在尝试重启服务。 nginx 服务已成功重启。 php-fpm-74 服务正常运行。 redis 服务正常运行。 mysqld 服务正常运行。 frpc 服务正常运行。 ---------------------------------------------------------------------------- ★[2024-05-12 09:32:02] Successful ---------------------------------------------------------------------------- 网站可访问,无需执行操作。 ---------------------------------------------------------------------------- ★[2024-05-12 10:00:03] Successful ---------------------------------------------------------------------------- 网站可访问,无需执行操作。 ---------------------------------------------------------------------------- ★[2024-05-12 10:30:02] Successful ---------------------------------------------------------------------------- 网站可访问,无需执行操作。 ---------------------------------------------------------------------------- ★[2024-05-12 11:00:02] Successful ---------------------------------------------------------------------------- 网站不可访问,正在重启相关服务和网络连接。 检查网络连接并尝试重启网络... 网络已成功重启。 Job for nginx.service failed because the control process exited with error code. See "systemctl status nginx.service" and "journalctl -xe" for details. nginx 服务未正常运行。正在尝试重启服务。 nginx 服务已成功重启。 php-fpm-74 服务正常运行。 redis 服务正常运行。 mysqld 服务正常运行。 frpc 服务正常运行。 ---------------------------------------------------------------------------- ★[2024-05-12 11:31:42] Successful
最终方案:
可能情况二:
由于磁盘的格式化分区的格式不太一样,sda是xfs格式sdb是ext4的格式,目前在进行磁盘健康检查等待检测结果文章源自玩技e族-https://www.playezu.com/826851.html
检测方式:
对于 CentOS 7 系统,以下是一些优化和检查磁盘的步骤,包括使用 smartctl
工具进行 SMART 测试以及常见的文件系统检查工具。文章源自玩技e族-https://www.playezu.com/826851.html
1. 使用 smartctl
进行磁盘健康检查
- 安装
smartmontools
: 如果尚未安装smartmontools
,请先安装它:yum install smartmontools
- 查看磁盘详细信息:
smartctl -i /dev/sdb
- 查看全部 SMART 信息:
smartctl -a /dev/sdb
- 运行短测试:
smartctl -t short /dev/sdb
你可以通过
smartctl -a /dev/sdb
命令来查看测试进度和结果。 - 运行长测试:
smartctl -t long /dev/sdb
2. 检查和修复文件系统
2.1 检查和修复 EXT4 文件系统
- 卸载文件系统:
umount /dev/sda1
- 检查和修复:
e2fsck -f /dev/sda1
2.2 检查和修复 XFS 文件系统
- 卸载文件系统:
umount /dev/sda1
- 检查和修复:
xfs_repair /dev/sda1
3. 磁盘性能优化
3.1 使用 hdparm
优化硬盘
- 安装
hdparm
:yum install hdparm
- 查看硬盘信息:
hdparm -I /dev/sda
- 启用 DMA:
hdparm -d1 /dev/sda
- 查看当前参数和性能测试:
hdparm -tT /dev/sda
3.2 使用 tuned
优化系统
- 安装
tuned
:yum install tuned
- 启动并启用
tuned
服务:systemctl start tuned systemctl enable tuned
- 列出可用的优化配置文件:
tuned-adm list
- 应用合适的优化配置文件(例如
virtual-guest
):tuned-adm profile virtual-guest
通过这些步骤,你可以全面检查和优化你的磁盘性能和健康状态,确保系统的稳定运行。如果遇到磁盘出现严重问题,请务必及时备份数据,并考虑更换磁盘。文章源自玩技e族-https://www.playezu.com/826851.html
在选择文件系统时,稳定性是一个重要的考虑因素。EXT4 和 XFS 是两种常见的 Linux 文件系统,它们各有优点,适合不同的使用场景。
EXT4
优点:
- 成熟性: EXT4 是 EXT3 的继任者,并且基于 EXT2 和 EXT3 的发展,具有多年的成熟经验和广泛的社区支持。
- 稳定性: EXT4 被广泛认为是非常稳定和可靠的文件系统。它已经在许多生产环境中得到了充分的验证。
- 兼容性: EXT4 向下兼容 EXT3 和 EXT2,这意味着可以轻松地从这些旧的文件系统升级到 EXT4。
- 性能: 对于一般用途的服务器和桌面系统,EXT4 提供了良好的性能,尤其是在小文件处理和元数据操作上。
缺点:
- 扩展性: EXT4 的扩展性不如 XFS,尤其是在处理大文件系统和大文件方面。
- 并发写入性能: 在高并发写入场景中,EXT4 的性能可能不如 XFS。
XFS
优点:
- 性能: XFS 擅长处理大文件和高并发写入工作负载,适用于大数据存储和处理。
- 扩展性: XFS 的扩展性非常好,能够支持非常大的文件和文件系统。对于需要扩展性的大型数据存储应用,XFS 是一个不错的选择。
- 快照和备份: XFS 支持高效的快照和备份操作,适合用于需要频繁快照和备份的环境。
- 一致性: XFS 使用日志文件系统,能有效维护数据一致性,尤其是在断电或系统崩溃的情况下。
缺点:
- 恢复时间: 在意外停机后,XFS 的文件系统检查和修复时间可能较长,特别是在非常大的文件系统上。
- 碎片问题: XFS 可能会在长时间运行后产生碎片,虽然这一般不会严重影响性能,但在某些情况下需要进行碎片整理。
选择建议
稳定性和兼容性: 如果你的主要关注点是稳定性和兼容性,尤其是在小文件和一般用途服务器上,EXT4 可能是更好的选择。EXT4 在广泛的使用和测试中已经证明了它的稳定性和可靠性。
性能和扩展性: 如果你需要处理大文件或高并发写入场景,或者你需要管理非常大的文件系统,XFS 可能更适合。XFS 的性能和扩展性在这些场景中表现得更好。
结论
- 选择 EXT4: 当你需要一个经过长时间验证的稳定文件系统,并且主要用于一般用途服务器、桌面系统或处理小文件。
- 选择 XFS: 当你需要高性能处理大文件和高并发写入工作负载,或者需要管理大型数据存储。
你可以根据具体的使用场景和需求选择合适的文件系统。无论选择哪个文件系统,都建议进行充分的测试,并确保有适当的备份和恢复方案。
磁盘检查
经过一天的检查时间,sdb检测报告如下sda还在执行,结果是磁盘100%健康也没有坏道:
[root@localhost ~]# smartctl -a /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.118.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Re Device Model: WDC WD4000FYYZ-01UL1B2 Serial Number: WD-WCC131982680 LU WWN Device Id: 0 000000 000000000 Firmware Version: M1.04Q.9 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) Local Time is: Tue May 21 17:27:43 2024 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 41) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete Offline data collection: (39900) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 432) minutes. SCT capabilities: (0x70b5) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0001 200 200 051 Pre-fail Offline - 0 3 Spin_Up_Time 0x0001 192 142 021 Pre-fail Offline - 9391 4 Start_Stop_Count 0x0000 100 100 000 Old_age Offline - 105 5 Reallocated_Sector_Ct 0x0001 200 200 140 Pre-fail Offline - 0 7 Seek_Error_Rate 0x0001 200 200 051 Pre-fail Offline - 0 9 Power_On_Hours 0x0000 081 081 000 Old_age Offline - 13971 10 Spin_Retry_Count 0x0001 100 100 051 Pre-fail Offline - 0 11 Calibration_Retry_Count 0x0000 100 100 051 Old_age Offline - 0 12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 102 16 Total_LBAs_Read 0x0000 003 197 000 Old_age Offline - 58652385959 184 End-to-End_Error 0x0001 100 100 097 Pre-fail Offline - 0 187 Reported_Uncorrect 0x0000 100 100 000 Old_age Offline - 0 188 Command_Timeout 0x0000 100 100 000 Old_age Offline - 0 190 Airflow_Temperature_Cel 0x0000 054 045 000 Old_age Offline - 46 192 Power-Off_Retract_Count 0x0000 200 200 000 Old_age Offline - 86 193 Load_Cycle_Count 0x0000 177 177 000 Old_age Offline - 69922 194 Temperature_Celsius 0x0000 106 097 000 Old_age Offline - 46 195 Hardware_ECC_Recovered 0x0000 200 200 000 Old_age Offline - 0 196 Reallocated_Event_Count 0x0000 200 200 000 Old_age Offline - 0 197 Current_Pending_Sector 0x0000 200 200 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0000 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0000 200 200 000 Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x0001 100 253 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 90% 13970 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
原创声明:本文章为原创内容,所有文章均由博主亲自撰写,严格遵循原创原则。我们承诺不使用任何人工智能生成的内容,所发布的每一篇文章都经过深思熟虑,旨在为读者提供真实、有价值的观点和信息。我们坚信原创才是知识分享的根本,致力于为广大读者呈现最具真实性和独特性的文章。感谢您的支持与关注,欢迎持续关注我们的原创内容。