从502到排障:Nginx常见故障分析案例
作为一名运维工程师,你是否曾在深夜被502错误的报警电话惊醒?是否因为神秘的Nginx故障而焦头烂额?本文将通过真实案例,带你深入Nginx故障排查的精髓,让你从运维小白进阶为故障排查专家。
引言:那些年我们踩过的Nginx坑
在互联网公司的运维生涯中,Nginx故障可以说是最常见也最让人头疼的问题之一。从简单的配置错误到复杂的性能瓶颈,从偶发的502到持续的高延迟,每一个故障背后都有其独特的原因和解决方案。
作为拥有8年运维经验的工程师,我见证了无数次午夜故障处理,也总结出了一套行之有效的故障排查方法论。今天,我将通过10个真实案例,手把手教你如何快速定位和解决Nginx常见故障。
案例一:经典502错误 - 上游服务不可达
故障现象
某电商网站在促销活动期间突然出现大量502错误,用户无法正常下单,业务损失严重。
故障排查过程
第一步:查看Nginx错误日志
# 查看最新的错误日志 tail-f /var/log/nginx/error.log # 典型502错误日志 2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused)whileconnecting to upstream, client: 192.168.1.100, server: shop.example.com, request:"POST /api/order HTTP/1.1", upstream:"http://192.168.1.200:8080/api/order", host:"shop.example.com"
第二步:检查上游服务状态
# 检查后端服务是否正常运行 netstat -tulpn | grep 8080 ps aux | grep java # 测试上游服务连通性 curl -I http://192.168.1.200:8080/health telnet 192.168.1.200 8080
第三步:分析Nginx配置
upstreambackend_servers { server192.168.1.200:8080weight=1max_fails=3fail_timeout=30s; server192.168.1.201:8080weight=1max_fails=3fail_timeout=30sbackup; } server{ listen80; server_nameshop.example.com; location/api/ { proxy_passhttp://backend_servers; proxy_connect_timeout5s; proxy_read_timeout60s; proxy_send_timeout60s; } }
根因分析
通过排查发现,主服务器192.168.1.200由于负载过高导致Java应用崩溃,而备份服务器配置有误未能及时接管流量。
解决方案
# 1. 重启故障服务器的应用
systemctl restart tomcat
# 2. 修复备份服务器配置
# 将backup参数移除,让两台服务器同时处理请求
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}
# 3. 重载Nginx配置
nginx -t && nginx -s reload
预防措施
• 配置健康检查机制
• 设置合理的负载均衡策略
• 建立完善的监控告警体系
案例二:SSL证书过期导致的服务中断
故障现象
某金融网站客户反馈无法访问,浏览器显示"您的连接不是私密连接"错误。
故障排查过程
检查SSL证书状态
# 查看证书到期时间 openssl x509 -in/etc/nginx/ssl/domain.crt -noout -dates # 使用openssl检查在线证书 echo| openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates # 查看Nginx SSL配置 nginx -T | grep -A 10 -B 5 ssl_certificate
Nginx SSL配置示例
server{
listen443ssl http2;
server_namefinance.example.com;
ssl_certificate/etc/nginx/ssl/domain.crt;
ssl_certificate_key/etc/nginx/ssl/domain.key;
ssl_protocolsTLSv1.2TLSv1.3;
ssl_ciphersECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;
# HSTS设置
add_headerStrict-Transport-Security"max-age=31536000"always;
}
解决方案
# 1. 生成新的SSL证书(以Let's Encrypt为例) certbot --nginx -d finance.example.com # 2. 手动更新证书配置 ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem; # 3. 测试并重载配置 nginx -t && nginx -s reload # 4. 验证SSL证书 curl -I https://finance.example.com
自动化解决方案
# 创建证书更新脚本 cat> /etc/cron.d/certbot << 'EOF' 0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload" EOF # 添加证书监控脚本 cat > /usr/local/bin/ssl_check.sh << 'EOF' #!/bin/bash DOMAIN="finance.example.com" DAYS=30 EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate |cut-d= -f2) EXPIRY_EPOCH=$(date-d"$EXPIRY_DATE"+%s) CURRENT_EPOCH=$(date+%s) DAYS_LEFT=$(( ($EXPIRY_EPOCH-$CURRENT_EPOCH) /86400)) if[$DAYS_LEFT-lt$DAYS];then echo"SSL certificate for$DOMAINexpires in$DAYS_LEFTdays!" # 发送告警 fi EOF
案例三:高并发下的性能瓶颈
故障现象
某视频网站在晚高峰期间响应缓慢,部分用户反馈视频加载失败。
性能分析工具
# 查看Nginx连接状态 curl http://localhost/nginx_status # 使用htop查看系统负载 htop # 检查网络连接数 ss -tuln |wc-l netstat -an | grep :80 |wc-l
Nginx状态页配置
server{
listen80;
server_namelocalhost;
location/nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
denyall;
}
}
性能优化配置
# 主配置优化
worker_processesauto;
worker_connections65535;
worker_rlimit_nofile65535;
events{
useepoll;
multi_accepton;
worker_connections65535;
}
http{
# 开启gzip压缩
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_typestext/plain text/css application/json application/javascript;
# 缓存优化
open_file_cachemax=100000inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;
# 连接优化
keepalive_timeout65;
keepalive_requests100;
# 缓冲区优化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}
系统层面优化
# 优化系统参数 cat>> /etc/sysctl.conf << 'EOF' # 网络优化 net.core.somaxconn = 65535 net.core.netdev_max_backlog = 5000 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_max_tw_buckets = 5000 # 文件描述符优化 fs.file-max = 1000000 EOF # 应用配置 sysctl -p
案例四:缓存配置错误导致的问题
故障现象
某新闻网站更新内容后,用户仍然看到旧内容,清除浏览器缓存后问题依然存在。
缓存配置分析
server{
listen80;
server_namenews.example.com;
# 静态资源缓存
location~* .(jpg|jpeg|png|gif|ico|css|js)${
expires1y;
add_headerCache-Control"public, immutable";
add_headerPragma public;
}
# 动态内容
location/ {
proxy_passhttp://backend;
# 错误的缓存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_headerX-Cache-Status$upstream_cache_status;
}
}
问题排查
# 检查缓存目录 ls-la /var/cache/nginx/ # 查看缓存配置 nginx -T | grep -A 20 proxy_cache # 测试缓存状态 curl -I http://news.example.com/article/123 | grep X-Cache-Status
正确的缓存配置
http{
# 缓存路径配置
proxy_cache_path/var/cache/nginx levels=1:2keys_zone=my_cache:10mmax_size=10ginactive=60muse_temp_path=off;
server{
listen80;
server_namenews.example.com;
# API接口不缓存
location/api/ {
proxy_passhttp://backend;
proxy_cacheoff;
add_headerCache-Control"no-cache, no-store, must-revalidate";
}
# 新闻内容缓存
location/article/ {
proxy_passhttp://backend;
proxy_cachemy_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerrortimeout updating;
add_headerX-Cache-Status$upstream_cache_status;
}
# 静态资源长期缓存
location~* .(jpg|jpeg|png|gif|ico)${
expires1y;
add_headerCache-Control"public, immutable";
}
location~* .(css|js)${
expires1d;
add_headerCache-Control"public";
}
}
}
缓存管理工具
# 清除特定URL缓存 curl -X PURGE http://news.example.com/article/123 # 批量清除缓存 find /var/cache/nginx -typef -name"*.cache"-mtime +7 -delete # 缓存统计脚本 cat> /usr/local/bin/cache_stats.sh << 'EOF' #!/bin/bash CACHE_DIR="/var/cache/nginx" echo"Cache directory size: $(du -sh $CACHE_DIR)" echo"Cache files count: $(find $CACHE_DIR -type f | wc -l)" echo"Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)" EOF
案例五:日志轮转异常导致磁盘空间耗尽
故障现象
服务器突然无法响应,检查发现磁盘空间100%占用,主要是Nginx日志文件过大。
问题诊断
# 检查磁盘空间 df-h # 找出大文件 du-h /var/log/nginx/ |sort-hr # 检查日志轮转配置 cat/etc/logrotate.d/nginx
修复和优化
# 紧急处理:截断当前日志 > /var/log/nginx/access.log > /var/log/nginx/error.log # 重启nginx以重新打开日志文件 nginx -s reopen
优化的日志轮转配置
# /etc/logrotate.d/nginx
/var/log/nginx/*.log{
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 nginx nginx
sharedscripts
postrotate
if[ -f /var/run/nginx.pid ];then
kill-USR1 `cat/var/run/nginx.pid`
fi
endscript
}
日志配置优化
http{
# 自定义日志格式
log_formatmain'$remote_addr-$remote_user[$time_local] "$request" '
'$status$body_bytes_sent"$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_timeuct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
# 条件日志记录
map$status$loggable{
~^[23] 0;
default1;
}
server{
# 只记录错误请求
access_log/var/log/nginx/access.log main if=$loggable;
# 静态资源不记录日志
location~* .(jpg|jpeg|png|gif|ico|css|js)${
access_logoff;
expires1y;
}
}
}
监控脚本
# 磁盘空间监控
cat> /usr/local/bin/disk_monitor.sh << 'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
echo"Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
# 自动清理老日志
find /var/log/nginx -name "*.log.*" -mtime +7 -delete
# 发送告警
fi
EOF
案例六:负载均衡配置错误
故障现象
某服务采用多台后端服务器,但发现流量分配不均,部分服务器负载过高而其他服务器闲置。
负载均衡策略对比
# 轮询(默认)
upstreambackend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 加权轮询
upstreambackend_weighted {
server192.168.1.10:8080weight=3;
server192.168.1.11:8080weight=2;
server192.168.1.12:8080weight=1;
}
# IP哈希
upstreambackend_ip_hash {
ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 最少连接
upstreambackend_least_conn {
least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
健康检查配置
upstreambackend_with_health {
server192.168.1.10:8080max_fails=3fail_timeout=30s;
server192.168.1.11:8080max_fails=3fail_timeout=30s;
server192.168.1.12:8080max_fails=3fail_timeout=30sbackup;
# keepalive连接池
keepalive32;
}
server{
location/ {
proxy_passhttp://backend_with_health;
# 健康检查相关
proxy_next_upstreamerrortimeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;
# 连接复用
proxy_http_version1.1;
proxy_set_headerConnection"";
}
}
监控脚本
# 后端服务器健康检查脚本
cat> /usr/local/bin/backend_health_check.sh << 'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")
for server in"${SERVERS[@]}"; do
if curl -sf "http://$server/health" > /dev/null;then
echo"$server: OK"
else
echo"$server: FAILED"
# 发送告警
fi
done
EOF
案例七:安全配置漏洞
故障现象
网站被恶意扫描,发现存在多个安全漏洞,需要加强Nginx安全配置。
安全加固配置
server{
listen80;
server_namesecure.example.com;
# 隐藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";
# 安全头设置
add_headerX-Frame-Options"SAMEORIGIN"always;
add_headerX-XSS-Protection"1; mode=block"always;
add_headerX-Content-Type-Options"nosniff"always;
add_headerReferrer-Policy"no-referrer-when-downgrade"always;
add_headerContent-Security-Policy"default-src 'self' http: https: data: blob: 'unsafe-inline'"always;
# 限制请求方法
if($request_method!~ ^(GET|HEAD|POST)$) {
return405;
}
# 防止目录遍历
location~ /.{
denyall;
access_logoff;
log_not_foundoff;
}
# 限制文件上传大小
client_max_body_size10M;
# 限制请求频率
limit_req_zone$binary_remote_addrzone=api:10mrate=10r/s;
limit_req_zone$binary_remote_addrzone=login:10mrate=1r/s;
location/api/ {
limit_reqzone=api burst=20nodelay;
proxy_passhttp://backend;
}
location/login {
limit_reqzone=login burst=5nodelay;
proxy_passhttp://backend;
}
}
防护脚本
# fail2ban配置示例 cat> /etc/fail2ban/filter.d/nginx-4xx.conf << 'EOF' [Definition] failregex = ^-.*"(GET|POST).*"(404|403|400) .*$ ignoreregex = EOF cat> /etc/fail2ban/jail.local << 'EOF' [nginx-4xx] enabled = true port = http,https filter = nginx-4xx logpath = /var/log/nginx/access.log maxretry = 10 bantime = 3600 findtime = 60 EOF
案例八:反向代理配置问题
故障现象
使用Nginx作为反向代理时,客户端真实IP丢失,后端服务无法获取正确的客户端信息。
问题分析和解决
server{
listen80;
server_nameapi.example.com;
location/ {
proxy_passhttp://backend;
# 正确传递客户端IP
proxy_set_headerHost$host;
proxy_set_headerX-Real-IP$remote_addr;
proxy_set_headerX-Forwarded-For$proxy_add_x_forwarded_for;
proxy_set_headerX-Forwarded-Proto$scheme;
# 处理重定向
proxy_redirectoff;
# 超时设置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;
# 缓冲设置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
}
}
WebSocket支持
map$http_upgrade$connection_upgrade{
defaultupgrade;
'' close;
}
server{
listen80;
server_namews.example.com;
location/websocket {
proxy_passhttp://backend;
proxy_http_version1.1;
proxy_set_headerUpgrade$http_upgrade;
proxy_set_headerConnection$connection_upgrade;
proxy_set_headerHost$host;
proxy_cache_bypass$http_upgrade;
# WebSocket特殊配置
proxy_read_timeout86400;
}
}
案例九:URL重写规则冲突
故障现象
网站URL重写规则复杂,出现重定向循环和404错误。
重写规则优化
server{
listen80;
server_nameexample.com www.example.com;
# 强制跳转到主域名
if($host!='example.com') {
return301https://example.com$request_uri;
}
# SEO友好的URL重写
location/ {
try_files$uri$uri/@rewrites;
}
location@rewrites{
rewrite^/product/([0-9]+)$/product.php?id=$1last;
rewrite^/category/([a-zA-Z0-9-]+)$/category.php?name=$1last;
rewrite^/user/([a-zA-Z0-9]+)$/profile.php?username=$1last;
return404;
}
# 防止重定向循环
location~ .php${
try_files$uri=404;
fastcgi_pass127.0.0.1:9000;
fastcgi_indexindex.php;
includefastcgi_params;
}
}
调试重写规则
# 开启重写日志
error_log/var/log/nginx/rewrite.lognotice;
rewrite_logon;
# 测试重写规则
location/test {
rewrite^/test/(.*)$/debug?param=$1break;
return200"Rewrite test:$args
";
}
案例十:性能监控与调优
故障现象
需要建立完善的Nginx性能监控体系,及时发现和解决性能问题。
监控脚本
# Nginx性能监控脚本
cat> /usr/local/bin/nginx_monitor.sh << 'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"
# 获取状态信息
STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS" | grep "Active connections" | awk '{print $3}')
ACCEPTS=$(echo"$STATUS" | awk 'NR==2 {print $1}')
HANDLED=$(echo"$STATUS" | awk 'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS" | awk 'NR==2 {print $3}')
READING=$(echo"$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo"$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo"$STATUS" | awk 'NR==3 {print $6}')
# 记录到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >>$LOG_FILE
# 告警逻辑
if[$ACTIVE_CONN-gt 1000 ];then
echo"High connection count:$ACTIVE_CONN"| logger -t nginx_monitor
fi
EOF
综合调优配置
# 终极优化配置 worker_processesauto; worker_cpu_affinityauto; worker_rlimit_nofile100000; error_log/var/log/nginx/error.logwarn; pid/var/run/nginx.pid; events{ useepoll; worker_connections10240; multi_accepton; accept_mutexoff; } http{ include/etc/nginx/mime.types; default_typeapplication/octet-stream; # 日志格式 log_formatmain'$remote_addr-$remote_user[$time_local] "$request" ' '$status$body_bytes_sent"$http_referer" ' '"$http_user_agent"$request_time$upstream_response_time'; # 性能优化 sendfileon; tcp_nopushon; tcp_nodelayon; keepalive_timeout65; keepalive_requests1000; # 压缩优化 gzipon; gzip_varyon; gzip_min_length1000; gzip_comp_level6; gzip_typestext/plain text/css application/json application/javascript text/xml application/xml; # 缓存优化 open_file_cachemax=100000inactive=20s; open_file_cache_valid30s; open_file_cache_min_uses2; open_file_cache_errorson; # 安全优化 server_tokensoff; client_header_timeout10; client_body_timeout10; reset_timedout_connectionon; send_timeout10; # 限流配置 limit_req_zone$binary_remote_addrzone=global:10mrate=100r/s; limit_conn_zone$binary_remote_addrzone=addr:10m; include/etc/nginx/conf.d/*.conf; }
故障排查方法论总结
1. 标准化排查流程
1.收集故障信息:确认故障现象、影响范围、发生时间
2.查看日志文件:error.log、access.log、系统日志
3.检查配置文件:语法检查、逻辑检查
4.验证网络连通:端口状态、连通性测试
5.分析性能指标:CPU、内存、网络、磁盘
6.确定根本原因:深入分析,找出真正原因
7.实施解决方案:临时修复、永久解决
8.验证修复效果:功能测试、性能测试
9.总结经验教训:文档记录、流程优化
2. 常用排查工具
•日志分析:tail、grep、awk、sed
•网络工具:curl、wget、telnet、netstat、ss
•性能监控:htop、iotop、iftop、nginx-status
•系统诊断:strace、lsof、tcpdump
3. 预防性措施
• 建立完善的监控告警体系
• 定期进行配置文件备份
• 实施自动化运维工具
• 制定标准化操作流程
• 定期进行故障演练
结语
Nginx故障排查是运维工程师必备的核心技能,需要扎实的理论基础和丰富的经验
-
nginx
+关注
关注
0文章
181浏览量
12977
原文标题:从502到排障:Nginx常见故障分析案例
文章出处:【微信号:magedu-Linux,微信公众号:马哥Linux运维】欢迎添加关注!文章转载请注明出处。
发布评论请先 登录

Nginx常见故障案例总结
评论