Nginx常见故障案例总结-电子发烧友网

从502到排障：Nginx常见故障分析案例

作为一名运维工程师，你是否曾在深夜被502错误的报警电话惊醒？是否因为神秘的Nginx故障而焦头烂额？本文将通过真实案例，带你深入Nginx故障排查的精髓，让你从运维小白进阶为故障排查专家。

引言：那些年我们踩过的Nginx坑

在互联网公司的运维生涯中，Nginx故障可以说是最常见也最让人头疼的问题之一。从简单的配置错误到复杂的性能瓶颈，从偶发的502到持续的高延迟，每一个故障背后都有其独特的原因和解决方案。

作为拥有8年运维经验的工程师，我见证了无数次午夜故障处理，也总结出了一套行之有效的故障排查方法论。今天，我将通过10个真实案例，手把手教你如何快速定位和解决Nginx常见故障。

案例一：经典502错误 - 上游服务不可达

故障现象

某电商网站在促销活动期间突然出现大量502错误，用户无法正常下单，业务损失严重。

故障排查过程

第一步：查看Nginx错误日志

# 查看最新的错误日志
tail-f /var/log/nginx/error.log

# 典型502错误日志
2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused)whileconnecting to upstream, client: 192.168.1.100, server: shop.example.com, request:"POST /api/order HTTP/1.1", upstream:"http://192.168.1.200:8080/api/order", host:"shop.example.com"

第二步：检查上游服务状态

# 检查后端服务是否正常运行
netstat -tulpn | grep 8080
ps aux | grep java

# 测试上游服务连通性
curl -I http://192.168.1.200:8080/health
telnet 192.168.1.200 8080

第三步：分析Nginx配置

upstreambackend_servers {
server192.168.1.200:8080weight=1max_fails=3fail_timeout=30s;
server192.168.1.201:8080weight=1max_fails=3fail_timeout=30sbackup;
}

server{
listen80;
server_nameshop.example.com;

location/api/ {
proxy_passhttp://backend_servers;
proxy_connect_timeout5s;
proxy_read_timeout60s;
proxy_send_timeout60s;
  }
}

根因分析

通过排查发现，主服务器192.168.1.200由于负载过高导致Java应用崩溃，而备份服务器配置有误未能及时接管流量。

解决方案

# 1. 重启故障服务器的应用
systemctl restart tomcat

# 2. 修复备份服务器配置
# 将backup参数移除，让两台服务器同时处理请求
upstream backend_servers {
  server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
  server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}

# 3. 重载Nginx配置
nginx -t && nginx -s reload

预防措施

• 配置健康检查机制

• 设置合理的负载均衡策略

• 建立完善的监控告警体系

案例二：SSL证书过期导致的服务中断

故障现象

某金融网站客户反馈无法访问，浏览器显示"您的连接不是私密连接"错误。

故障排查过程

检查SSL证书状态

# 查看证书到期时间
openssl x509 -in/etc/nginx/ssl/domain.crt -noout -dates

# 使用openssl检查在线证书
echo| openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

# 查看Nginx SSL配置
nginx -T | grep -A 10 -B 5 ssl_certificate

Nginx SSL配置示例

server{
listen443ssl http2;
server_namefinance.example.com;

ssl_certificate/etc/nginx/ssl/domain.crt;
ssl_certificate_key/etc/nginx/ssl/domain.key;
ssl_protocolsTLSv1.2TLSv1.3;
ssl_ciphersECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;

# HSTS设置
add_headerStrict-Transport-Security"max-age=31536000"always;
}

解决方案

# 1. 生成新的SSL证书（以Let's Encrypt为例）
certbot --nginx -d finance.example.com

# 2. 手动更新证书配置
ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem;

# 3. 测试并重载配置
nginx -t && nginx -s reload

# 4. 验证SSL证书
curl -I https://finance.example.com

自动化解决方案

# 创建证书更新脚本
cat> /etc/cron.d/certbot << 'EOF'
0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload"
EOF

# 添加证书监控脚本
cat > /usr/local/bin/ssl_check.sh << 'EOF'
#!/bin/bash
DOMAIN="finance.example.com"
DAYS=30

EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate |cut-d= -f2)
EXPIRY_EPOCH=$(date-d"$EXPIRY_DATE"+%s)
CURRENT_EPOCH=$(date+%s)
DAYS_LEFT=$(( ($EXPIRY_EPOCH-$CURRENT_EPOCH) /86400))

if[$DAYS_LEFT-lt$DAYS];then
echo"SSL certificate for$DOMAINexpires in$DAYS_LEFTdays!"
# 发送告警
fi
EOF

案例三：高并发下的性能瓶颈

故障现象

某视频网站在晚高峰期间响应缓慢，部分用户反馈视频加载失败。

性能分析工具

# 查看Nginx连接状态
curl http://localhost/nginx_status

# 使用htop查看系统负载
htop

# 检查网络连接数
ss -tuln |wc-l
netstat -an | grep :80 |wc-l

Nginx状态页配置

server{
listen80;
server_namelocalhost;

location/nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
denyall;
  }
}

性能优化配置

# 主配置优化
worker_processesauto;
worker_connections65535;
worker_rlimit_nofile65535;

events{
useepoll;
multi_accepton;
worker_connections65535;
}

http{
# 开启gzip压缩
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_typestext/plain text/css application/json application/javascript;

# 缓存优化
open_file_cachemax=100000inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;

# 连接优化
keepalive_timeout65;
keepalive_requests100;

# 缓冲区优化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}

系统层面优化

# 优化系统参数
cat>> /etc/sysctl.conf << 'EOF'
# 网络优化
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000

# 文件描述符优化
fs.file-max = 1000000
EOF

# 应用配置
sysctl -p

案例四：缓存配置错误导致的问题

故障现象

某新闻网站更新内容后，用户仍然看到旧内容，清除浏览器缓存后问题依然存在。

缓存配置分析

server{
listen80;
server_namenews.example.com;

# 静态资源缓存
location~* .(jpg|jpeg|png|gif|ico|css|js)${
expires1y;
add_headerCache-Control"public, immutable";
add_headerPragma public;
  }

# 动态内容
location/ {
proxy_passhttp://backend;

# 错误的缓存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_headerX-Cache-Status$upstream_cache_status;
  }
}

问题排查

# 检查缓存目录
ls-la /var/cache/nginx/

# 查看缓存配置
nginx -T | grep -A 20 proxy_cache

# 测试缓存状态
curl -I http://news.example.com/article/123 | grep X-Cache-Status

正确的缓存配置

http{
# 缓存路径配置
proxy_cache_path/var/cache/nginx levels=1:2keys_zone=my_cache:10mmax_size=10ginactive=60muse_temp_path=off;

server{
listen80;
server_namenews.example.com;

# API接口不缓存
location/api/ {
proxy_passhttp://backend;
proxy_cacheoff;
add_headerCache-Control"no-cache, no-store, must-revalidate";
    }

# 新闻内容缓存
location/article/ {
proxy_passhttp://backend;
proxy_cachemy_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerrortimeout updating;
add_headerX-Cache-Status$upstream_cache_status;
    }

# 静态资源长期缓存
location~* .(jpg|jpeg|png|gif|ico)${
expires1y;
add_headerCache-Control"public, immutable";
    }

location~* .(css|js)${
expires1d;
add_headerCache-Control"public";
    }
  }
}

缓存管理工具

# 清除特定URL缓存
curl -X PURGE http://news.example.com/article/123

# 批量清除缓存
find /var/cache/nginx -typef -name"*.cache"-mtime +7 -delete

# 缓存统计脚本
cat> /usr/local/bin/cache_stats.sh << 'EOF'
#!/bin/bash
CACHE_DIR="/var/cache/nginx"
echo"Cache directory size: $(du -sh $CACHE_DIR)"
echo"Cache files count: $(find $CACHE_DIR -type f | wc -l)"
echo"Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)"
EOF

案例五：日志轮转异常导致磁盘空间耗尽

故障现象

服务器突然无法响应，检查发现磁盘空间100%占用，主要是Nginx日志文件过大。

问题诊断

# 检查磁盘空间
df-h

# 找出大文件
du-h /var/log/nginx/ |sort-hr

# 检查日志轮转配置
cat/etc/logrotate.d/nginx

修复和优化

# 紧急处理：截断当前日志
> /var/log/nginx/access.log
> /var/log/nginx/error.log

# 重启nginx以重新打开日志文件
nginx -s reopen

优化的日志轮转配置

# /etc/logrotate.d/nginx
/var/log/nginx/*.log{
  daily
  missingok
  rotate 14
  compress
  delaycompress
  notifempty
  create 640 nginx nginx
  sharedscripts
  postrotate
if[ -f /var/run/nginx.pid ];then
kill-USR1 `cat/var/run/nginx.pid`
fi
  endscript
}

日志配置优化

http{
# 自定义日志格式
log_formatmain'$remote_addr-$remote_user[$time_local] "$request" '
'$status$body_bytes_sent"$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_timeuct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';

# 条件日志记录
map$status$loggable{
    ~^[23] 0;
default1;
  }

server{
# 只记录错误请求
access_log/var/log/nginx/access.log main if=$loggable;

# 静态资源不记录日志
location~* .(jpg|jpeg|png|gif|ico|css|js)${
access_logoff;
expires1y;
    }
  }
}

监控脚本

# 磁盘空间监控
cat> /usr/local/bin/disk_monitor.sh << 'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')

if [ $USAGE -gt $THRESHOLD ]; then
echo"Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
# 自动清理老日志
    find /var/log/nginx -name "*.log.*" -mtime +7 -delete
# 发送告警
fi
EOF

案例六：负载均衡配置错误

故障现象

某服务采用多台后端服务器，但发现流量分配不均，部分服务器负载过高而其他服务器闲置。

负载均衡策略对比

# 轮询（默认）
upstreambackend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

# 加权轮询
upstreambackend_weighted {
server192.168.1.10:8080weight=3;
server192.168.1.11:8080weight=2;
server192.168.1.12:8080weight=1;
}

# IP哈希
upstreambackend_ip_hash {
  ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

# 最少连接
upstreambackend_least_conn {
  least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

健康检查配置

upstreambackend_with_health {
server192.168.1.10:8080max_fails=3fail_timeout=30s;
server192.168.1.11:8080max_fails=3fail_timeout=30s;
server192.168.1.12:8080max_fails=3fail_timeout=30sbackup;

# keepalive连接池
keepalive32;
}

server{
location/ {
proxy_passhttp://backend_with_health;

# 健康检查相关
proxy_next_upstreamerrortimeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;

# 连接复用
proxy_http_version1.1;
proxy_set_headerConnection"";
  }
}

监控脚本

# 后端服务器健康检查脚本
cat> /usr/local/bin/backend_health_check.sh << 'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")

for server in"${SERVERS[@]}"; do
if curl -sf "http://$server/health" > /dev/null;then
echo"$server: OK"
else
echo"$server: FAILED"
# 发送告警
fi
done
EOF

案例七：安全配置漏洞

故障现象

网站被恶意扫描，发现存在多个安全漏洞，需要加强Nginx安全配置。

安全加固配置

server{
listen80;
server_namesecure.example.com;

# 隐藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";

# 安全头设置
add_headerX-Frame-Options"SAMEORIGIN"always;
add_headerX-XSS-Protection"1; mode=block"always;
add_headerX-Content-Type-Options"nosniff"always;
add_headerReferrer-Policy"no-referrer-when-downgrade"always;
add_headerContent-Security-Policy"default-src 'self' http: https: data: blob: 'unsafe-inline'"always;

# 限制请求方法
if($request_method!~ ^(GET|HEAD|POST)$) {
return405;
  }

# 防止目录遍历
location~ /.{
denyall;
access_logoff;
log_not_foundoff;
  }

# 限制文件上传大小
client_max_body_size10M;

# 限制请求频率
limit_req_zone$binary_remote_addrzone=api:10mrate=10r/s;
limit_req_zone$binary_remote_addrzone=login:10mrate=1r/s;

location/api/ {
limit_reqzone=api burst=20nodelay;
proxy_passhttp://backend;
  }

location/login {
limit_reqzone=login burst=5nodelay;
proxy_passhttp://backend;
  }
}

防护脚本

# fail2ban配置示例
cat> /etc/fail2ban/filter.d/nginx-4xx.conf << 'EOF'
[Definition]
failregex = ^ -.*"(GET|POST).*"(404|403|400) .*$
ignoreregex =
EOF

cat> /etc/fail2ban/jail.local << 'EOF'
[nginx-4xx]
enabled = true
port = http,https
filter = nginx-4xx
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 60
EOF

案例八：反向代理配置问题

故障现象

使用Nginx作为反向代理时，客户端真实IP丢失，后端服务无法获取正确的客户端信息。

问题分析和解决

server{
listen80;
server_nameapi.example.com;

location/ {
proxy_passhttp://backend;

# 正确传递客户端IP
proxy_set_headerHost$host;
proxy_set_headerX-Real-IP$remote_addr;
proxy_set_headerX-Forwarded-For$proxy_add_x_forwarded_for;
proxy_set_headerX-Forwarded-Proto$scheme;

# 处理重定向
proxy_redirectoff;

# 超时设置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;

# 缓冲设置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
  }
}

WebSocket支持

map$http_upgrade$connection_upgrade{
defaultupgrade;
  '' close;
}

server{
listen80;
server_namews.example.com;

location/websocket {
proxy_passhttp://backend;
proxy_http_version1.1;
proxy_set_headerUpgrade$http_upgrade;
proxy_set_headerConnection$connection_upgrade;
proxy_set_headerHost$host;
proxy_cache_bypass$http_upgrade;

# WebSocket特殊配置
proxy_read_timeout86400;
  }
}

案例九：URL重写规则冲突

故障现象

网站URL重写规则复杂，出现重定向循环和404错误。

重写规则优化

server{
listen80;
server_nameexample.com www.example.com;

# 强制跳转到主域名
if($host!='example.com') {
return301https://example.com$request_uri;
  }

# SEO友好的URL重写
location/ {
try_files$uri$uri/@rewrites;
  }

location@rewrites{
rewrite^/product/([0-9]+)$/product.php?id=$1last;
rewrite^/category/([a-zA-Z0-9-]+)$/category.php?name=$1last;
rewrite^/user/([a-zA-Z0-9]+)$/profile.php?username=$1last;
return404;
  }

# 防止重定向循环
location~ .php${
try_files$uri=404;
fastcgi_pass127.0.0.1:9000;
fastcgi_indexindex.php;
includefastcgi_params;
  }
}

调试重写规则

# 开启重写日志
error_log/var/log/nginx/rewrite.lognotice;
rewrite_logon;

# 测试重写规则
location/test {
rewrite^/test/(.*)$/debug?param=$1break;
return200"Rewrite test:$args
";
}

案例十：性能监控与调优

故障现象

需要建立完善的Nginx性能监控体系，及时发现和解决性能问题。

监控脚本

# Nginx性能监控脚本
cat> /usr/local/bin/nginx_monitor.sh << 'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"

# 获取状态信息
STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS" | grep "Active connections" | awk '{print $3}')
ACCEPTS=$(echo"$STATUS" | awk 'NR==2 {print $1}')
HANDLED=$(echo"$STATUS" | awk 'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS" | awk 'NR==2 {print $3}')
READING=$(echo"$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo"$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo"$STATUS" | awk 'NR==3 {print $6}')

# 记录到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >>$LOG_FILE

# 告警逻辑
if[$ACTIVE_CONN-gt 1000 ];then
echo"High connection count:$ACTIVE_CONN"| logger -t nginx_monitor
fi
EOF

综合调优配置

# 终极优化配置
worker_processesauto;
worker_cpu_affinityauto;
worker_rlimit_nofile100000;

error_log/var/log/nginx/error.logwarn;
pid/var/run/nginx.pid;

events{
useepoll;
worker_connections10240;
multi_accepton;
accept_mutexoff;
}

http{
include/etc/nginx/mime.types;
default_typeapplication/octet-stream;

# 日志格式
log_formatmain'$remote_addr-$remote_user[$time_local] "$request" '
'$status$body_bytes_sent"$http_referer" '
'"$http_user_agent"$request_time$upstream_response_time';

# 性能优化
sendfileon;
tcp_nopushon;
tcp_nodelayon;
keepalive_timeout65;
keepalive_requests1000;

# 压缩优化
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_comp_level6;
gzip_typestext/plain text/css application/json application/javascript text/xml application/xml;

# 缓存优化
open_file_cachemax=100000inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;

# 安全优化
server_tokensoff;
client_header_timeout10;
client_body_timeout10;
reset_timedout_connectionon;
send_timeout10;

# 限流配置
limit_req_zone$binary_remote_addrzone=global:10mrate=100r/s;
limit_conn_zone$binary_remote_addrzone=addr:10m;

include/etc/nginx/conf.d/*.conf;
}

故障排查方法论总结

1. 标准化排查流程

1.收集故障信息：确认故障现象、影响范围、发生时间

2.查看日志文件：error.log、access.log、系统日志

3.检查配置文件：语法检查、逻辑检查

4.验证网络连通：端口状态、连通性测试

5.分析性能指标：CPU、内存、网络、磁盘

6.确定根本原因：深入分析，找出真正原因

7.实施解决方案：临时修复、永久解决

8.验证修复效果：功能测试、性能测试

9.总结经验教训：文档记录、流程优化

2. 常用排查工具

•日志分析：tail、grep、awk、sed

•网络工具：curl、wget、telnet、netstat、ss

•性能监控：htop、iotop、iftop、nginx-status

•系统诊断：strace、lsof、tcpdump

3. 预防性措施

• 建立完善的监控告警体系

• 定期进行配置文件备份

• 实施自动化运维工具

• 制定标准化操作流程

• 定期进行故障演练

结语

Nginx故障排查是运维工程师必备的核心技能，需要扎实的理论基础和丰富的经验

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

nginx

nginx

+关注

关注
0

文章
181

浏览量
12977

原文标题：从502到排障：Nginx常见故障分析案例

文章出处：【微信号：magedu-Linux，微信公众号：马哥Linux运维】欢迎添加关注！文章转载请注明出处。

搜索历史

Nginx常见故障案例总结

评论