前提
查看了应用日志,是凌晨就已经 OOM 了,但是直到早上八点多客服反馈登录不了才发现问题,当时在路上,于是让运维 dump 了堆栈和线程信息,重启了服务器。
jvm 配置:
-Xmx2000M
-Xms2000M
-Xmn500M
-XX:PermSize=250M
-XX:MaxPermSize=250M
-Xss256K
-XX:-UseGCOverheadLimit
-Djava.net.preferIPv4Stack=true
-Djava.net.preferIPv4Addresses
-XX:+DisableExplicitGC
-XX:SurvivorRatio=1
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSClassUnloadingEnabled
-XX:LargePageSizeInBytes=128M
-XX:+UseFastAccessorMethods
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=60
-XX:SoftRefLRUPolicyMSPerMB=0
-XX:+PrintClassHistogram
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-Xloggc:/data/apps/apache-tomcat-6.0.36/logs/gc.log
jvm.heap
2020-05-07 08:46:
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 2097152000 (2000.0MB)
NewSize = 524288000 (500.0MB)
MaxNewSize = 524288000 (500.0MB)
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 1
PermSize = 262144000 (250.0MB)
MaxPermSize = 262144000 (250.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 349569024 (333.375MB)
used = 137905504 (131.51693725585938MB)
free = 211663520 (201.85806274414062MB)
39.45014990801931% used
Eden Space:
capacity = 174850048 (166.75MB)
used = 129558040 (123.5561752319336MB)
free = 45292008 (43.193824768066406MB)
74.09665681075478% used
From Space:
capacity = 174718976 (166.625MB)
used = 8347464 (7.960762023925781MB)
free = 166371512 (158.66423797607422MB)
4.777651627262284% used
To Space:
capacity = 174718976 (166.625MB)
used = 0 (0.0MB)
free = 174718976 (166.625MB)
0.0% used
concurrent mark-sweep generation:
capacity = 1572864000 (1500.0MB)
used = 532712344 (508.0340805053711MB)
free = 1040151656 (991.9659194946289MB)
33.86893870035807% used
Perm Generation:
capacity = 262144000 (250.0MB)
used = 149215056 (142.3025665283203MB)
free = 112928944 (107.69743347167969MB)
56.921026611328124% used
2020-05-07 08:46:
using thread-local object allocation.
Parallel GC with 4 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 1566572544 (1494.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 174063616 (166.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 1966080 (1.875MB)
used = 628456 (0.5993423461914062MB)
free = 1337624 (1.2756576538085938MB)
31.964925130208332% used
From Space:
capacity = 196608 (0.1875MB)
used = 98304 (0.09375MB)
free = 98304 (0.09375MB)
50.0% used
To Space:
capacity = 196608 (0.1875MB)
used = 0 (0.0MB)
free = 196608 (0.1875MB)
0.0% used
PS Old Generation
capacity = 65273856 (62.25MB)
used = 4938552 (4.709770202636719MB)
free = 60335304 (57.54022979736328MB)
7.565895907850151% used
PS Perm Generation
capacity = 25296896 (24.125MB)
used = 25255776 (24.085784912109375MB)
free = 41120 (0.039215087890625MB)
99.8374504128886% used
应用日志
jstack 信息
2020-05-07 08:47 cpu_busy: 1
"scheduler-9" prio=10 tid=0x00007f962c6e6000 nid=0x3a0d waiting on condition [0x00007f962437f000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:193)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
"scheduler-8" prio=10 tid=0x00007f962cb95800 nid=0x3a0c waiting on condition [0x00007f96248d4000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:193)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
"scheduler-7" prio=10 tid=0x00007f962c7c6000 nid=0x3a0b waiting on condition [0x00007f962468b000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
2020-05-07 08:47 cpu_busy: 1
"VM Thread" prio=10 tid=0x00007f962c0fe800 nid=0x38e1 runnable
"Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f962c010800 nid=0x38dc runnable
"Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f962c012800 nid=0x38dd runnable
"Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x00007f962c014800 nid=0x38de runnable
"Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x00007f962c016000 nid=0x38df runnable
"Concurrent Mark-Sweep GC Thread" prio=10 tid=0x00007f962c0a5000 nid=0x38e0 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007f962c258000 nid=0x39b4 waiting on condition
JNI global references: 1952
2020-05-07 08:47 cpu_busy: 1
"DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275)
"DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a00228> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-6" prio=10 tid=0x00007f1c8c241000 nid=0x6682 in Object.wait() [0x00007f1c3bffe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
2020-05-07 08:47 cpu_busy: 1
"Attach Listener" daemon prio=10 tid=0x00007f1c68001000 nid=0x30d waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"DestroyJavaVM" prio=10 tid=0x00007f1c8c008000 nid=0x665b waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275)
"DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
- locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
"DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
用 jhat -J-Xmx50m jvm.dump 查看所有类的实例数
这是我第一次碰到 OOM,我不理解的是为什么我看 jvm.heap 里面,堆内存使用率不是百分百,但请求依然是 OOM 呢?这些堆栈文件是运维用脚本生成的,必要的话我可以把脚本贴上来。
我会继续排查问题,希望哪位看到了也能帮我瞅瞅,非常感谢。
ps:
发现日志的打印时间顺序有问题,不知道这个会不会可以找到什么线索?
pps
哪位大佬如果有好的建议,赶紧提一下赶在故障前我准备下。
1
alexfarm 2020-05-07 12:46:47 +08:00
以前用过个工具 eclipse memory analyzer 去看 dump 的内容,稍微直观详细点,主要看看堆内存占用高的类
|
2
Lax 2020-05-07 12:47:57 +08:00
不太懂 java,不过看你输出里有个 PS Perm Generation 已经满了。
|
3
aircjm 2020-05-07 12:48:17 +08:00
这个应该是对象太大导致的吧?? 你可以用 eclipse 的那个分析工具分析下你的 dump 文件找下
|
5
hand515 2020-05-07 13:04:54 +08:00
你这里是两个进程? jmap 输出的第二段,明显 jvm 的内存配置都跟你贴的不一样,Perm Gen 才 24m
|
7
lovedebug 2020-05-07 13:17:17 +08:00
我跟楼主说一个自己碰到过的。线上爆发的及其容易忽略的导致 oom 的点,所有 debug 和 info 级别 log 中不要打印整个 List/Set/Map,这些在运行中实际也是会计算的。
|
8
lovedebug 2020-05-07 13:18:10 +08:00
另外楼主应该为这个服务设置 health check api,可以今早发现问题。
|
9
lovedebug 2020-05-07 13:19:35 +08:00
PPS:先升级机器内存起一台新机器线上泡泡看看,这台出问题的机器可以从集群中移除仔细分析
|
10
pmispig 2020-05-07 13:34:49 +08:00
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/logs/gc.hprof
加上这个 |
12
Solace202 OP @hand515 这是使用了两个 GC ( Mark-Sweep GC 和 Parallel GC ),jvm 里面配置是给第一个 GC 配的
![]( https://i.bmp.ovh/imgs/2020/05/56930920c1ac5a20.png) |
13
yamasa 2020-05-07 13:39:41 +08:00
jmap -histo <pid>会按 mem size 对堆内对象排序,先看看前几名分别是啥。但是这种方式看不出对象树,所以比较大可能还是需要用 jmap dump 一个 bin 或者 hprof 下来,用 eclipse mem analyzer 直接分析。这个 analyzer 可以对一个 dump 出具 mem leak 的报表,要仔细看看,找到了高内存占用的类看看对象树,是不是你们哪里的业务代码造成了内存泄漏。
|
14
Solace202 OP @hand515 贴个配置和 jmap 的对应![关系图]( https://i.bmp.ovh/imgs/2020/05/56930920c1ac5a20.png)
|
17
yamasa 2020-05-07 13:42:59 +08:00
话说你们的 GC 是 CMS,可以用这个工具看看 https://github.com/DarLiner/vjtools/tree/master/vjmap 。 如果是 G1 的话反而用不了。这是可以分代的 jmap,如果 jmap 给出的分布看不出来问题就上这个,一个一个代区来分析。
|
18
Solace202 OP @pmispig 已经让运维加了,我们有 GC 的日志,但是没有日期,这个能否定位问题呢?可以的话我让在 JVM 把 gc 的日期和时间也加上?
|
19
limbo0 2020-05-07 14:11:53 +08:00
除了堆内存, 也得考虑下堆外内存,是不是有大量的网络请求
|
20
coolcfan 2020-05-07 14:13:14 +08:00 via Android
如果是 PermGen 满,结合日志内容,可以关注下应用里的反射操作或者是否存在动态的类生成……
|
22
25OHd2qObJmJ6P10 2020-05-07 14:17:38 +08:00
从应用日志来看是 spring 注入一个类的时候报了堆空间不足,spring 注入属性会去加载相应的类到永久区,根据你的日志分析,你的 jdk 应该是 1.8 以下,而且你的 jvm 内存空间也显示你的永久区只有 24M,使用达到 99%,所以初步分析就可以增大永久区 -XX:PermSize=75M -XX:MaxPermSize=100M 尝试,然后再根据 jstat 和 jmap 分析内存变化,确定原因以及调整最合适的 jvm 参数
|
23
Solace202 OP @rayduan 我是有个地方不理解,两个 GC,为什么注入的类会放到下面这个 GC 的 Perm Generation 而不是上面的那个 Perm Generation ?
|
24
zwx327634 2020-05-07 14:31:50 +08:00
如果是 jdk1.7,升到 1.8 就没问题了,永久代变成元空间,放在堆里了。快速的话就是像楼上,增加永久代空间
|
26
1ffree 2020-05-07 15:05:50 +08:00
好奇应用运行期间 还有 autowire 是啥使用场景呢
|
27
pangleon 2020-05-07 15:07:44 +08:00
你这个提示的是 heap space 而不是 permgen space,后者是永久代满了,那么基本上可以定位到你们的程序创建了大量的临时对象
如果真是凌晨 OOM,凌晨也不是用户高峰那么基本可以确定是定时任务导致的。 你可以查询下有哪些定时任务在运行,他们是否创建了大量的 LIST,MAP 这种集合类对象或者拼接了大量字符串对象。 |
28
pmispig 2020-05-07 15:15:31 +08:00
@Solace202 -XX:HeapDumpPath=/logs/gc.hprof 看看能不能重现一次,拿到这个 dump 出来分析就比较好定位
|
29
liukzz 2020-05-07 15:16:37 +08:00
https://gceasy.io/gc-index.jsp 这个网站下个离线版的分析工具,把 gc.log 导进去分析下
|
31
gadsavesme 2020-05-07 15:45:01 +08:00 1
Perm Generation 满了,这个 jdk1.8 以下的时候 fullgc 的时候不会清理空间,满了就 oom 十分蛋疼。
|
32
bonke 2020-05-07 16:00:18 +08:00
hashMap 的 key 是重写了 hashCode 方法么 看下有没有重写 equals 方法
|
33
admol 2020-05-07 17:05:15 +08:00
推荐一下这个工具 https://console.perfma.com/
|
34
chibupang 2020-05-07 17:31:03 +08:00 via iPhone
我看 OOM 日志错误是在“work”反射调用的时候,是不是创建了大内存对象。
|
35
kingwang 2020-05-07 17:57:40 +08:00
应该是出报表的时候跪了,看一下这些报表里会不会象楼上们所说的产生大量集合对象
|
36
iFlicker 2020-05-07 17:59:47 +08:00 1
如果是 macOS 我推荐 Jprofiler ,平时 Android dump 堆内存查问题就是用这个看 比 as 强大
|
37
EmdeBoas 2020-05-07 18:04:08 +08:00
机器内存多大?
|
38
cheng6563 2020-05-07 18:15:46 +08:00 via Android
强烈推荐 Jprofiler 来观测 jvm 内存转存。轻松找到哪有问题
|
39
Huozy 2020-05-07 19:40:30 +08:00
用的 was 吗?看看 was 控制台最大内存和最小内存是多少,如果操作的数据多了 超出一定范围会报内存溢出的
|
41
clino 2020-05-07 20:49:20 +08:00 via Android
不一定内存哦,什么用户最大进程数都可能会出的
所以也要去看一下 syslog kernel log 之类的 |
42
luozic 2020-05-07 23:42:05 +08:00
jdk 6 ? 可以试试 jprofiler, 免费的有阿里的 https://github.com/alibaba/arthas/blob/master/README_CN.md ,还有个唯品会的工具,这两个工具也可以出图来分析。
并且现在新版本的 jdk 6 也支持 gc1 了 |
43
zinan 2020-05-08 01:14:36 +08:00
就是永久代满了吧, 1.8 以下永久代就是放在 heap 中管理的, 永久代在 full gc 中不会清理. 反射调用有可能会触发类加载的行为, 永久代放不下就会报 oom 了
|
44
heyhumor 2020-05-08 08:13:22 +08:00
V 站摸鱼的大佬还是不少啊[滑稽]
|