6 April 2015

VMware VM with High CPU usage becomes unresponsive

Today one of our producation VMs became unreposnsive with VMware graphs were showing CPU 100% usage and CPU high alerts in vCenter logs,

After diving deep into logs, below are my observations from VMware and Linux logs within period before the crash,


From ESXi side, 
we can see host has lost access to SAP01 datastore, but our VM is resided in DS01, So it should not affects this VM operation.

Lost access to volume 547f2eaa-1a00d590-44f0-001018b3dc14 
(SAP01) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. info 7/17/2013 6:56:47 AM SAP01

Successfully restored access to volume 547f2eaa-1a00d590-44f0-001018b3dc14 (SAP01) following connectivity issues. info 7/17/2013 6:56:47 AM 10.8.225.211


From VM guest side, 
there is only 1 entry related to CPU status has been recorded:

Alarm 'Virtual machine CPU usage' changed from Green to Red info 7/16/2015 1:59:10 PM PRODvm


From VM guest vmware.log file, 
there is only an entry of high CPU has been recorded in 13:25hr,

2013-07-16T13:25:25.109Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'high cpu', deviceState=1, lazy=1, logging=0, quiesced=0, forceNative=0, tryNative=0, sibling=0 saveAllocMaps=0 cb=CDA3C80, cbData=3236B4E0
2015-07-16T13:25:25.127Z| vcpu-0| I120: Destroying virtual dev for scsi0:0 vscsi=8371
2015-07-16T13:25:25.127Z| vcpu-0| I120: VMMon_VSCSIStopVports: No such target on adapter



From within Linux guest, 
we can see below SAR report, that Linux CPU was totally idle before the crash:

# sar -f /var/log/sa/sa16
01:10:01 PM CPU %user %nice %system %iowait %steal %idle
01:20:02 PM all 0.12 0.00 0.27 0.06 0.00 99.55
01:30:01 PM all 0.12 0.00 0.28 0.06 0.00 99.54
01:40:01 PM all 0.20 0.00 0.32 0.10 0.00 99.38
01:50:01 PM all 0.10 0.00 0.27 0.04 0.00 99.58
Average: all 0.14 0.00 0.28 0.32 0.00 99.26

06:05:18 PM LINUX RESTART


06:10:01 PM CPU %user %nice %system %iowait %steal %idle
06:20:01 PM all 0.16 0.00 0.46 0.29 0.00 99.09
06:30:01 PM all 0.05 0.00 0.11 0.10 0.00 99.74
06:40:01 PM all 0.16 0.00 0.52 1.19 0.00 98.13


Therefore, Linux CPU was not busy as opposed to VMware that shows CPU used 100%.

Linux didn’t report any complain about high CPU I/O wait, so disks were accessible in a normal way to Linux,

And CPU run queue was normal as well:

# sar -q -f /var/log/sa/sa16
01:10:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
01:20:02 PM 0 147 0.00 0.00 0.00
01:30:01 PM 0 148 0.00 0.00 0.00
01:40:01 PM 0 159 0.00 0.00 0.00
01:50:01 PM 0 147 0.00 0.00 0.00
Average: 0 147 0.01 0.01 0.00

06:05:18 PM LINUX RESTART

06:10:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
06:20:01 PM 0 161 0.00 0.00 0.00
06:30:01 PM 0 153 0.00 0.00 0.00
06:40:01 PM 0 152 0.00 0.01 0.00
06:50:01 PM 0 151 0.01 0.01 0.00
07:00:01 PM 0 148 0.00 0.01 0.00



Memory usage was also fine from within the Linux:

# sar -r -f /var/log/sa/sa16
01:10:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
01:20:02 PM 1248036 806492 39.25 133084 439992 274044 4.39
01:30:01 PM 1248028 806500 39.25 133092 440012 274044 4.39
01:40:01 PM 1234796 819732 39.90 133108 440064 290796 4.65
01:50:01 PM 1247664 806864 39.27 133108 440080 274044 4.39
Average: 1250279 804249 39.15 132749 439035 270872 4.33

06:05:18 PM LINUX RESTART


06:10:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
06:20:01 PM 1766932 287596 14.00 33744 88596 289664 4.64
06:30:01 PM 1766764 287764 14.01 34080 94268 274408 4.39
06:40:01 PM 1634960 419568 20.42 107564 115004 274408 4.39
06:50:01 PM 1634232 420296 20.46 107920 120440 266008 4.26
07:00:01 PM 1633088 421440 20.51 109056 118656 261192 4.18


Swap usage was normal as well:

# sar -S -f /var/log/sa/sa16
01:10:01 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
01:20:02 PM 4194296 0 0.00 0 0.00
01:30:01 PM 4194296 0 0.00 0 0.00
01:40:01 PM 4194296 0 0.00 0 0.00
01:50:01 PM 4194296 0 0.00 0 0.00
Average: 4194296 0 0.00 0 0.00

06:05:18 PM LINUX RESTART


06:10:01 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
06:20:01 PM 4194296 0 0.00 0 0.00
06:30:01 PM 4194296 0 0.00 0 0.00
06:40:01 PM 4194296 0 0.00 0 0.00
06:50:01 PM 4194296 0 0.00 0 0.00


Linux disk read/write was normal before the crash, and after 1:50pm nothing more recorded:

# sar -b -f /var/log/sa/sa16
01:10:01 PM tps rtps wtps bread/s bwrtn/s
01:20:02 PM 1.73 0.00 1.73 0.00 16.30
01:30:01 PM 1.73 0.00 1.73 0.00 16.22
01:40:01 PM 3.44 0.00 3.44 0.03 37.70
01:50:01 PM 1.72 0.00 1.72 0.00 16.28
Average: 1.87 0.00 1.87 0.00 18.12

06:05:18 PM LINUX RESTART


06:10:01 PM tps rtps wtps bread/s bwrtn/s
06:20:01 PM 4.19 2.51 1.68 147.58 17.06
06:30:01 PM 0.92 0.34 0.59 38.68 5.66
06:40:01 PM 72.39 56.18 16.21 627.77 225.36
06:50:01 PM 3.35 0.01 3.34 0.11 47.39



Hence, based on above observation, CPU/Memory/Disk from within the OS were all reported normal behavior,


As VMware shows weird hike of CPU, I suppose it was something between ESXi and guest VM resource allocation and CPU shares or fight for some resources that send the VM to an unresponsive mode.




No comments:

Post a Comment