IDKCS002615
VMS Management Interface Not Connected WrittenSep 12, 2016

 
Attachments0
  

Product Categories
Server Management

 Problem
SMweb is showing or reporting one of the following messages for the VMS or hosted VM's
 
 VM Mgmt Interface Not Connected
 VM Mgmt Comm Error
 VMMgmt Communication Error
 
 

 
Or
 


NOTE:
If you get an incident with the combination below, then close the incident pending an upgrade to VMS Version 2.0.6.0 per JIRA SM-25041
- CMIC Version is 12.07.01
- VMS Server is 2.0.0.5
- VMS Server has been recently rebooted
- Faults in MyWork show "VM Management not connected" for both the SWS and CMIC VM


Environment/Conditions/Configuration

VMS Version 1.03.x or later


Cause
1.  VMS Memory leak - JIRA OSEDEV-563 due to the VMS Server being up for more than 300 days. In some cases we have seen this occur in less than 300 days
2.  The vmname= entry in the CMIC Config xml file for the hosted VM's is different from the hosted vm names you see by running the vm-list command on the VMS Server
3.  The VMS Server chassis information is not correct for one of the hosted VM's in the CMIC Config xml file
 

Solution
 
Workaround for Cause 1:
 
This problem is currently has a fix in version VMS 02.00.06.00.  This version is supposed to prevent the memory leak causing this issue.
 
If not at this version, then reboot of VMS will resolve issue.
 
 
NOTE:  If the system being worked on has multiple VMS Servers, and the uptime is the same, then please reboot all the VMS Servers.  Failure to reboot the
             other VMS Servers will result in more incidents that could be avoided if time had been taken to reboot all the VMS Servers showing the same amount of uptime.

1.  Check the uptime on the VMS Servers by running the following command as root on the SWS
     vmscmd -a uptime
2.  Create a maintenance window
3.  Try connecting to the VMS Server

How to connect to or reboot a VMS Servererver:
     -  Can you ping any IP on the VMS Server?
     -  If Yes then go to step 4
     -  If No then go to step 5

4.  Try connecting to the VMS Server using the following procedure?
      -  Connect to the VMS Server using KCS004526 as root and run the following command
     ipmitool chassis power reset
      -  If you can't connect then go to step 5
 
NOTE:  It can take up to 45 minutes to reboot a VMS Server that has more than 4 drives due to fsck needing to be run on all drives.  If 45 minutes has passed and the
             VMS Server has not recovered, then a CSR must be dispatched to pull the power plugs and bring the VMS Server back up.
 
5.  Use the procedure in KAP1B2EF2 to reboot the VMS Server remotely.
 
6.  If steps 1-5 fail, then schedule a onsite visit to do the following
     -  Pull the power plugs on the VMS Server that’s hung
     -  Wait about 20 seconds
     -  Plug the power plugs back in and power up the VMS Server

     -  Look for errors during power up
 
7.  Open SMweb again after the VMS Server has rebooted

8.  If you see any nodes showing "Chassis No Contact", and you have the latest gsctools package installed, then run the following command as root on the SWS

   /opt/teradata/gsctools/sbin/racreset
 
9.  If you don't have the gsctools package that has the racreset script then either download and install the latest version of gsctools or use KAP2BAE4A
 
 
 
Solution for Cause 2:
 
1.  Connect to the VMS Server as root.  Refer to KCS004526 if assistance is needed.
2.  Make sure the vmname= entry in the CMIC Config xml file matches the vm names from the output of the vm-list command
3.  Refer to knowledge article KAP315BAB2 for how to change a vmname in the CMIC Config xml file
 
In this example the SWS was migrated from XEN to KVM but the vmname was not changed from sws1 to sws1-kvm in the CMIC Config xml
 
Example from VMS vm-list
Name                         ID    Mem  VCPUs   Type  State
cmic95                        2  10240      7    HVM  Running
sws1-kvm                      3  10240      7    HVM  Running
sws1                             10240      7    XEN  Not under KVM domain management
 
From CMIC Config xml file
<Chassis idnum="62" vmname="sws1" vmoerole="SWS">   <== in this case sws1 should have been changed to sws1-kvm
 
 
 
Solution for Cause 3:
 
1.  Connect to SMweb and take note of the chassis positions of the VMS Servers
2.  Make sure the entries in the CMIC Config xml for the hosted VM's have the correct chassis position for the VMS Server hosting that VM
 
If you look at the /datapart/cmic/trace/mepluginhost.MEPlugin_VMS_R1000GZ-1.log on the SOV or Master CMIC you will see what hosted vm it's complaining about.
MEPlugin_VMS_R1000GZ_1.1.1.12:CVMSCIMStateUpdateThread::ValidateVMConfig(): VM cmic1 (KVM) (oerole ) is unmanaged -- bad config?
 
In the example below SMweb was showing "Self Managed, Unmanaged VM(s) detected" due to the wrong chassis position 11 for the VMS Server.  It should have been 12
In this case you would simple change the <ChasisID> from 11 to 12 to resolve this problem
 
 
<Chassis idnum="63" type="cmic" vmname="cmic1" vmoerole="CMIC">
 <MEPlugin name="MEPlugin_CMIC_KVM">
 <Settings>
  <VMS>
  <SystemID>1</SystemID>
  <CollectiveID>1</CollectiveID>
  <CabinetID>1</CabinetID>
  <ChassisID>11</ChassisID>  <=============  Should have been 12
  <IPv4 type="primary">39.80.8.12</IPv4>
  <IPv4 type="secondary">39.96.8.12</IPv4>
 

Special Considerations
‚Äč

Additional Information

 

On the VMS, in /var/log/messages you find events similar to following:

Oct 17 00:30:36 vms kernel: sfcbd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Oct 17 00:30:36 vms kernel: sfcbd cpuset=/ mems_allowed=0
Oct 17 00:30:36 vms kernel: Pid: 11288, comm: sfcbd Not tainted 2.6.32.43-0.4-xen #1
Oct 17 00:30:36 vms kernel: Call Trace:
Oct 17 00:30:36 vms kernel:  [<ffffffff80009b95>] dump_trace+0x65/0x180
Oct 17 00:30:36 vms kernel:  [<ffffffff80352768>] dump_stack+0x69/0x71
Oct 17 00:30:36 vms kernel:  [<ffffffff8009d760>] oom_kill_process+0xe0/0x220
Oct 17 00:30:36 vms kernel:  [<ffffffff8009df10>] __out_of_memory+0x50/0xa0
Oct 17 00:30:36 vms kernel:  [<ffffffff8009dfbe>] out_of_memory+0x5e/0xc0
Oct 17 00:30:36 vms kernel:  [<ffffffff800a13f8>] __alloc_pages_slowpath+0x478/0x540
Oct 17 00:30:36 vms kernel:  [<ffffffff800a15fa>] __alloc_pages_nodemask+0x13a/0x140


NOTE:  If the oom-killer messages are not present in /var/log/messages of the VMS node then this is a possible network or plug-in issue

Refer to JIRA SM-22972 for current workaround