hcimarkus: November 2019

Maintenance Mode for CVM

ncli host list (to get UUID from the host)

ncli host edit id=<insert uuid here> enable-maintenance-mode='false' | 'true'

Use 'true' to get into maintenance mode and 'false' to get off

Maintenancemode für AHV und ESX siehe hier:

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LIQoCAO

CVM
Placing the CVM in maintenance does not migrate VMs from the host. The hypervisor must be placed in maintenance mode to migrate VMs. See the relevant section for your hypervisor below to accomplish this.

First, get the CVM host ID:

nutanix@cvm$ ncli host ls

The host ID is consists of the characters to the right of the double colons. In the example below it is "11":

Id                        : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx::11

To place CVM maintenance mode:

nutanix@cvm$ ncli host edit id=<host_id> enable-maintenance-mode=true

To exit CVM maintenance mode execute the command below from a CVM that is not in maintenance mode:

nutanix@cvm$ ncli host edit id=<host_id> enable-maintenance-mode=false

ESXi
To place ESXi host in maintenance mode:

root@esxi# esxcli system maintenanceMode set --enable true

To end ESXi host maintenance mode:

root@esxi# esxcli system maintenanceMode set --enable false

To verify the maintenance mode status of an ESXi host:

root@esxi# esxcli system maintenanceMode get

AHV
To get the AHV hypervisor address:

nutanix@cvm$ acli host.list

To enter AHV host maintenance mode:

nutanix@cvm$ acli host.enter_maintenance_mode <hypervisor_address> wait=true

To exit AHV host maintenance mode:

nutanix@cvm$ acli host.exit_maintenance_mode <hypervisor_address>

Error – Foundation service running on one of the nodes

Under normal operations, the foundation service is stopped on all cluster nodes. Only if you destroy a cluster, the foundation service gets started permanently until you create a new cluster / add the nodes to an existing cluster.

As far as I know the only other component within AOS is LCM, which leverages the foundation service for certain hardware update tasks like a BIOS update. This is also the most common reason, why a foundation process is started / still running in an “normal” cluster: Some sort of previous failed LCM actions.

To check if and where Foundation is running ssh into one of the CVMs and run the following command:

allssh 'genesis status | grep foundation'

As you can see in my output it was running on my CVM with the .24 IP-address (the process IDs in the brackets is the indication that the process is up and running):

To stop the foundation process just ssh to the related CVM and run:

genesis stop foundation

The output will directly show you that the service is now stopped

Now just run LCM / Foundation upgrade again and the pre-checks will succeed.

Node stucks in Phoenix (no boot back to Hypervisor)

If a node stucks in Phoenix you can get it back to the HV, but first check that there are no tasks runing from phoenix: On a CVM check this:

ecli task.list include_completed=false
host_upgrade_status
firmware_upgrade_status

If there is nothing running, you can do this in Phoenix:

python /phoenix/reboot_to_host.py

Then the host will boot back to HV, you will have to remove the CVM from maintenance mode:

cluster status (to see, if everything in the Cluster is ok)

ncli host list (to get UUID from the host)

ncli host edit id=<insert uuid here> enable-maintenance-mode='false'

Check the metadataring:

nodetool -h localhost ring

If the Host is AHV, you have to leave Maintenance Mode for AHV, too:

acli host.list
acli host.exit_maintenance_mode <hostname>

or in newer Versions use the Nutanix provided script:

also first check upgrade status etc.

then on a CVM:

python /home/nutanix/cluster/bin/lcm/lcm_node_recovery.py <IP of affected CVM/phoenix>

This works with ESXi and AHV, maybe you have to provide the vCenter and Credentials.

The script should also end the foundation service, but this don't work.

So after bringing back the node you have to check:

allssh 'genesis status|grep foundation'

If foundation is running on one node

allssh genesis stop foundation

Then the Node should be back and running in your cluster.

Other option is to use the Script from another CVM:

python /home/nutanix/cluster/bin/lcm/lcm_node_recovery.py xx.xx.xx.xx

LCM >= 3.0: 
nutanix@CVM:~$ /home/nutanix/cluster/bin/lcm/lcm_node_recovery <CVM_IP>

xx.xx.xx.xx is the IP from the affected CVM/phoenix

This completes all tasks except the foundation service, so its good to use the script and then perform a:

allssh genesis stop foundation

and you are good to go again.

hcimarkus

Search

Dienstag, 12. November 2019

Maintenance Mode for CVM

Error – Foundation service running on one of the nodes

Node stucks in Phoenix (no boot back to Hypervisor)