Tuesday 20 November 2018

Transferring primary NSX manager role to other manager causes network outage in cross-vcenter environment

Hi, title says it all.

Recently I had big outage when migrating primary NSX manager to other NSX manager.

I used official VMware method to migrate primary role:

1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.

 Following steps can result with a network outage between NSX edges and rest of networking. You may get synchronization error between new primary and old primary NSX manager with error similar to this:

"Cannot redeploy UDLR control VM when UDLR local egress is disabled"

Problem lies somewhere between UDLR controll VM and controller cluster. UDLR control VM stays in old location and controller cluster is in new location. This causes connectivity issues and as a result - no routes on UDLR.

One source is telling me that this is a bug on VMware side, other that this is by design - it is not supported automatically to redeploy UDLR control VM when local egress is disabled.

So if it is a case, sequence above should be modified to include UDLR redeployment step. I beleive somewhere between assigning new primary role and redeploying controllers.


This is a state for now, I'll be updating this post as situation progresses.

Update: I have new, working sequence of tasks.

1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.
6. Redeploy UDLR control VM in new primary NSX site
7. Remove old UDLR control VM.

Deployment of new UDLR control VM is very fast, connectivity should be restored just after new U VM is powered on.