Hi, title says it all.
Recently I had big outage when migrating primary NSX manager to other NSX manager.
I used official VMware method to migrate primary role:
1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.
Following steps can result with a network outage between NSX edges and rest of networking. You may get synchronization error between new primary and old primary NSX manager with error similar to this:
"Cannot redeploy UDLR control VM when UDLR local egress is disabled"
Problem lies somewhere between UDLR controll VM and controller cluster. UDLR control VM stays in old location and controller cluster is in new location. This causes connectivity issues and as a result - no routes on UDLR.
One source is telling me that this is a bug on VMware side, other that this is by design - it is not supported automatically to redeploy UDLR control VM when local egress is disabled.
So if it is a case, sequence above should be modified to include UDLR redeployment step. I beleive somewhere between assigning new primary role and redeploying controllers.
This is a state for now, I'll be updating this post as situation progresses.
Update: I have new, working sequence of tasks.
1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.
6. Redeploy UDLR control VM in new primary NSX site
7. Remove old UDLR control VM.
Deployment of new UDLR control VM is very fast, connectivity should be restored just after new U VM is powered on.
Tuesday, 20 November 2018
Friday, 13 July 2018
How to determine physical location of bad VSAN components
Sometimes as a VSAN admin you must locate a bad VSAN component. This is especially true when you stumble upon this infamous "component metadata health" error:
Using RVC you can get nice text representation of the components:
Having component ID, lets find disk ID:
Using RVC you can get nice text representation of the components:
vsan.health.health_summary
.
+-----------------------------------+--------------------------------------+--------+---------------+
|
Host
| Component
| Health | Notes |
+-----------------------------------+--------------------------------------+--------+---------------+
|
host7.something.com | c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8 |
Error | Invalid state |
|
host7.something.com | 5648245b-b4cb-91b4-c786-0cc47a39c320 |
Error | Invalid state |
|
host1.something.com | d4e3325b-3436-4aa5-6707-0cc47aa4e64e |
Error | Invalid state |
|
host1.something.com | fca2325b-fc57-cd89-4bf4-0cc47aa3cf00 |
Error | Invalid state |
|
host1.something.com | 2c32355b-f810-6782-4c49-0cc47aa4e64e |
Error | Invalid state |
|
host5.something.com | 59f91d5b-9c8f-362d-88e5-0cc47aa3cf00 |
Error | Invalid state |
|
host5.something.com | 380e1b5b-842e-75e9-d305-0cc47aa3cf00 |
Error | Invalid state |
|
host3.something.com | fcb10d5b-2cef-cb46-19a5-0cc47a39bab8 |
Error | Invalid state |
+-----------------------------------+--------------------------------------+--------+---------------+
Having component ID, lets find disk ID:
/localhost/Datacenter/computers/Cluster>
vsan.cmmds_find . -u c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8
+---+------+------+-------+--------+---------+
| # | Type |
UUID | Owner | Health | Content |
+---+------+------+-------+--------+---------+
+---+------+------+-------+--------+---------+
/localhost/Datacenter/computers/Cluster>
vsan.cmmds_find . -u 2c32355b-f810-6782-4c49-0cc47aa4e64e
+---+------+------+-------+--------+---------+
| # | Type |
UUID | Owner | Health | Content |
+---+------+------+-------+--------+---------+
+---+------+------+-------+--------+---------+
As you can see, it's empty. Thanks to VMware support I was able to determine where those components are located by using below oneliner on affected host:
for i in $(vsish -e ls /vmkModules/lsom/disks/ | sed 's/.$//'); do echo; echo "Disk:" $i; localcli vsan storage list | grep $i -B 2 | grep Displ | sed 's/ / /'; echo " Components:"; for c in $(vsish -e ls /vmkModules/lsom/disks/"$i"/recoveredComponents/ 2>/dev/null | grep -v ^626); do vsish -e cat /vmkModules/lsom/disks/"$i"/recoveredComponents/"$c"info/ 2>/dev/null | grep -E "UUID|state" | grep -v diskUUID; done; done
Remember, it's a oneliner. If its get wrapped, edit it.
Result it gives is this:
Disk:
52fcf2cf-a2b3-765b-16da-6b1fbc17b623
Display Name:
naa.600605b00a63535021aa24b9dbc6fdae
Components:
Disk:
52e53b16-d317-b32e-88c3-558c05fefec3
Display Name:
naa.600605b00a63535021aa24badbddc9de
Components:
Disk:
52d3e0e5-e9c2-d6a5-9927-b10685a53dbf
Display Name:
naa.600605b00a63535021aa24c0dc3002a6
Components:
Disk:
527923ad-74ac-80de-c24d-2204dacb91ee
Components:
Disk:
52023556-e225-fa66-ac31-bbf91a968dee
Display Name:
naa.600605b00a63535021aa24c9dcc2a011
Components:
UUID:5648245b-b4cb-91b4-c786-0cc47a39c320
state:10
Disk:
52545e5c-9acc-dd84-7351-fe500287cdb4
Display Name:
naa.600605b00a63535021aa24c5dc882959
Components:
Disk:
52959ffa-298b-4183-c9ca-60265bbf1363
Display Name:
naa.600605b00a63535021aa24bedc154902
Components:
Disk:
5202f5c4-7538-5964-fd55-975289da4d9b
Display Name:
naa.600605b00a63535021aa24c2dc4db53a
Components:
Disk:
52772abe-e8b4-ec73-40fe-a75933126534
Display Name:
naa.600605b00a63535021aa24c7dca6d2ff
Components:
Disk:
52eb25b2-c0b5-d629-ee44-0a3048d22701
Display Name:
naa.600605b00a63535021aa24c3dc6883f7
Components:
UUID:c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8
state:10
Components with state:10 are those problematic.You can also see physical disk NAA ID. From here you can continue normal troubleshooting, removing VSAN disk from disk group in this case.
Subscribe to:
Posts (Atom)