Tuesday 20 November 2018

Transferring primary NSX manager role to other manager causes network outage in cross-vcenter environment

Hi, title says it all.

Recently I had big outage when migrating primary NSX manager to other NSX manager.

I used official VMware method to migrate primary role:

1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.

 Following steps can result with a network outage between NSX edges and rest of networking. You may get synchronization error between new primary and old primary NSX manager with error similar to this:

"Cannot redeploy UDLR control VM when UDLR local egress is disabled"

Problem lies somewhere between UDLR controll VM and controller cluster. UDLR control VM stays in old location and controller cluster is in new location. This causes connectivity issues and as a result - no routes on UDLR.

One source is telling me that this is a bug on VMware side, other that this is by design - it is not supported automatically to redeploy UDLR control VM when local egress is disabled.

So if it is a case, sequence above should be modified to include UDLR redeployment step. I beleive somewhere between assigning new primary role and redeploying controllers.


This is a state for now, I'll be updating this post as situation progresses.

Update: I have new, working sequence of tasks.

1. Delete primary role from primary NSX manager
2. Remove controllers one by one. Use "force remove" on last controller.
3. Assign primary role to one of NSX managers.
4. Add secondary NSX managers.
5. Redeploy controllers one at a time.
6. Redeploy UDLR control VM in new primary NSX site
7. Remove old UDLR control VM.

Deployment of new UDLR control VM is very fast, connectivity should be restored just after new U VM is powered on.


Friday 13 July 2018

How to determine physical location of bad VSAN components

Sometimes as a VSAN admin you must locate a bad VSAN component. This is especially true when you stumble upon this infamous "component metadata health" error:

 Using RVC you can get nice text representation of the components:


vsan.health.health_summary .



+-----------------------------------+--------------------------------------+--------+---------------+

  | Host                              | Component                            | Health | Notes         |

  +-----------------------------------+--------------------------------------+--------+---------------+

  | host7.something.com | c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8 | Error  | Invalid state |

  | host7.something.com | 5648245b-b4cb-91b4-c786-0cc47a39c320 | Error  | Invalid state |

  | host1.something.com | d4e3325b-3436-4aa5-6707-0cc47aa4e64e | Error  | Invalid state |

  | host1.something.com | fca2325b-fc57-cd89-4bf4-0cc47aa3cf00 | Error  | Invalid state |

  | host1.something.com | 2c32355b-f810-6782-4c49-0cc47aa4e64e | Error  | Invalid state |

  | host5.something.com | 59f91d5b-9c8f-362d-88e5-0cc47aa3cf00 | Error  | Invalid state |

  | host5.something.com | 380e1b5b-842e-75e9-d305-0cc47aa3cf00 | Error  | Invalid state |

  | host3.something.com | fcb10d5b-2cef-cb46-19a5-0cc47a39bab8 | Error  | Invalid state |

  +-----------------------------------+--------------------------------------+--------+---------------+
 

Having component ID, lets find disk ID:


/localhost/Datacenter/computers/Cluster> vsan.cmmds_find . -u c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8

+---+------+------+-------+--------+---------+

| # | Type | UUID | Owner | Health | Content |

+---+------+------+-------+--------+---------+

+---+------+------+-------+--------+---------+


/localhost/Datacenter/computers/Cluster> vsan.cmmds_find . -u 2c32355b-f810-6782-4c49-0cc47aa4e64e

+---+------+------+-------+--------+---------+

| # | Type | UUID | Owner | Health | Content |

+---+------+------+-------+--------+---------+

+---+------+------+-------+--------+---------+

As you can see, it's empty. Thanks to VMware support I was able to determine where those components are located by using below oneliner on affected host:

for i in $(vsish -e ls /vmkModules/lsom/disks/ | sed 's/.$//'); do echo; echo "Disk:" $i; localcli vsan storage list | grep $i -B 2 | grep Displ | sed 's/   / /'; echo "  Components:"; for c in $(vsish -e ls /vmkModules/lsom/disks/"$i"/recoveredComponents/ 2>/dev/null | grep -v ^626); do vsish -e cat  /vmkModules/lsom/disks/"$i"/recoveredComponents/"$c"info/ 2>/dev/null | grep -E "UUID|state" | grep -v diskUUID; done; done

Remember, it's a oneliner. If its get wrapped, edit it.

Result it gives is this:


Disk: 52fcf2cf-a2b3-765b-16da-6b1fbc17b623

 Display Name: naa.600605b00a63535021aa24b9dbc6fdae

  Components:



Disk: 52e53b16-d317-b32e-88c3-558c05fefec3

 Display Name: naa.600605b00a63535021aa24badbddc9de

  Components:



Disk: 52d3e0e5-e9c2-d6a5-9927-b10685a53dbf

 Display Name: naa.600605b00a63535021aa24c0dc3002a6

  Components:



Disk: 527923ad-74ac-80de-c24d-2204dacb91ee

  Components:



Disk: 52023556-e225-fa66-ac31-bbf91a968dee

 Display Name: naa.600605b00a63535021aa24c9dcc2a011

  Components:

   UUID:5648245b-b4cb-91b4-c786-0cc47a39c320

   state:10



Disk: 52545e5c-9acc-dd84-7351-fe500287cdb4

 Display Name: naa.600605b00a63535021aa24c5dc882959

  Components:



Disk: 52959ffa-298b-4183-c9ca-60265bbf1363

 Display Name: naa.600605b00a63535021aa24bedc154902

  Components:



Disk: 5202f5c4-7538-5964-fd55-975289da4d9b

 Display Name: naa.600605b00a63535021aa24c2dc4db53a

  Components:



Disk: 52772abe-e8b4-ec73-40fe-a75933126534

 Display Name: naa.600605b00a63535021aa24c7dca6d2ff

  Components:



Disk: 52eb25b2-c0b5-d629-ee44-0a3048d22701

 Display Name: naa.600605b00a63535021aa24c3dc6883f7

  Components:

   UUID:c30d1b5b-586b-8b74-b3c6-0cc47aa4b1b8

   state:10

  

Components with state:10 are those problematic.You can also see physical disk NAA ID. From here you can continue normal troubleshooting, removing VSAN disk from disk group in this case.


 

Tuesday 5 December 2017

Powershell/Powercli: remove disabled NICs.

Hi.

Today I had some interesting task, to remove/modify disabled NICs in Windows OS. Information I had was just list of VMs.

"Easy"- I thought - "I''ll take MAC addresses of disabled adapters, compare them against adapters present in VM, done".
 As it turned out, it's not as straight forward as I assumed. When adapters are disabled in OS, driver is unloaded and no info about NIC's MAC or IP is provided. So no WMI.
There is Get-NetAdapter cmdlet, which can return MAC of disabled NIC, but I had issues with PSremoting, and I didn't have access to OS itself.

So I decided to turn a flaw into a advantage. VMtools grabs info about NIC and present it in
 $currentvm.ExtensionData.Guest.Net ($currentvm is of course variable holding VM object):


We have all nice info about active adapters in OS. So, if adapter is disabled, it's not present here.

I then grabbed list of MACs from Get-Networkadapter cmdlet:





Now it's just comparing the lists, adding some logic in case there are more than one active adapter present in OS, and I'll ended up with this script:

$vms = Get-Content .\vms.txt

foreach ($vm in $vms) {
$currentvm = get-vm $vm
$workingmac = $currentvm.ExtensionData.Guest.net.macaddress
$currentvmadapters = $currentvm|Get-NetworkAdapter
$macstoremove = $currentvmadapters.MacAddress

 if ($workingmac.count -ge 1) {
  $workingmac|foreach {
  $removeme = $_
  $macstoremove = $macstoremove|sls -NotMatch "$removeme"
 
   }

  }
  
 $macstoremove|foreach {
 $mac = $_
 $currentvmadapters|where {$_.macaddress -like "$mac" }|Set-NetworkAdapter -StartConnected:$false -Connected:$false

 }

}



I'm using simple Select-String cmdlet to filter out MAC addresses currently used from list of MACs to be modified. 
I'm just disconnecting NICs here, but of course it can delete them.


Update:

Turned out that Remove-NetworkAdapter cmdlet works only with powered off VMs. Bummer.

But thanks to amazing LucD from vmware forums, it can be done by device change task for each device. Here's the thread.

So final code may look like this:

$vms = Get-Content .\vms.txt

foreach ($vm in $vms) {
$currentvm = get-vm $vm
$workingmac = $currentvm.ExtensionData.Guest.net.macaddress
$vmos = $currentvm.ExtensionData.Config.GuestFullName
$currentvmadapters = $currentvm|Get-NetworkAdapter
$macstoremove = $currentvmadapters.MacAddress
$spec = New-Object VMware.Vim.VirtualMachineConfigSpec
$nicstoremove = @()

if ($vmos -notlike "*windows*") {
$currentvm|select name,@{n="OS";e={$vmos}}|Out-File -Force -Append nonwindows_OS.txt
write-host -ForegroundColor Yellow "$($currentvm.name) is not the Windows VM..."
}
else {


if ($workingmac.count -ge 1) {
$workingmac|foreach {
 $removeme = $_
 $macstoremove = $macstoremove|sls -NotMatch "$removeme"
 
  }
 }

 Write-host -ForegroundColor Green "Processing $($currentvm.name) VM..."
  
 $macstoremove|foreach {
 $mac = $_
 $nicstoremove += $currentvmadapters|where {$_.macaddress -like "$mac" }
 }
 }
 $nicstoremove|select parent, macaddress, type, networkname|Export-Csv NicsToRemove.csv -Append -NoTypeInformation

$spec = New-Object VMware.Vim.VirtualMachineConfigSpec
  foreach($Adapter in $nicstoremove){
      $devSpec = New-Object VMware.Vim.VirtualDeviceConfigSpec
      $devSpec.operation = "remove"
      $devSpec.device += $Adapter.ExtensionData
      $spec.deviceChange += $devSpec
    }
  $currentvm.ExtensionData.ReconfigVM_Task($spec)|Out-Null


 }















Sunday 17 September 2017

Zerto VPG creation failure (vim.fault.CannotCreateFile)

I just solved annoying issue. I'm using Zerto to replicate VM's between sites. Target site has vSAN as storage. Some time ago we had host failure and we replaced host with new one.

When creating new VPG, Zerto throws very generic error:




In tasks I found additional info:

Cannot complete file creation operation.. Fault: Vim25Api.CannotCreateFile.

I started to dig, first steps was to look at /var/log/vpxa.log file. I found this:

2017-09-15T20:27:34.791Z info vpxa[2F85EB70] [Originator@6876 sub=Default opID=326fd094-3b] [VpxLRO] -- ERROR task-internal-108298 -- vpxa -- vpxapi.VpxaService.reserveName: vim.fault.CannotCreateFile:
--> Result:
--> (vim.fault.CannotCreateFile) {
--> faultCause = (vmodl.MethodFault) null,
--> file = "ds:///vmfs/volumes/vsan:a5c518a4ceaa4b9e-8cd24fc5c9c0cad3/44e38a59-5440-a5e4-a8e5-0cc47aa432a8/256_vsanDatastore_vm-393_history_10_134_9_132_log_volume.vmdk",
--> msg = ""
--> }
--> Args:
-->
--> Arg spec:
--> (vpxapi.VmLayoutSpec) {
--> vmLocation = (vpxapi.VmLayoutSpec.Location) null,
--> multipleConfigs = <unset>,
--> basename = "Z-VRA-host2.domain.com",
--> baseStorageProfile = <unset>,
--> disk = (vpxapi.VmLayoutSpec.Location) [
--> (vpxapi.VmLayoutSpec.Location) {
--> url = "ds:///vmfs/volumes/vsan:a5c518a4ceaa4b9e-8cd24fc5c9c0cad3/44e38a59-5440-a5e4-a8e5-0cc47aa432a8/256_vsanDatastore_vm-393_history_10_134_9_132_log _volume.vmdk",
--> key = 16001,
--> sourceUrl = <unset>,
--> urlType = "exactFilePath",
--> storageProfile = <unset>
--> }
--> ],
--> reserveDirOnly = <unset>
--> }
 


Not very helpfull... 'msg' line was empty, so I started to chase my tail to find something. 
 Then I looked into /var/log/hostd.log file and I found this:

2017-09-16T21:44:11.266Z info hostd[69401B70] [Originator@6876 sub=Solo.Vmomi opID=179ca1eb-c-cc73 user=vpxuser:VSPHERE.LOCAL\prod-Zerto-fcbdf1fb-3575-4d99-9fde-0be131222758] Result:
--> (vim.fault.FileAlreadyExists) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = (vmodl.LocalizableMessage) [
-->       (vmodl.LocalizableMessage) {
-->          key = "com.vmware.esx.hostctl.default",
-->          arg = (vmodl.KeyAnyValue) [
-->             (vmodl.KeyAnyValue) {
-->                key = "reason",
-->                value = "Failed to create directory 44e38a59-5440-a5e4-a8e5-0cc47aa432a8 (File Already Exists)"
-->             }
-->          ],
-->          message = <unset>
-->       }
-->    ],
-->    file = "44e38a59-5440-a5e4-a8e5-0cc47aa432a8"
-->    msg = ""
--> }


 Ah, it can't create directory. Quick look to vSAN content and I noticed that this directory exists. But deleting it from Web client failed. So I went back to ESXi:


ls: ./44e38a59-5440-a5e4-a8e5-0cc47aa432a8: No such device or address


Hmm, something's wrong here, can I remove it by force? Of course:

 /usr/lib/vmware/osfs/bin/osfs-rmdir 44e38a59-5440-a5e4-a8e5-0cc47aa432a8 -f

Result:

Deleting directory 44e38a59-5440-a5e4-a8e5-0cc47aa432a8 in container id a5c518a4ceaa4b9e8cd24fc5c9c0cad3 backed by vsan (force=True)

 (yes, I used The Force)

Verified in GUI that catalog was deleted (it was) and tried to create VPG again... Voila, it worked! 

So, leftovers of old host/VRA caused this issue. Interesting, vSphere didn't give any information that there is something wrong. Health service reported everything as green. 

Oh, BTW I passed some time ago VCAP6 design exam and now I'm VCIX :P
 

Friday 30 September 2016

Gnome disable screen power off on lockscreen

Hi, a quick tip today. To keep screen on while lockscreen is activated you must disable DPMS. do it by typing in console:

xset -dpms

Because this setting is only active during current session, you must put it to some startup script. I have my scripts in ~/.config

I use gnome-session-properties for configuring startup scripts. 

I even registered to gnome bugzilla to suggest this workaround. Strange, that all those Linux gurus didn't found out this solution, me, Windows guy, had to do it ;)

Tuesday 1 March 2016

VCAP - DCA 5.5 thoughts

Today I took VCAP - DCA 5.5 exam. And I passed :)

So now I'm officially:


Few thoughts about this exam, maybe someone will find it usefull:

- Many people already mentioned that. Time is your worst enemy. I wasn't able to do few questions, because time ran out.
- Because of that, don't spend too much time on one question. Do whatever you can and move along.
- Pay attention to question and required servers you're supposed to work on.
- Expect typos in naming, questions. Be intelligent.
- There is no option to "mark" question for review at the end. So if you want to read questions from beggining of the exam, be ready to click "previous","previous","previous"... It takes time, which you don't have. So either remember things you have to do or write it down.
- Yeah, it's slow... connection I mean, don't freak out if you'll loose connection for a short while (happened to me).


Generally, if you have an experience you'll pass it. Just watch this timer....

Wednesday 10 February 2016

Static teaming causes packet lost on HP BL460c gen 8 with virtual connect

A quickie today. We had problems with packet lost on Windows Server 2012 R2 servers. It turned out that Virtual Connect doesn't support "Static" and "LACP" teaming modes, only "Switch independent".

Changing teaming mode resolved our issue.