Hello Dean Ferley,
Thanks for your questiom. And yeah, you certainly don't want that happening again.
I would recommend using diagnostics initially in the portal:
- Navigate to the impacted VM in the Azure portal.
- Select Diagnose and solve problems > Common problems > VM restarted or stopped unexpectedly.
- On the VM restarted or stopped unexpectedly page, select My resource has been stopped unexpectedly from the Tell us more about the problem you are experiencing drop-down menu.
- Once you select My resource has been stopped unexpectedly, the diagnostics run on the impacted VM. After the diagnostics are completed, you can check the reboot RCA information from the diagnostics result.
You may find Kernel panics, Disk read errors or even Guest OS faults.
The above is documented from (This link will help): https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/windows/unexpected-vm-reboot-root-cause-analysis?source=recommendations
You should also Review the Azure Activity Log for both VMSS instances around the time of the restart and check for any scaling events that might have triggered the restart.
To prevent it from happening again, Ideally you should:
Distribute your instances across Azss to reduce single-point failures.
Are you using a Spot Vm?
I will recommend you take a look at:
https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/windows/understand-vm-reboot
You can mark it 'Accept Answer' and 'Upvote' if this helped you
Regards,
Abiola