Hello,
I have an odd one. While a node is live, without draining or removing from the cluster we do the following:
1. Reboot it
2. Upon coming back up, sign in
3. Within a minute itll bluescreen
4. Boot back up, sign in, everything is fine
The dump shows ntoskrnl.exe DRIVER_IRQL_LESS_THAN_OR_EQUAL 0x000000D1
If you check the cluster operational, youll see it start some GUM Process with GrantLock, Process Request lock. This happens over and over until it bluescreens. Subsequent reboot from bluescreen shows GUM but it only does the "processing locally".
Events below:
Preceeding Bluescreen(these repeated over and over and were even suppressed per application log):
[GUM] Node 2: Processing RequestLock 4:595
[GUM] Node 2: Processing GrantLock to 4 (sent by 5 gumid: 20121)
Post Bluescreen (note these still showed pre-bluescreen above but rarely):
[GUM]Node
2: Executing
locally gumId: 20121,
updates: 1,
first action: /dm/update
Before the bluescreen in the event viewer the following happens with the NIC. Keep in mind this NIC is apart of a team. 2 of the 4 team members are down (waiting to be plugged in if the others die) and 2 are live. This team is handled by the OS in Server
Manager. We are using Intel drivers not system drivers. Latest.
Reboot - 9:51
Kernel Power Hardware Notifications upon boot up
Connectivity state in standby: Disconnected, Reason: NIC compliance - 9:54
both adapters come online - 9:54
Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.
and
Intel® Ethernet 10G 2P X520 Adapter #2
Network link has been established at 10Gbps full duplex.
===============================
NIC report disconnected
Intel® Ethernet 10G 2P X520 Adapter
Network link is disconnected.
Intel® Ethernet 10G 4P X520/I350 rNDC #2
Network link is disconnected.
MsLbfoSys
Member Nic {30793b81-07bd-4afe-85f6-6dd873581384} Connected.
NIC Disconnects again
Intel® Ethernet 10G 4P X520/I350 rNDC
Network link is disconnected.
NICs reconnect
Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.
MsLbfoSys
Member Nic {7947a925-563e-4bf8-b3c6-73c46ef2d4ed} Connected.
DNS Resolution and Domain Resolution fail - 9:55
lphplsvc reports that network is coming up - 9:55
At this point you can sign into the server and shortly there after itll bluescreen. I have not yet tested it but I believe it will also bluescreen without signing in(as was reported to me), im just relaying the recent event. This doesnt happen everytime
but is a 50/50. Ill test in my lab this coming week to reproduce. Anything additional I should capture? As a note, this is reproducable across 10 similar physical servers, with a 2 cluster split of 5 each.
I see a hotfix for this issue 0x0D1 for server 2012 but this 2016. I have a feeling that the Network coming up causes Windows or the Cluster to grab the address space for the driver and then the opposite one tries for it upon network recovery above but it
fails to release the address space. I am assuming the cluster is snagging it then windows is trying after, thus ntokrnl.exe being at fault.
Any input would be great, this is an odd one and im hoping to track it down. I understand that delaying the startup of SQL services might be a suggestion but I mixed reviews on doing that and being that it seems like cluster activity not so much SQL, im
wondering if that is even an option here.