SCOM 2012 R2 Agent Shows APM/App Pool Crashes in 2016 MG
We had built a new SCOM 2016 Management Group alongside the existing 2012 R2 UR11 one in order to dual home and test before switching over. As we saw posts showing that there were issues with APM and .Net Application Pools crashing with UR2 we decided to wait until UR3 but that did not fix the problem.
The System Center Operations Manager team blogged about the issue on 31st May.
On the 6th June they blogged about workarounds that could be used until a fix was released as UR3 still had problems.
- SCOM 2016 Agent can be replaced with SCOM 2012 R2 Agent, it’s forward-compatible with SCOM 2016 Server and APM feature will continue to work with the older bits
- SCOM 2016 Agent can be reinstalled with NOAPM=1 switch in msiexec.exe setup command line, APM feature will be excluded from setup
To do the install with the recommended NOAPM switch would have meant manually installing the agent on hundreds of servers. The SCOM Team decided that utilising the existing 2012 R2 agent that was already on the server was a better option and initially have the agent dual homed to both management groups. We could then upgrade the agent to 2016 when the fix was released. This work was carried out using Kevin Holman’s Agent Management MP (now SCOM Management) as it has a nice task of adding the existing agent to another Management Group. Pushing the agent out from the 2016 MG would have meant that the 2016 UR2 agent would have been deployed. This allowed us to have the 2012 R2 agent dual homed to the old Production Managment Group and the new 2016 one.
As we had decided to use the 2012 R2 agent and had not configured APM to be used on any server then the SCOM Team had thought that they had followed the required steps to avoid the application pool crash issue but still being able to move forwards with SCOM 2016 to get the benefits of scheduled maintenance mode and other features. Therefore, the fact that SharePoint 2016 servers on a Windows 2012 R2 server started to exhibit this issue was not expected. The SharePoint Team removed the SCOM Agent and that stopped the problem for them but left the servers unmonitored.
The team started to analyse the situation. Two new registry keys had been added.
These are MULTI_SZ with the values being
We discovered that the contents of the registry key was the GUID for the SCOM APM component and that key turned on .Net profiling. We did a web search to see if we could find any issues with the 2012 R2 agent and the APM issue. This was known to happen with the SCOM 2016 agents but we could not find any mentions of this combination. We did find a forum post that mentioned the registry key for 2016 agents and a recommendation to change a rule in one of the APM Management Packs.
On investigating this, we found that this rule was in the Operations Manager APM Infrastructure MP. This is a standard MP that is installed as part of the SCOM installation but the MP in SCOM 2016 had a higher version number that the one in the 2012 R2 Management Group. We checked the differences between the two MPs and found that Microsoft had added an undocumented feature to the newer version.
The rule – Apply APM Agent Configuration
There is a new parameter in the 2016 MP that is not in the 2012 R2 MP – Enable RITA Profiler.
This has the additional parameter Enable RTIA Profiler that is on by default and is targeted to every server that has the APM agent (.Net Application Monitoring class). This means every server as the APM agent is automatically installed in a switched off state when the SCOM agent is installed.
We tested this override and found that when we disabled it for a server then the registry keys would be removed and when we set it back to enabled, it would add these keys back again. This was the difference between the 2012 R2 Management Group and the new 2016 Management Group.
Next after discussion with the SharePoint Team we requested access to a SharePoint server that displayed the errors so that we could repeat the process with a server that would break so that we could ensure that the problem was fixed.
We overrode the rule to enable it for the Sharepoint server and we could see that the two registry keys were created as soon as it received the MP. We left it for about 6 minutes and no issues. As soon as we did an IISreset then the events for the App Pool crashing appeared in the Application log. Events 1325, 2016 and 1000. And these kept repeating which meant the web site using that application pool no longer worked.
In the SCOM Console we could see the Application Pool alert increment as the pools kept crashing.
After we removed the override to enable it the events stopped and alert stopped incrementing.
We also found that a number of servers had these events repeating in the Operations Manager event log.
These events were the agent trying to start and failing. We did not investigate these as the main issue was with the application pools.
As a second test we removed the agent and installed the agent with the NOAPM switch. As this server no longer has the .Net Application Monitoring class then the rule does not apply to it. During all tests this server carried on with no problems regardless of how the rule was set as it would never be sent to that server as the target class did not exist on it.
As we were not using APM we put in the override to disable the parameter “EnableRTIA” (set to false) for the rule “Apply APM Agent Configuration”. This will ensure that the registry keys are not created. This will stop the application pool crashes due to the .Net Profiler being activated and the APM Agent error events across the estate.
Note as only one .Net profiler can be run at one time you may want to disable this rule if you are using the .Net profiler with a different application. I found that problem occurred in a product called New Relic when searching.
Note that this override had to be done in both the 2016 Test and 2016 Production Management Groups (if you have this setup) as dual homed agents will create the following warning alert – “APM .NET Server-Side monitoring Configuration Error or Conflict”.
Note that this does not stop application pool crashes that may crash due to genuine problems.
For the SharePoint Farm we recommended installing the agent with the NOAPM switch to give double reassurance to the SharePoint Team that this would not cause problems again.
I hope that SCOM 2016 UR4 fixes this problem for good.
I would like the Product Group to consider having a switch on agent installation from the console with a tickbox on whether or not to install the APM agent. That would have made it easier to change the agent from the console rather than having to manually install it (unless you have a software deployment tool this is a lot of work.).
Edit – 6 August 2017
I mentioned Kevin Holman’s SCOM Management MP which has help to find servers with old Management Groups and remove them and also a task to add a Management Group to an agent. The MP also contains a task to run a command or PowerShell command from a central location which I had thought about using but Kevin has just blogged about that and how to use to remove the APM agent from your servers.
Not quite as good as the Product Group doing it as part of the agent install as the agent has to be deployed for the task to run but still pretty useful for changing the agent.
He has also listed all the APM MPs that you can remove from SCOM 2016 so that APM is never deployed.