I recently noticed that some of the Macs in my deployment have stopped running policies, even though they’ve been checking in regularly. This is most noticeable when a machine has been checking in regularly, but the last inventory update was greater than 24 hours ago (we request an inventory update every 24 hours) as shown in the following screen shot.
I noticed one common set of symptoms on all of the affected Macs:
- Attempting to run sudo jamf policy resulted in “this policy trigger is already being run”
- Killing the jamf process or rebooting would allow me to run sudo jamf policy
- The policies would run as expected until a policy attempting to rotate a FileVault 2 recovery key ran, then it would get stuck
Some more digging revealed that this behavior would stop if the FileVault 2 Recovery Key Redirection Configuration Profile was not present on the Mac. Since I’ve been having so much trouble with defects in Configuration Profiles lately, I suspected this was related and opened a case with JAMF. My TAM Paul Nichols helped me figure out exactly what was going on.
It seems that, when a FileVault 2 Recovery Key Redirection payload is present, that a dropped network connection during a FileVault 2 Recovery Key Rotation results in the FDERecoveryAgent process becoming stuck in a seemingly infinite loop. You can test this yourself by deploying a FileVault 2 Recovery Key Redirection Configuration Profile to a Mac, disconnecting it from the network, and running the following command.
sudo fdesetup changerecovery -personal -verbose
On first run, everything appears normal, however if you attempt to run it a second time (even with the network connection back online), the command will get stuck on the “escrowing recovery key” step. My suspicion is that the JAMF policy that rotates the FileVault 2 Recovery Key patiently waits for the FDERecoveryAgent to finish, with no timeout. This causes all subsequent check ins to the JSS to result in “this policy trigger is already being run” and thus no policies are run.
The good news is that the fix is actually quite simple once a computer is in this state. Unloading the FDERecoveryAgent seems to clear it right up. This can be done with the following command.
launchctl unload /System/Library/LaunchDaemons/com.apple.security.FDERecoveryAgent.plist
Great! All I have to do is write a script that can detect if FDERecoveryAgent has been running for a significant amount of time (say 5 minutes) and unload the process. So that’s what I did.
There’s only one small catch: since policies aren’t running, there’s no way to push the command out via Casper. I realized that I was going to need to get ahead of this problem if I wanted to solve it, so I decided to attack it from 3 sides.
In the FileVault 2 Recovery Key Rotation Policy. Since this is the policy that’s getting stuck, I added the above script to run before the rest of the policy in an effort to unload the FDERecoveryAgent if it’s been running for longer than 5 min, that way the policy should be able to do it’s thing without getting stuck.
Locally with a Launch Daemon. I also decided that it would be a good idea for the script to live locally on every Mac and run periodically to check. I did this with a simple Launch Daemon that calls the script every few hours.
The two methods above will most likely prevent any additional Macs from getting stuck in the loop, but unfortunately neither will address Macs that are already stuck. Which brings me to my third strategy.
Run the script via Self Service. Fortunately, it seems that Self Service policies continue to work while the Mac is stuck in the loop. I set up a smart group in Casper that looked for any machine that has checked in in the last 24 hours, but has not updated inventory in over 48 hours, and scoped a policy that runs the above script. I sent an email to all of the affected users asking them to run the policy. There are still a few stragglers that I will need to chase down, but since implementing the above solutions I haven’t seen any new Macs become affected.
Have you run into this in your environment? Let me know in the comments.