My EndpointSecurity Client process is kicked by OS on Mac sleep/wake cycle

Question

suMac OP

Created 1d

Replies 1

Boosts 0

Participants 2

Hi, I develop an ES client applying rule-engine evaluating ES events (mostly File-system events).

It is a bit non-standard not being deployed as a System-Extension, but rather as a global daemon.

On some Macs, I sometimes see "crash reports" for the ES process, all sharing

Termination Reason: Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

All of these happen not while normal Mac usage, but rather right at Mac wakeup time after sleep.

My guess is, some ES_AUTH events (with deadline) arrive when Mac goes to sleep, and somehow my high-priority dispatch_queue handling them is "put to sleep" mid processing them, so when the Mac wakes up - event handling continues long after the deadline passed, and MacOS decides to kick the process.

Questions:

What is the recommended behavior with ES vs Sleep/Wake cycles? (we're not an antivirus, and we don't care much to clear events or go "blind" for such time)
Can I specify somewhere in the info.plist of my bundle (this is built like an App) that my process should't be put to sleep, or that the OS should sleep it only when it becomes idle, or some other way tells the OS it is "ready for sleep" ?
If not -- How do I observe the scenario so I can suspend my event handling IN TIME and resume on wake?

Thanks!

Boost

Answer 1

DTS Engineer OP

Apple

5h

It is a bit non-standard not being deployed as a System-Extension, but rather as a global daemon.

This is actually pretty common.

So, let me start here:

My guess is, some ES_AUTH events (with deadline) arrive when Mac goes to sleep, and somehow my high-priority dispatch_queue handling them is "put to sleep" mid-processing them, so when the Mac wakes up - event handling continues long after the deadline passed, and MacOS decides to kick the process.

No, I don't think that's what's going on, at least not in the basic case. The system that "feeds" events into your ES client as a kernel extension feeding events into user space and it will have kept your ES client active long enough that we "cleared" all syscall activity through your client before we allowed the system to sleep. There are lots of places where things can go wrong, but I don't think the core event delivery system is the issue.

That leads back to here:

Hi, I develop an ES client applying rule-engine evaluating ES events (mostly File-system events).

What does that engine actually "do"? In particular:

Is it running inside your daemon or in a helper process?
What if ANY system APIs (file system related or not) does it call in the process of evaluating its rules?
What logic have you put in place to ensure that you're processing requests quickly and, if necessary, short-circuiting your normal processing to ensure you complete every request?

I suspect what's actually going on here is that some part of the processing you're doing ends up calling into a system daemon which either isn't yet awake (because you're processing an event during very early wake) or already asleep (because you're very late in the sleep process).

One thing to be aware of about the termination here:

On some Macs, I sometimes see "crash reports" for the ES process, all sharing Termination Reason: Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

First off, if possible, please post the crash logs as I'd like to see what they actually show.

Next, I'm not sure how much you can "trust" the timing data of the log itself. I'd need to look at the code in more detail, but I suspect the termination logic works something like this:

The ES system decides you've missed a deadline.
The ES system "cuts" your daemon out of the approval loop, allowing "normal" system operation to proceed.
The ES system sends the termination request to kill your app.
The request is processed, killing your app.

The problem here is that if you're late enough in the sleep process, it's entirely possible (fairly likely even) that the system will actually go to sleep somewhere after #2 bug before #3/#4. In that case, the log will show that you terminated after wake (because that's what happened) even though the termination decision (and the problem you need to fix) actually happened before sleep.

Unfortunately, this may also render the crash log itself fairly useless, as your daemon may have proceeded past the failure point without realizing anything went wrong. If you haven't looked at this already, this is where doing your own "manual" deadline tracking can be helpful— in production, you can use it to prevent catastrophic failure, but in debugging, you can also use it to force a crash closer to the point "real" failure occurred.

Finally, you can also use a much shorter deadline in your own tracker. MANY ES failures actually start as significant slowdowns in event processing but only terminate when the right circumstances make the slow down big enough. Those earlier slowdowns are exactly the same underlying failure, but are often easier to follow and debug because "less" is going on.

Can I specify somewhere in the info.plist of my bundle (this is built like an App)?

As a minor side note, good, that is an excellent choice. At this point, I'm basically convinced that ALL daemons should be inside an app bundle. At a minimum, it lets you make the component more "attractive" (custom icon, localized names, better "Get Info") and there's a good chance it will prevent problems (now and in the future).

That leads to here:

That my process should't be put to sleep, or that the OS should sleep it only when it becomes idle, or some other way tells the OS it is "ready for sleep" ?

If not -- How do I observe the scenario so I can suspend my event handling IN TIME and resume on wake?

Unfortunately, this is the wrong way to think about this (and most ES issues). ES clients are event processing "engines" and they fail when defects in that event processing logic prevent events from flowing properly. The trap here is focusing on the direct cause ("the system when to sleep") instead of trying to figure out the underlying defect ("what disrupted engine processing"). Focusing on specific causes can easily lead to ongoing series of increasingly weird edge cases as the system finds new and interesting ways to trip over the underlying defect.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0