Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive lag #332

Open
edwardalee opened this issue Jan 19, 2024 · 17 comments
Open

Excessive lag #332

edwardalee opened this issue Jan 19, 2024 · 17 comments

Comments

@edwardalee
Copy link
Contributor

edwardalee commented Jan 19, 2024

Related discussion: #289.

At least on macOS, we get a lag (physical time minus logical time) that is always larger than what I would expect. @lsk567 and I looked at this yesterday with the following extremely simple LF program:

image

In this program, the runtime system will call lf_cond_timedwait() to wait for physical time to match each multiple of 1ms before invoking the reaction. lf_cond_timedwait() calls the POSIX function pthread_cond_timedwait() on macOS (and presumably on Linux as well). We inserted the following code at the point in the runtime right after it returns from lf_cond_timedwait().

            instant_t now = lf_time_physical();
            tracepoint(env->trace, user_event, NULL, NULL, 0, 0, 0, &now, NULL, 0, false);

This inserts a "user event" into the trace file. We then converted the trace file to CSV using trace_to_csv and looked at the times at which things are occurring. The trace looks something like this:

image

Here, the last column is physical time as reported by lf_time_physical_elapsed(). The above section of the trace is when the advancing from 2ms to 3ms in logical time is occurring.

What we found was pretty interesting.

  1. pthread_cond_timedwait() consistently returns later than you ask it to. In the above trace, it is 291us late.
  2. The time taken between when pthread_cond_timedwait() returns and the reaction gets invoked is pretty small. In the above trace, it is 5us, although it is often as low as 1us. 1us is the smallest time increment we can measure with lf_time_physical() because the macOS clock has a resolution of 1us.
  3. lf_cond_timedwait() sometimes gets invoked many times in rapid succession. It gets called several dozen times rapidly. Why?

The lag we are seeing is mostly due to pthread_cond_timedwait(), not to our runtime system. There are some obvious questions:

  1. Is pthread_cond_timedwait() using the same clock as lf_time_physical()? If not, this could account for what we are seeing.
  2. Could we learn at runtime what the offset is between pthread_cond_timedwait() and lf_time_physical() and correct for it by pre-subtracting the 200us from the argument given to pthread_cond_timedwait()?
@erlingrj
Copy link
Collaborator

pthread_cond_timedwait might experience spurious wakeups. So that is probably why you are seeing dozens of calls. I have read somewhere that it is a lot less precise than nanosleep. I think we only need the condition variable if we have physical actions in the program. An optimization could be to use nanosleep in programs without physical actions.

@erlingrj
Copy link
Collaborator

Actually. The reason why nanosleep was more accurate was that for small sleeps, less than 2ms, when the thread is scheduled with a real-time scheduler, e.g. SCHED_RR, then those small sleeps are turned into busy-waits instead.

My tests showed pretty similar performance for pthread_cond_timedwait and nanosleep, in the order of 60-90 usec lag. A simple optimization would be to wakeup 200usec earlier and do a busy wait for the remainder. This we can do for both timedwait and nanosleep. I think this optimization should be optional as it consumes more CPU to get more accurate timing.

So this optimization based on physical actions might not be worth it. (@lhstrh: FYI)

@lhstrh
Copy link
Member

lhstrh commented Jan 29, 2024

Thanks for the tag. Where is the busy-wait happening? In our implementation or in nanosleep itself?

@erlingrj
Copy link
Collaborator

I am going to do a more thorough comparison on my desktop before we make any conclusion.

The busy-waiting was done inside nanosleep itself. But in newer versions of the Linux kernel this optimization is removed. It is worth considering bringing in such an optimization in reactor-c, Edward already proposed it, wakeup a few hundred microseconds early and do the remainder in busy-wait

@erlingrj
Copy link
Collaborator

This is a simple benchmark comparing the wakeup-lag of nanosleep and pthread_cond_timedwait on my ubuntu desktop. The program tries to sleep for 1ms and I measure the "lag", i.e. how many additional nanoseconds we spend (of course there is some instrumentation overhead).

image

They have very similar performance averaging around 85usec.

When I increase the priority of the thread running the benchmark and scheduling with SCHED_FIFO then nanosleep improves while cond_timedwait is unchanged:
image

I think this experiment shows that it is worth it to use nanosleep in programs without physical actions.

@soyerefsane
Copy link
Collaborator

I ran the same experiment on Raspberry Pi 4b and my results on the recorded offset are slightly different:
image
Another interesting thing I have observed while using GDB is the pthread_cond_timedwait function gets called only once during the execution of the program.
Just like Edward, I inserted a tracepoint under the else branch of the lf_cond_timedwait section, which is called when the timer reaches a timeout; weirdly such a user-specified event cannot be seen until logical time 906000000.

@edwardalee
Copy link
Contributor Author

When running under gdb, I would expect pthread_cond_timedwait() to trigger only once because the first time it triggers, it halts logical time advance but not physical time advance. In subsequent execution, the wait is skipped because physical time is already past logical time.

@soyerefsane
Copy link
Collaborator

I have discovered some details regarding this issue on Raspberry Pi 4b. First of all the following is a histogram of lags on raspberry pi over 7 runs of the program.
image

Important program behaviors:

  • While MacOS uses CLOCK_MONOTONIC, raspberry pi is actually using a CLOCK_REALTIME.
  • MacOS uses the lf_POSIX_threads_support.c for lf_cond_timedwait function call, which calls pthread_cond_timedwait function call underneath. Raspberry Pi is using lf_C11_threads_support.c which calls cnd_timedwait as the underlying function call. I'm honestly not sure how accurate a function cnd_timedwait is.
  • There is an interesting behavior in one of the runs such that the startup code takes around 7.6 milliseconds to execute, such that the very first event the tracer logs is at physical time 7.6 milliseconds. I'm trying to reproduce this non-deterministic issue to see what could be the root cause.

@erlingrj
Copy link
Collaborator

erlingrj commented Feb 7, 2024

What are the units of the values? Are they absolute values and the sign is given by the color of the bar? If the runtime tries to sleep for less than 10 usec, it just returns immediately. So a spurious wakeup of pthread_cond_timedwait at the right time can lead to a negative lag. You can try changing MIN_SLEEP_DURATION to 0 in reactor.h.

@soyerefsane
Copy link
Collaborator

The y-axis is the frequency (aka the number of times we see such lag) in log scale and the x-axis is in microseconds. The negative lags are really small in the absolute value, like maybe <10 nanoseconds; therefore they were getting put in the same bucket as small positive values in the histogram, I did the color change to signify the difference in the signs. It should technically not be in absolute values -- but the histogram shows them almost about 0 -- I believe it's because the values are really small compared to the positive lags.

The MIN_SLEEP_DURATION was also my theory, but I haven't tested it yet. But to re-iterate one of my points from above, in Raspberry Pi 4b, the function that waits is cnd_timedwait under lf_C11_threads_support.c and not pthread_cond_timedwait. I'm not sure if this is intended behavior or what's the reason behind it, but it's what's happening now.

At the same time, if the wait time is less than the MIN_SLEEP_DURATION the wait_until function just returns, it might be worth exploring if it makes sense to busy wait instead of just returning. Is there any significance behind the value 10 usec?

@erlingrj
Copy link
Collaborator

erlingrj commented Feb 7, 2024

The MIN_SLEEP_DURATION was also my theory, but I haven't tested it yet. But to re-iterate one of my points from above, in Raspberry Pi 4b, the function that waits is cnd_timedwait under lf_C11_threads_support.c and not pthread_cond_timedwait. I'm not sure if this is intended behavior or what's the reason behind it, but it's what's happening now.

What is missing from my picture is this: What is the critical difference of the behaviour of the macOS benchmark and the RPi benchmark that you want to attribute to cnd_timedwait vs pthread_cond_timedwait.

At the same time, if the wait time is less than the MIN_SLEEP_DURATION the wait_until function just returns, it might be worth exploring if it makes sense to busy wait instead of just returning. Is there any significance behind the value 10 usec?

10usec was an unfounded guess at the overhead involved in doing a cond_timedwait. I could be convinced to remove it, if we feel that having negative lag is very bad.

@erlingrj
Copy link
Collaborator

erlingrj commented Feb 7, 2024

I like Edwards proposal in the original post. We could learn, at runtime, the sleep overhead for the platforms and correct for it. This is a standard a control problem and a PI controller would probably manage it nicely.

@edwardalee
Copy link
Contributor Author

I'm pretty sure MIN_SLEEP_DURATION has nothing to do with the initial observation because the timer periods is quite large.

@erlingrj
Copy link
Collaborator

erlingrj commented Feb 7, 2024

MIN_SLEEP_DURATION can explain why we could have a small negative lag sometimes. The initial observation could be due to several things:

  1. overhead with pthread_cond_timedwait, if I understand your numbers they are in the order of 100-200 usec. I measured around 80 usec on my laptop.
  2. If macOS uses CLOCK_MONTONIC it could cause issues because pthread_cond_timedwait on macOS is relative CLOCK_REALTIME. This means that we must translate the desired wakeup time from CLOCK_MONOTONIC to CLOCK_REALTIME. Currently that is done by adding a constant offset to the timestamps read from CLOCK_MONOTONIC, that offset was estimated at startup. However if a NTP client has stepped the clock since, that offset is no longer valid and we will get a constant bias. We could re-calculate that offset everytime we sleep, but It seems we will remain vulnerable for clock stepping at the wrong moment (after we recalculated the offset). For Linux it is not a problem to use CLOCK_MONOTONIC because pthread_cond_timedwait can be configured relative CLOCK_MONTONIC. I don't know what C11 uses, probably CLOCK_REALTIME, so we would have that problem with C11 and Linux also.

In fact, point 2 here is why I am advocating just sticking with CLOCK_REALTIME and say that our real-time performance is no better than the performance of the system clock. So it is up to the user to configure his system correctly (i.e. disable clock-stepping for NTP or PTP clients running)

@edwardalee
Copy link
Contributor Author

Yes, all good points. I also agree on sticking to CLOCK_REALTIME. I'm just wondering whether this should be an overridable default.

@soyerefsane
Copy link
Collaborator

What is missing from my picture is this: What is the critical difference of the behaviour of the macOS benchmark and the RPi benchmark that you want to attribute to cnd_timedwait vs pthread_cond_timedwait.

The only MacOS measurement I have run was over Edward's laptop, I'll go ahead and make some measurements over running it multiple times over Mac as well and share the histogram.

I could be convinced to remove it, if we feel that having negative lag is very bad.

I think how bad negative lag is depends on our expected behavior, I believe our current definition for physical action is min_delay based for example, and at that point, I think the timing guarantee shouldn't have any negative lag, I'm not sure what would be the best definition for timing accuracy here -- since we also want relatively small lag.

If macOS uses CLOCK_MONTONIC it could cause issues because pthread_cond_timedwait on macOS is relative CLOCK_REALTIME. This means that we must translate the desired wakeup time from CLOCK_MONOTONIC to CLOCK_REALTIME.

I'm not sure on macOS, but nanosleep uses CLOCK_MONOTONIC in Linux based systems. So another potential implementation change would be changing the function being used instead of changing the base clock.

@erlingrj
Copy link
Collaborator

To summarize: I have three proposals for reducing lag on macOS (and Linux) which can be implemented as three distinct PRs all aiming to deliver better real-time performance for reactor-c

  1. Running the program with higher priority, there is a draft PR which extends the platform API such that you can set core affinity, scheduling policy and thread priority. Extend platform API with functions for setting thread scheduling policy, core affinity and priority + remove C11 as platform #355
  2. Enable nanosleep-based sleeping when there are no physical actions in the program.
  3. Implement a PI controller that estimates the wakeup-overhead and subtract that from every sleep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants