-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246
Conversation
pinging @tmatth @lionelnicolas since they opened the mentioned issues |
@atoppi I'm not able to test currently but concurrent watch + destroy definitely sounds like the type of condition we would have hit when we saw this. |
@atoppi Thank you for this !
I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production. I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days. |
Sounds good 👍 |
No it's not checking that number. It's just doing a sequence of (create mp / add viewer / destroy mp) with random timings, and it only exits if mp creation failed (like If I can find some time today I'll try to refine my script to add the number of viewer check, and maybe also put mp creation/destruction in a different thread than viewer join/leave (to try to get that watch/destroy race condition) |
On a side note, when we encounter this issue, it seems to happen a lot more frequently when a peer with a very bad network is involved (which could cause the |
I'm not sure that mp create/destroy is strictly needed to hit the issue. As of bad network peers, maybe a poor connection is closing the transport (e.g. websocket), leading to session tear down, while handling a watch request? |
Maybe I should try to:
(and try to execute 4. and 5. at the same time ?) Or instead of destroying the mp, just check the viewer count of the mp via the admin API ? |
I'd say just create mp once or reuse a static one, then in parallel threads:
After N iterations, check that mp viewers count is 0. |
Ok thanks I'll try to check that today |
I may have good news 😄 Janus setupCompiled with Stress testI've added this to my janus CLI client, just after sending the (...)
rand_action = random.choice([1, 2, 3, 4])
if rand_action == 1:
logger.warning("STRESS : killing websocket")
await self.ws.close()
elif rand_action == 2:
logger.warning("STRESS : detaching plugin")
await self.send(
{
"janus": "detach",
"session_id": self.data.get("session_id"),
"handle_id": self.data.get("handle_id"),
"transaction": self.transaction_create("instantiate-listener"),
}
)
await asyncio.sleep(0.2)
await self.ws.close()
elif rand_action == 3:
logger.warning("STRESS : destroying session")
await self.send(
{
"janus": "destroy",
"session_id": self.data.get("session_id"),
"transaction": self.transaction_create("instantiate-listener"),
}
)
await asyncio.sleep(0.2)
await self.ws.close()
else:
logger.warning("STRESS : running normal behaviour (watch stream for 2 seconds)")
(...) Then run 15 parallel jobs with 12000 iterations:
Without applying this PRAfter 12000 iterations, Using this PRAfter 12000 iteration, the NotesI've seen not seen any |
To be sure, I wanted to try with 25 parallel jobs:
Without applying this PRAfter 8000 iterations, Using this PRAfter 8000 iteration, the |
I can't approve the PR, but LGTM 👍 |
I imagine that it's because although you have dangling pointers in your test, if you're not actually dereferencing them at some point after they've been freed asan won't complain. |
@lionelnicolas excellent work! That proves the patch is working as intended and will fix at least the "already watching" issue. @tmatth you're right, the bare insert of a dangling ptr is not enough to trigger the sanitizers, you need to deref it. |
Thanks for all the feedback! I'll merge then, and work on a port of the same fix to |
Done for |
This patch tries to address #3105 and #3108.
We think that both issues are caused by an unwanted insert in the mountpoint viewers list, following this flow:
watch
(handled in two different threads) are receivedjanus_streaming_handler
janus_streaming_hangup_media_internal
(called by session destroy) is handled before thewatch
request completeswatch
handler completes and ends up adding a dangling session pointer in the viewers listThe patch is basically extending the scope of
sessions_mutex
in order to avoid the concurrent handling of awatch
request and ajanus_streaming_hangup_media_internal
, thus restoring the correct order of execution.