Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

atoppi · 2023-07-31T10:35:13Z

This patch tries to address #3105 and #3108.
We think that both issues are caused by an unwanted insert in the mountpoint viewers list, following this flow:

concurrent requests of plugin session destroy and watch (handled in two different threads) are received
the check for a valid session is satisfied at the beginning of janus_streaming_handler

		janus_mutex_lock(&sessions_mutex);
		janus_streaming_session *session = janus_streaming_lookup_session(msg->handle);
		if(!session) {
			janus_mutex_unlock(&sessions_mutex);
			JANUS_LOG(LOG_ERR, "No session associated with this handle...\n");
			janus_streaming_message_free(msg);
			continue;
		}

meanwhile janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes
the watch handler completes and ends up adding a dangling session pointer in the viewers list
from now on, bad things might happen

The patch is basically extending the scope of sessions_mutex in order to avoid the concurrent handling of a watch request and a janus_streaming_hangup_media_internal, thus restoring the correct order of execution.

atoppi · 2023-07-31T10:37:57Z

pinging @tmatth @lionelnicolas since they opened the mentioned issues

tmatth · 2023-07-31T13:06:37Z

pinging @tmatth @lionelnicolas since they opened the mentioned issues

@atoppi I'm not able to test currently but concurrent watch + destroy definitely sounds like the type of condition we would have hit when we saw this.

lionelnicolas · 2023-08-01T10:30:29Z

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

atoppi · 2023-08-01T10:46:46Z

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

Sounds good 👍
Do your scripts check the current number of viewers in order to assert a condition (e.g. should be zero) ?
The bare insert of unwanted sessions in the list is enough to trigger undefined behaviors.

lionelnicolas · 2023-08-01T10:59:06Z

No it's not checking that number. It's just doing a sequence of (create mp / add viewer / destroy mp) with random timings, and it only exits if mp creation failed (like already existing or API deadlock/timeout)

If I can find some time today I'll try to refine my script to add the number of viewer check, and maybe also put mp creation/destruction in a different thread than viewer join/leave (to try to get that watch/destroy race condition)

lionelnicolas · 2023-08-01T11:03:26Z

3. janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes

On a side note, when we encounter this issue, it seems to happen a lot more frequently when a peer with a very bad network is involved (which could cause the watch request to take longer to complete ?)

atoppi · 2023-08-01T11:31:35Z

I'm not sure that mp create/destroy is strictly needed to hit the issue.
A mp destroy will trigger existing sessions destroy of course, and those sessions teardown, along with a concurrent watch request, might lead to unwanted inserts in the viewers list.

As of bad network peers, maybe a poor connection is closing the transport (e.g. websocket), leading to session tear down, while handling a watch request?
In general the time needed to handle a watch request is basically equal to mp/sessions/configuration checks + JSEP offer generation.

lionelnicolas · 2023-08-01T12:16:47Z

Maybe I should try to:

[mp] create mp
[viewer] attach viewer
[viewer] send watch
[viewer] close WS immediately after sending watch
[mp] immediately destroy mp

(and try to execute 4. and 5. at the same time ?)

Or instead of destroying the mp, just check the viewer count of the mp via the admin API ?

atoppi · 2023-08-01T12:55:54Z

I'd say just create mp once or reuse a static one, then in parallel threads:

attach
watch (don't wait response)
quickly send a random command among close ws, detach handle, destroy session

After N iterations, check that mp viewers count is 0.
Of course the test might stop early when libasan detects a corruption.

lionelnicolas · 2023-08-01T13:49:51Z

Ok thanks I'll try to check that today

lionelnicolas · 2023-08-02T22:48:11Z

I may have good news 😄

Janus setup

Compiled with libasan and libubsan.
Running with mimalloc disabled.

Stress test

I've added this to my janus CLI client, just after sending the watch command:

(...)

            rand_action = random.choice([1, 2, 3, 4])

            if rand_action == 1:
                logger.warning("STRESS : killing websocket")
                await self.ws.close()

            elif rand_action == 2:
                logger.warning("STRESS : detaching plugin")
                await self.send(
                    {
                        "janus": "detach",
                        "session_id": self.data.get("session_id"),
                        "handle_id": self.data.get("handle_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            elif rand_action == 3:
                logger.warning("STRESS : destroying session")
                await self.send(
                    {
                        "janus": "destroy",
                        "session_id": self.data.get("session_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            else:
                logger.warning("STRESS : running normal behaviour (watch stream for 2 seconds)")

(...)

Then run 15 parallel jobs with 12000 iterations:

parallel -n0 --line-buffer --jobs 15 "./stress STREAM-TOKEN" ::: $(seq 12000)

Without applying this PR

After 12000 iterations, mountpoint_viewers was 6 after all viewers were torn down.

Using this PR

After 12000 iteration, the mountpoint_viewers was 0.

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

lionelnicolas · 2023-08-02T23:38:00Z

To be sure, I wanted to try with 25 parallel jobs:

parallel -n0 --line-buffer --jobs 25 "./stress STREAM-TOKEN" ::: $(seq 8000)

Without applying this PR

After 8000 iterations, mountpoint_viewers was 3 after all viewers were torn down.

Using this PR

After 8000 iteration, the mountpoint_viewers was 0.

lionelnicolas · 2023-08-02T23:43:53Z

I can't approve the PR, but LGTM 👍

tmatth · 2023-08-03T03:01:53Z

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

I imagine that it's because although you have dangling pointers in your test, if you're not actually dereferencing them at some point after they've been freed asan won't complain.

atoppi · 2023-08-03T07:10:17Z

@lionelnicolas excellent work! That proves the patch is working as intended and will fix at least the "already watching" issue.

@tmatth you're right, the bare insert of a dangling ptr is not enough to trigger the sanitizers, you need to deref it.

lminiero · 2023-08-04T09:00:22Z

Thanks for all the feedback! I'll merge then, and work on a port of the same fix to 1.x too.

… (see #3246)

lminiero · 2023-08-04T09:21:20Z

Done for 1.x in the PR above, please double check I didn't miss anything.

… (see #3246) (#3250)

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list

Unverified

This user has not yet uploaded their public signing key.

GPG key ID: 3D148D07E4722A21

Learn about vigilant mode

5c0c1be

lminiero merged commit 8162ce7 into 0.x Aug 4, 2023

lminiero deleted the 0x-fix-bad-viewer-insert branch August 4, 2023 09:00

lminiero added a commit that referenced this pull request Aug 4, 2023

Added missing unlock (see #3246)

3343814

lminiero added a commit that referenced this pull request Aug 4, 2023

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list…

91f72d9

… (see #3246)

lminiero mentioned this pull request Aug 4, 2023

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list (see #3246) #3250

Merged

This was referenced Aug 21, 2023

[0.x] janus.plugin.streaming: Already watching a stream #3105

Closed

[0.x] janus.plugin.streaming: deadlock #3108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

atoppi commented Jul 31, 2023

atoppi commented Jul 31, 2023

tmatth commented Jul 31, 2023

lionelnicolas commented Aug 1, 2023

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023 •

edited

Loading

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

lionelnicolas commented Aug 2, 2023

lionelnicolas commented Aug 2, 2023

lionelnicolas commented Aug 2, 2023

tmatth commented Aug 3, 2023

Notes

atoppi commented Aug 3, 2023

lminiero commented Aug 4, 2023

lminiero commented Aug 4, 2023

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

Conversation

atoppi commented Jul 31, 2023

atoppi commented Jul 31, 2023

tmatth commented Jul 31, 2023

lionelnicolas commented Aug 1, 2023

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023 • edited Loading

atoppi commented Aug 1, 2023

lionelnicolas commented Aug 1, 2023

lionelnicolas commented Aug 2, 2023

Janus setup

Stress test

Without applying this PR

Using this PR

Notes

lionelnicolas commented Aug 2, 2023

Without applying this PR

Using this PR

lionelnicolas commented Aug 2, 2023

tmatth commented Aug 3, 2023

Notes

atoppi commented Aug 3, 2023

lminiero commented Aug 4, 2023

lminiero commented Aug 4, 2023

lionelnicolas commented Aug 1, 2023 •

edited

Loading