Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend sessions_mutex scope to avoid rogue inserts in mp viewers list #3246

Merged
merged 1 commit into from
Aug 4, 2023

Conversation

atoppi
Copy link
Member

@atoppi atoppi commented Jul 31, 2023

This patch tries to address #3105 and #3108.
We think that both issues are caused by an unwanted insert in the mountpoint viewers list, following this flow:

  1. concurrent requests of plugin session destroy and watch (handled in two different threads) are received
  2. the check for a valid session is satisfied at the beginning of janus_streaming_handler
		janus_mutex_lock(&sessions_mutex);
		janus_streaming_session *session = janus_streaming_lookup_session(msg->handle);
		if(!session) {
			janus_mutex_unlock(&sessions_mutex);
			JANUS_LOG(LOG_ERR, "No session associated with this handle...\n");
			janus_streaming_message_free(msg);
			continue;
		}
  1. meanwhile janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes
  2. the watch handler completes and ends up adding a dangling session pointer in the viewers list
  3. from now on, bad things might happen

The patch is basically extending the scope of sessions_mutex in order to avoid the concurrent handling of a watch request and a janus_streaming_hangup_media_internal, thus restoring the correct order of execution.

Unverified

This user has not yet uploaded their public signing key.
@atoppi
Copy link
Member Author

atoppi commented Jul 31, 2023

pinging @tmatth @lionelnicolas since they opened the mentioned issues

@tmatth
Copy link
Contributor

tmatth commented Jul 31, 2023

pinging @tmatth @lionelnicolas since they opened the mentioned issues

@atoppi I'm not able to test currently but concurrent watch + destroy definitely sounds like the type of condition we would have hit when we saw this.

@lionelnicolas
Copy link
Contributor

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

@atoppi
Copy link
Member Author

atoppi commented Aug 1, 2023

@atoppi Thank you for this !

janus_streaming_hangup_media_internal definitely rings a bell when remembering investigations we did on that issue.

I just did a quick load test using this PR and it seems to work. We wrote a few scripts in the past to try to reproduce the issue addressed by this PR, but we were never able to reproduce it systematically. It was (almost) only happening in production.

I'm going to deploy this PR on all our dev/test/qa environment to see if we catch any issue in the next days.

Sounds good 👍
Do your scripts check the current number of viewers in order to assert a condition (e.g. should be zero) ?
The bare insert of unwanted sessions in the list is enough to trigger undefined behaviors.

@lionelnicolas
Copy link
Contributor

No it's not checking that number. It's just doing a sequence of (create mp / add viewer / destroy mp) with random timings, and it only exits if mp creation failed (like already existing or API deadlock/timeout)

If I can find some time today I'll try to refine my script to add the number of viewer check, and maybe also put mp creation/destruction in a different thread than viewer join/leave (to try to get that watch/destroy race condition)

@lionelnicolas
Copy link
Contributor

3. janus_streaming_hangup_media_internal (called by session destroy) is handled before the watch request completes

On a side note, when we encounter this issue, it seems to happen a lot more frequently when a peer with a very bad network is involved (which could cause the watch request to take longer to complete ?)

@atoppi
Copy link
Member Author

atoppi commented Aug 1, 2023

I'm not sure that mp create/destroy is strictly needed to hit the issue.
A mp destroy will trigger existing sessions destroy of course, and those sessions teardown, along with a concurrent watch request, might lead to unwanted inserts in the viewers list.

As of bad network peers, maybe a poor connection is closing the transport (e.g. websocket), leading to session tear down, while handling a watch request?
In general the time needed to handle a watch request is basically equal to mp/sessions/configuration checks + JSEP offer generation.

@lionelnicolas
Copy link
Contributor

lionelnicolas commented Aug 1, 2023

Maybe I should try to:

  1. [mp] create mp
  2. [viewer] attach viewer
  3. [viewer] send watch
  4. [viewer] close WS immediately after sending watch
  5. [mp] immediately destroy mp

(and try to execute 4. and 5. at the same time ?)

Or instead of destroying the mp, just check the viewer count of the mp via the admin API ?

@atoppi
Copy link
Member Author

atoppi commented Aug 1, 2023

I'd say just create mp once or reuse a static one, then in parallel threads:

  1. attach
  2. watch (don't wait response)
  3. quickly send a random command among close ws, detach handle, destroy session

After N iterations, check that mp viewers count is 0.
Of course the test might stop early when libasan detects a corruption.

@lionelnicolas
Copy link
Contributor

Ok thanks I'll try to check that today

@lionelnicolas
Copy link
Contributor

I may have good news 😄

Janus setup

Compiled with libasan and libubsan.
Running with mimalloc disabled.

Stress test

I've added this to my janus CLI client, just after sending the watch command:

(...)

            rand_action = random.choice([1, 2, 3, 4])

            if rand_action == 1:
                logger.warning("STRESS : killing websocket")
                await self.ws.close()

            elif rand_action == 2:
                logger.warning("STRESS : detaching plugin")
                await self.send(
                    {
                        "janus": "detach",
                        "session_id": self.data.get("session_id"),
                        "handle_id": self.data.get("handle_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            elif rand_action == 3:
                logger.warning("STRESS : destroying session")
                await self.send(
                    {
                        "janus": "destroy",
                        "session_id": self.data.get("session_id"),
                        "transaction": self.transaction_create("instantiate-listener"),
                    }
                )
                await asyncio.sleep(0.2)
                await self.ws.close()

            else:
                logger.warning("STRESS : running normal behaviour (watch stream for 2 seconds)")

(...)

Then run 15 parallel jobs with 12000 iterations:

parallel -n0 --line-buffer --jobs 15 "./stress STREAM-TOKEN" ::: $(seq 12000)

Without applying this PR

After 12000 iterations, mountpoint_viewers was 6 after all viewers were torn down.

Using this PR

After 12000 iteration, the mountpoint_viewers was 0.

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

@lionelnicolas
Copy link
Contributor

To be sure, I wanted to try with 25 parallel jobs:

parallel -n0 --line-buffer --jobs 25 "./stress STREAM-TOKEN" ::: $(seq 8000)

Without applying this PR

After 8000 iterations, mountpoint_viewers was 3 after all viewers were torn down.

Using this PR

After 8000 iteration, the mountpoint_viewers was 0.

@lionelnicolas
Copy link
Contributor

I can't approve the PR, but LGTM 👍

@tmatth
Copy link
Contributor

tmatth commented Aug 3, 2023

Notes

I've seen not seen any asan / ubsan warnings in the logs when mountpoint_viewers was non-zero.

I imagine that it's because although you have dangling pointers in your test, if you're not actually dereferencing them at some point after they've been freed asan won't complain.

@atoppi
Copy link
Member Author

atoppi commented Aug 3, 2023

@lionelnicolas excellent work! That proves the patch is working as intended and will fix at least the "already watching" issue.

@tmatth you're right, the bare insert of a dangling ptr is not enough to trigger the sanitizers, you need to deref it.

@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

Thanks for all the feedback! I'll merge then, and work on a port of the same fix to 1.x too.

@lminiero lminiero merged commit 8162ce7 into 0.x Aug 4, 2023
@lminiero lminiero deleted the 0x-fix-bad-viewer-insert branch August 4, 2023 09:00
lminiero added a commit that referenced this pull request Aug 4, 2023
@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

Done for 1.x in the PR above, please double check I didn't miss anything.

lminiero added a commit that referenced this pull request Aug 4, 2023

Verified

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.
… (see #3246) (#3250)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants