Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cc 2166 speedup eventing websocket reconnect #2736

Conversation

JanFellner
Copy link
Contributor

As followup from #2734

Changed the implementation for the reconnect from the handling inside the lws_service thread to the scheduler approach as shown in the libsocket example. (https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/ws-client/minimal-ws-client-tx/minimal-ws-client.c)

The current approach is not working properly if the target is not localhost. The connection failed callback is triggered 5 seconds after the connect however the lws_service continues to sleep for 25 seconds.

Using the lws_sul_schedule the time when the next reconnect has to happen is exactly specified, so no need to do that in the lws_service loop and the loop just does the lws_service loop.

(!) Not yet tested with wsl 3.1 want to do that now but a short feedback in the mean time would be nice.

@januscla
Copy link

Thanks for your contribution, @JanFellner! Please make sure you sign our CLA, as it's a required step before we can merge this.

@lminiero
Copy link
Member

lminiero commented Jul 15, 2021

As I said in the issue, the schedule is a no go, because it's only available on newer versions of the library. Even doing two different implementations to accommodate both, I think will unneededly complicate things, especially considering your changes are a big refactoring already. I'd rather try and fix the one we have now for the issue you identified.

@JanFellner
Copy link
Contributor Author

The code is aligned with what libwebsocket suggest for >= 3.2 (no own grown code, just the plain sample)

I understand your concerns, tried my best documenting the code, using readable function names.
The difference are two different thread method implementations.

I did not find a solution to get it working the lws_service with 3.2 just sleeps and i see no way in waking it up.

(Testing with 3.1 in the mean time, loosen to what happens with the PR)

@JanFellner
Copy link
Contributor Author

Updated PR to work with lws 3.1 (now building with 3.1 and 4.2)
For lws 3.1 the behaviour is absolutly the same, lws_service wakes up every 25 seconds.
Specified a timeout of 10 seconds but that does not change the behavior

@JanFellner
Copy link
Contributor Author

CLA signed...
i am out now (received two pinball machines from italy need to check what i received :D)

Copy link
Member

@atoppi atoppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JanFellner , I think this is a good enhancement.
You missed the handling of re-connection in case of LWS_CALLBACK_CLOSED.

I'm leaving some other notes that should be addressed.

* Schedules a connect attempt using the lws scheduler as
*/
void janus_wsevh_schedule_connect_attempt(void) {
#if (LWS_LIBRARY_VERSION_MAJOR >= 3 && LWS_LIBRARY_VERSION_MINOR > 2) || (LWS_LIBRARY_VERSION_MAJOR > 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LWS_LIBRARY_VERSION_MAJOR >= 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@JanFellner
Copy link
Contributor Author

@atoppi are you using any kind for code formatting with rules? (using the same env as you vscode)
Could not find any clang rules in the repo but i guess some sort of automation is needed that you don´t need to comment PR by coding style (which i don´t care i just use any ruleset available).
If not i will just follow your comments 😉

@atoppi
Copy link
Member

atoppi commented Jul 16, 2021

@JanFellner No I'm not applying any kind of template in the IDE, just following the style in the rest of code.

Copy link
Member

@atoppi atoppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is still missing the reconnection schedule in case of LWS_CALLBACK_CLOSED.
I've left other notes to address.

/* Forward declarations */
void janus_wsevh_schedule_connect_attempt(void);
void janus_wsevh_calculate_reconnect_delay_on_fail(void);
// lws_sorted_usec_list_t is defined starting with lws 3.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use /* syntax for comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@JanFellner
Copy link
Contributor Author

While working on the code changes i lost the closed in mind, will address it

@JanFellner
Copy link
Contributor Author

Hope i did now cover everything :)

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 16, 2021

As much of the comments were styling related, i wonder if you ever discussed using code formatters.
I did not until 2 years ago and i regret not having used them before.
It just releases you from the burden of having to communicate how code should look like. The rule just formats it.
(We started with just specifying what is common in the team and not with rules that have conflict potential like line length limits)

@atoppi
Copy link
Member

atoppi commented Jul 16, 2021

@JanFellner nope, never discussed about this.
Are you thinking about a simple text document describing the common rules to follow or a specific file that auto-format your code in the IDE (like VSC clang-format) ?
If the latter, consider that @lminiero afaik is using geany to write code, so I'm not sure how he could use or validate it.

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 16, 2021

I am a friend about automation.
If you specify it in a standard you can also apply automated tools that validated stuff e.g. in a PR. so no unformatted code comes in the repo. So a clang file would for me be a starting point but it´s something you need to discuss.
The main task in the beginning is to setup the tooling, in vscode it´s just the plugin that you need to enable. That finds the file and does the magic for you. Other editors other ways...

and see it loosen from the rules you apply. What you want to specify, specify it. (or in other words, what you have to comment in a PR more than 5 times get it automated ;] )

@atoppi
Copy link
Member

atoppi commented Jul 16, 2021

It seems to work as intended on lws v4.2 :-)

However when specifying an invalid address, the timeout on my machine is 15 seconds.
I guess, if not specified, it is falling back to a default environment specific value.
I confirm that setting the timeout (e.g. to 5) the connection goes in failed state quicker.

@JanFellner
Copy link
Contributor Author

Added a connected log entry and removed a duplicate disconnected log entry

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 16, 2021

However when specifying an invalid address, the timeout on my machine is 15 seconds.
I guess, if not specified, it is falling back to a default environment specific value.
I confirm that setting the timeout (e.g. to 5) the connection goes in failed state quicker.

I added that value with d914fa4 for lws>=3.2

I used it now the whole day while developing debugging the signalling server.
The connection is always setup super fast and reliably (using lws 4.2)
looks good to me

@atoppi
Copy link
Member

atoppi commented Jul 19, 2021

@JanFellner lgtm 👍
waiting for @lminiero feedback before merging.

Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added some editorial notes and a question inline.

/*
* Websocket thread loop for websocket library newer than 3.2
* The reconnect is handled in a dedicated lws scheduler janus_wsevh_schedule_connect_attempt
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: comments should be in the style below

/* Bla bla
 * bla bla */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

JANUS_LOG(LOG_VERB, "Joining WebSocketsEventHandler (lws>=3.2) client thread\n");
int n = 0;
while(n >= 0 && g_atomic_int_get(&initialized) && !g_atomic_int_get(&stopping))
n = lws_service(context, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can cause lws_service to return a negative integer? I'm thinking of intermittent errors that could cause this loop to break before Janus is shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It´s copy pasted from the sample actually.

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/ws-client/minimal-ws-client/minimal-ws-client.c#n210

I will check inside libwebsocket if it returns a different value and what might be conditions

Copy link
Contributor Author

@JanFellner JanFellner Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you follow the functions through the lib there are some places where the function returns -1,
I did not validate them all but to me it looks like those are scenarios where the context is not properly settled and therefore further execution does not make sense...

  • lws_service() returns 1 or lws_plat_service()
  • lws_plat_service() returns _lws_plat_service_tsi()
  • _lws_plat_service_tsi() returns on seldom cases -1 but the main negative returns are here lws_service_fd_tsi()

https://libwebsockets.org/git/libwebsockets/tree/lib/core-net/service.c?h=v4.2-stable#n624

The question is whats better in the end.
A non sleeping 100% core consuming thread or a thread termination :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this could result in a fatal scenario due to a temporary issue: as a first integration of the return value, I'd just print the negative value in case of errors, but I'd move on with the loop (so no n>=0 in the while). This would give us some more information from people's logs, in case it ever happens.

Copy link
Contributor Author

@JanFellner JanFellner Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check 78bf993

/*
* Websocket thread loop for websocket library prior to (less than) 3.2
* The reconnect is handled in the loop for lws < 3.2
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above on comments style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

JANUS_LOG(LOG_WARN, "Error attempting reconnection...\n");
continue;
janus_wsevh_connect_attempt(NULL);
if (!wsi) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no space between 'if` and the bracket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

disconnected = janus_get_monotonic_time();
reconnect = TRUE;
JANUS_LOG(LOG_WARN, "WebSocketsEventHandler: Error attempting connection... (next retry in %ds)\n", reconnect_delay);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a log error above: please add the retry info there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

reconnect_delay = 1;
else if (reconnect_delay < JANUS_WSEVH_MAX_RETRY_SECS) {
reconnect_delay += reconnect_delay;
if (reconnect_delay > JANUS_WSEVH_MAX_RETRY_SECS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two if above have an extra space that per code style shouldn't be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


/**
* Schedules a connect attempt using the lws scheduler as
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above on comments style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

#if (LWS_LIBRARY_VERSION_MAJOR >= 3 && LWS_LIBRARY_VERSION_MINOR >= 2) || (LWS_LIBRARY_VERSION_MAJOR >= 4)
lws_sul_schedule(context, 0, &sul_stagger, janus_wsevh_connect_attempt, reconnect_delay * LWS_US_PER_SEC);
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a missing empty line at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

#if !(((LWS_LIBRARY_VERSION_MAJOR == 3 && LWS_LIBRARY_VERSION_MINOR >= 2) || LWS_LIBRARY_VERSION_MAJOR >= 4))
#define lws_sorted_usec_list_t void
#endif
static void janus_wsevh_connect_attempt(lws_sorted_usec_list_t* sul);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per code style, the * should be next to the pointer, not the type:

lws_sorted_usec_list_t *sul

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

* Implements the connecting attempt to the upstream websocket server
* sets the connection result (lws_client_connect_info) to static wsi
*/
static void janus_wsevh_connect_attempt(lws_sorted_usec_list_t* sul) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (* next to the pointer, not type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@JanFellner
Copy link
Contributor Author

Thanks @lminiero for the comments, will check those now...

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 19, 2021

Pushed

That one needs to be nailed down. I can also remove the negative check, i just took it from the samples...
Check my comments i made.
#2736 (comment)

Concerning the styling stuff please do also check #2736 (comment) and the following two posts as suggestion...
(Even if you don´t use it it would help others to use the style you are using without having to comment them on PRs :) )

@JanFellner
Copy link
Contributor Author

Logging lws_service result if it gets negative and just in case if it reverts back from negativ (or changes) as well.
Both classified as errors just to have them both in the log and not just one of them

@lminiero
Copy link
Member

Looks good to me too, now! Thanks for the quick feedback and fixes, merging ✌️

@lminiero
Copy link
Member

PS: just noticed the automated builds weren't started, I'll do those before merging just to be sure everything's fine.

@lminiero lminiero merged commit 1b9c4e2 into meetecho:master Jul 20, 2021
@JanFellner JanFellner deleted the feature/CC-2166-speedup-eventing-websocket-reconnect branch July 20, 2021 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants