Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zilla Quickstart gRPC RouteGuide service hangs after lots of messages #719

Closed
vordimous opened this issue Jan 9, 2024 · 3 comments · Fixed by #844
Closed

Zilla Quickstart gRPC RouteGuide service hangs after lots of messages #719

vordimous opened this issue Jan 9, 2024 · 3 comments · Fixed by #844
Assignees
Labels
bug Something isn't working

Comments

@vordimous
Copy link
Contributor

vordimous commented Jan 9, 2024

Describe the bug
Starting the zilla quickstart.

To Reproduce
Steps to reproduce the behavior:

  1. Run the quickstart
  2. Invoke the ListFeature endpoint in the gRPC RouteGuide collection one or many times
  • each invoke should stream ~50 messages
  1. switch to and invoke the RecordRoute endpoint, send multiple messages, end the stream, and a message should return stating the number of messages recieved
  2. repeat steps 2 and 3 until one doesn't finish correctly
@vordimous vordimous added the bug Something isn't working label Jan 9, 2024
@vordimous vordimous changed the title Zilla Quickstart not working with 0.9.64 Zilla Quickstart gRPC RouteGuide service hangs after lots of use Jan 9, 2024
@vordimous vordimous changed the title Zilla Quickstart gRPC RouteGuide service hangs after lots of use Zilla Quickstart gRPC RouteGuide service hangs after lots of messages Jan 9, 2024
@vordimous
Copy link
Contributor Author

After working in the quickstart some more this issue may be related to overall message count running through Zilla. The mqtt-simulator is always on and producing messages, when I turned it off I wasn't able to produce the issues. When I turned the simulator back on and connected using the Simulator Topics tab in the quickstart I eventually got this error: 0x8E - Session taken over. This is the point at which the gRPC services stopped working as well.

@vordimous
Copy link
Contributor Author

confirmed this bug still exists using 0.9.69.

The issue is still related to throughput, I can't replicate issue with the mqtt-simulator turned off. The mqtt-simulator is producing ~60 messages per min.

@akrambek
Copy link
Contributor

akrambek commented Mar 6, 2024

The reason why the request gets hung is that during the start the consumer group can't get into a stable state where it can assign partition so that fetch stream can start delivering messages. The failure to complete join group request is happening due to a session timeout that removes the member from the group. After further analysis, we found that there is a 3-second delay in the Join Group request and since we are using a connection pool this delay will be multiplied by the number of join group requests happening in parallel. After some digging, we found that there is Kafka config called group.initial.rebalance.delay.ms that introduces this delay which has 3 seconds as the default value. So if we set it to 0 then we could see that everything started to work as expected. So proposed solution is:

  • Move join group request stream out of coordinator stream and use a separate connection if the group.initial.rebalance.delay.ms > 0 otherwise use the connection pool since there won't be any delay.
  • Update quickstart or any other examples that are using consumer group and set group.initial.rebalance.delay.ms = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants