Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After reconnecting to the agent, data cannot be received normally for a period of time #434

Closed
chrisjaxy opened this issue Dec 14, 2021 · 38 comments
Assignees

Comments

@chrisjaxy
Copy link

chrisjaxy commented Dec 14, 2021

Issue template

  • Hardware description: STM32F103
  • RTOS: freertos
  • Installation type: micro_ros_setup
  • Version or commit hash: foxy

Steps to reproduce the issue

We created a subscriber on stm32f103, and created a publisher on another machine to publish the same topic at a frequency of 50hz. When stm32f103 successfully reconnects to the agent, it cannot subscribe to this topic for a certain period of time.
After debugging, it was found that the problem appeared in the function "uxr_create_session_retries" of the file "session.c". After calling the function "uxr_reset_stream_storage", calling "wait_session_status" to receive data may cause "uxr_reset_stream_storage" to become invalid and cause problems.
We try to call "uxr_reset_stream_storage" again after calling "wait_session_status" to avoid the problem, but I don't know if this modification is feasible. I want to ask your suggestions?

bool uxr_create_session_retries(
        uxrSession* session,
        size_t retries)
{
    uxr_reset_stream_storage(&session->streams);

    uint8_t create_session_buffer[CREATE_SESSION_MAX_MSG_SIZE];
    ucdrBuffer ub;
    ucdr_init_buffer_origin_offset(&ub, create_session_buffer, CREATE_SESSION_MAX_MSG_SIZE, 0u, uxr_session_header_offset(
                &session->info));

    uxr_buffer_create_session(&session->info, &ub, (uint16_t)(session->comm->mtu - INTERNAL_RELIABLE_BUFFER_OFFSET));
    uxr_stamp_create_session_header(&session->info, ub.init);

    bool received = wait_session_status(session, create_session_buffer, ucdr_buffer_length(&ub), (size_t) retries);
    bool created = received && UXR_STATUS_OK == session->info.last_requested_status;
    
    printf("uxr_reset_stream_storage again\r\n");
    uxr_reset_stream_storage(&session->streams);

    return created;
}

Expected behavior

When we reconnect the agent successfully, the MCU can subscribe and respond to data normally.

Actual behavior

Additional information

@pablogs9
Copy link
Member

Can you share your application code? How are you reimplementing the reconnection?

@chrisjaxy
Copy link
Author

We refer to the reconnection code in https://github.com/micro-ROS/micro_ros_arduino/blob/galactic/examples/micro-ros_reconnection_example/micro-ros_reconnection_example.ino. Call "destroy_entities" via a trigger signal, and then reconnect via "create_entities". In addition, our other test of reconnection is to reset the MCU, so that it will directly connect through "create_entities".

@pablogs9
Copy link
Member

, it cannot subscribe to this topic for a certain period of time.

Does this mean that the micro-ROS node finally reconnects? Or it never receives a subscription again?

@chrisjaxy
Copy link
Author

, it cannot subscribe to this topic for a certain period of time.

Does this mean that the micro-ROS node finally reconnects? Or it never receives a subscription again?

The micro ROS node finally successfully reconnected. but it can not receives a subscription until the "uxr_receive_best_effort_message" function, seq_num catches up to stream->last_handled

@chrisjaxy
Copy link
Author

, it cannot subscribe to this topic for a certain period of time.

Does this mean that the micro-ROS node finally reconnects? Or it never receives a subscription again?

The micro ROS node finally successfully reconnected. but it can not receives a subscription until the "uxr_receive_best_effort_message" function, seq_num catches up to stream->last_handled

We found in the function "uxr_create_session_retries", after calling the function "uxr_reset_stream_storage", calling "wait_session_status" to receive data may cause "uxr_reset_stream_storage" to become invalid, it means the "stream->last_handled" was changed because of the mcu can still receive subscribed topics at 50hz.

@pablogs9
Copy link
Member

Which are the values of RMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT RMW_UXRCE_ENTITY_CREATION_TIMEOUT and RMW_UXRCE_ENTITY_DESTROY_TIMEOUT in your colcon.meta?

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 14, 2021

Which are the values of RMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT RMW_UXRCE_ENTITY_CREATION_TIMEOUT and RMW_UXRCE_ENTITY_DESTROY_TIMEOUT in your colcon.meta?

1000

@pablogs9
Copy link
Member

Does it have the same behaviour if you set RMW_UXRCE_ENTITY_DESTROY_TIMEOUT to 0 in the colcon.meta?

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 14, 2021

Does it have the same behaviour if you set RMW_UXRCE_ENTITY_DESTROY_TIMEOUT to 0 in the colcon.meta?

No, we creat a static library, RMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT this value is the default 1000.This is our colcon.meta

{
    "names": {
        "tracetools": {
            "cmake-args": [
                "-DTRACETOOLS_DISABLED=ON",
                "-DTRACETOOLS_STATUS_CHECKING_TOOL=OFF"
            ]
        },
        "rosidl_typesupport": {
            "cmake-args": [
                "-DROSIDL_TYPESUPPORT_SINGLE_TYPESUPPORT=ON"
            ]
        },
        "rcl": {
            "cmake-args": [
                "-DBUILD_TESTING=OFF",
                "-DRCL_COMMAND_LINE_ENABLED=OFF",
                "-DRCL_LOGGING_ENABLED=OFF"
            ]
        }, 
        "rcutils": {
            "cmake-args": [
                "-DENABLE_TESTING=OFF",
                "-DRCUTILS_NO_FILESYSTEM=ON",
                "-DRCUTILS_NO_THREAD_SUPPORT=ON",
                "-DRCUTILS_NO_64_ATOMIC=ON",
                "-DRCUTILS_AVOID_DYNAMIC_ALLOCATION=ON"
            ]
        },
        "microxrcedds_client": {
            "cmake-args": [
                "-DUCLIENT_PIC=OFF",
                "-DUCLIENT_PROFILE_UDP=OFF",
                "-DUCLIENT_PROFILE_TCP=OFF",
                "-DUCLIENT_PROFILE_DISCOVERY=OFF",
                "-DUCLIENT_PROFILE_SERIAL=OFF",
                "-UCLIENT_PROFILE_STREAM_FRAMING=ON",
                "-DUCLIENT_PROFILE_MULTITHREAD=ON",
                "-DUCLIENT_PROFILE_CUSTOM_TRANSPORT=ON"
            ]
        },
        "rmw_microxrcedds": {
            "cmake-args": [
                "-DRMW_UXRCE_MAX_NODES=1",
                "-DRMW_UXRCE_MAX_PUBLISHERS=8",
                "-DRMW_UXRCE_MAX_SUBSCRIPTIONS=6",
                "-DRMW_UXRCE_MAX_SERVICES=12",
                "-DRMW_UXRCE_MAX_CLIENTS=0",
                "-DRMW_UXRCE_MAX_HISTORY=2",
                "-DRMW_UXRCE_TRANSPORT=custom"
            ]
        }
    }
}

@pablogs9
Copy link
Member

Could you try to set the value to 0 and rebuild the library and check if the issue is solved?

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 14, 2021

Could you try to set the value to 0 and rebuild the library and check if the issue is solved?

We set the value to 0, but it still doesn't work. "stream->last_handled" will still be changed in the "wait_session_status" function, we still need to call the "uxr_reset_stream_storage" function again.

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 14, 2021

Could you try to set the value to 0 and rebuild the library and check if the issue is solved?

We set the value to 0, but it still doesn't work. "stream->last_handled" will still be changed in the "wait_session_status" function, we still need to call the "uxr_create_session_retries" function again.

Is it reasonable to call the "uxr_reset_stream_storage" function again after the "wait_session_status" function?

@pablogs9
Copy link
Member

According to the documentation:

  • uxr_create_session_retries should be called with a "A uxrSesssion structure previously initialized.". Link
  • when a session is inited with uxr_init_session() streams are set to 0. So should not have anything in the stream storage when creating the session.

@chrisjaxy
Copy link
Author

According to the documentation:

  • uxr_create_session_retries should be called with a "A uxrSesssion structure previously initialized.". Link
  • when a session is inited with uxr_init_session() streams are set to 0. So should not have anything in the stream storage when creating the session.

I see that in the rmw_init function, uxr_init_session is called first, streams are set to 0, and then the uxr_create_session_retries function is called. In the uxr_create_session_retries function, streams are set to other values because the continuously sent topic or other data is monitored in the wait_session_status function. I think there should be a way to solve this situation.

@pablogs9
Copy link
Member

Working on it

@Acuadros95
Copy link
Contributor

Could you please post the agent log for this scenario with the debug flag -v6 activated?

@chrisjaxy
Copy link
Author

agent.log
I set RMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT to 0, publish a topic at a frequency of 1 Hz, and at the same time subscribe to a topic that has been published at 50 Hz.

@Acuadros95
Copy link
Contributor

Acuadros95 commented Dec 15, 2021

A few questions:

  • There is something strange going on, as the client is having problems to create a session in the beginning of your log.
    That is why your are getting too many session re-established messages. Check the reliability of your serial connection.

  • Are you using threads on your application?

  • You are creating a lot of entities, could you try to replicate this with only one subscriber?

  • Could you elaborate on this? Are you restarting the device to test the re connection? That example is mean to check the agent connection using pings, then reconnect when possible.

    Call "destroy_entities" via a trigger signal, and then reconnect via "create_entities". In addition, our other test of reconnection
    is to reset the MCU, so that it will directly connect through "create_entities".

@chrisjaxy
Copy link
Author

A few questions:

  • There is something strange going on, as the client is having problems to create a session in the beginning of your log.
    That is why your are getting too many session re-established messages. Check the reliability of your serial connection.
  • Are you using threads on your application?
  • You are creating a lot of entities, could you try to replicate this with only one subscriber?
  • Could you elaborate on this? Are you restarting the device to test the re connection? That example is mean to check the agent connection using pings, then reconnect when possible.

    Call "destroy_entities" via a trigger signal, and then reconnect via "create_entities". In addition, our other test of reconnection
    is to reset the MCU, so that it will directly connect through "create_entities".

1.yes, we use threads on your application.

2.we creat 7 publishers, 4 subscribers, 10 service servers. but only 2 publishers and 1 subscribers are used in the test.

3.we used a button to "destroy_entities" and "create_entities" to reconnect agent, In addition, we restarted the program to reconnect agent.

@Acuadros95
Copy link
Contributor

Can you share your code?

Threading has its limitations, for example:

  • Creation and destruction of entities are not thread safe.
  • You can use the ping to check if the agent is available once all your sessions and entities have been created.
  • You can use the ping when there is no session or entity created and you are checking if there is an agent available.

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 16, 2021

Can you share your code?

Threading has its limitations, for example:

  • Creation and destruction of entities are not thread safe.
  • You can use the ping to check if the agent is available once all your sessions and entities have been created.
  • You can use the ping when there is no session or entity created and you are checking if there is an agent available.
  1. Creation and destruction of entities and rclc_executor_spin_some are on one thread and we use mutithread to pub topic.

  2. when we destroy_entities finish or restarted the program, we "create_entities" until ping ok.

  3. we want create a simple example to reproduce, and then show the code

@pablogs9
Copy link
Member

Ok, when you have an example code for reproducing the issue we will take a look.

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 16, 2021

This is a simple example code, I added some debugging printing.

bool uxr_create_session_retries(
        uxrSession* session,
        size_t retries)
{
    uxr_reset_stream_storage(&session->streams);

    uint8_t create_session_buffer[CREATE_SESSION_MAX_MSG_SIZE];
    ucdrBuffer ub;
    ucdr_init_buffer_origin_offset(&ub, create_session_buffer, CREATE_SESSION_MAX_MSG_SIZE, 0u, uxr_session_header_offset(
                &session->info));

    uxr_buffer_create_session(&session->info, &ub, (uint16_t)(session->comm->mtu - INTERNAL_RELIABLE_BUFFER_OFFSET));
    uxr_stamp_create_session_header(&session->info, ub.init);

    bool received = wait_session_status(session, create_session_buffer, ucdr_buffer_length(&ub), (size_t) retries);
    bool created = received && UXR_STATUS_OK == session->info.last_requested_status;

    printf("%s, stream->last_handled :%d\r\n", __FUNCTION__, session->streams.input_best_effort->last_handled);
    /* Force reset stream. add*/
    // uxr_reset_stream_storage(&session->streams);
    return created;
}
bool uxr_receive_best_effort_message(
        uxrInputBestEffortStream* stream,
        uxrSeqNum seq_num)
{
    printf("%s, stream->last_handled :%d, seq_num: %d\r\n", __FUNCTION__, stream->last_handled, seq_num);
    bool available_to_read = (0 > uxr_seq_num_cmp(stream->last_handled, seq_num));
    if (available_to_read)
    {
        stream->last_handled = seq_num;
    }

    return available_to_read;
}
rclc_support_t test_support;
rcl_node_t test_node;
rclc_executor_t test_executor;
rcl_allocator_t test_allocator;
rcl_publisher_t test_publisher;
rcl_subscription_t test_subscription;

std_msgs__msg__UInt8 pile_heartbeat;
sensor_msgs__msg__BatteryState battery;
bool micro_ros_init_successful = false;
bool micro_ros_destory_flag = false;

void battery_info(const void * msgin)
{
	const sensor_msgs__msg__BatteryState * msg = (const sensor_msgs__msg__BatteryState *)msgin;

	LOG_DEBUG("battert current : %1f ,percentage : %3f, status: %d", msg->current, msg->percentage, msg->power_supply_status);
}

bool create_entities()
{	
	test_allocator = rcl_get_default_allocator();

	// create init_options
	RCCHECK(rclc_support_init(&test_support, 0, NULL, &test_allocator));

	// create node
	RCCHECK(rclc_node_init_default(&test_node, "test", "", &test_support));

	// create publisher
	RCCHECK(rclc_publisher_init_default(
		&test_publisher,
		&test_node,
		ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, UInt8),
		"std_msgs_msg_Int8"));

	RCCHECK(rclc_subscription_init_best_effort(
		&test_subscription,
		&test_node,
		ROSIDL_GET_MSG_TYPE_SUPPORT(sensor_msgs, msg, BatteryState),
		"/battery"));

	// create executor
	test_executor = rclc_executor_get_zero_initialized_executor();
	RCCHECK(rclc_executor_init(&test_executor, &test_support.context, 1, &test_allocator));
	RCCHECK(rclc_executor_add_subscription(&test_executor, &test_subscription, &battery, battery_info, ON_NEW_DATA));

	micro_ros_init_successful = true;

	return true;
}

void destroy_entities()
{
	rcl_subscription_fini(&test_subscription, &test_node);
	rcl_publisher_fini(&test_publisher, &test_node);
	rcl_node_fini(&test_node);
	rclc_executor_fini(&test_executor);
	rclc_support_fini(&test_support);

	micro_ros_init_successful = false;
}

void StartDefaultTask(void *argument)
{
	bool pingFlag = false;
	MicroRosChannelConfig();

	while (1)
	{
		osDelay(5);
                if ((micro_ros_init_successful == false) && !pingFlag)
		{
			if (rmw_uros_ping_agent(50, 1) != RMW_RET_OK)
		 	{
				LOG_DEBUG("ping error!");
				osDelay(200);
				continue;
			}

			LOG_DEBUG("ping ok!");
			pingFlag = true;
		}

		if (!micro_ros_init_successful) {
			LOG_DEBUG("start create_entities!");
			if (create_entities() == false) {
				micro_ros_destory_flag = true;
				LOG_DEBUG("create_entities fail, destroy_entities!");
			} else {
				LOG_DEBUG("create_entities finish");
			}
		} else {
			rclc_executor_spin_some(&test_executor, RCL_MS_TO_NS(1));
		}

		if ((micro_ros_init_successful == true) && (micro_ros_destory_flag == true)){
			LOG_DEBUG("destroy_entities!");
			destroy_entities();
			micro_ros_destory_flag = false;
		}
        }
}
}

agent.log
MCU.log

@Acuadros95
Copy link
Contributor

ok, I have replicated it and you are right:

  • When the destroy process starts, the Agent is still sending the subscription data to the board ([==>> DDS <<==] on the Agent log).

  • This messages are saved on the serial buffer and they are processed on the destroy/creation process, we can see it clearly on the MCU log:

    [log_d]:pal_uros.c[0815]: destroy_entities!
    Kavabot:/$ uxr_receive_best_effort_message, stream->last_handled :576, seq_num: 577
    uxr_receive_best_effort_message, stream->last_handled :577, seq_num: 578
    uxr_receive_best_effort_message, stream->last_handled :578, seq_num: 579
    uxr_receive_best_effort_message, stream->last_handled :579, seq_num: 580
    uxr_receive_best_effort_message, stream->last_handled :580, seq_num: 581
    uxr_receive_best_effort_message, stream->last_handled :581, seq_num: 582
    uxr_receive_best_effort_message, stream->last_handled :582, seq_num: 583
    uxr_receive_best_effort_message, stream->last_handled :583, seq_num: 584
    uxr_receive_best_effort_message, stream->last_handled :584, seq_num: 585
    uxr_receive_best_effort_message, stream->last_handled :585, seq_num: 586
    [log_d]:pal_uros.c[0803]: start create_entities!
    Kavabot:/$ uxr_receive_best_effort_message, stream->last_handled :65535, seq_num: 568
    uxr_receive_best_effort_message, stream->last_handled :568, seq_num: 569
    uxr_receive_best_effort_message, stream->last_handled :569, seq_num: 570
    uxr_receive_best_effort_message, stream->last_handled :570, seq_num: 571
    uxr_receive_best_effort_message, stream->last_handled :571, seq_num: 572
    uxr_receive_best_effort_message, stream->last_handled :572, seq_num: 573
    
  • stream->last_handled is reset at the beginning of uxr_create_session_retries, but its messed up by older messages on wait_session_status.

It makes sense to call uxr_reset_stream_storage after the session is successfully created, I am going to test a solution. In the mean time, it is a good idea to flush the MCU UART buffer before calling create_entities().

@chrisjaxy
Copy link
Author

I added a service to the above program and uxr_reset_stream_storage after wait_session_status.I found that sometimes I could not use service call normally because DDS can not get the data. What could be the cause?
agent.log
first
MCU.log
second
servicecall.log
third

@chrisjaxy
Copy link
Author

I added a service to the above program and uxr_reset_stream_storage after wait_session_status.I found that sometimes I could not use service call normally because DDS can not get the data. What could be the cause? agent.log first MCU.log second servicecall.log third

I found that it is the same if uxr_reset_stream_storage is not added after wait_session_status. After reconnection, the service call may not be able to send data normally.

@Acuadros95
Copy link
Contributor

Services use subscriptions to get the request/response messages, so it makes sense.
Anyway, make sure that you start the service request on your ros2 terminal when the service is already created on the agent side.

@chrisjaxy
Copy link
Author

Services use subscriptions to get the request/response messages, so it makes sense. Anyway, make sure that you start the service request on your ros2 terminal when the service is already created on the agent side.

After the first connection is successful, the ros2 terminal will not fail to respond when calling the service call. When the reconnection is complete, sometimes the service call will not respond, service list can see it was created. What does "make sure that is already created" mean? Hasn't the service been created yet?

@Acuadros95
Copy link
Contributor

What does "make sure that is already created" mean? Hasn't the service been created yet?

When you reconnect the service is deleted and created again, so if you are on the middle of a request when this happens, the request will fail.

Also, it can be a good idea to increase the executor spin time on rclc_executor_spin_some(&test_executor, RCL_MS_TO_NS(1)); to atleas 100 or 200 milliseconds.

@chrisjaxy
Copy link
Author

What does "make sure that is already created" mean? Hasn't the service been created yet?

When you reconnect the service is deleted and created again, so if you are on the middle of a request when this happens, the request will fail.

Also, it can be a good idea to increase the executor spin time on rclc_executor_spin_some(&test_executor, RCL_MS_TO_NS(1)); to atleas 100 or 200 milliseconds.

After the reconnection is successful, as long as I don’t set micro_ros_destory_flag true, the code will only run rclc_executor_spin_some(&test_executor, RCL_MS_TO_NS(1)) and osDelay(5). The request may still fail sometimes. It’s the same if spin time set 100 -200.

@Acuadros95
Copy link
Contributor

Acuadros95 commented Dec 20, 2021

Are you using best effort services (rclc_service_init_best_effort)?
If that is the case, try with reliable ones: rclc_service_init_default

@chrisjaxy
Copy link
Author

Are you using best effort services (rclc_service_init_best_effort)? If that is the case, try with reliable ones: rclc_service_init_default

I tried to use rclc_service_init_default, the phenomenon is the same. After several consecutive reconnections, when I use the ros2 service call, there is no printing of [==>> DDS <<==] in the agent, so the terminal does not respond.

@Acuadros95
Copy link
Contributor

Acuadros95 commented Dec 20, 2021

We just merged the second uxr_reset_stream_storage:
eProsima/Micro-XRCE-DDS-Client#304

And regarding the services, what are you exactly testing?

  • What is the period between reconnections?
  • Are you sure you are not launching the request when a reconnections is happening?
  • Why are you testing the reconnection functionality on a serial transport? How could a disconnection happen on this case?

Also, this changes may help: micro-ROS/rmw_microxrcedds#224

@chrisjaxy
Copy link
Author

chrisjaxy commented Dec 21, 2021

We just merged the second uxr_reset_stream_storage: eProsima/Micro-XRCE-DDS-Client#304

And regarding the services, what are you exactly testing?

  • What is the period between reconnections?
  • Are you sure you are not launching the request when a reconnections is happening?
  • Why are you testing the reconnection functionality on a serial transport? How could a disconnection happen on this case?

Also, this changes may help: micro-ROS/rmw_microxrcedds#224

Thank you for your help.

We are testing the situation after the mcu reconnects through the microros if the agent is not closed.

  • There is no specific interval, and the reconnection is triggered manually.
  • We confirm launching the request after connecting. At this time, the ros2 terminal publishes the topic, and the MCU can subscribe to it, but there is a certain probability that the service cannot be responded to.
  • The reconnection situation may occur in: 1. The ota upgrade of the MCU. 2. The MCU is connected to the agent through Bluetooth, wifi transparent transmission, 3. The MCU is powered off or restarted due to some external or internal reasons.

micro-ROS/rmw_microxrcedds#224 does not seem to help. I found that the microros_ros threads increased after reconnection. Is it possible that this problem is related to the agent?

@Acuadros95
Copy link
Contributor

Is it possible that this problem is related to the agent?

I have replicated the issue and it looks quite possible, working on it.

@Acuadros95
Copy link
Contributor

We have found a bug on services destruction on the Agent side, check here: eProsima/Micro-XRCE-DDS-Agent#295
This fixes the service responses and the "thread leak" you found.

@chrisjaxy
Copy link
Author

We have found a bug on services destruction on the Agent side, check here: eProsima/Micro-XRCE-DDS-Agent#295 This fixes the service responses and the "thread leak" you found.

Thank you for your help, I think it has solved the issue

@Acuadros95
Copy link
Contributor

Great news!
Closing, reopen if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants