Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lock in HelixStateTransitionHandler #1681

Merged
merged 2 commits into from
Apr 27, 2021

Conversation

xyuanlu
Copy link
Contributor

@xyuanlu xyuanlu commented Mar 24, 2021

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#1675 Remove requested state update in task framwork.
This is the first PR of the issue.
This PR is for lock scope deduction only, because lock scope change expects more review and has independent logic itself.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

*High level design of the lock scope:

We can prevent over write by checking the in memory state _stateModel and update ZK conditionally in message handling. Also the in memory state _stateModel update (in both task Runner thread and state transition handling thread), comparing and ZK update need to be protected by a lock.

*Is it safe to do so?
In current message handling design, HelixTaskExecutor (a message lister that creates and schedules all message handler) guarantees that only one on going state transition per each partition/task at a time, removing this synchronization shouldn't cause any problem.
When onMessage receives a STATE_TRANSITION_CANCELLATION messages and a ST message for the same entity, a HelixStateTransitionCancellationHanlder is created can call cancel(). Changing the lock scope does not have any influence on HelixStateTransitionCancellationHanlder.

*Why need to Remove lock in HelixStateTransitionHandle
In the design to remove requested state, task runner thread update CurrentState when task finishes or ends with error state.
However, it is possible that the state transition handler is also trying to update CurrentState. It could be an INIT→ RUNNING, RUNNING→CANCEL, or any other state transition message message. We will run into the write conflict.
This PR removes this redundant synchronization. Will add a smaller scoped lock in the PR that removes requested state.

Tests

  • The following tests are written for this issue:
    Our current Task framework tests covers all state transition types, should be able to cover this change.

  • The following is the result of the "mvn test" command on the appropriate module:

[ERROR] Failures:                                                                                                                                                                      
[ERROR] org.apache.helix.integration.rebalancer.DelayedAutoRebalancer.TestDelayedAutoRebalanceWithDisabledInstance.beforeTest(org.apache.helix.integration.rebalancer.DelayedAutoRebalan
cer.TestDelayedAutoRebalanceWithDisabledInstance)                                                                                                                                      
[ERROR]   Run 1: TestDelayedAutoRebalanceWithDisabledInstance.beforeTest:304->enableInstance:315 expected:<true> but was:<false>                                                       
[INFO]   Run 2: PASS                                                                                                                                                                   
[INFO]                                                                                                                                                                                 
[INFO]                                                                        
[ERROR] Tests run: 9, Failures: 1, Errors: 0, Skipped: 6                       
[INFO]                                                        
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE                                                                                                                                                                   
[INFO] ------------------------------------------------------------------------                                          
[INFO] Total time:  40.449 s                                                                                                                                                           
[INFO] Finished at: 2021-04-19T16:45:27-07:00  
[INFO] ------------------------------------------------------------------------
rerun:
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 78.981 s - in org.apache.helix.integration.rebalancer.DelayedAutoRebalancer.TestDelayedAutoRebalanceWithDisabledI
nstance                                                            

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@xyuanlu xyuanlu marked this pull request as ready for review March 24, 2021 20:17
Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will it be save to exclude the invoke() for the locking? There could be two messages at the same time invoked for the same state mode: 1 regular ST and 1 cancel message. I am not quite sure whether that could cause problem.

@xyuanlu
Copy link
Contributor Author

xyuanlu commented Mar 26, 2021

Question: will it be save to exclude the invoke() for the locking? There could be two messages at the same time invoked for the same state mode: 1 regular ST and 1 cancel message. I am not quite sure whether that could cause problem.

Thanks for the offline discussion!

For STATE_TRANSITION type messages, for each target, it is garnered in validateAndProcessStateTransitionMessage, only one ST at a time .

When onMessage receives a STATE_TRANSITION_CANCELLATION messages and a ST message for the same entity, it will
either

  1. try to Delete the previous created stateTransitionHandler if the to-be-canceled state transition is not started. In this step, there will be no stateTransitionHandler created for this cancelation message.
  2. or If the the to-be-canceled state transition is on going, we will try to create a HelixStateTransitionCancellationHanlder for this cancelation message and It will directly call cancel().
    So cancelation's logic should not be infected by any lock scope change in HelixStateTransitionHandler
public void onMessage(){
    for (Message message : messages) {
        if (checkAndProcessNoOpMessage(message, instanceName, changeContext, manager, sessionId,
          stateTransitionHandlers)) {  /*  Call cancelNotStartedStateTransition */                        ------(1)
   .......
        msgHandler = createMessageHandler(message, msgWorkingContext);

        if (message.getMsgType().equals(MessageType.STATE_TRANSITION.name()) || message.getMsgType()
          .equals(MessageType.STATE_TRANSITION_CANCELLATION.name())) {
          if (validateAndProcessStateTransitionMessage(message, manager, stateTransitionHandlers,    ------(2)
              msgHandler)) {
    }
}
......
}
 private boolean validateAndProcessStateTransitionMessage{
     if (stateTransitionHandlers.containsKey(messageTarget)) {
        // If there are 2 messages in same batch about same partition's state transition,
        // the later one is discarded
        Message duplicatedMessage = stateTransitionHandlers.get(messageTarget)._message;
        String errMsg = String.format(
            "Duplicated state transition message: %s. Existing: %s->%s; New (Discarded): %s->%s",
            message.getMsgId(), duplicatedMessage.getFromState(), duplicatedMessage.getToState(),
            message.getFromState(), message.getToState());
        updateUnprocessableMessage(message, null /* exception */, errMsg, manager);
        return false;
      }
}

...
// Try to cancel this state transition that has not been started yet.
// Three Types of Cancellation: 1. Message arrived with previous state transition
//                              2. Message handled but task not started
//                              3. Message handled and task already started
// This method tries to handle the first two cases, it returns true if no further cancellation is needed,
// false if not been able to cancel the state transition (i.e, further cancellation is needed).
private boolean cancelNotStartedStateTransition(Message message,
      Map<String, MessageHandler> stateTransitionHandlers, HelixDataAccessor accessor,
      String instanceName) {
...
}

@jiajunwang
Copy link
Contributor

Question, Could you explain more about the necessity of the change? In detail, the main thread creates taskRunner thread which updates the state at the end. And the main thread continues to run after taskRunner thread is created. In this case, even the main thread holds the lock, the taskRunner can still execute. And the runner thread can wait until the main thread finishes the state update. I remember you mentioned this when we talked offline, but please document the detailed reason in this ticket too.

@xyuanlu
Copy link
Contributor Author

xyuanlu commented Mar 29, 2021

Question, Could you explain more about the necessity of the change? In detail, the main thread creates taskRunner thread which updates the state at the end. And the main thread continues to run after taskRunner thread is created. In this case, even the main thread holds the lock, the taskRunner can still execute. And the runner thread can wait until the main thread finishes the state update. I remember you mentioned this when we talked offline, but please document the detailed reason in this ticket too.

Thanks for the question.
For most task state transition messages, it is true that the main state transition handler thread creates taskRunner thread and the main thread continues to run after taskRunner thread is created.

However, when handling 'RUNNING' -> 'STOPPED' , stateTransition handler creates a taskRunner, wait for state transition finishes and then continue the rest part. So if we include the invoke() in the lock scope, we will have a deadlock here.


  @Transition(to = "STOPPED", from = "RUNNING")
  public String onBecomeStoppedFromRunning(Message msg, NotificationContext context) {
    String taskPartition = msg.getPartitionName();
    if (_taskRunner == null) {
      throw new IllegalStateException(String.format(
          "Invalid state transition. There is no running task for partition %s.", taskPartition));
    }

    _taskRunner.cancel();
    TaskResult r = _taskRunner.waitTillDone();
    LOG.info("Task {} completed with result {}.", msg.getPartitionName(), r);

    timeout_task.cancel(false);

    return r.getInfo();
}

@xyuanlu xyuanlu marked this pull request as draft April 2, 2021 21:07
@xyuanlu xyuanlu changed the title Reducing lock scope in HelixStateTransitionHandler Remove lock in HelixStateTransitionHandler Apr 2, 2021
@xyuanlu
Copy link
Contributor Author

xyuanlu commented Apr 2, 2021

Changing this PR to removing the redundant lock since it is not logically complete.
Will add a smaller scoped lock in the PR that removes requested state.

@xyuanlu xyuanlu marked this pull request as ready for review April 2, 2021 22:56
@xyuanlu xyuanlu force-pushed the task_requestedState2 branch from f0a5a46 to 37434f9 Compare April 15, 2021 22:23
@xyuanlu
Copy link
Contributor Author

xyuanlu commented Apr 16, 2021

This PR is ready to be merged. Approved by @dasahcc
Final commit message:

Remove synchronization on _staleModel in HelixStateTransitionHandler

In current message handling design, HelixTaskExecutor (a message lister that creates and schedules all message handler) guarantees that only one on-going state transition is created at a time. This change remove this duplicated synchronization in HelixStateTransitionHandler.

@xyuanlu xyuanlu force-pushed the task_requestedState2 branch 2 times, most recently from 2e363a4 to 9ade91e Compare April 20, 2021 23:41
@xyuanlu xyuanlu force-pushed the task_requestedState2 branch from 9ade91e to 1661895 Compare April 21, 2021 17:43
@xyuanlu xyuanlu force-pushed the task_requestedState2 branch from 1661895 to 6f95bf8 Compare April 23, 2021 21:36
@junkaixue junkaixue merged commit 07e3ed5 into apache:master Apr 27, 2021
@xyuanlu xyuanlu deleted the task_requestedState2 branch April 30, 2021 00:11
xyuanlu added a commit to xyuanlu/helix that referenced this pull request May 16, 2021
* reducing lock scope in HelixStateTransitionHandler
* remove lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants