[AIRFLOW-5567] BaseReschedulePokeOperator #6210

jaketf · 2019-09-28T20:03:31Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-XXX
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

mik-laj · 2019-09-28T20:13:15Z

My team was thinking of creating similar features but we haven't started work and it's possible that the plans will change. However, I will gladly help in implementing this features in the community.

Some observation that I can add now it is useful that the functionality of the code execution after the operation is completed.
For example:

Pre-execute: create a report request in Google Marketing Platform
Poke: Waiting for status change from processing to completed
Post-execute: download the raport to GCS

Code execution only before the operator is insufficient in more complex processes.

The current implementation lacks a mechanism for passing the context of operations to the next step, e.g. the ID of the created object.

jaketf · 2019-09-28T20:38:07Z

@mik-laj this is mostly a discussion piece of a PR so thanks for joining in!
I think post execute would be good. IMO you could just use XCom to communicate the id (if you need to). Do you think there's value in doing it another way?

mik-laj · 2019-09-28T20:41:30Z

I think it's worth introducing a standardized way of passing resource ID. This will also be very useful in extra links.

airflow/models/base_async_operator.py

mik-laj · 2019-09-28T23:28:47Z

airflow/models/base_async_operator.py

+        Check if we have the XCOM_EXTERNAL_RESOURCE_ID_KEY
+        for this task and call submit_request if it is missing.
+        """
+        if not self.get_external_resource_id(context):


I think that we should not rely on this content. I can imagine that there may be operators that do not operate on any identifier.

If an operator doesn't operate on an identifier what is there to poke? Maybe we can have BaseAsyncOperator set a default 'PLACEHOLDER_ID' ?

There can only be one object in the universe. In this case, there is a constant question about a unique object. After thinking for a long time, however, I have concerns about performance. Now we have one query to the database, and if we tried to save the state then there would always be two queries.

Apologies, I'm really not following you here.

By "one object in the universe" do you mean there can only be one external resource?

Why is the object uniqueness in question?

Can you explain the performance concern? performance of what? the task? the submit? I'm envisioning this async operator pattern only being used for long running operations where you can tolerate a poke interval of >1 min.

airflow/models/base_async_operator.py

mik-laj · 2019-09-28T23:49:28Z

airflow/models/base_async_operator.py

+                 **kwargs) -> None:
+        super().__init__(mode='reschedule', *args, **kwargs)
+
+    def submit_request(self, context) -> string:


Suggested change

def submit_request(self, context) -> string:

def submit_request(self, context) -> IT:

I imagine that the identifier will be a different type, e.g. a number, or an array with several identifiers. What do you think about using generic types?

@mik-laj why not string / dict / json-serializable string? if this is using xcom, and xcom pickling is going away, is that not a good solution? Even if we were to add a new database column and not use xcom, wouldn't json serializability be desired? (have not really used generics and not sure what their value is here... interested in understanding more about what you are suggesting)

The generic class allows you to provide information about the type, and at the same time specify that the same type occurs in several places. It can still be string or dict or other serializable value, but it will not be specified by this class, but it will have to be determined by the child class.

For example: We have a Stack class. The list can contain two methods: push(A) -> None, poll() -> A; If we want to point out that the given A in both cases must be equal in the child class, we can mark it using generic. In this case, the child class will have to specify the type, but we won't have to do it because we don't know what type it can be. I know it should be a subset of types, but I don't know what type exactly.

airflow/models/base_async_operator.py

airflow/sensors/base_sensor_operator.py

mik-laj

A very useful change. I hope that together we will be able to develop it further.

dstandish · 2019-09-29T16:47:29Z

airflow/models/base_async_operator.py

+                 **kwargs) -> None:
+        super().__init__(mode='reschedule', *args, **kwargs)
+
+    def submit_request(self, context) -> string:


@mik-laj why not string / dict / json-serializable string? if this is using xcom, and xcom pickling is going away, is that not a good solution? Even if we were to add a new database column and not use xcom, wouldn't json serializability be desired? (have not really used generics and not sure what their value is here... interested in understanding more about what you are suggesting)

airflow/models/base_async_operator.py

jaketf · 2019-09-29T20:05:50Z

airflow/models/base_async_operator.py

+from airflow.exceptions import AirflowException
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):


I had a similar thought process. Where I ended up was the implementation of the common parent would nearly identical to BaseSensorOperator. The logic being BaseAsyncOperator just implements a few more methods (e.g [submit/process]_request) and is opinionated that mode=reschedule.
The main reasoning to subclass v.s. enhancing the existing Sensor is I wanted to use the same execute method but have a class nomenclature that indicates that subclasses of this are intended to take action. I think people are less likely to understand that a Sensor can take action rather than being a read-only poller.

The spirit of this class is to provide an improved way for what we'd traditionally do with Operators so I wanted a class with the name Operator. Perhaps this is short sighted.

I've refactored to do the submit / process in the execute method of BaseSensorOperator

Ref. the name, I agree that it's important for people to consider this as something that takes action and that's implied by the Operator suffix. That's why I consider BaseSensorOperator in its current form to have a misleading name: it implies that it's both for taking action and sensing. Luckily the actual derived sensor implementations are all suffixed with Sensor - e.g. EmrStepSensor - which doesn't mislead.

I feel like having "Async" in your new class highlights the wrong thing. It's not that it is async in the sense of non-synchronous or rescheduled (Airflow terminology) that makes this class unique, since all sensors derived from BaseSensorOperator can operate as async/rescheduled: it implies not occupying a task slot. What differentiates your class is that it combines action and sensing. That's why I tend to think that the already used BaseSensorOperator would actually be a great name to describe what your new class does; so why not enhance the existing class to optionally do the action part?

There might be technical reasons to have a separate class - e.g. code is easier to read or to reduce risk by extending rather than modifying (SOLID) - but I can't think of a better alternative to "Async": BaseDoThenWaitOperator (hah); BaseActionSensor; maybe BaseAtomicOperator; eventually back to BaseSensorOperator.

turbaszek

I like the idea of async operators! However I have a little concern regarding the approach to developing new operators. How should I decide whether I should create an async operator or operator+sensor? Or should I try to create an operator that could be used in sync or async mode by passing an async=True flag?

airflow/models/base_async_operator.py

jaketf · 2019-10-01T19:30:37Z

@nuclearpinguin I think those concerns are quite valid.

How should I decide whether I should create an async operator or operator+sensor?

IMO if your operator and sensor refer to the same resource (job, query, etc.) you'd almost always want to use async operator. This has to do w/ retries if your sensor fails and you want to retry the submit, then it should be async operator (this wouldn't be possible w/ operator+sensor w/o a whole re-run of the DAG).

I'd love to hear differing opinions here.
@dstandish mentioned (slack) if polling / retrival is lower priority than kicking the job off this could be a reason to have separate operator + sensor.

Should I try to create an operator that could be used in sync or async mode by passing an async=True flag?

I was just thinking about this today as well. Similarly: "for our existing operators how do we expose async versions". Are these new operators? I think you could develop a new operator that extends BaseAsyncOperator you would define a submit_request, process_request, and poke. The you'd override execute with something that checks

if self.async:
  super().execute(context)  # use the `BaseAsyncOperator`'s execute defined in this PR.
else:
    #Note this is what most operators that wait on long running operations look like.
   self.submit_request(context)
   while not self.poke():
       time.sleep(self.poke_interval)
   self.process_results(context)

jaketf · 2019-10-01T19:49:36Z

I can refactor the DataflowJavaOperator as part of this PR as an opinionated illustration of how others might use BaseAsyncOperator.

jaketf · 2019-10-02T21:53:36Z

Digging deeper on DataflowJavaOperator as an example. This is a poor choice because it requires running a java process on the worker (can't really be rescheduled). Therefore I could pivot to modifying the DataflowTemplateOperator however this would require significant refactoring of that hook which assumes you always want to block while waiting. I can address this in a subsequent PR but it would be distracting here. The dataproc hook/operators will be a little cleaner to refactor to show a usage of BaseAsyncOperator

mik-laj · 2019-10-02T22:15:32Z

Dataproc operators will be rewritten to the goog-cloud library - * in the near future. Will it bother you?

jaketf · 2019-10-03T03:10:43Z

@mik-laj Is there any operator you can think of that might be a better example to refactor as an async operator that might not be throw-away work?
I don't really mind the dataproc operator potentially being throw away work.
My only goal for refactoring an operator in this PR is to provide a more concrete example.

jaketf · 2019-10-03T03:16:22Z

I have noticed in looking at existing dataflow and dataproc hooks many hooks conflate job submission and wait_for_done logic so migrating an existing operator to expose an async capability often requires a decent leg work on the hook side.
I imagine this is more than a coincidence. Is their guiding principles / design patterns for hooks? In a world where we have a way to support an async operator it seems it'd be good to have guiding principles for "functions should do one thing" in hooks rather than "this function submits a job and waits for it to be done"

JonnyIncognito · 2019-10-16T05:44:31Z

Is there a good candidate operator to rework to use this? (there was discussion about dataflow and dataproc above but either the hooks are not set up well or the operators are actively being reworked)

Not sure what cloud provider you're using, but if you have access to AWS: EmrAddStepsOperator and EmrStepSensor. Even if you don't, these do have existing unit tests. There's an ongoing thread (link) in airflow-dev about exactly these.

dstandish · 2019-10-16T16:43:48Z

airflow/models/base_async_operator.py

+from airflow.utils.decorators import apply_defaults
+
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):


is there a need to reference SkipMixin given that BaseSensor already extends this class?

dstandish · 2019-10-16T16:47:16Z

airflow/models/base_async_operator.py

+                 **kwargs) -> None:
+        super().__init__(mode='reschedule', *args, **kwargs)
+
+    @abstractmethod


should we extend ABC in this class? maybe there is a reason we don't do this?

dstandish · 2019-10-16T16:55:54Z

airflow/models/base_async_operator.py

+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+


OK I think I may have found an issue.

In live testing it appeared XCOM was not recorded. In actuality, the problem is XCOM is cleared when task restarts after reschedule. So after first rescheduled poke, the xcom is obliterated, and resource id is lost.

XCOM data appears to be cleared at start of each task. See here.

So when task restarts after reschedule, we lose the resource id.

Probably a similar explanation for invisible logs issue i commented on earlier.

Here's my sample operator:

class Op(BaseAsyncOperator): def submit_request(self, context: Dict) -> str: return '129uh8981h9u80eh' def poke(self, context): ti = context['ti'] print(f"try_number: {ti.try_number}") for k, v in context.items(): print(f"{k}: {v}") print("\n sleeping") import time time.sleep(60) return False

Not sure what is best way to resolve.

It's also curious that state goes initially to up_for_reschedule before later becoming up_for_retry.... I am not sure why that is but I have not used sensors / rescheduling before...

XCOM data appears to be cleared at start of each task. See here.

So when task restarts after reschedule, we lose the resource id.

Ouch, that is quite a limitation for using XCom. Is there anywhere else to store task state? If not, some options:

Use mode='poke', at the expense of using up a task slot for long-running tasks. Not ideal, but gets "correct" behaviour. It'd then be "atomic" rather than "async" behaviour.

Enhance TaskInstance to make clearing XCom data conditional by delegating it to the task/operator. There could be a new function like BaseOperator.pre_execute_clear_state() which can be overridden by implementers. BaseOperator is already aware of / coupled with TaskInstance, so I don't think we'd be breaking separation of concerns any more than it already is?

There might be enough justification to say that for rescheduled tasks (i.e. transition from State.UP_FOR_RESCHEDULE to State.RUNNING) then TaskInstance shouldn't clear XCom. The call to run() -> _check_and_change_state_before_execution() does know about this state change, but I see in the code that there are places which bypass the call to run() and go directly to _run_raw_task() (e.g. the CLI).

Seems that some form of option 2 is the least risky way to get the desired outcome

Ref. State.UP_FOR_RESCHEDULE, this was added back in January by @seelmann as part of AIRFLOW-2747. Perhaps he has some thoughts on whether or not we can change the contract for rescheduled tasks to say that we retain XCom state? It's a fairly recent feature, so that might be okay, e.g. operators may not be depending on the current behaviour.

It seems to me we definitely want to wipe XComs at the end of a DagRun as to not affect a re run of this dag for this execution date.

What kind of consensus should we build on that kind of change for XCom in reschedule scenarios?

We should be careful with renaming classes. I think the async capabilities should be merged into the BaseOperator.

I agree with trying to put the behaviour into BaseOperator. As it stands, the use-cases are:

Only operate (BaseOperator <- e.g. EmrAddStepsOperator)

Only sense (BaseSensorOperator <- e.g. EmrStepSensor)

Do both (to be determined)

Do none (BaseOperator <- e.g. DummyOperator)

Currently 1, 2 and 4 are implemented as variations of execute(); 2 is the odd one out in having a special class, as would 3 with the current proposal.

If BaseOperator had a default execute() that runs two phases, the other behaviours can be achieved by optionally implementing either of the phases.

It's a typical problem using class hierarchies for behaviours, that it's hard to mix and match. I was actually thinking about whether we could use traits to keep the behaviours cleanly separated into their own implementations. But it's probably overkill with so few.

Something else to consider about a potential "multi phase" BaseOperator is that we already have pre_execute() + execute() + post_execute() and the order is governed by TaskInstance._run_raw_task() (link). My understanding is that these hooks are always called even when tasks are rescheduled. I'm not entirely sure of the purpose, the docs simply say "... it’s mostly a hook for people deriving operators" and the only two places I can see in the source-code where they're used is: CloudSqlQueryOperator to ensure the db connection is created and torn down; and for Lineage support to ensure the lists of inlets and outlets are built and saved to Xcom (another kind of state retention!).

When considering extra phases, they'd be orthogonal to above and really some form of "execution phase N": the BaseOperator starts to behave a bit like a SubDag where the overall task status / state machine runs its course while what would have been sub-tasks in the SubDag now run as sequential execution phases in the same task. Instead of the sub-tasks being visible in the UI as DAG nodes, these execution phases are internalised into the BaseOperator.

Why would we want to do that? The justification is that these execution phases are tightly coupled and should be retried together (and in order) as some kind of atomic unit.

If we were to generalise this much, I think we should support being able to run through the execution phases synchronously (taking up a task slot) or asynchronously (using reschedule).

Sorry this is getting a bit philosophical but I'm trying to provide some guiding context about why we're doing this! I also see it as a potential future foundation for a mechanism to compose operators via the Airflow API.

Back to the practical, can I propose:

New BaseOperator hooks are implemented - something like execute_phase_start() and execute_phase_final() - that by default simply raise NotImplementedError() as per current execute() implementation

New BaseOperator default execute() implementation that calls the two hooks in order. If the task was rescheduled, skip the first phase. I've seen code that detects this case in BaseSensorOperator.execute() (link).

Existing overrides of BaseOperator.execute() should then behave as-is, i.e. the phase hooks are simply never called. The current BaseSensorOperator implementation would continue to work, minimising the amount of work needed in this PR; it could later be reimplemented to use the new phased execution.

Any new operators implemented to combine an operation + sensor can override the new phase hooks, since they'd be multi-phase aware.

Allowing only two phases simplifies the implementation for now given the immediate requirements, but could even be extended in future for any number of phases if we wanted to store the current/next phase as part of the task instance state. I can't think of a good practical need right now.

What do you guys think, is this proposal (or something like it) workable?

I can't see why we need another execute_phase_{start,final}. The pre-execute can be overridden by a custom implementation, but in practice, this hasn't been done much.

So the TaskInstance._run_raw_task() is what's being executed on the worker, and this will just actually run the whole task to the end.

The justification is that these execution phases are tightly coupled and should be retried together (and in order) as some kind of atomic unit.

I really agree with that. We've seen situations when the DAG would start with creating a cluster, but this breaks the atomicity of the DAG. When the cluster is being killed (maybe because of idle timeout), the only way to retry the workflow, is by restarting the whole dag.

It is still not clear to me how the hierarchy will look like. I think we should move most of the reschedule logic into the BaseOperator. Maybe we can enrich the context which is being passed to the execute method with reschedule logic to let the implementation of the execute step know that you're rescheduling.

… into feature/baseAsyncOperator

jaketf · 2019-10-18T01:07:07Z

tests/models/test_base_async_operator.py

+                self.assertIsNotNone(resource_id)
+                self.assertTrue(resource_id.startswith('job_id'))
+
+    def test_ok_with_reschedule(self):


@dstandish I'll try to improve this unit test until it reproduces your issue w/ XCom clearing.

The PR has been merged; #6370

Just pointing out here that #6370 was reverted (link). Are you intending to continue work on a variation of that, @Fokko?

Yes, but I need more time to get this to work.

@jaketf @JonnyIncognito I'm getting a lot of pushback on idempotency, which is a fair point. Keeping state makes the function inpure by definition. How about adding labels to the services that we launch on GCP. This label should contain a tuple of (composer_instance, dag_id, task_id, execution_date) to make it unique. In this case, we could just query the state using the GCP API's. WDYT?

@Fokko are you considering abandoning the persisting of TI state on metastore? So you persist state externally? I think think this TI state persistence is crucial feature and must be supported. XCom can support it with appropriate changes. We just have to figure out what those are. Another option is just let writer of operator choose when xcom is cleared by calling clear_xcom_data when appropriate.

I would like to investigate other options. I'm not ruling anything out, but I think about it. If we could avoid storing state in the metastore, then it would make life simpler, and I like simple.

@Fokko the ability to retain state is super important for a rescheduled task, I agree with @dstandish on this. These are mostly used where external long-lived resources are created. For sensors we've dodged the issue because they tend to use XCom to fetch the state from a prior task. But if we're now going further and having multi-phased / composed tasks, there needs to be a general Airflow mechanism for that state rather than hacking around it with e.g. cloud-provider specific things like labels.

Keeping state makes the function impure by definition.

Depends on the scope of the idempotency. I can see why this argument could be used for existing operators that are supposed to do the same thing for every call of execute(). However, for ones with multiple phases they should be idempotent by the time the task succeeds, across multiple calls to execute(), i.e. the scope is a bit wider.

How about we only clear the state when needed? Options:

Delegate clearing to the task (default implementation always clears). If you look at my proposal in other thread (link) the task could clear all state on the initial phase but not for subsequent phases / reschedules.

Clear state only if there was no prior transition to State.UP_FOR_RESCHEDULE. I've seen code that detects this but it'd add a cost for every user (link).

Add some concept of a reschedule counter to the task instance and only clear all state on the first execution. If it were more along the lines of a phase counter, it could later be used for tasks that have more than two phases to figure out which code path to follow on each execute().

So the problem is with the ti.xcom_pull(include_prior_dates=true). If you run this one on the first poll, then it might be different than with subsequent polls. Personally I would be fine with the first one option and having some specific key for storing the state.

jaketf · 2019-10-18T03:04:47Z

@JonnyIncognito I'm a gcp-er don't have aws access.

KKcorps · 2019-10-18T08:49:08Z

@mik-laj @jaketf The documentation needs to be included in this PR only, right? Also, @mik-laj where should I place this doc, inside howto folder where you have guides to all other operators?

JonnyIncognito · 2019-10-18T15:56:36Z

@JonnyIncognito I'm a gcp-er don't have aws access.

@jaketf I'm almost certainly going to write an operator that combines EmrAddStepsOperator and EmrStepSensor on top of your work once we figure out the state retention issue, so I'd be happy to contribute it to your PR as an example if you don't get to a GCP example first.

Question is what we're going to name this style of operators, related to the other thread about the class hierarchy and base naming. Options:

Suffix to imply composition / doing more than one thing: XYZAsyncOperator vs. XYZAtomicOperator e.g. Emr...StepsAsyncOperator or Emr...StepsAtomicOperator
How to imply action and sensing: RunXYZ vs. EnsureXYZ e.g. EmrRunSteps...Operator or EmrEnsureSteps...Operator

I suspect these are going to become very popular, so it's important to start off with a good convention!

… into feature/baseAsyncOperator

milton0825 · 2019-10-22T06:38:16Z

airflow/models/base_reschedule_poke_operator.py

+from airflow.ti_deps.deps.ready_to_reschedule import ReadyToRescheduleDep
+
+
+class BaseReschedulePokeOperator(BaseOperator, SkipMixin, ABC):


I wonder how on_kill get invoked if we shutdown the task through Airflow UI when the task is waiting to be rescheduled.

that is a really good point. I'm not sure. I imagine it wouldn't run until the task was rescheduled.

airflow/models/base_reschedule_poke_operator.py

airflow/gcp/hooks/dataproc.py

Fokko · 2019-10-22T12:58:00Z

airflow/models/base_reschedule_poke_operator.py

+        self.set_external_resource_id(context, None)
+
+    @staticmethod
+    def set_external_resource_id(context, value):


I would generalize this, and maybe call it set_state(). Please add type annotations.

dstandish · 2019-11-24T21:45:54Z

So there were concerns with @Fokko's xcom change re idempotency.

I think it makes sense to create second table, very similar to xcom, but designed specifically to support stateful tasks. The table could perhaps be called TaskState.
This task state should not be pegged to a specific execution date, because execution date only really makes sense for non-stateful tasks. And execution date can be out of sequence with actual run time.
I think it might make sense to make it so we don't do updates: when state changes, we insert a new record with the current state. Primary key would be dag id / task id / timestamp. To get current state, we get the last record for the dag / task. It's possible we could allow state to be namespaced under task id with a column key like is done with XCom but I don't think it's necessary.

I previously shared the concern, why create another table that is almost identical to xcom. But the reality is XCom is problematic for stateful tasks in a number of ways. Obviously there is the clearing / idempotency issue. But additionally if you use trigger dag, with XCom your next scheduled run won't get current state because it sorts by execution_date.

WDYT?

Fokko · 2019-11-25T06:47:05Z

@dstandish Sorry for the late reply. I think having a second table or changing the existing xcom table is just an implementation detail. Having a state table will have a fundamental impact on the idempotency of the execution of the tasks.

Why would the manual triggering of a dag introduce issues, the execution date will be equal to the moment that it was triggered. I think it should work as well.

Since this will introduce such as a fundamental change to the way operators were intended, being idempotent, I think it would be great to first start an AIP on the topic, so we can have a clear an structured approach.

dstandish · 2019-11-25T07:39:40Z

Having a state table will have a fundamental impact on the idempotency of the execution of the tasks.

It's optional to use such a thing. Just like it is with XCom. If you don't use it, nothing is changed.

Why would the manual triggering of a dag introduce issues, the execution date will be equal to the moment that it was triggered. I think it should work as well.

Because execution_date is run date minus one interval, and xcom_pull sorts by execution_date. So, suppose I want to persist state with XCom (which I do in many jobs), and I have a daily job, running at midnight. At end of each run, we push some value to XCom. At start of next job, we retrieve last value and use it somehow. Consider this case:

run 1: 12am D1
run 2: manually triggered at 8am (exec date is D1 8am; xcom retrieves from run 1)
run 3: 12am D2
run 4: 12am D3
run 5: 12am D4

Outcome:

Run 3 will retrive the XCom from run 1, because its execution date is prior to run 2 execution date.
Run 4 retrieves run 2 for same reason.
Run 5 retrieves run 4 (finally things are back in order); run 3 xcom is never retrieved by any job.

The schedule interval edge PR would resolve the execution date ordering problem. But if XCom is cleared at start of task, it is remains problematic as a mechanism for state persistence.

Since this will introduce such as a fundamental change to the way operators were intended, being idempotent, I think it would be great to first start an AIP on the topic, so we can have a clear and structured approach.

An AIP sounds appropriate. But I'll just note that I see this more as better support for common use pattern rather than fundamental change of anything. I suspect stateful use of airflow (including the use of XCom as state persistence mechanism) is quite common.

dstandish · 2019-12-18T22:11:44Z

this doesn't help this immediate PR....

BUT...

i just realized that as a user, you can add a model .... e.g. an XCom-like table that is designed how you want it... AND... using the plugins system, you can add an admin view for that model so you can inspect and edit the entries just as you would with XCom...

so if you personally need to manage state for tasks and cannot wait until this functionality makes it into airflow, you can roll your own, and it's not that hard.

if anyone wants sample code i am happy to share you can ping me on slack

so stoked to have stumbled upon this

dazza-codes · 2019-12-23T19:47:16Z

8. Validate my assumption that this would not need a retroactive AIP.

Some of this topic seems to be covered in AIP-17:

https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization

jaketf · 2019-12-29T07:20:24Z

I think the conversation here has been educational and thanks all for chiming in.
Apologies for being MIA on this for so long.
However, during the time away I've reflected on:
What exactly is our problem statement? Originally I set out to: "Provide a primitive to construct operators that allow retry / failing of starting a long running external job without blocking a worker for the entire duration of that long running job" For this, We might consider some very different approach:
Instead of thinking about this as a rescheduling pokes problem, think of it as special kind of SubDag pattern we want to better support: achieve the desired behavior by having a start task and a rescheduling pokes sensor for completion task inside a SubDag. (This might be DOA / short-sighted as I think SubDagOperator task actually blocks a worker to monitor the SubDag completion). But could we perhaps refocus this effort on improving how SubDags monitor for completion?

However, there was some discussion that adding support for stateful tasks in airflow would have broader impact than just this rescheduling case.

From my read through all these threads it seems the key open questions for the rescheduling approach similar to this PR are:

Per @JonnyIncognito we need to get a strong consensus around scope of idempotency. Does each (rescheduled) task instance have to be idempotent, or does a task need to be idempotent before succeeding or failing?
Where should rescheduling logic lie in the class structure? (it seems like it should be moved to BaseOperator).
Is a TaskState model/table preferable to the changes it would take for XCom? and what other use cases should be considered in its schema??
Should we explore persisting information in the context object.

I'm not that familiar w/ AIP process and if it's something a small group of people from the community get aligned on in a meeting/slack before filing or something and individual just proposes.
@Fokko @JonnyIncognito @dstandish @mik-laj you all seem to have vested interest in this would you be open to scheduling a meeting or some time to discuss synchronously over slack?

potiuk · 2019-12-29T11:48:24Z

I think the original AsyncOperator was a better name for this :). And in fact it looks like it is quite highly requested feature in the AirflowSurvey (https://airflow.apache.org/blog/airflow-survey/):

Asynchronous Operators | 97 | 31.49%

I am also interested in discussing this - I will create a channel: #sig-async-operators in slack and invite you all there.

stale · 2020-02-12T12:19:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

[AIRFLOW-5567] prototype BaseAsyncOperator

18a9f05

jaketf changed the title ~~[AIRFLOW-5567] prototype BaseAsyncOperator~~ [AIRFLOW-5567] [Do not Merge] prototype BaseAsyncOperator Sep 28, 2019

xcom for job id, and pre / post functions

b843703

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Sep 28, 2019

View reviewed changes

airflow/sensors/base_sensor_operator.py Outdated Show resolved Hide resolved

mik-laj requested changes Sep 28, 2019

View reviewed changes

dstandish reviewed Sep 29, 2019

View reviewed changes

jaketf added 3 commits September 29, 2019 15:41

refactor to do submit/process requst in execute-ception

61ff3ee

todos

2756c6c

testing plan in TODOs

20c37d7

turbaszek requested changes Oct 1, 2019

View reviewed changes

airflow/models/base_async_operator.py Outdated Show resolved Hide resolved

jaketf added 2 commits October 1, 2019 15:55

first stab at tests

689ffe2

improve style

5d5f1ca

kaxil force-pushed the master branch from 558f9f1 to d719e1f Compare October 2, 2019 15:30

first stab at DataProcBaseJobOperator

daa9a3a

dstandish requested changes Oct 16, 2019

View reviewed changes

Merge branch 'feature/baseAsyncOperator' of github.com:jaketf/airflow…

960ef6a

… into feature/baseAsyncOperator

jaketf commented Oct 18, 2019

View reviewed changes

Kamil Breguła and others added 3 commits October 20, 2019 12:58

Fix docs build

65adb4b

Merge branch 'feature/baseAsyncOperator' of github.com:jaketf/airflow…

1dac3bd

… into feature/baseAsyncOperator

reshuffling classes around

1ec5318

milton0825 reviewed Oct 22, 2019

View reviewed changes

airflow/models/base_reschedule_poke_operator.py Outdated Show resolved Hide resolved

Fokko requested changes Oct 22, 2019

View reviewed changes

Jacob Ferriero added 2 commits October 25, 2019 13:46

fix tests

44cdf56

refactor funciton naming from external resource to state

01fdb99

jaketf changed the title ~~[AIRFLOW-5567] BaseAsyncOperator~~ [AIRFLOW-5567] BaseReschedulePokeOperator Oct 26, 2019

Fokko mentioned this pull request Oct 26, 2019

AIRFLOW-5701: Don't clear xcom explicitly before execution #6370

Merged

5 tasks

Merge branch 'master' into feature/baseAsyncOperator

0709835

Fokko added the needs-fixes label Oct 29, 2019

dazza-codes mentioned this pull request Dec 21, 2019

[AIRFLOW-6245] Add custom waiters for AWS batch jobs #6811

Merged

5 tasks

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 12, 2020

stale bot closed this Feb 19, 2020

	def submit_request(self, context) -> string:
	def submit_request(self, context) -> IT:

		from airflow.utils.decorators import apply_defaults


		class BaseAsyncOperator(BaseSensorOperator, SkipMixin):

		from airflow.sensors.base_sensor_operator import BaseSensorOperator
		from airflow.utils.decorators import apply_defaults

		from airflow.ti_deps.deps.ready_to_reschedule import ReadyToRescheduleDep


		class BaseReschedulePokeOperator(BaseOperator, SkipMixin, ABC):

[AIRFLOW-5567] BaseReschedulePokeOperator #6210

[AIRFLOW-5567] BaseReschedulePokeOperator #6210

Conversation

jaketf commented Sep 28, 2019

Jira

Description

Tests

Commits

Documentation

mik-laj commented Sep 28, 2019 • edited Loading

jaketf commented Sep 28, 2019

mik-laj commented Sep 28, 2019

mik-laj Sep 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mik-laj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

JonnyIncognito Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

turbaszek left a comment

Choose a reason for hiding this comment

jaketf commented Oct 1, 2019 • edited Loading

jaketf commented Oct 1, 2019

jaketf commented Oct 2, 2019

mik-laj commented Oct 2, 2019

jaketf commented Oct 3, 2019

jaketf commented Oct 3, 2019

JonnyIncognito commented Oct 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

JonnyIncognito Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

JonnyIncognito Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonnyIncognito Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Oct 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaketf commented Oct 18, 2019

KKcorps commented Oct 18, 2019 • edited Loading

JonnyIncognito commented Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish commented Nov 24, 2019

Fokko commented Nov 25, 2019

dstandish commented Nov 25, 2019 • edited Loading

dstandish commented Dec 18, 2019 • edited Loading

dazza-codes commented Dec 23, 2019

jaketf commented Dec 29, 2019 • edited Loading

potiuk commented Dec 29, 2019

stale bot commented Feb 12, 2020

mik-laj commented Sep 28, 2019 •

edited

Loading

mik-laj Sep 28, 2019 •

edited

Loading

JonnyIncognito Oct 16, 2019 •

edited

Loading

jaketf commented Oct 1, 2019 •

edited

Loading

dstandish Oct 16, 2019 •

edited

Loading

JonnyIncognito Oct 16, 2019 •

edited

Loading

JonnyIncognito Oct 16, 2019 •

edited

Loading

JonnyIncognito Oct 25, 2019 •

edited

Loading

dstandish Oct 29, 2019 •

edited

Loading

KKcorps commented Oct 18, 2019 •

edited

Loading

JonnyIncognito commented Oct 18, 2019 •

edited

Loading

dstandish commented Nov 25, 2019 •

edited

Loading

dstandish commented Dec 18, 2019 •

edited

Loading

jaketf commented Dec 29, 2019 •

edited

Loading