-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amazon EMR on Amazon EKS #17178
Amazon EMR on Amazon EKS #17178
Conversation
- Add operators for starting job - Add operator for canceling a job - Add operator for checking job state
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
We currently have one operator that allows us to run Spark job on Kubernetes. It works with both EKS and GCP as well as any other Kubernetes platform. - SparkKubernetesOperator. Why would anyone use this operator instead of the generic operator for Kubernetes? |
Hey @mik-laj, I am aware of apache spark and livy operator and also EMR operator. However, EMR on EKS works differently because EMR launches virtual cluster in your EKS. The pods (spark master and executors) launched are ephemeral, only existing when a start job is invoked. For more information, kindly visit https://aws.amazon.com/emr/features/eks/ In addition, SparkKubernetesOperator is only suitable if you have Spark cluster has been setup in Kubernetes. In this case, it is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments/questions.
Realizing there is #16766 already submitted, the big difference between the two seems to be the addition of a couple different operators and the default behavior in #16766 being that the Operator waits for the job to complete.
I haven't done a thorough review of this, but I think (depending on if they're needed) we should try to merge 1 or more of the additional operators into my PR. I'm just not familiar enough with real-world DAGs to know to which degree the additional operators are necessary. If you have any feedback on that, would much appreciate it!
from airflow.providers.amazon.aws.hooks.emr_containers import EmrContainersHook | ||
|
||
|
||
class EmrContainersGetJobStateOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the use case is for an operator that just returns the job state?
Apologies if this is a silly question - still getting up to speed on typical Airflow patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you asked me that question, I don't think this has a straight forward use case as an operator.
airflow/providers/amazon/aws/operators/emr_containers_get_job_state.py
Outdated
Show resolved
Hide resolved
configuration_overrides=configuration_overrides, | ||
tags=tags, | ||
name=name, | ||
client_token=client_token, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see you defining client_token
anywhere. Have you tested this in a live environment to validate that it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used in line 89;
response = emr_containers.start_job(**self.start_job_params)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it's not initialized anywhere. The default is None
, which means if the client doesn't pass in a client_token
, a new job won't get created after the first one. I ran into a similar issue and had to generate a UUID as the default:
client_token or str(uuid4())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dacort for this.
…state.py Co-authored-by: Damon P. Cortesi <d.lifehacker@gmail.com>
Here's my current thinking on this - I like the addition of the Thoughts? |
I agree @dacort , let's do that. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
@wanderijames #16766 was merged. Would you like to proceed? |
Adding operators for interacting with EMR on EKS as per #17175
closes: #17175
related: #17175