Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for remote cluster deployments #42

Merged
merged 2 commits into from
Mar 18, 2025

Conversation

thepetk
Copy link
Contributor

@thepetk thepetk commented Mar 10, 2025

What does this PR do?:

The PR adds support for remote cluster deployments. More specifically:

  • It updates the application-dev.yaml flow, so in case of remote cluster deployment we use a different destination for this application.
  • In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.
  • The namespace initialization is moved to the tekton app so we can run all pipelines from the selected namespace. It should be up to the remote cluster admin to prepare the external namespace so that the software templates can be installed correctly.

Which issue(s) this PR fixes:

Fixes RHDHPAI-580

PR acceptance criteria:

Testing and documentation do not need to be complete in order for this PR to be approved. We just need to ensure tracking issues are opened and linked to this PR, if they are not in the PR scope due to various constraints.

  • Tested and Verified

  • Documentation (READMEs, Product Docs, Blogs, Education Modules, etc.)

How to test changes / Special notes to the reviewer:

Setup the two clusters

  • Create two clusters.
  • Before you install RHDH on the one cluster you have to add the second cluster in tekton-plugins of the ai-rhdh-installer, so the topology and kubernetes plugins show resources for the other cluster.
  • Note, you have to create a cluster-reader role for the SA token required in tekton-plugins.yaml.
  • After completing the installation with the installer, add a remote-cluster secret on the RHDH namespace. This should be like:
...
metadata:
     lables:
         argocd.argoproj.io/secret-type: cluster
...
data:
    config: |
        {"bearerToken": "your-token", "tlsClientConfig":{"insecure": true}}
    name: name
    server: api-server
  • NOTE: You also need a way to setup everything on the second cluster (to initialize the namespace of the second cluster so we can deploy templates there). An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand. This way once the new image of the application is created, the argoCD is able to pull the new image. Otherwise, you'll have to initialize the namespace manually.

Deploy templates

Examples

  1. RHOAI Deployment
    image

  2. The deployed template overview
    image

  3. The two different clusters in topology
    image

  4. PipelineRuns (In host cluster)
    image

  5. Deployments
    image

Important Notes

One important note I'd like everyone to have in mind during review is about the namespace we run the pipelineRuns on the host cluster. The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

thepetk added 2 commits March 10, 2025 12:59
Signed-off-by: thepetk <thepetk@gmail.com>
Signed-off-by: thepetk <thepetk@gmail.com>
@thepetk thepetk changed the title WIP: Add support for remote cluster deployments Add support for remote cluster deployments Mar 10, 2025
@maysunfaisal
Copy link
Contributor

An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand.

@thepetk what if someone wants to deploy to multiple namespaces remotely? Do we need to ensure all of these namespaces are configured before hand?

@maysunfaisal
Copy link
Contributor

The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

For now as first iteration of this new feature, lets keep it simple and keep the same name. I am thinking of edge cases where someone can specify different namespaces on diffierent clusters but there might be duplicate resources already present and it may result in failures, I think we should probably think a bit more. Any other concerns?

@thepetk
Copy link
Contributor Author

thepetk commented Mar 12, 2025

The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

For now as first iteration of this new feature, lets keep it simple and keep the same name. I am thinking of edge cases where someone can specify different namespaces on diffierent clusters but there might be duplicate resources already present and it may result in failures, I think we should probably think a bit more. Any other concerns?

Totally agree and pretty much this is the reason of using the same namespace. I feel too that many things might change, even when we finalize the updates on the software template side.

@thepetk
Copy link
Contributor Author

thepetk commented Mar 12, 2025

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.

The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

@thepetk
Copy link
Contributor Author

thepetk commented Mar 12, 2025

An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand.

@thepetk what if someone wants to deploy to multiple namespaces remotely? Do we need to ensure all of these namespaces are configured before hand?

Yes I think so. At least for the current PR I feel is out of scope. We could potentially provide a way through an installer script to ease this process (e.g automatically prepare the namespace) but in terms of the Software Template level I feel it should be a requirement on the remote side to have everything ready so they can accept deployments of the templates.

Copy link
Contributor

@Jdubrick Jdubrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code generally looks good to me, just need to take some time and try it out as well :)

@maysunfaisal
Copy link
Contributor

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.

The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

Interesting thought. I think we can open an issue to track the idea about it. My only concern would be to protect the pipeline resource names from clashing with so many other pipelines in the same namespace. I think we can raise it as a parking lot topic

@thepetk
Copy link
Contributor Author

thepetk commented Mar 13, 2025

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.
The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

Interesting thought. I think we can open an issue to track the idea about it. My only concern would be to protect the pipeline resource names from clashing with so many other pipelines in the same namespace. I think we can raise it as a parking lot topic

Yeah +1, just wanted first to see if there's anything missing from my train of thought so we had a big NO-GO. I'll raise it as parking lot item.

Copy link
Contributor

@Jdubrick Jdubrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

@thepetk
Copy link
Contributor Author

thepetk commented Mar 13, 2025

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

@Jdubrick
Copy link
Contributor

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

@thepetk I agree with it being complicated, I think having an even more detailed version with maybe some examples or templated steps would help those reviewing + can be used for actual setup for end users

@maysunfaisal
Copy link
Contributor

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

@thepetk
Copy link
Contributor Author

thepetk commented Mar 14, 2025

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

@thepetk I agree with it being complicated, I think having an even more detailed version with maybe some examples or templated steps would help those reviewing + can be used for actual setup for end users

@Jdubrick I tried my best <3. Went through the previous steps and added more details to each one of them. In regards to the actual setup for end users I partially agree as this flow is a good start to add documentation but in general it should be covered as part of RHDHPAI-622.

Detailed instructions for reviewers

Cluster Setup

Create two clusters.

  1. You could provision the first cluster (the host) via clusterbot. OCP version of the cluster depends on the RHDH version you'll be installing. For RHDH 1.3 - 1.4 use ocp 4.16, for 1.2 use ocp 4.15. You don't need to install RHDH now, it will be part of the next step.

  2. You could provision the second cluster (the remote) via openshift-installer. In order to test the RHOAI cases is better to use a slightly bigger machine.
    a. After installing the installer, in installer's root dir run openshift-install create install-config and update the install-config.yaml compute section:

    ...
    platform:
        aws:
            type: g5.4xlarge

    b. Run the installer
    c. Once the cluster has been provisioned you have to install RHOAI. You'll need to add the following operators:
    d. Node Feature Discovery with a NodeFeatureDiscovery CR which you'll have to wait to be available.
    e. Nvidia GPU Operator with a ClusterPolicy CR which you'll have to wait to be ready (~5 min).
    f. Finally the Openshift AI operator with a DataScienceCluster CR.

Before you install RHDH on the one cluster you have to add the second cluster in tekton-plugins of the ai-rhdh-installer, so the topology and kubernetes plugins show resources for the other cluster.

  1. To install RHDH you'll have to use the ai-rhdh-installer.
  2. After installing and before configuring you'll need to add a second entry to the pluginConfg.kubernetes.clusterLocatorMethods.clusters here. For more information about cluster configuration see here
  3. In regards to the fields you need to fill in,
    a. Set the oc context to point to the remote cluster. oc config use-context <remote-cluster-context-name>
    b. create a service account token for the remote cluster. You could use kubectl create serviceaccount backstage-sa -n default
    c. create a ClusterRoleBinding. You could use kubectl create clusterrolebinding backstage-view-binding --clusterrole=view --serviceaccount=default:backstage-sa
    d. get the sa token. You could create a token with kubectl create token backstage-sa -n default.
    e. An example of the tekton-plugins cluster configuration is:
 - authProvider: serviceAccount
   name: <sa name>
   serviceAccountToken: <sa token>
   skipTLSVerify: true
   url: https://api.mycluster.p3.openshiftapps.com/
  1. Run bash configure.sh to configure your RHDH installation on the host cluster.

After completing the installation with the installer, add a remote-cluster secret on the RHDH namespace.

Now that your RHDH instance is up and running, monitoring the resources of the remote cluster too, you have to register the remote cluster in RHDH argoCD instance. For more information about this step see here.

More specifically you'll need to ad a new secret to the RHDH namespace (e.g. ai-rhdh). This could be an example:

kind: Secret
apiVersion: v1
metadata:
  name: remote-cluster
  namespace: ai-rhdh
  labels:
    argocd.argoproj.io/secret-type: cluster
data:
  config: |
     {"bearerToken": "your-token-see-below-how-you-obtain-it", "tlsClientConfig":{"insecure": true}}
  name: api-mycluster-p3-openshiftapps-com:443
  server: https://api.mycluster.p3.openshiftapps.com:433
type: Opaque

Note: to get the token for the bearerToken value you, while in the context of remote cluster (oc config use-context <remote-cluster-context>) run kubectl config view --minify --raw -o yaml.

NOTE: You also need a way to setup everything on the second cluster (to initialize the namespace of the second cluster so we can deploy templates there). An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand. This way once the new image of the application is created, the argoCD is able to pull the new image. Otherwise, you'll have to initialize the namespace manually.

This is just a suggestion from my side. When you'll try to deploy the gitops resources of the software template on the remote cluster the selected namespace won't be initialized. That said you want be able to pull the new image once the gitops repo is updated. A very quick workaround (not optimal but works) is to install RHDH on the remote cluster too and initialize the namespace you are going to use for the remote cluster. For example before you deploy remotely you can simply install a chatbot software template in the same namespace, from the RHDH instance of the remote cluster.

Note, again this is a workaround to tackle the remote namespace initialization. A permanent solution is out of the scope of this PR and it will be addressed as part of RHDHPAI-622 & RHDHPAI-581.

Software Template Setup

  1. On your ai-lab-template fork, update the import-gitops-template script to point to this branch.
  2. Generate temporarily the updated templates with bash generate.sh.
  3. You could update one of your templates with my test here: https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml. This way you'll be able to use the extra fields. Note that the template updates are scoped as part of RHDHPAI-581.

@thepetk
Copy link
Contributor Author

thepetk commented Mar 14, 2025

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

I agree from the documentation perspective. As mentioned also in the instructions I just wrote it should be covered as part of RHDHPAI-622 & RHDHPAI-581. I feel this should be ok as the ai-lab-template & ai-rhdh-installer repos IMO are better fit to host this documentation rather than ai-lab-app, which content is consumed by the templates.

@maysunfaisal
Copy link
Contributor

maysunfaisal commented Mar 14, 2025

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔

I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

@thepetk
Copy link
Contributor Author

thepetk commented Mar 16, 2025

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔

I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.

With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

@maysunfaisal
Copy link
Contributor

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔
I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.

With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

okay that makes sense. However I did not see a deployment on the remote cluster, and I did get some bunch of argoCD errors 🤔

@thepetk
Copy link
Contributor Author

thepetk commented Mar 17, 2025

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔
I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.
With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

okay that makes sense. However I did not see a deployment on the remote cluster, and I did get some bunch of argoCD errors 🤔

Could you share the error trace?

Copy link
Contributor

@Jdubrick Jdubrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a piece I am missing in the setup? I followed the instructions and the template prompts me for the remote URL but after entering it the application is still deployed on the host and my remote SA doesn't have permissions when I select it in the 'Topology' tab

@thepetk
Copy link
Contributor Author

thepetk commented Mar 17, 2025

Is there a piece I am missing in the setup? I followed the instructions and the template prompts me for the remote URL but after entering it the application is still deployed on the host and my remote SA doesn't have permissions when I select it in the 'Topology' tab

Let's check it together, I think might be better!

Copy link
Contributor

@Jdubrick Jdubrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after our meeting to go through the changes and setup for testing. We should make sure to update the actual template UI so that it is clear regarding the deployment namespace

Copy link
Contributor

@maysunfaisal maysunfaisal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
good work

screenshots from the host cluster RHDH and ArgoCD of a remote deployment:
Screenshot 2025-03-17 at 7 09 41 PM
Screenshot 2025-03-17 at 7 10 22 PM

@thepetk
Copy link
Contributor Author

thepetk commented Mar 18, 2025

@maysunfaisal @Jdubrick huge thanks for the review especially with this complicated setup involved

@thepetk thepetk merged commit 8d610dd into redhat-ai-dev:main Mar 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants