Add support for remote cluster deployments #42

thepetk · 2025-03-10T14:49:11Z

What does this PR do?:

The PR adds support for remote cluster deployments. More specifically:

It updates the application-dev.yaml flow, so in case of remote cluster deployment we use a different destination for this application.
In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.
The namespace initialization is moved to the tekton app so we can run all pipelines from the selected namespace. It should be up to the remote cluster admin to prepare the external namespace so that the software templates can be installed correctly.

Which issue(s) this PR fixes:

Fixes RHDHPAI-580

PR acceptance criteria:

Testing and documentation do not need to be complete in order for this PR to be approved. We just need to ensure tracking issues are opened and linked to this PR, if they are not in the PR scope due to various constraints.

Tested and Verified
Documentation (READMEs, Product Docs, Blogs, Education Modules, etc.)

How to test changes / Special notes to the reviewer:

Setup the two clusters

Create two clusters.
Before you install RHDH on the one cluster you have to add the second cluster in tekton-plugins of the ai-rhdh-installer, so the topology and kubernetes plugins show resources for the other cluster.
Note, you have to create a cluster-reader role for the SA token required in tekton-plugins.yaml.
After completing the installation with the installer, add a remote-cluster secret on the RHDH namespace. This should be like:

...
metadata:
     lables:
         argocd.argoproj.io/secret-type: cluster
...
data:
    config: |
        {"bearerToken": "your-token", "tlsClientConfig":{"insecure": true}}
    name: name
    server: api-server

NOTE: You also need a way to setup everything on the second cluster (to initialize the namespace of the second cluster so we can deploy templates there). An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand. This way once the new image of the application is created, the argoCD is able to pull the new image. Otherwise, you'll have to initialize the namespace manually.

Deploy templates

On your ai-lab-template fork, update the import-gitops-template script to point to this branch.
Generate temporarily the update templates.
I have prepared a test template here: https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml
Import the template, select a remote cluster (put the second cluster there).

Examples

RHOAI Deployment
The deployed template overview
The two different clusters in topology
PipelineRuns (In host cluster)
Deployments

Important Notes

One important note I'd like everyone to have in mind during review is about the namespace we run the pipelineRuns on the host cluster. The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

Signed-off-by: thepetk <thepetk@gmail.com>

maysunfaisal · 2025-03-11T22:15:48Z

An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand.

@thepetk what if someone wants to deploy to multiple namespaces remotely? Do we need to ensure all of these namespaces are configured before hand?

maysunfaisal · 2025-03-11T22:19:20Z

The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

For now as first iteration of this new feature, lets keep it simple and keep the same name. I am thinking of edge cases where someone can specify different namespaces on diffierent clusters but there might be duplicate resources already present and it may result in failures, I think we should probably think a bit more. Any other concerns?

thepetk · 2025-03-12T10:31:38Z

The current PR uses the same namespace name with the remote cluster one. My question is if we would like to add an option to the user to specify this too.

For now as first iteration of this new feature, lets keep it simple and keep the same name. I am thinking of edge cases where someone can specify different namespaces on diffierent clusters but there might be duplicate resources already present and it may result in failures, I think we should probably think a bit more. Any other concerns?

Totally agree and pretty much this is the reason of using the same namespace. I feel too that many things might change, even when we finalize the updates on the software template side.

thepetk · 2025-03-12T10:36:46Z

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.

The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

thepetk · 2025-03-12T10:44:27Z

An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand.

@thepetk what if someone wants to deploy to multiple namespaces remotely? Do we need to ensure all of these namespaces are configured before hand?

Yes I think so. At least for the current PR I feel is out of scope. We could potentially provide a way through an installer script to ease this process (e.g automatically prepare the namespace) but in terms of the Software Template level I feel it should be a requirement on the remote side to have everything ready so they can accept deployments of the templates.

Jdubrick

Code generally looks good to me, just need to take some time and try it out as well :)

templates/http/base/kustomization.yaml

templates/http/overlays/development/kustomization.yaml

maysunfaisal · 2025-03-12T22:37:22Z

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.

The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

Interesting thought. I think we can open an issue to track the idea about it. My only concern would be to protect the pipeline resource names from clashing with so many other pipelines in the same namespace. I think we can raise it as a parking lot topic

thepetk · 2025-03-13T12:16:35Z

@maysunfaisal this PR made me thinking also this: How we feel about a separate tekton argoCD application for all cases. I'm thinking out loud here, we could setup a specific namespace on the RHDH cluster where all pipelineRuns run. This could potentially reduce the amount of secrets we spread across different namespaces right (with initialize-namespace job)? I was curious to see wdyt.
The main reason I'm thinking this is because right now we initialize the namespace (If I get it right) and the user who has access to the software template namespace only, gets access to organization-wide secrets (e.g quay token). So using a namespace with restricted access to run the pipelineRuns might reduce this spread.

Interesting thought. I think we can open an issue to track the idea about it. My only concern would be to protect the pipeline resource names from clashing with so many other pipelines in the same namespace. I think we can raise it as a parking lot topic

Yeah +1, just wanted first to see if there's anything missing from my train of thought so we had a big NO-GO. I'll raise it as parking lot item.

Jdubrick

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

thepetk · 2025-03-13T17:57:02Z

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

Jdubrick · 2025-03-13T19:17:17Z

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

@thepetk I agree with it being complicated, I think having an even more detailed version with maybe some examples or templated steps would help those reviewing + can be used for actual setup for end users

maysunfaisal · 2025-03-13T22:11:42Z

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

thepetk · 2025-03-14T10:30:39Z

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

Are we able to get a detailed instruction set for how to set everything up to utilize the remote cluster? I believe it would also make it easier for others looking to give it a try for review?

The truth is the setup here is quite complicated. I have tried to capture everything in the PR description. But ofc I can have a more detailed version tomorrow if that helps! Let me know!

@thepetk I agree with it being complicated, I think having an even more detailed version with maybe some examples or templated steps would help those reviewing + can be used for actual setup for end users

@Jdubrick I tried my best <3. Went through the previous steps and added more details to each one of them. In regards to the actual setup for end users I partially agree as this flow is a good start to add documentation but in general it should be covered as part of RHDHPAI-622.

Detailed instructions for reviewers

Cluster Setup

Create two clusters.

You could provision the first cluster (the host) via clusterbot. OCP version of the cluster depends on the RHDH version you'll be installing. For RHDH 1.3 - 1.4 use ocp 4.16, for 1.2 use ocp 4.15. You don't need to install RHDH now, it will be part of the next step.
You could provision the second cluster (the remote) via openshift-installer. In order to test the RHOAI cases is better to use a slightly bigger machine.
a. After installing the installer, in installer's root dir run openshift-install create install-config and update the install-config.yaml compute section:
```
...
platform:
    aws:
        type: g5.4xlarge
```
b. Run the installer
c. Once the cluster has been provisioned you have to install RHOAI. You'll need to add the following operators:
d. Node Feature Discovery with a NodeFeatureDiscovery CR which you'll have to wait to be available.
e. Nvidia GPU Operator with a ClusterPolicy CR which you'll have to wait to be ready (~5 min).
f. Finally the Openshift AI operator with a DataScienceCluster CR.

Before you install RHDH on the one cluster you have to add the second cluster in tekton-plugins of the ai-rhdh-installer, so the topology and kubernetes plugins show resources for the other cluster.

To install RHDH you'll have to use the ai-rhdh-installer.
After installing and before configuring you'll need to add a second entry to the pluginConfg.kubernetes.clusterLocatorMethods.clusters here. For more information about cluster configuration see here
In regards to the fields you need to fill in,
a. Set the oc context to point to the remote cluster. oc config use-context <remote-cluster-context-name>
b. create a service account token for the remote cluster. You could use kubectl create serviceaccount backstage-sa -n default
c. create a ClusterRoleBinding. You could use kubectl create clusterrolebinding backstage-view-binding --clusterrole=view --serviceaccount=default:backstage-sa
d. get the sa token. You could create a token with kubectl create token backstage-sa -n default.
e. An example of the tekton-plugins cluster configuration is:

 - authProvider: serviceAccount
   name: <sa name>
   serviceAccountToken: <sa token>
   skipTLSVerify: true
   url: https://api.mycluster.p3.openshiftapps.com/

Run bash configure.sh to configure your RHDH installation on the host cluster.

After completing the installation with the installer, add a remote-cluster secret on the RHDH namespace.

Now that your RHDH instance is up and running, monitoring the resources of the remote cluster too, you have to register the remote cluster in RHDH argoCD instance. For more information about this step see here.

More specifically you'll need to ad a new secret to the RHDH namespace (e.g. ai-rhdh). This could be an example:

kind: Secret
apiVersion: v1
metadata:
  name: remote-cluster
  namespace: ai-rhdh
  labels:
    argocd.argoproj.io/secret-type: cluster
data:
  config: |
     {"bearerToken": "your-token-see-below-how-you-obtain-it", "tlsClientConfig":{"insecure": true}}
  name: api-mycluster-p3-openshiftapps-com:443
  server: https://api.mycluster.p3.openshiftapps.com:433
type: Opaque

Note: to get the token for the bearerToken value you, while in the context of remote cluster (oc config use-context <remote-cluster-context>) run kubectl config view --minify --raw -o yaml.

NOTE: You also need a way to setup everything on the second cluster (to initialize the namespace of the second cluster so we can deploy templates there). An easy way to do it is to install RHDH on the second cluster and initialize the namespace there before hand. This way once the new image of the application is created, the argoCD is able to pull the new image. Otherwise, you'll have to initialize the namespace manually.

This is just a suggestion from my side. When you'll try to deploy the gitops resources of the software template on the remote cluster the selected namespace won't be initialized. That said you want be able to pull the new image once the gitops repo is updated. A very quick workaround (not optimal but works) is to install RHDH on the remote cluster too and initialize the namespace you are going to use for the remote cluster. For example before you deploy remotely you can simply install a chatbot software template in the same namespace, from the RHDH instance of the remote cluster.

Note, again this is a workaround to tackle the remote namespace initialization. A permanent solution is out of the scope of this PR and it will be addressed as part of RHDHPAI-622 & RHDHPAI-581.

Software Template Setup

On your ai-lab-template fork, update the import-gitops-template script to point to this branch.
Generate temporarily the updated templates with bash generate.sh.
You could update one of your templates with my test here: https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml. This way you'll be able to use the extra fields. Note that the template updates are scoped as part of RHDHPAI-581.

thepetk · 2025-03-14T10:37:48Z

It probably makes sense to write up a documentation because we can point that documentation to whoever is eventually going to use it. It should perhaps be part of Acceptance Criteria.

I agree from the documentation perspective. As mentioned also in the instructions I just wrote it should be covered as part of RHDHPAI-622 & RHDHPAI-581. I feel this should be ok as the ai-lab-template & ai-rhdh-installer repos IMO are better fit to host this documentation rather than ai-lab-app, which content is consumed by the templates.

maysunfaisal · 2025-03-14T22:15:09Z

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔

I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

thepetk · 2025-03-16T13:50:56Z

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔

I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.

With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

maysunfaisal · 2025-03-17T15:10:57Z

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔
I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.

With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

okay that makes sense. However I did not see a deployment on the remote cluster, and I did get some bunch of argoCD errors 🤔

thepetk · 2025-03-17T15:16:20Z

@thepetk I followed the instructions and seems like the PLR are running on the host cluster 🤔
I am using your template https://github.com/thepetk/ai-lab-template/blob/temp_changes/templates/codegen2/template.yaml

Yeah this is the reason I went with two separate application components (app & app-tekton). The Openshift Pipelines operator is installed in the host cluster, thus the github app is set to communicate with this cluster's (host's) webhook URL. This way whenever a change happens on the github repo the PLRs are triggered and the new image is built.
With this setup we could have multiple remote clusters involved in the process and ofc no secret in regards to the PLR functionality is shared accross the clusters. I tried to capture it in the PR description too:

In order to keep the tekton resources in place (so all pipelines run from the RHDH cluster) we split the application-dev into two parts (app and app-tekton). As a result, in case a remote deployment is selected the tekton resources are monitored by a separate argo app. This way we are able to maintain the PoC webhook functionality.

okay that makes sense. However I did not see a deployment on the remote cluster, and I did get some bunch of argoCD errors 🤔

Could you share the error trace?

Jdubrick

Is there a piece I am missing in the setup? I followed the instructions and the template prompts me for the remote URL but after entering it the application is still deployed on the host and my remote SA doesn't have permissions when I select it in the 'Topology' tab

thepetk · 2025-03-17T17:29:39Z

Is there a piece I am missing in the setup? I followed the instructions and the template prompts me for the remote URL but after entering it the application is still deployed on the host and my remote SA doesn't have permissions when I select it in the 'Topology' tab

Let's check it together, I think might be better!

Jdubrick

lgtm after our meeting to go through the changes and setup for testing. We should make sure to update the actual template UI so that it is clear regarding the deployment namespace

maysunfaisal

lgtm
good work

screenshots from the host cluster RHDH and ArgoCD of a remote deployment:

thepetk · 2025-03-18T09:29:34Z

@maysunfaisal @Jdubrick huge thanks for the review especially with this complicated setup involved

thepetk added 2 commits March 10, 2025 12:59

Add remote cluster deployment support

b3a744c

Signed-off-by: thepetk <thepetk@gmail.com>

Use one namespace

b8c7323

Signed-off-by: thepetk <thepetk@gmail.com>

thepetk changed the title ~~WIP: Add support for remote cluster deployments~~ Add support for remote cluster deployments Mar 10, 2025

thepetk requested review from michael-valdron, maysunfaisal and Jdubrick March 10, 2025 14:50

Jdubrick reviewed Mar 12, 2025

View reviewed changes

templates/http/base/kustomization.yaml Show resolved Hide resolved

templates/http/overlays/development/kustomization.yaml Show resolved Hide resolved

Jdubrick reviewed Mar 13, 2025

View reviewed changes

Jdubrick reviewed Mar 17, 2025

View reviewed changes

Jdubrick approved these changes Mar 17, 2025

View reviewed changes

maysunfaisal approved these changes Mar 17, 2025

View reviewed changes

thepetk merged commit 8d610dd into redhat-ai-dev:main Mar 18, 2025
1 check passed

Jdubrick mentioned this pull request Mar 20, 2025

Pull in Changes From Ai-Lab-App Repo redhat-ai-dev/ai-lab-template#66

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for remote cluster deployments #42

Add support for remote cluster deployments #42

thepetk commented Mar 10, 2025

maysunfaisal commented Mar 11, 2025

maysunfaisal commented Mar 11, 2025

thepetk commented Mar 12, 2025

thepetk commented Mar 12, 2025

thepetk commented Mar 12, 2025 •

edited

Loading

Jdubrick left a comment

maysunfaisal commented Mar 12, 2025

thepetk commented Mar 13, 2025

Jdubrick left a comment

thepetk commented Mar 13, 2025

Jdubrick commented Mar 13, 2025

maysunfaisal commented Mar 13, 2025

thepetk commented Mar 14, 2025 •

edited

Loading

thepetk commented Mar 14, 2025

maysunfaisal commented Mar 14, 2025 •

edited

Loading

thepetk commented Mar 16, 2025

maysunfaisal commented Mar 17, 2025

thepetk commented Mar 17, 2025

Jdubrick left a comment

thepetk commented Mar 17, 2025

Jdubrick left a comment

maysunfaisal left a comment

thepetk commented Mar 18, 2025

Add support for remote cluster deployments #42

Add support for remote cluster deployments #42

Conversation

thepetk commented Mar 10, 2025

What does this PR do?:

Which issue(s) this PR fixes:

PR acceptance criteria:

How to test changes / Special notes to the reviewer:

Setup the two clusters

Deploy templates

Examples

Important Notes

maysunfaisal commented Mar 11, 2025

maysunfaisal commented Mar 11, 2025

thepetk commented Mar 12, 2025

thepetk commented Mar 12, 2025

thepetk commented Mar 12, 2025 • edited Loading

Jdubrick left a comment

Choose a reason for hiding this comment

maysunfaisal commented Mar 12, 2025

thepetk commented Mar 13, 2025

Jdubrick left a comment

Choose a reason for hiding this comment

thepetk commented Mar 13, 2025

Jdubrick commented Mar 13, 2025

maysunfaisal commented Mar 13, 2025

thepetk commented Mar 14, 2025 • edited Loading

Detailed instructions for reviewers

Cluster Setup

Software Template Setup

thepetk commented Mar 14, 2025

maysunfaisal commented Mar 14, 2025 • edited Loading

thepetk commented Mar 16, 2025

maysunfaisal commented Mar 17, 2025

thepetk commented Mar 17, 2025

Jdubrick left a comment

Choose a reason for hiding this comment

thepetk commented Mar 17, 2025

Jdubrick left a comment

Choose a reason for hiding this comment

maysunfaisal left a comment

Choose a reason for hiding this comment

thepetk commented Mar 18, 2025

thepetk commented Mar 12, 2025 •

edited

Loading

thepetk commented Mar 14, 2025 •

edited

Loading

maysunfaisal commented Mar 14, 2025 •

edited

Loading