Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created Hotswap Best Practice file. #1041

Merged
merged 31 commits into from
Mar 27, 2025
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a93f816
Update README.md
rishibathina Mar 12, 2025
51e8d64
Create hotswap.md
rishibathina Mar 12, 2025
d387136
Update README.md
rishibathina Mar 12, 2025
8dbe5a8
Update hotswap.md
rishibathina Mar 13, 2025
b4bfedd
Insert yamls.
rishibathina Mar 13, 2025
dcca5b2
Update hotswap.md
rishibathina Mar 13, 2025
4813047
Update hotswap.md
rishibathina Mar 13, 2025
373747c
Update hotswap.md
rishibathina Mar 13, 2025
a034f13
Update hotswap.md
rishibathina Mar 13, 2025
a9ea48a
Update hotswap.md
rishibathina Mar 13, 2025
d208e1f
Update hotswap.md
rishibathina Mar 13, 2025
4423684
Update hotswap.md
rishibathina Mar 13, 2025
f925ba4
Update hotswap.md
rishibathina Mar 13, 2025
abba088
Update hotswap.md
rishibathina Mar 13, 2025
0ea3e72
Update hotswap.md
rishibathina Mar 13, 2025
e0c85ac
Update hotswap.md
rishibathina Mar 14, 2025
30ede1e
Update hotswap.md
rishibathina Mar 14, 2025
82cf2c1
Update hotswap.md
rishibathina Mar 14, 2025
1a7e280
Update hotswap.md
rishibathina Mar 17, 2025
72084c5
Update README.md
rishibathina Mar 17, 2025
1d4463a
Update README.md
rishibathina Mar 17, 2025
2971930
Update hotswap.md
rishibathina Mar 24, 2025
b2d87aa
Update hotswap.md
rishibathina Mar 24, 2025
c46f589
Update hotswap.md
rishibathina Mar 24, 2025
a5c3872
Update hotswap.md
rishibathina Mar 24, 2025
686be88
Update hotswap.md
rishibathina Mar 24, 2025
9a5f26a
Update README.md
rishibathina Mar 25, 2025
89ec66e
Update hotswap.md
rishibathina Mar 25, 2025
2a95da0
Update hotswap.md
rishibathina Mar 25, 2025
3031bc7
Update hotswap.md
rishibathina Mar 26, 2025
23a4299
Update hotswap.md
rishibathina Mar 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions best-practices/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ This reference architecture is designed to assist platform administrators, cloud
## [Best Practices for Faster Workload Cold Start](/best-practices/startup-latency.md)

To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency.

## [Enabling Hotswap to reduce Workload Rescheduling time](/best-practices/hotswap.md)

In order to reduce workload rescheduling time during interuptions, we highly reccomend modifing your workloads to allow for smooth interactions with Hotswap.
157 changes: 157 additions & 0 deletions best-practices/hotswap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
## Modifing Workload Deployment to Utilize Hotswap
This doc describes how to modify your workload to achieve reduced rescheduling time by utilizing Hotswap on Google Kubernetes Engine (GKE).

# Introduction

Hotswap is intended to reduce MTTR (Median-Time-To-Recovery) by reacting to node failures and interuptions, and essentially swapping them out for healthy, active hardware. It is used fungible over both GPUs and TPUs to help reduce the Workload Rescheduling time, as that is often a bottleneck when dealing with interuptions. Traditionally, without Hotswap, customers have to wait till the unhealthy nodes hosting the workloads recover, which can take > 5 minutes. With Hotswap, we can bring it down to O(secs).

# Hotswap Takes Effect

Hotswap takes effect in 2 main ways:
1) When a node hosting workloads become unhealthy, it looks for a spare, eligible accelerator hardware to replace. This is a simple swap of the hardware hosting the workload with the spare.
2) When a node hosting workloads become unhealthy, if there are no spares, it will evict a *lower priority* workload, from an eligible slice, and transfer the accelerator hardware to this *higher priority* job. Priority in this case is depicted by a priority class, making this a more nuanced scenario, that requires a little setup.

**Note:** Scenario 2 takes effect when multiple workloads are running on the **same cluster**, and they are sharing the same set of accelerator nodepools.

#### Priority Classes
For Hotswap to work, we need to attach a PriorityClass to the workloads. PriorityClasses help differentiate how to select which workload to preempt versus not. **This is different than job level priority**. Thankfully, Kubernetes makes it super easy to incorporate these classes into your workloads.

### Example
To begin, lets setup two different Priority Classes to indicate our levels of priority. The first class will have a lower priority by indicating a lower value, 1000000, and the higher priority class will have a value of 2000000, having a clear differentiation.

```apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-prior-job
value: 1000000
globalDefault: false
description: "This priority class should be used for low priority pods only."
```
```apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-prior-job
value: 2000000
globalDefault: false
description: "This priority class should be used for hero pods only."
```

Now we can create a high priority Jobset Workload, making sure to add the priority labels to pod templates, as well as adding the priorityClassName to clearly differentiate the workload's priority. This workload will utilize v6e TPUs, with a topology of 4x4, to run a training job on LLama2-7B. **This is an example workload so you would personalize the hardware to fit your needs.** This will be located in high_prio_job.yaml.
```

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: high-jax-v6e
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 10
restartStrategy: BlockingRecreate
replicatedJobs:
- name: slice
replicas:
template:
spec:
backoffLimit: 0
completions: 4
parallelism: 4
template:
metadata:
labels:
priority: high
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
cloud.google.com/gke-tpu-topology: 4x4
#restartPolicy: Never
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
priorityClassName: high-prior-job
containers:
- name: jax-program
image: gcr.io/tpu-prod-env-one-vm/rishi_v6e_test
command:
- python3
- MaxText/train.py
- MaxText/configs/base.yml
- model_name=llama2-7b
- run_name=rishibathinav6e
- steps=300
- base_output_directory=gs://tpu-vm-v6e-bucket
- dataset_path=gs://max-datasets-rogue
- max_target_length=4096
- dataset_type=synthetic
- enable_checkpointing=False
resources:
limits:
google.com/tpu: 4
```
Then we can create a low priority Jobset Workload, again making sure to add the priority labels and the priorityClassName. Again, it is the same training job as the high priority job, but with a lower PriorityClass. This will be located in low_prio_job.yaml.

```
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: low-jax-v6e
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 10
restartStrategy: BlockingRecreate
replicatedJobs:
- name: slice
replicas:
template:
spec:
backoffLimit: 0
completions: 4
parallelism: 4
template:
metadata:
labels:
priority: low
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
cloud.google.com/gke-tpu-topology: 4x4
#restartPolicy: Never
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
priorityClassName: low-prior-job
containers:
- name: jax-program
image: gcr.io/tpu-prod-env-one-vm/rishi_v6e_test
command:
- python3
- MaxText/train.py
- MaxText/configs/base.yml
- model_name=llama2-7b
- run_name=rishibathinav6e
- steps=300
- base_output_directory=gs://tpu-vm-v6e-bucket
- dataset_path=gs://max-datasets-rogue
- max_target_length=4096
- dataset_type=synthetic
- enable_checkpointing=False
resources:
limits:
google.com/tpu: 4

```

Now that we have clearly differentiated priorities for two different Jobset specifications, we can go ahead and deploy them using

```
kubectl apply -f low_prio_job.yaml
kubectl apply -f high_prio_job.yaml
```

Now, when a infrastruction interruption takes place that interrupts your high prio job, it will evict the low prio job's pods off their nodes, and give the high prio job those nodes to schedule on. This happens in O(sec), drastically reducing workload idle time. If you want to test that your workload setup works, you can simulate workload disruption by cordoning the nodepool that one of your high prio jobs is running on:
```kubectl cordon -l cloud.google.com/gke-nodepool={$NODEPOOL_NAME}```

You will see the high priority jobs are restarted and scheduled onto a healthy node pool. At the same time, the low priority job is in failed status and belonging leader pod is in pending status. Then go ahead and uncordon the nodes to simulate the recovery of the infrastructure. You will then see the low priority job is rescheduled back to the nodepool that recovered:

```kubectl uncordon -l cloud.google.com/gke-nodepool={$NODEPOOL_NAME}```