Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling, s3 mounting, distributed tests for axlearn #1332

Open
wants to merge 61 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
e5ee472
Improve error handling, s3 mounting, distributed tests for
Steboss Mar 7, 2025
9de29cd
test mounted s3 bucket
Steboss Mar 10, 2025
6c6cd34
Fix action
Steboss Mar 10, 2025
aa63ac6
fix the bash shell and remember to mount the volume
Steboss Mar 10, 2025
0ae1b83
start working on the shell of the action
Steboss Mar 10, 2025
8f65cd4
try to fix using posix-sh-compatible
Steboss Mar 10, 2025
7e70be7
test on name of the volume and location
Steboss Mar 10, 2025
2eec743
check tests can run
Steboss Mar 13, 2025
d809b67
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss Mar 13, 2025
27f985d
create script for running 70B model
Steboss Mar 13, 2025
fca227e
test on 5 tests to see if mount works now
Steboss Mar 13, 2025
2f671cc
try to ls what's inside output folder
Steboss Mar 13, 2025
df34f12
try @aybchan pytest dist on axlearn
Steboss Mar 13, 2025
551bf94
test only the tests
Steboss Mar 13, 2025
89944f1
test again
Steboss Mar 13, 2025
3f78165
try with simpel bash to avoid bash conflicts for bad substitution
Steboss Mar 13, 2025
dacafd4
check if we can use the pipefail here
Steboss Mar 13, 2025
d36550c
do not use bash to run the test suite
Steboss Mar 13, 2025
4673919
do not use bash to run the test suite
Steboss Mar 13, 2025
7a0081b
add explicitly log dir
Steboss Mar 13, 2025
bc56ad8
error what
Steboss Mar 13, 2025
705b8f1
test whats wrong
Steboss Mar 13, 2025
144f217
try with shell bash
Steboss Mar 14, 2025
3106e3b
retry
Steboss Mar 14, 2025
5e0ee44
try simple test
Steboss Mar 14, 2025
73eb0bd
try to modify test script
Steboss Mar 14, 2025
ea02323
reset test and maybe the bad subs is in the submission step
Steboss Mar 14, 2025
f1be512
try with another subs
Steboss Mar 14, 2025
6b951e0
try to use a sh-like approach in the k8s action
Steboss Mar 14, 2025
43c9382
change to posix shell type
Steboss Mar 14, 2025
0cb48f4
do we need parallelism
Steboss Mar 14, 2025
73edb7b
try to fix the mps
Steboss Mar 14, 2025
d08a53b
do we really need it
Steboss Mar 14, 2025
c30296a
regardless parallelism test
Steboss Mar 14, 2025
4fd33d3
add echo
Steboss Mar 14, 2025
275ed81
add some logs
Steboss Mar 14, 2025
8e4f17e
try to modify the instructions for polling and set a 2 hours poll
Steboss Mar 17, 2025
a63fd82
check fail
Steboss Mar 17, 2025
2f345c8
try to simplify teh approach
Steboss Mar 17, 2025
d7c55c5
try a new build
Steboss Mar 17, 2025
6d1ae2c
fix step
Steboss Mar 17, 2025
935e72b
start craeting also a precommit file
Steboss Mar 17, 2025
d790590
fix the pre-commit so it avoids running on rosetta
Steboss Mar 17, 2025
e95e090
Fix the workers and gpus needed
Steboss Mar 17, 2025
b4f9fef
Update .github/eks-workflow-files/axlearn/axlearn-job.yml
Steboss Mar 18, 2025
5857445
@olupton comments fix
Steboss Mar 18, 2025
0c9c515
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss Mar 19, 2025
2a774a0
try to extend the timeout for nccl
Steboss Mar 20, 2025
05c77a8
Merge branch 'sbosisio/axlearn_improvements' of github.com:NVIDIA/JAX…
Steboss Mar 20, 2025
4600cbb
@olupton comments fix
Steboss Mar 20, 2025
8764fa5
update README file
Steboss Mar 20, 2025
2521a4f
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss Mar 21, 2025
ba6f277
@olupton comments
Steboss Mar 21, 2025
535a018
fix readme
Steboss Mar 21, 2025
cec2cf0
retrieve metrics direclty in script
Steboss Mar 24, 2025
9b29514
fix to yaml
Steboss Mar 24, 2025
72e05a8
fix script to retrieve output
Steboss Mar 25, 2025
8f40ae5
fix fuji script for metrics retrieval
Steboss Mar 25, 2025
dabc2cb
fix to world_size
Steboss Mar 25, 2025
2cba865
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss Mar 25, 2025
23cc9bc
Merge branch 'main' into sbosisio/axlearn_improvements
Steboss Mar 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions .github/actions/submit-delete-k8s-job/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ runs:
steps:
- name: Submit and Delete Kubernetes job
uses: ./.github/actions/with-post-step
shell: bash -eo pipefail
with:
main: |
echo "Submit K8s job"
echo "Submit K8s job ${{ inputs.job-config-file }}"
kubectl apply -f "${{ inputs.job-config-file }}"

# Wait for job to be craeted
Expand All @@ -32,6 +33,37 @@ runs:

# Stream logs
kubectl logs --all-containers=true --all-pods=true --follow job/${{ inputs.job-name }}


# Check whether the job succeeded or failed
while readarray -d : -t status < <(kubectl get job/${{ inputs.job-name }} -o 'jsonpath={.status.failed}:{.status.succeeded}'); do
failures="${status[0]:-0}"
successes="${status[1]:-0}"
total=$((failures + successes))

if [[ $total -lt 2 ]]; then
# neither "failed" nor "succeeded" is 2, so wait
sleep 1
elif [[ $total -eq 2 ]]; then
# we have total=2 => either 2 successes or 2 failures
# (or 1 failed + 1 succeeded).
# In any case, the job is done – break.
break
else
# Just in case we get an unexpected number
exit 255
fi
done

# If job indicates a failure try to print out the info
if [[ $failures -gt 0 ]]; then
echo "Job ${{ inputs.job-name }} has $failures failures"
# this is for batch jobs only
pods=$(kubectl get pods --selector=batch.kubernetes.io/job-name=${{ inputs.job-name }} -o name)
if [[ -n "${pods}" ]]; then
kubectl describe ${pods}
fi
exit 1
fi
post: |
echo "Deleting K8s job: ${{ input.job-name }}"
kubectl delete -f "${{ inputs.job-config-file }}"
1 change: 0 additions & 1 deletion .github/workflows/_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -769,4 +769,3 @@ jobs:
with:
job-config-file: ".github/eks-workflow-files/axlearn/axlearn-fuji-model.yml"
job-name: ${{ env.JOB_NAME }}

8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ We support and test the following JAX frameworks and model architectures. More d

| Framework | Models | Use cases | Container |
| :--- | :---: | :---: | :---: |
| [maxtext](./rosetta/rosetta/projects/maxtext)| GPT, LLaMA, Gemma, Mistral, Mixtral | pretraining | `ghcr.io/nvidia/jax:maxtext` |
| [maxtext](./rosetta/rosetta/projects/maxtext)| GPT, LLaMA, Gemma, Mistral, Mixtral | pre-training | `ghcr.io/nvidia/jax:maxtext` |
| [t5x](./rosetta/rosetta/projects/t5x) | T5, ViT | pre-training, fine-tuning | `ghcr.io/nvidia/jax:t5x` |
| [t5x](./rosetta/rosetta/projects/imagen) | Imagen | pre-training | `ghcr.io/nvidia/t5x:imagen-2023-10-02.v3` |
| [big vision](./rosetta/rosetta/projects/paligemma) | PaliGemma | fine-tuning, evaluation | `ghcr.io/nvidia/jax:gemma` |
| levanter | GPT, LLaMA, MPT, Backpacks | pretraining, fine-tuning | `ghcr.io/nvidia/jax:levanter` |
| axlearn | Fuji | pretraining | `gchr.io/nvidia/jax:axlearn` |
| levanter | GPT, LLaMA, MPT, Backpacks | pre-training, fine-tuning | `ghcr.io/nvidia/jax:levanter` |
| axlearn | Fuji | pre-training | `gchr.io/nvidia/jax:axlearn` |

# Build Pipeline Status
<table>
Expand Down Expand Up @@ -269,7 +269,7 @@ We support and test the following JAX frameworks and model architectures. More d
</td>
<td>
<a href="https://gist.github.com/nvjax/913c2af68649fe568e9711c2dabb23ae#file-badge-maxtext-test-json">
<img style="height:1em;" src="https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2Fnvjax%2F913c2af68649fe568e9711c2dabb23ae%2Fraw%2Fbadge-axleran-test.json&logo=nvidia&label=A100%20distributed">
<img style="height:1em;" src="https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2Fnvjax%2F913c2af68649fe568e9711c2dabb23ae%2Fraw%2Fbadge-axlearn-test.json&logo=nvidia&label=A100%20distributed">
</a>
</td>
</tr>
Expand Down
Loading