-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameterize ONNX model tests. #65
Conversation
I'm looking for some early feedback, @zjgarvey or others. Early testing shows that running these tests on HIP would have caught some recent regressions. The specific mechanics used for setting flags, choosing which tests to run, and checking or reporting which stages passed/failed can be implemented in multiple ways. |
parser.addoption( | ||
"--test-config-file", | ||
type=Path, | ||
default=default_config_file, | ||
help="Config JSON file used to parameterize test cases", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/run.py uses
--device=
,--backend=
,--target-chip=
, and--test-filter=
arguments. Arbitrary flags are not supported, and test expectations are also not supported, so there is no way to directly signal if tests are unexpectedly passing or failing.
Here are some ideas, adding complexity in the conftest file but making it more flexible:
- Add more options here to use instead of
--test-config-file
, like--iree-compile-flags
- Make every option a flag then use a flagfile pattern like https://stackoverflow.com/a/27434050 so "config.json" is just a collection of regular flags
- Not sure how that would work with the lists of tests with expectations... I like having the full list of tests that will run be sorted and not split between groups like in the onnx op tests
- Could load a .py file that has the list of tests and statuses, or even just implements the
pytest_collection_modifyitems
hook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using https://github.com/gsnedders/pytest-expect or https://github.com/projectcaluma/pytest-xfaillist is also an option.
Both support updating the xfail files via a --update-xfail
or --generate-xfaillist
option. Neither supports xfail reasons, but that's not critical.
https://github.com/projectcaluma/pytest-xfaillist seems to only support a hardcoded xfails.list
file name next to the config root (source here), which wouldn't work with separate lists of xfails depending on the configuration (e.g. backend choice)
https://github.com/gsnedders/pytest-expect does allow specifying a file with --xfail-file
What I have right now in this "config JSON file" groups these three items:
- The flags to use when compiling and running, allowing you to choose between CPU or GPU, for example
- The list of tests to run
- The list of tests that are expected to fail (and how)
We'll always want a custom implementation for (1). For (2), https://community.lambdatest.com/t/how-to-run-pytest-tests-from-a-list-of-test-paths/31682/2 has some answers but I don't see an existing plugin or standard convention. For (3), we could use one of those projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the expected outcomes to a json is a good idea for making it customizable to the different configs for each test.
I think it will probably be helpful to have the ability to specify extra test-specific options for an individual test through the json file. E.g. for large models, it might be useful to pass additional importer and runtime flags for externalizing params. Although I'm not sure if models that large are going to be in the scope of this testing suite right now.
One thing I think will be rather helpful is to have a way to take a list of urls and automatically generate the test functions. For example:
from typing import List
test_urls = [
"https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/faster-rcnn/model/FasterRCNN-12.onnx",
"https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12.onnx",
]
def generate_name(url: str):
split = url.split("/")
return f'test_{split[-3]}_{split[-1].removesuffix(".onnx")}'
def make_function(url: str):
def func(compare_between_iree_and_onnx_runtime):
compare_between_iree_and_onnx_runtime(
model_url=url,
artifacts_subdir=artifacts_subdir,
)
return func
def define_functions(url_list: List[str]):
for url in url_list:
globals()[generate_name(url)] = make_function(url)
Yes! However, I'm on the fence about specifically where to support extra options. We can add options in all these places:
I don't want to confuse developers with too many choices, but flexibility can helpful for a variety of situations.
For example here, that sounds like those options should apply regardless of the backend configuration, so the flags could go in the test cases.
I think they should be, as we can choose which tests to run on what schedule. How I have the PR right now allows for the list of tests to be fully opt-in using the default "skip" behavior. You can also filter with pytest: # run only tests matching a string
-k resnet
# skip all large tests
-m "not size_large" We can also use something like https://pypi.org/project/pytest-shard/ to run across multiple machines.
Yeah! I like how easy it looks to add test cases when they are in files like https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/onnx_tests/models/external_lists/onnx_model_zoo_computer_vision_1.txt. One of the tradeoffs I'm considering here is how it isn't obvious at all from a single test case name what the test is actually doing. I'd like for test suites to be forkable into user code, not be their own world of meta programming. The new tests that Rob added in https://github.com/iree-org/iree-test-suites/blob/main/sharktank_models/llama3.1/test_llama.py are on the other side of that spectrum:
What's in this test suite right now is closer to alt_e2eshark in that there is code for each test case... but that code is nearly all boilerplate: def test_age_gender_gender_googlenet(compare_between_iree_and_onnxruntime):
compare_between_iree_and_onnxruntime(
model_url="https://github.com/onnx/models/raw/main/validated/vision/body_analysis/age_gender/models/gender_googlenet.onnx",
artifacts_subdir=artifacts_subdir,
) I can iterate on further parameterization as you suggest here... maybe using The ergonomics question is partially solved by
|
Thanks again for the review comments @zjgarvey . I'm planning to pick this back up soon. |
a3f7e28
to
99c3df6
Compare
@pytest.mark.parametrize( | ||
"model", | ||
[ | ||
# fmt: off | ||
pytest.param("duc/model/ResNet101-DUC-12.onnx", marks=pytest.mark.size_large), | ||
pytest.param("faster-rcnn/model/FasterRCNN-12.onnx"), | ||
pytest.param("fcn/model/fcn-resnet50-12.onnx"), | ||
pytest.param("mask-rcnn/model/MaskRCNN-12.onnx"), | ||
pytest.param("retinanet/model/retinanet-9.onnx"), | ||
pytest.param("ssd/model/ssd-12.onnx"), | ||
pytest.param("ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx", marks=pytest.mark.xfail(raises=NotImplementedError)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I think will be rather helpful is to have a way to take a list of urls and automatically generate the test functions.
How's this @zjgarvey ?
Some details:
-
If no marks are needed (all models in the list are supported by the test suite, no "large" or other special model tags needed), then the
pytest.param()
wrappers could be removed, for just:@pytest.mark.parametrize( "model", [ "duc/model/ResNet101-DUC-12.onnx", "faster-rcnn/model/FasterRCNN-12.onnx", "fcn/model/fcn-resnet50-12.onnx", "mask-rcnn/model/MaskRCNN-12.onnx",
-
Using
fmt: off
to prevent the formatter from wrapping lines, so each test case gets its own line, even if it becomes very long -
Test function names could be customized here: https://stackoverflow.com/questions/37575690/override-a-pytest-parameterized-functions-name . Along with that, I could change how the setup code in conftest.py decides which test cases to modify, using regex match or some shorthand, instead of the explicit
"tests_and_expected_outcomes": { "tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v1/model/inception-v1-12.onnx]": "fail-compile",
The explicit format is copy-paste friendly but not typing friendly :P
-
I figure for any test cases that need extra arguments, they could be their own groups that call
compare_between_iree_and_onnxruntime
or another test helper function, instead of packing more options in to this@pytest.mark.parametrize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense to me.
I'll go through and take another look at this PR when I get the chance. If I don't get to it for a while and you'd like a review, please feel free to message/ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was able to look through this pretty quickly now, and it looks like a good change to me. What else needs to be done to undraft this PR?
I'm not totally content with a few design points yet, but this could be good enough to merge and start using. I'll iterate a bit on the PR description and docs so they reflect the current status then mark as ready for review. Thanks for taking a look! |
# Download the model as needed. | ||
# TODO(scotttodd): move to fixture with cache / download on demand | ||
# TODO(scotttodd): overwrite if already existing? check SHA? | ||
# TODO(scotttodd): redownload if file is corrupted (e.g. partial download) | ||
onnx_path = test_artifacts_dir / f"{model_name}.onnx" | ||
if not onnx_path.exists(): | ||
urllib.request.urlretrieve(model_url, onnx_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up tasks
- Adjust file downloading / caching behavior to avoid redownloading and using significant bandwidth when used together with persistent self-hosted runners or github actions caches
For a sense of scale, the onnx_models/artifacts/ directory is around 34GB on my Windows system right now, including .mlir and .vmfb files. I don't want CI runs to redownload 10GB+ from GitHub each job run, since I think that cuts in to the Git LFS bandwidth quota for https://github.com/onnx/models . The docs at https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage say that quota comes from the repository owner, not the user, so I want to be a good citizen here.
I may take ideas from #59 (comment) and build some caching layer that can be shared across the test suites here.
…P. (#19524) This switches from running ONNX model compile->run correctness tests on only CPU to now run on GPU using the Vulkan and HIP APIs. We could also run on CUDA with #18814 and Metal with #18817. These new tests will help guard against regressions to full models, at least when using default flags. I'm planning on adding models coming from other frameworks (such as [LiteRT Models](https://github.com/iree-org/iree-test-suites/tree/main/litert_models)) in future PRs. As these tests will run on every pull request and commit, I'm starting the test list with all tests that are passing on our current set of runners, with no (strict _or_ loose) XFAILs. The full set of tests will be run nightly in https://github.com/iree-org/iree-test-suites using nightly IREE releases... once we have runners with GPUs available in that repository. See also iree-org/iree-test-suites#65 and iree-org/iree-test-suites#6. ## Sample logs I have not done much triage on the test failures, but it does seem like Vulkan pass rates are substantially lower than CPU and ROCm. Test reports, including logs for all failures, are currently published as artifacts on actions runs in iree-test-suites, such as https://github.com/iree-org/iree-test-suites/actions/runs/12794322266. We could also archive test reports somewhere like https://github.com/nod-ai/e2eshark-reports and/or host the test reports on a website like https://nod-ai.github.io/shark-ai/llm/sglang/index.html?sort=result. ### CPU https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117085?pr=19524#step:8:395 ``` ============================== slowest durations =============================== 39.46s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[vgg/model/vgg19-7.onnx] 13.39s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 13.25s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 12.48s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 11.93s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 11.49s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 11.28s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx] 11.26s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 9.14s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx] 7.73s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx] 7.61s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/gender_googlenet.onnx] 7.57s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx] 7.27s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 4.86s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 4.61s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx] 4.58s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx] 3.08s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[squeezenet/model/squeezenet1.0-9.onnx] 2.02s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] 1.90s call tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx] ================== 19 passed, 18 skipped in 184.96s (0:03:04) ================== ``` ### ROCm https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117629?pr=19524#step:8:344 ``` ============================== slowest durations =============================== 9.40s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx] 9.15s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 9.05s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 8.73s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 7.95s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx] 7.94s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 7.81s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 7.13s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 6.95s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx] 5.15s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx] 4.52s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/googlenet/model/googlenet-12.onnx] 3.55s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 3.12s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx] 2.57s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 2.48s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx] 2.21s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx] 1.36s call tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx] 0.95s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] ============ 17 passed, 19 skipped, 1 xfailed in 100.10s (0:01:40) ============= ``` ### Vulkan https://github.com/iree-org/iree/actions/runs/12797886622/job/35681118044?pr=19524#step:8:216 ``` ============================== slowest durations =============================== 13.10s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 12.97s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 12.40s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 12.22s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 9.07s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 8.09s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 6.04s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 2.93s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx] 1.86s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 0.90s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] ============= 9 passed, 27 skipped, 1 xfailed in 79.62s (0:01:19) ============== ``` ci-exactly: build_packages, test_onnx
…P. (iree-org#19524) This switches from running ONNX model compile->run correctness tests on only CPU to now run on GPU using the Vulkan and HIP APIs. We could also run on CUDA with iree-org#18814 and Metal with iree-org#18817. These new tests will help guard against regressions to full models, at least when using default flags. I'm planning on adding models coming from other frameworks (such as [LiteRT Models](https://github.com/iree-org/iree-test-suites/tree/main/litert_models)) in future PRs. As these tests will run on every pull request and commit, I'm starting the test list with all tests that are passing on our current set of runners, with no (strict _or_ loose) XFAILs. The full set of tests will be run nightly in https://github.com/iree-org/iree-test-suites using nightly IREE releases... once we have runners with GPUs available in that repository. See also iree-org/iree-test-suites#65 and iree-org/iree-test-suites#6. ## Sample logs I have not done much triage on the test failures, but it does seem like Vulkan pass rates are substantially lower than CPU and ROCm. Test reports, including logs for all failures, are currently published as artifacts on actions runs in iree-test-suites, such as https://github.com/iree-org/iree-test-suites/actions/runs/12794322266. We could also archive test reports somewhere like https://github.com/nod-ai/e2eshark-reports and/or host the test reports on a website like https://nod-ai.github.io/shark-ai/llm/sglang/index.html?sort=result. ### CPU https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117085?pr=19524#step:8:395 ``` ============================== slowest durations =============================== 39.46s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[vgg/model/vgg19-7.onnx] 13.39s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 13.25s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 12.48s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 11.93s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 11.49s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 11.28s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx] 11.26s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 9.14s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx] 7.73s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx] 7.61s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/gender_googlenet.onnx] 7.57s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx] 7.27s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 4.86s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 4.61s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx] 4.58s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx] 3.08s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[squeezenet/model/squeezenet1.0-9.onnx] 2.02s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] 1.90s call tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx] ================== 19 passed, 18 skipped in 184.96s (0:03:04) ================== ``` ### ROCm https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117629?pr=19524#step:8:344 ``` ============================== slowest durations =============================== 9.40s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx] 9.15s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 9.05s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 8.73s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 7.95s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx] 7.94s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 7.81s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 7.13s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 6.95s call tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx] 5.15s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx] 4.52s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/googlenet/model/googlenet-12.onnx] 3.55s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 3.12s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx] 2.57s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 2.48s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx] 2.21s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx] 1.36s call tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx] 0.95s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] ============ 17 passed, 19 skipped, 1 xfailed in 100.10s (0:01:40) ============= ``` ### Vulkan https://github.com/iree-org/iree/actions/runs/12797886622/job/35681118044?pr=19524#step:8:216 ``` ============================== slowest durations =============================== 13.10s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx] 12.97s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx] 12.40s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx] 12.22s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx] 9.07s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx] 8.09s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx] 6.04s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx] 2.93s call tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx] 1.86s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx] 0.90s call tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx] ============= 9 passed, 27 skipped, 1 xfailed in 79.62s (0:01:19) ============== ``` ci-exactly: build_packages, test_onnx Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
Progress on #6. See how this is used downstream in iree-org/iree#19524.
Overview
This replaces hardcoded flags like
and inlined marks like
with a JSON config file passed to the test runner via the
--test-config-file
option or theIREE_TEST_CONFIG_FILE
environment variable.During test case collection, each test case name is looked up in the config file to determine what the expected outcome is, from
["skip" (special option), "pass", "fail-import", "fail-compile", "fail-run"]
. By default, all tests are skipped. This design allows for out of tree testing to be performed using explicit test lists (encoded in a file, unlike the-k
option), custom flags, and custom test expectations.Design details
Compare this implementation with these others:
skip_compile_tests
,skip_run_tests
,expected_compile_failures
, andexpected_run_failures
. All tests are run by default.--device=
,--backend=
,--target-chip=
, and--test-filter=
arguments. Arbitrary flags are not supported, and test expectations are also not supported, so there is no way to directly signal if tests are unexpectedly passing or failing. A utility script can be used to diff the results of two test reports: https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py.@pytest.fixture([params=[...]])
withpytest.mark.target_hip
and other custom marks. This is more standard pytest and supports fluent ways to express other test configurations, but it makes annotating large numbers of tests pretty verbose and doesn't allow for out of tree configuration.I'm imagining a few usage styles:
tests_and_expected_outcomes
, we could just limit testing to only models that are passing.Follow-up tasks
--update-xfail
?)