From ca539910d7a25cb79026890dc59928f579be49a1 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Fri, 11 Mar 2022 15:26:58 -0600 Subject: [PATCH 1/6] Add performance and longevity testing validation to the release template Signed-off-by: Travis Benedict --- .github/ISSUE_TEMPLATE/release_template.md | 68 ++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/.github/ISSUE_TEMPLATE/release_template.md b/.github/ISSUE_TEMPLATE/release_template.md index eaa9992190..8cfc906161 100644 --- a/.github/ISSUE_TEMPLATE/release_template.md +++ b/.github/ISSUE_TEMPLATE/release_template.md @@ -56,6 +56,74 @@ __REPLACE with OpenSearch wide initiatives to improve quality and consistency.__ - [ ] Code Complete (__REPLACE_RELEASE-minus-14-days__ - __REPLACE_RELEASE-minus-11-days__): Teams test their component within the distribution, ensuring integration, backwards compatibility, and perf tests pass. - [ ] Sanity Testing (__REPLACE_RELEASE-minus-8-days__ - __REPLACE_RELEASE-minus-6-days__): Sanity testing and fixing of critical issues found by teams. +### Performance testing validation - _Ends __REPLACE_RELEASE-minus-6-days___ +- [ ] Performance tests do not show a regression + +
How to identify regressions in performance tests +

+ +Disclaimer: the guidelines listed below were determined based on empirical testing using OpenSearch Benchmark. +These tests were run against OpenSearch 1.2 build #762 and used the nyc_taxis workload with 2 warmup and 3 test iterations. +The values listed below are **not** applicable to other configurations. More details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461 + +Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications layed out in the table + +Please keep in mind the following: + +1. Expected values are rough estimates. These are only meant to establish a baseline understanding of test results. +2. StDev% Mean is the standard deviation as a percentage of the mean. This is expected variation between tests. + 1. If the average of several tests consistently falls outside this bound there may be a performance regression. +3. MinMax% Diff is the worst case variance between any two tests with the same configuration. + 1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. + 1. In general, comparing one off test runs should be avoided if possible. + + +|Instance Type|Security|Expected Indexing Throughput Avg (req/s)|Expected Indexing Error Rate|Indexing StDev% Mean|Indexing MinMax% Diff|Expected Query Latency p90 (ms)|Expected Query Latency p99 (ms)|Expected Query Error Rate|Query StDev% Mean|Query MinMax% Diff| +|---|---|---|---|---|---|---|---|---|---|---| +|m5.xlarge|Enabled|30554|0|~5%|~12%|431|449|0|~10%|~23%| +|m5.xlarge|Disabled|34472|0|~5%|~15%|418|444|0|~10%|~25%| +|m6g.xlarge|Enabled|38625|0|~3%|~8%|497|512|0|~8%|~23| +|m6g.xlarge|Disabled|45447|0|~2%|~3%|470|480|0|~5%|~15%| + +Note that performance regressions are based on decreased indexing throughput and/or increased query latency. + +Additionally, error rates on the order of 0.01% are acceptable, though higher ones may be cause for concern. + + +

+
+ +- [ ] Longevity tests do not show any issues + +
How to identify issues in longevity tests +

+ +Navigate to the Jenkins build for a longevity test. Look at the Console Output + +Search for: + +``` +INFO:root:Test can be monitored on +``` + +Navigate to that link then click the link for "Live Dashboards" + +Use the following table to monitor metrics for the test: + +|Metric|Health Indicators / Expected Values|Requires investigations / Cause for concerns| +|---|---|---| +|Memory|saw tooth graph|upward trends| +|CPU| |upward trends or rising towards 100%| +|Threadpool|0 rejections|any rejections| +|Indexing Throughput|Consistent rate during each test iteration|downward trends| +|Query Throughput|Varies based on the query being issued|downward trends between iterations| +|Indexing Latency|Consistent during each test iteration|upward trends| +|Query Latency|Varies based on the query being issued|upward trends| + +

+
+ + ### Release - _Ends {__REPLACE_RELEASE-day}_ - [ ] Declare a release candidate build, and publish all test results. From 1d22db4604dfe3c82c097ee1c733ad8959527413 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Wed, 16 Mar 2022 09:46:38 -0500 Subject: [PATCH 2/6] Move validation details to test_workflow README Signed-off-by: Travis Benedict --- .github/ISSUE_TEMPLATE/release_template.md | 62 ---------------------- src/test_workflow/README.md | 60 ++++++++++++++++++++- 2 files changed, 59 insertions(+), 63 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/release_template.md b/.github/ISSUE_TEMPLATE/release_template.md index 8cfc906161..bc8bbc8bc1 100644 --- a/.github/ISSUE_TEMPLATE/release_template.md +++ b/.github/ISSUE_TEMPLATE/release_template.md @@ -59,70 +59,8 @@ __REPLACE with OpenSearch wide initiatives to improve quality and consistency.__ ### Performance testing validation - _Ends __REPLACE_RELEASE-minus-6-days___ - [ ] Performance tests do not show a regression -
How to identify regressions in performance tests -

- -Disclaimer: the guidelines listed below were determined based on empirical testing using OpenSearch Benchmark. -These tests were run against OpenSearch 1.2 build #762 and used the nyc_taxis workload with 2 warmup and 3 test iterations. -The values listed below are **not** applicable to other configurations. More details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461 - -Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications layed out in the table - -Please keep in mind the following: - -1. Expected values are rough estimates. These are only meant to establish a baseline understanding of test results. -2. StDev% Mean is the standard deviation as a percentage of the mean. This is expected variation between tests. - 1. If the average of several tests consistently falls outside this bound there may be a performance regression. -3. MinMax% Diff is the worst case variance between any two tests with the same configuration. - 1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. - 1. In general, comparing one off test runs should be avoided if possible. - - -|Instance Type|Security|Expected Indexing Throughput Avg (req/s)|Expected Indexing Error Rate|Indexing StDev% Mean|Indexing MinMax% Diff|Expected Query Latency p90 (ms)|Expected Query Latency p99 (ms)|Expected Query Error Rate|Query StDev% Mean|Query MinMax% Diff| -|---|---|---|---|---|---|---|---|---|---|---| -|m5.xlarge|Enabled|30554|0|~5%|~12%|431|449|0|~10%|~23%| -|m5.xlarge|Disabled|34472|0|~5%|~15%|418|444|0|~10%|~25%| -|m6g.xlarge|Enabled|38625|0|~3%|~8%|497|512|0|~8%|~23| -|m6g.xlarge|Disabled|45447|0|~2%|~3%|470|480|0|~5%|~15%| - -Note that performance regressions are based on decreased indexing throughput and/or increased query latency. - -Additionally, error rates on the order of 0.01% are acceptable, though higher ones may be cause for concern. - - -

-
- - [ ] Longevity tests do not show any issues -
How to identify issues in longevity tests -

- -Navigate to the Jenkins build for a longevity test. Look at the Console Output - -Search for: - -``` -INFO:root:Test can be monitored on -``` - -Navigate to that link then click the link for "Live Dashboards" - -Use the following table to monitor metrics for the test: - -|Metric|Health Indicators / Expected Values|Requires investigations / Cause for concerns| -|---|---|---| -|Memory|saw tooth graph|upward trends| -|CPU| |upward trends or rising towards 100%| -|Threadpool|0 rejections|any rejections| -|Indexing Throughput|Consistent rate during each test iteration|downward trends| -|Query Throughput|Varies based on the query being issued|downward trends between iterations| -|Indexing Latency|Consistent during each test iteration|upward trends| -|Query Latency|Varies based on the query being issued|upward trends| - -

-
- ### Release - _Ends {__REPLACE_RELEASE-day}_ diff --git a/src/test_workflow/README.md b/src/test_workflow/README.md index f11921f62b..5fff6aa4aa 100644 --- a/src/test_workflow/README.md +++ b/src/test_workflow/README.md @@ -99,7 +99,65 @@ opensearch-dashboards=https://ci.opensearch.org/ci/dbc/bundle-build-dashboards/1 ### Performance Tests -TODO +TODO: Add instructions on how run performance tests with `test.sh` + + + +#### How to identify regressions in performance tests + +Disclaimer: the guidelines listed below were determined based on empirical testing using OpenSearch Benchmark. +These tests were run against OpenSearch 1.2 build #762 and used the nyc_taxis workload with 2 warmup and 3 test iterations. +The values listed below are **not** applicable to other configurations. More details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461 + +Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications layed out in the table + +Please keep in mind the following: + +1. Expected values are rough estimates. These are only meant to establish a baseline understanding of test results. +2. StDev% Mean is the standard deviation as a percentage of the mean. This is expected variation between tests. + 1. If the average of several tests consistently falls outside this bound there may be a performance regression. +3. MinMax% Diff is the worst case variance between any two tests with the same configuration. + 1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. + 1. In general, comparing one off test runs should be avoided if possible. + + +|Instance Type|Security|Expected Indexing Throughput Avg (req/s)|Expected Indexing Error Rate|Indexing StDev% Mean|Indexing MinMax% Diff|Expected Query Latency p90 (ms)|Expected Query Latency p99 (ms)|Expected Query Error Rate|Query StDev% Mean|Query MinMax% Diff| +|---|---|---|---|---|---|---|---|---|---|---| +|m5.xlarge|Enabled|30554|0|~5%|~12%|431|449|0|~10%|~23%| +|m5.xlarge|Disabled|34472|0|~5%|~15%|418|444|0|~10%|~25%| +|m6g.xlarge|Enabled|38625|0|~3%|~8%|497|512|0|~8%|~23| +|m6g.xlarge|Disabled|45447|0|~2%|~3%|470|480|0|~5%|~15%| + +Note that performance regressions are based on decreased indexing throughput and/or increased query latency. + +Additionally, error rates on the order of 0.01% are acceptable, though higher ones may be cause for concern. + + + +#### How to identify issues in longevity tests + +Navigate to the Jenkins build for a longevity test. Look at the Console Output + +Search for: + +``` +INFO:root:Test can be monitored on +``` + +Navigate to that link then click the link for "Live Dashboards" + +Use the following table to monitor metrics for the test: + +|Metric|Health Indicators / Expected Values|Requires investigations / Cause for concerns| +|---|---|---| +|Memory|saw tooth graph|upward trends| +|CPU| |upward trends or rising towards 100%| +|Threadpool|0 rejections|any rejections| +|Indexing Throughput|Consistent rate during each test iteration|downward trends| +|Query Throughput|Varies based on the query being issued|downward trends between iterations| +|Indexing Latency|Consistent during each test iteration|upward trends| +|Query Latency|Varies based on the query being issued|upward trends| + ## Testing in CI/CD From c6979b7fd877cbd76020170e75db86ab9144ee45 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Wed, 16 Mar 2022 09:49:32 -0500 Subject: [PATCH 3/6] Typo Signed-off-by: Travis Benedict --- src/test_workflow/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/test_workflow/README.md b/src/test_workflow/README.md index 5fff6aa4aa..be1fe7b897 100644 --- a/src/test_workflow/README.md +++ b/src/test_workflow/README.md @@ -99,7 +99,7 @@ opensearch-dashboards=https://ci.opensearch.org/ci/dbc/bundle-build-dashboards/1 ### Performance Tests -TODO: Add instructions on how run performance tests with `test.sh` +TODO: Add instructions on how to run performance tests with `test.sh` From 0d7281df0fbafbe164f054c4544b5b6166228f45 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Wed, 16 Mar 2022 15:57:31 -0500 Subject: [PATCH 4/6] Add TOC entry, remove disclaimer, update wording Signed-off-by: Travis Benedict --- src/test_workflow/README.md | 78 ++++++++++++++++++++----------------- 1 file changed, 43 insertions(+), 35 deletions(-) diff --git a/src/test_workflow/README.md b/src/test_workflow/README.md index be1fe7b897..3b40b616e4 100644 --- a/src/test_workflow/README.md +++ b/src/test_workflow/README.md @@ -3,13 +3,16 @@ - [Integration Tests](#integration-tests) - [Backwards Compatibility Tests](#backwards-compatibility-tests) - [Performance Tests](#performance-tests) + - [Identifying Regressions in Performance Tests](#identifying-regressions-in-performance-tests) + - [Identifying Regressions in Nightly Performance Tests](#identifying-regressions-in-nightly-performance-tests) + - [Identifying Issues in Longevity Tests](#identifying-issues-in-longevity-tests) - [Testing in CI/CD](#testing-in-cicd) - - [Test Workflow (in development)](#test-workflow-in-development) - - [Component-Level Details](#component-level-details) - - [test-orchestrator pipeline](#test-orchestrator-pipeline) - - [integTest job](#integtest-job) - - [bwcTest job](#bwctest-job) - - [perfTest job](#perftest-job) + - [Test Workflow (in development)](#test-workflow-in-development) + - [Component-Level Details](#component-level-details) + - [test-orchestrator pipeline](#test-orchestrator-pipeline) + - [integTest job](#integtest-job) + - [bwcTest job](#bwctest-job) + - [perfTest job](#perftest-job) - [Manifest Files](#manifest-files) - [Dependency Management](#dependency-management) - [S3 Permission Model](#s3-permission-model) @@ -99,23 +102,39 @@ opensearch-dashboards=https://ci.opensearch.org/ci/dbc/bundle-build-dashboards/1 ### Performance Tests -TODO: Add instructions on how to run performance tests with `test.sh` +TODO: Add instructions for running performance tests with `test.sh` +Performance tests from `test.sh` are executed using an internal service which automatically provisions hosts that run [OpenSearch Benchmark](https://github.com/opensearch-project/OpenSearch-Benchmark). Work to open source these internal features is being tracked [here](https://github.com/opensearch-project/opensearch-benchmark/issues/97). -#### How to identify regressions in performance tests +Comparable performance data can be generated by directly using OpenSearch Benchmark, assuming that the same cluster and workload setups are used. More details on the performance testing configuration used for the nightly runs can be found [here](https://github.com/opensearch-project/OpenSearch/issues/2461). -Disclaimer: the guidelines listed below were determined based on empirical testing using OpenSearch Benchmark. -These tests were run against OpenSearch 1.2 build #762 and used the nyc_taxis workload with 2 warmup and 3 test iterations. -The values listed below are **not** applicable to other configurations. More details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461 +In addition to the standard performance tests that run on the order of hours, longevity tests are run which load to a cluster for days or weeks. These tests are meant to validate cluster stability over a longer timeframe. +Longevity tests are also executed using OpenSearch Benchmark, using a modified version of the [nyc_taxis workload](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) that repeats the schedule for hundreds of iterations. -Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications layed out in the table +#### Identifying Regressions in Performance Tests + +Before trying to identify a performance regression a set of baseline tests should be run, in order to establish expected values for performance metrics and to understand the variance between tests for the same configuration. Performance regressions are primarily determined based on decreased indexing throughput and/or increased query latency. +There is some amount of variance expected between any two tests. Empirically it has been found that generally tests for the same configuration can differ by about 5% of the mean for average indexing throughput and by about 10% of the mean for p90 or p99 query latency. Note that these values may vary depending on the underlying hardware of the cluster and the workload being used. + +If performance metrics for a certain testing configuration consistently fall outside the range created by the expected value for a metric +/- the standard deviation for the metric in the baseline tests then there is likely a performance regression. + +The nightly performance runs use the nyc_taxis workload with 2 warmup and 3 test iterations; tests using this configuration can also use the particular values defined in [this section](#identifying-regressions-in-nightly-performance-tests) for identifying performance regression. + +Additionally, error rates can be indicative of a performance regression. Error rates on the order of 0.01% are acceptable, though higher values are cause for concern. High error rates may point to issues with cluster availability or a change in the logic for processing a specific operation. +For tests using OpenSearch Benchmark with an external OpenSearch cluster configured as the data store, more details on the cause of the errors can be found by searching for the test execution ID in the `benchmark-metrics-*` index of the metrics data store. + + +##### Identifying Regressions in Nightly Performance Tests + +Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications laid out in the table below. +The data for this table came from tests using OpenSearch 1.2 build #762, more details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461. Please keep in mind the following: -1. Expected values are rough estimates. These are only meant to establish a baseline understanding of test results. -2. StDev% Mean is the standard deviation as a percentage of the mean. This is expected variation between tests. - 1. If the average of several tests consistently falls outside this bound there may be a performance regression. +1. Changing the number of iterations or the workload type for a test can drastically change performance characteristics. This table is not necessarily applicable to other workload configurations. +2. StDev% Mean is the standard deviation as a percentage of the mean. It is expected that metrics for a test will be +/- this value relative to the expected value. + 1. If the average of several tests consistently falls outside this bound relative to the expected value there may be a performance regression (or improvement). 3. MinMax% Diff is the worst case variance between any two tests with the same configuration. 1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. 1. In general, comparing one off test runs should be avoided if possible. @@ -128,25 +147,15 @@ Please keep in mind the following: |m6g.xlarge|Enabled|38625|0|~3%|~8%|497|512|0|~8%|~23| |m6g.xlarge|Disabled|45447|0|~2%|~3%|470|480|0|~5%|~15%| -Note that performance regressions are based on decreased indexing throughput and/or increased query latency. - -Additionally, error rates on the order of 0.01% are acceptable, though higher ones may be cause for concern. - - - -#### How to identify issues in longevity tests - -Navigate to the Jenkins build for a longevity test. Look at the Console Output - -Search for: - -``` -INFO:root:Test can be monitored on -``` - -Navigate to that link then click the link for "Live Dashboards" +#### Identifying Issues in Longevity Tests + +Longevity tests are long running performance tests meant to measure the stability of a cluster over the course of several days or weeks. +Internal tools provide dashboards for monitoring cluster behavior during these tests. Use the following steps to spot issues in automated longevity tests: -Use the following table to monitor metrics for the test: +1. Navigate to the Jenkins build for a longevity test. +2. In the Console Output search for ``` INFO:root:Test can be monitored on ``` +3. Navigate to that link then click the link for "Live Dashboards" +4. Use the table below to monitor metrics for the test: |Metric|Health Indicators / Expected Values|Requires investigations / Cause for concerns| |---|---|---| @@ -158,7 +167,6 @@ Use the following table to monitor metrics for the test: |Indexing Latency|Consistent during each test iteration|upward trends| |Query Latency|Varies based on the query being issued|upward trends| - ## Testing in CI/CD The CI/CD infrastructure is divided into two main workflows - `build` and `test`. The `build` workflow automates the process to generate all OpenSearch and OpenSearch Dashboards artifacts, and provide them as distributions to the `test` workflow, which runs exhaustive testing on the artifacts based on the artifact type. The next section talks in detail about the test workflow. @@ -193,7 +201,7 @@ The development of the bwc test automation is tracked by meta issue [#90](https: #### perfTest job -It is a Jenkins job that runs performance tests on the bundled artifact using mensor (rally). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. +It is a Jenkins job that runs performance tests on the bundled artifact using OpenSearch Benchmark (Mensor). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. Once the performance tests completes (usually takes 5-8 hours for nyc_taxis track), it will report the test results and publish a human readable report in S3 bucket. From b7194d44b825c57a43040cbd8240d4c19f37c696 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Wed, 16 Mar 2022 16:00:40 -0500 Subject: [PATCH 5/6] Fix TOC indentation Signed-off-by: Travis Benedict --- src/test_workflow/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/test_workflow/README.md b/src/test_workflow/README.md index 3b40b616e4..5a4ef630c8 100644 --- a/src/test_workflow/README.md +++ b/src/test_workflow/README.md @@ -7,12 +7,12 @@ - [Identifying Regressions in Nightly Performance Tests](#identifying-regressions-in-nightly-performance-tests) - [Identifying Issues in Longevity Tests](#identifying-issues-in-longevity-tests) - [Testing in CI/CD](#testing-in-cicd) - - [Test Workflow (in development)](#test-workflow-in-development) - - [Component-Level Details](#component-level-details) - - [test-orchestrator pipeline](#test-orchestrator-pipeline) - - [integTest job](#integtest-job) - - [bwcTest job](#bwctest-job) - - [perfTest job](#perftest-job) + - [Test Workflow (in development)](#test-workflow-in-development) + - [Component-Level Details](#component-level-details) + - [test-orchestrator pipeline](#test-orchestrator-pipeline) + - [integTest job](#integtest-job) + - [bwcTest job](#bwctest-job) + - [perfTest job](#perftest-job) - [Manifest Files](#manifest-files) - [Dependency Management](#dependency-management) - [S3 Permission Model](#s3-permission-model) From 1841fc29e6889b4ba72688c1d02c9876790a6b99 Mon Sep 17 00:00:00 2001 From: Travis Benedict Date: Thu, 17 Mar 2022 11:54:28 -0500 Subject: [PATCH 6/6] Clean up nits Signed-off-by: Travis Benedict --- .github/ISSUE_TEMPLATE/release_template.md | 1 - src/test_workflow/README.md | 16 +++++++--------- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/release_template.md b/.github/ISSUE_TEMPLATE/release_template.md index bc8bbc8bc1..95aaf4caee 100644 --- a/.github/ISSUE_TEMPLATE/release_template.md +++ b/.github/ISSUE_TEMPLATE/release_template.md @@ -58,7 +58,6 @@ __REPLACE with OpenSearch wide initiatives to improve quality and consistency.__ ### Performance testing validation - _Ends __REPLACE_RELEASE-minus-6-days___ - [ ] Performance tests do not show a regression - - [ ] Longevity tests do not show any issues diff --git a/src/test_workflow/README.md b/src/test_workflow/README.md index 5a4ef630c8..340af02799 100644 --- a/src/test_workflow/README.md +++ b/src/test_workflow/README.md @@ -105,9 +105,9 @@ opensearch-dashboards=https://ci.opensearch.org/ci/dbc/bundle-build-dashboards/1 TODO: Add instructions for running performance tests with `test.sh` -Performance tests from `test.sh` are executed using an internal service which automatically provisions hosts that run [OpenSearch Benchmark](https://github.com/opensearch-project/OpenSearch-Benchmark). Work to open source these internal features is being tracked [here](https://github.com/opensearch-project/opensearch-benchmark/issues/97). +Performance tests from `test.sh` are executed using an internal service which automatically provisions hosts that run [OpenSearch Benchmark](https://github.com/opensearch-project/OpenSearch-Benchmark). Work to open source these internal features is being tracked in [opensearch-benchmark#97](https://github.com/opensearch-project/opensearch-benchmark/issues/97). -Comparable performance data can be generated by directly using OpenSearch Benchmark, assuming that the same cluster and workload setups are used. More details on the performance testing configuration used for the nightly runs can be found [here](https://github.com/opensearch-project/OpenSearch/issues/2461). +Comparable performance data can be generated by directly using OpenSearch Benchmark, assuming that the same cluster and workload setups are used. More details on the performance testing configuration used for the nightly runs can be found in [OpenSearch#2461](https://github.com/opensearch-project/OpenSearch/issues/2461). In addition to the standard performance tests that run on the order of hours, longevity tests are run which load to a cluster for days or weeks. These tests are meant to validate cluster stability over a longer timeframe. Longevity tests are also executed using OpenSearch Benchmark, using a modified version of the [nyc_taxis workload](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) that repeats the schedule for hundreds of iterations. @@ -128,16 +128,14 @@ For tests using OpenSearch Benchmark with an external OpenSearch cluster configu ##### Identifying Regressions in Nightly Performance Tests Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications laid out in the table below. -The data for this table came from tests using OpenSearch 1.2 build #762, more details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461. +The data for this table came from tests using OpenSearch 1.2 build #762 ([arm64](https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/1.2.4/762/linux/arm64/dist/opensearch/manifest.yml +)/[x64](https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/1.2.4/762/linux/x64/dist/opensearch/manifest.yml)), more details on the test setup can be found in [OpenSearch#2461](https://github.com/opensearch-project/OpenSearch/issues/2461). Please keep in mind the following: 1. Changing the number of iterations or the workload type for a test can drastically change performance characteristics. This table is not necessarily applicable to other workload configurations. -2. StDev% Mean is the standard deviation as a percentage of the mean. It is expected that metrics for a test will be +/- this value relative to the expected value. - 1. If the average of several tests consistently falls outside this bound relative to the expected value there may be a performance regression (or improvement). -3. MinMax% Diff is the worst case variance between any two tests with the same configuration. - 1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. - 1. In general, comparing one off test runs should be avoided if possible. +2. StDev% Mean is the standard deviation as a percentage of the mean. It is expected that metrics for a test will be +/- this value relative to the expected value. If the average of several tests consistently falls outside this bound relative to the expected value there may be a performance regression (or improvement). +3. MinMax% Diff is the worst case variance between any two tests with the same configuration. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. In general, comparing one off test runs should be avoided if possible. |Instance Type|Security|Expected Indexing Throughput Avg (req/s)|Expected Indexing Error Rate|Indexing StDev% Mean|Indexing MinMax% Diff|Expected Query Latency p90 (ms)|Expected Query Latency p99 (ms)|Expected Query Error Rate|Query StDev% Mean|Query MinMax% Diff| @@ -201,7 +199,7 @@ The development of the bwc test automation is tracked by meta issue [#90](https: #### perfTest job -It is a Jenkins job that runs performance tests on the bundled artifact using OpenSearch Benchmark (Mensor). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. +It is a Jenkins job that runs performance tests on the bundled artifact using [OpenSearch Benchmark](https://github.com/opensearch-project/OpenSearch-Benchmark) (Mensor). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. Once the performance tests completes (usually takes 5-8 hours for nyc_taxis track), it will report the test results and publish a human readable report in S3 bucket.