Add TOC entry, remove disclaimer, update wording

Signed-off-by: Travis Benedict <benedtra@amazon.com>
opensearch-project · gaiksaya · Mar 17, 2022 · Mar 11, 2022 · Mar 16, 2022 · Mar 16, 2022
commit 0d7281df0fbafbe164f054c4544b5b6166228f45
@@ -3,13 +3,16 @@
   - [Integration Tests](#integration-tests)
   - [Backwards Compatibility Tests](#backwards-compatibility-tests)
   - [Performance Tests](#performance-tests)
+    - [Identifying Regressions in Performance Tests](#identifying-regressions-in-performance-tests)
+      - [Identifying Regressions in Nightly Performance Tests](#identifying-regressions-in-nightly-performance-tests)
+    - [Identifying Issues in Longevity Tests](#identifying-issues-in-longevity-tests)
 - [Testing in CI/CD](#testing-in-cicd)
-  - [Test Workflow (in development)](#test-workflow-in-development)
-  - [Component-Level Details](#component-level-details)
-    - [test-orchestrator pipeline](#test-orchestrator-pipeline)
-    - [integTest job](#integtest-job)
-    - [bwcTest job](#bwctest-job)
-    - [perfTest job](#perftest-job)
+    - [Test Workflow (in development)](#test-workflow-in-development)
+    - [Component-Level Details](#component-level-details)
+      - [test-orchestrator pipeline](#test-orchestrator-pipeline)
+      - [integTest job](#integtest-job)
+      - [bwcTest job](#bwctest-job)
+      - [perfTest job](#perftest-job)
 - [Manifest Files](#manifest-files)
 - [Dependency Management](#dependency-management)
 - [S3 Permission Model](#s3-permission-model)
@@ -99,23 +102,39 @@ opensearch-dashboards=https://ci.opensearch.org/ci/dbc/bundle-build-dashboards/1
 
 ### Performance Tests
 
-TODO: Add instructions on how to run performance tests with `test.sh`
+TODO: Add instructions for running performance tests with `test.sh`
 
 
+Performance tests from `test.sh` are executed using an internal service which automatically provisions hosts that run [OpenSearch Benchmark](https://github.com/opensearch-project/OpenSearch-Benchmark). Work to open source these internal features is being tracked [here](https://github.com/opensearch-project/opensearch-benchmark/issues/97).
 
-#### How to identify regressions in performance tests
+Comparable performance data can be generated by directly using OpenSearch Benchmark, assuming that the same cluster and workload setups are used. More details on the performance testing configuration used for the nightly runs can be found [here](https://github.com/opensearch-project/OpenSearch/issues/2461).
 
-Disclaimer: the guidelines listed below were determined based on empirical testing using OpenSearch Benchmark. 
-These tests were run against OpenSearch 1.2 build #762 and used the nyc_taxis workload with 2 warmup and 3 test iterations. 
-The values listed below are **not** applicable to other configurations. More details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461
+In addition to the standard performance tests that run on the order of hours, longevity tests are run which load to a cluster for days or weeks. These tests are meant to validate cluster stability over a longer timeframe.
+Longevity tests are also executed using OpenSearch Benchmark, using a modified version of the [nyc_taxis workload](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) that repeats the schedule for hundreds of iterations.
 
-Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications layed out in the table
+#### Identifying Regressions in Performance Tests
+
+Before trying to identify a performance regression a set of baseline tests should be run, in order to establish expected values for performance metrics and to understand the variance between tests for the same configuration. Performance regressions are primarily determined based on decreased indexing throughput and/or increased query latency. 
+There is some amount of variance expected between any two tests. Empirically it has been found that generally tests for the same configuration can differ by about 5% of the mean for average indexing throughput and by about 10% of the mean for p90 or p99 query latency. Note that these values may vary depending on the underlying hardware of the cluster and the workload being used. 
+
+If performance metrics for a certain testing configuration consistently fall outside the range created by the expected value for a metric +/- the standard deviation for the metric in the baseline tests then there is likely a performance regression. 
+
+The nightly performance runs use the nyc_taxis workload with 2 warmup and 3 test iterations; tests using this configuration can also use the particular values defined in [this section](#identifying-regressions-in-nightly-performance-tests) for identifying performance regression.
+
+Additionally, error rates can be indicative of a performance regression. Error rates on the order of 0.01% are acceptable, though higher values are cause for concern. High error rates may point to issues with cluster availability or a change in the logic for processing a specific operation. 
+For tests using OpenSearch Benchmark with an external OpenSearch cluster configured as the data store, more details on the cause of the errors can be found by searching for the test execution ID in the `benchmark-metrics-*` index of the metrics data store.
+
+
+##### Identifying Regressions in Nightly Performance Tests
+
+Using the aggregate results from the nightly performance test runs, compare indexing and query metrics to the specifications laid out in the table below. 
+The data for this table came from tests using OpenSearch 1.2 build #762, more details on the test setup can be found here: https://github.com/opensearch-project/OpenSearch/issues/2461.
 
 Please keep in mind the following:
 
-1. Expected values are rough estimates. These are only meant to establish a baseline understanding of test results. 
-2. StDev% Mean is the standard deviation as a percentage of the mean. This is expected variation between tests. 
-   1. If the average of several tests consistently falls outside this bound there may be a performance regression. 
+1. Changing the number of iterations or the workload type for a test can drastically change performance characteristics. This table is not necessarily applicable to other workload configurations.
+2. StDev% Mean is the standard deviation as a percentage of the mean. It is expected that metrics for a test will be +/- this value relative to the expected value.
+   1. If the average of several tests consistently falls outside this bound relative to the expected value there may be a performance regression (or improvement). 
 3. MinMax% Diff is the worst case variance between any two tests with the same configuration. 
    1. If there is a difference greater than this value than there is likely a performance regression or an issue with the test setup. 
       1. In general, comparing one off test runs should be avoided if possible.
@@ -128,25 +147,15 @@ Please keep in mind the following:
 |m6g.xlarge|Enabled|38625|0|~3%|~8%|497|512|0|~8%|~23|
 |m6g.xlarge|Disabled|45447|0|~2%|~3%|470|480|0|~5%|~15%|
 
-Note that performance regressions are based on decreased indexing throughput and/or increased query latency.
-
-Additionally, error rates on the order of 0.01% are acceptable, though higher ones may be cause for concern.
-
-
-
-#### How to identify issues in longevity tests
-
-Navigate to the Jenkins build for a longevity test. Look at the Console Output
-
-Search for:
-
-```
-INFO:root:Test can be monitored on <link>
-```
-
-Navigate to that link then click the link for "Live Dashboards"
+#### Identifying Issues in Longevity Tests
+
+Longevity tests are long running performance tests meant to measure the stability of a cluster over the course of several days or weeks.
+Internal tools provide dashboards for monitoring cluster behavior during these tests. Use the following steps to spot issues in automated longevity tests:
 
-Use the following table to monitor metrics for the test:
+1. Navigate to the Jenkins build for a longevity test. 
+2. In the Console Output search for ``` INFO:root:Test can be monitored on <link>```
+3. Navigate to that link then click the link for "Live Dashboards"
+4. Use the table below to monitor metrics for the test:
 
 |Metric|Health Indicators / Expected Values|Requires investigations / Cause for concerns|
 |---|---|---|
@@ -158,7 +167,6 @@ Use the following table to monitor metrics for the test:
 |Indexing Latency|Consistent during each test iteration|upward trends|
 |Query Latency|Varies based on the query being issued|upward trends|
 
-
 ## Testing in CI/CD
 
 The CI/CD infrastructure is divided into two main workflows - `build` and `test`. The `build` workflow automates the process to generate all OpenSearch and OpenSearch Dashboards artifacts, and provide them as distributions to the `test` workflow, which runs exhaustive testing on the artifacts based on the artifact type. The next section talks in detail about the test workflow.
@@ -193,7 +201,7 @@ The development of the bwc test automation is tracked by meta issue [#90](https:
 
 #### perfTest job
 
-It is a Jenkins job that runs performance tests on the bundled artifact using mensor (rally). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. 
+It is a Jenkins job that runs performance tests on the bundled artifact using OpenSearch Benchmark (Mensor). It reads the bundle-manifest, config files and spins up a remote cluster with the bundled artifact installed on it. It will run performance test with and without security for specified architecture of the opensearch bundle. The job will kick off the single node cdk that sets up a remote cluster. It will then run the performance tests on those cluster using the mensor APIs from the whitelisted account and remote cluster endpoint(accessible to mensor system). These tests are bundle level tests. Any plugin on-boarding does not need to be a separate process. If the plugin is a part of the bundle, it is already onboarded. 
 
 Once the performance tests completes (usually takes 5-8 hours for nyc_taxis track), it will report the test results and publish a human readable report in S3 bucket.