Skip to content

Commit 102f00a

Browse files
add generation time metrics (#613)
- Added performance metrics and updated Readme with description how to use them - Added cpp and python sample for benchmarking Sample to calculate and visualize performance metrics. ``` import openvino_genai as ov_genai import tqdm import pandas as pd import matplotlib.pylab as pl pipe = ov_genai.LLMPipeline('TinyLlama-1.1B-Chat-v1.0/') config = ov_genai.GenerationConfig(max_new_tokens=15) metrics_df = pd.DataFrame(columns=['batch_size', 'throughput', 'ttft', 'tpot', 'std_throughput', 'std_ttft', 'std_tpot']) num_iter = 3 for batch_size in tqdm.tqdm([1, 2, 4, 16, 32, 64, 128]): prompts = ["The Sky is blue because"] * batch_size res = pipe.generate(prompts, config) metrics = res.perf_metrics for _ in range(num_iter - 1): res = pipe.generate(prompts, config) metrics += res.perf_metrics metrics_df = metrics_df._append({ 'throughput': metrics.get_throughput().mean, 'ttft': metrics.get_ttft().mean, 'tpot': metrics.get_tpot().mean, 'std_throughput': metrics.get_throughput().std, 'std_ttft': metrics.get_ttft().std, 'std_tpot': metrics.get_tpot().std, 'batch_size': batch_size, }, ignore_index=True) fig, axes = pl.subplots(nrows=3, ncols=1, figsize=(6, 8), sharex=True) axes[0].plot(metrics_df['batch_size'], metrics_df['throughput'], '-o') axes[1].plot(metrics_df['batch_size'], metrics_df['ttft'], '-o', ) axes[2].plot(metrics_df['batch_size'], metrics_df['tpot'], '-o') axes[0].set_ylabel('Throughput'), axes[1].set_ylabel('TTFT'), axes[2].set_ylabel('TPOT') axes[2].set_xlabel('Batch Size') axes[0].grid(True), axes[1].grid(True), axes[2].grid(True) pl.tight_layout() ``` ![image](https://github.com/user-attachments/assets/021a94b4-fc75-4b5f-90e6-60db471a3810) ticket: CVS-132859
2 parents 3bfbab5 + e553ef5 commit 102f00a

16 files changed

+744
-30
lines changed

samples/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ add_subdirectory(cpp/greedy_causal_lm)
1010
add_subdirectory(cpp/multinomial_causal_lm)
1111
add_subdirectory(cpp/prompt_lookup_decoding_lm)
1212
add_subdirectory(cpp/speculative_decoding_lm)
13+
add_subdirectory(cpp/benchmark_genai)
1314

1415
install(FILES requirements.txt DESTINATION samples
1516
COMPONENT cpp_samples_genai)
+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Copyright (C) 2023-2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
5+
find_package(OpenVINOGenAI REQUIRED PATHS
6+
"${CMAKE_BINARY_DIR}" # Reuse the package from the build.
7+
${OpenVINO_DIR} # GenAI may be installed alogside OpenVINO.
8+
)
9+
10+
FetchContent_Declare(cxxopts
11+
URL https://github.com/jarro2783/cxxopts/archive/refs/tags/v3.1.1.tar.gz
12+
URL_HASH SHA256=523175f792eb0ff04f9e653c90746c12655f10cb70f1d5e6d6d9491420298a08)
13+
FetchContent_MakeAvailable(cxxopts)
14+
15+
add_executable(benchmark_genai benchmark_genai.cpp)
16+
target_link_libraries(benchmark_genai PRIVATE openvino::genai cxxopts::cxxopts)
17+
set_target_properties(benchmark_genai PROPERTIES
18+
COMPILE_PDB_NAME benchmark_genai
19+
# Ensure out of box LC_RPATH on macOS with SIP
20+
INSTALL_RPATH_USE_LINK_PATH ON)
21+
install(TARGETS benchmark_genai
22+
RUNTIME DESTINATION samples_bin/
23+
COMPONENT samples_bin
24+
EXCLUDE_FROM_ALL)

samples/cpp/benchmark_genai/README.md

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# LLMs benchmarking sample
2+
3+
This sample script demonstrates how to benchmark an LLMs in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.
4+
5+
## Download and convert the model and tokenizers
6+
7+
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
8+
9+
It's not required to install [../../requirements.txt](../../requirements.txt) for deployment if the model has already been exported.
10+
11+
```sh
12+
pip install --upgrade-strategy eager -r ../../requirements.txt
13+
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
14+
```
15+
16+
## Usage
17+
18+
```sh
19+
benchmark_vanilla_genai [OPTIONS]
20+
```
21+
22+
### Options
23+
24+
- `-m, --model`: Path to the model and tokenizers base directory.
25+
- `-p, --prompt` (default: `"The Sky is blue because"`): The prompt to generate text.
26+
- `-nw, --num_warmup` (default: `1`): Number of warmup iterations.
27+
- `-mt, --max_new_tokens` (default: `20`): Number of warmup iterations.
28+
- `-n, --num_iter` (default: `3`): Number of iterations.
29+
- `-d, --device` (default: `"CPU"`): Device to run the model on.
30+
31+
### Output:
32+
33+
```
34+
benchmark_vanilla_genai -m TinyLlama-1.1B-Chat-v1.0 -n 10
35+
```
36+
37+
```
38+
Load time: 3405.69 ms
39+
Generate time: 1430.77 ± 3.04 ms
40+
Tokenization time: 0.51 ± 0.02 ms
41+
Detokenization time: 0.37 ± 0.01 ms
42+
TTFT: 81.60 ± 0.54 ms
43+
TPOT: 71.52 ± 2.72 ms
44+
Throughput tokens/s: 13.98 ± 0.53
45+
```
46+
47+
For more information how performance metrics are calculated please follow [performance-metrics tutorial](../../../src/README.md#performance-metrics).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
// Copyright (C) 2023-2024 Intel Corporation
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
#include "openvino/genai/llm_pipeline.hpp"
5+
#include <cxxopts.hpp>
6+
7+
int main(int argc, char* argv[]) try {
8+
cxxopts::Options options("benchmark_vanilla_genai", "Help command");
9+
10+
options.add_options()
11+
("m,model", "Path to model and tokenizers base directory", cxxopts::value<std::string>()->default_value("."))
12+
("p,prompt", "Prompt", cxxopts::value<std::string>()->default_value("The Sky is blue because"))
13+
("nw,num_warmup", "Number of warmup iterations", cxxopts::value<size_t>()->default_value(std::to_string(1)))
14+
("n,num_iter", "Number of iterations", cxxopts::value<size_t>()->default_value(std::to_string(3)))
15+
("mt,max_new_tokens", "Maximal number of new tokens", cxxopts::value<size_t>()->default_value(std::to_string(20)))
16+
("d,device", "device", cxxopts::value<std::string>()->default_value("CPU"))
17+
("h,help", "Print usage");
18+
19+
cxxopts::ParseResult result;
20+
try {
21+
result = options.parse(argc, argv);
22+
} catch (const cxxopts::exceptions::exception& e) {
23+
std::cout << e.what() << "\n\n";
24+
std::cout << options.help() << std::endl;
25+
return EXIT_FAILURE;
26+
}
27+
28+
if (result.count("help")) {
29+
std::cout << options.help() << std::endl;
30+
return EXIT_SUCCESS;
31+
}
32+
33+
std::string prompt = result["prompt"].as<std::string>();
34+
const std::string model_path = result["model"].as<std::string>();
35+
std::string device = result["device"].as<std::string>();
36+
size_t num_warmup = result["num_warmup"].as<size_t>();
37+
size_t num_iter = result["num_iter"].as<size_t>();
38+
39+
ov::genai::GenerationConfig config;
40+
config.max_new_tokens = result["max_new_tokens"].as<size_t>();
41+
42+
ov::genai::LLMPipeline pipe(model_path, device);
43+
44+
for (size_t i = 0; i < num_warmup; i++)
45+
pipe.generate(prompt, config);
46+
47+
ov::genai::DecodedResults res = pipe.generate(prompt, config);
48+
ov::genai::PerfMetrics metrics = res.perf_metrics;
49+
for (size_t i = 0; i < num_iter - 1; i++) {
50+
res = pipe.generate(prompt, config);
51+
metrics = metrics + res.perf_metrics;
52+
}
53+
54+
std::cout << std::fixed << std::setprecision(2);
55+
std::cout << "Load time: " << metrics.get_load_time() << " ms" << std::endl;
56+
std::cout << "Generate time: " << metrics.get_generate_duration().mean << " ± " << metrics.get_generate_duration().std << " ms" << std::endl;
57+
std::cout << "Tokenization time: " << metrics.get_tokenization_duration().mean << " ± " << metrics.get_tokenization_duration().std << " ms" << std::endl;
58+
std::cout << "Detokenization time: " << metrics.get_detokenization_duration().mean << " ± " << metrics.get_detokenization_duration().std << " ms" << std::endl;
59+
std::cout << "TTFT: " << metrics.get_ttft().mean << " ± " << metrics.get_ttft().std << " ms" << std::endl;
60+
std::cout << "TPOT: " << metrics.get_tpot().mean << " ± " << metrics.get_tpot().std << " ms/token " << std::endl;
61+
std::cout << "Throughput: " << metrics.get_throughput().mean << " ± " << metrics.get_throughput().std << " tokens/s" << std::endl;
62+
63+
return 0;
64+
} catch (const std::exception& error) {
65+
std::cerr << error.what() << '\n';
66+
return EXIT_FAILURE;
67+
} catch (...) {
68+
std::cerr << "Non-exception object thrown\n";
69+
return EXIT_FAILURE;
70+
}
+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# LLMs benchmarking sample
2+
3+
This sample script demonstrates how to benchmark an LLMs in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.
4+
5+
## Download and convert the model and tokenizers
6+
7+
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
8+
9+
It's not required to install [../../requirements.txt](../../requirements.txt) for deployment if the model has already been exported.
10+
11+
```sh
12+
pip install --upgrade-strategy eager -r ../../requirements.txt
13+
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
14+
```
15+
16+
## Usage
17+
18+
```sh
19+
python benchmark_vanilla_genai.py [OPTIONS]
20+
```
21+
22+
### Options
23+
24+
- `-m, --model`: Path to the model and tokenizers base directory.
25+
- `-p, --prompt` (default: `"The Sky is blue because"`): The prompt to generate text.
26+
- `-nw, --num_warmup` (default: `1`): Number of warmup iterations.
27+
- `-n, --num_iter` (default: `3`): Number of iterations.
28+
- `-mt, --max_new_tokens` (default: `20`): Number of warmup iterations.
29+
- `-d, --device` (default: `"CPU"`): Device to run the model on.
30+
31+
### Output:
32+
33+
```
34+
python benchmark_vanilla_genai.py -m TinyLlama-1.1B-Chat-v1.0 -n 10
35+
```
36+
37+
```
38+
Load time: 3405.69 ms
39+
Generate time: 1430.77 ± 3.04 ms
40+
Tokenization time: 0.51 ± 0.02 ms
41+
Detokenization time: 0.37 ± 0.01 ms
42+
TTFT: 81.60 ± 0.54 ms
43+
TPOT: 71.52 ± 2.72 ms
44+
Throughput tokens/s: 13.98 ± 0.53
45+
```
46+
47+
For more information on how performance metrics are calculated, see [performance metrics readme](../../../src/README.md#performance-metrics).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Copyright (C) 2023-2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import argparse
5+
import openvino_genai as ov_genai
6+
7+
def main():
8+
parser = argparse.ArgumentParser(description="Help command")
9+
parser.add_argument("-m", "--model", type=str, help="Path to model and tokenizers base directory")
10+
parser.add_argument("-p", "--prompt", type=str, default="The Sky is blue because", help="Prompt")
11+
parser.add_argument("-nw", "--num_warmup", type=int, default=1, help="Number of warmup iterations")
12+
parser.add_argument("-n", "--num_iter", type=int, default=2, help="Number of iterations")
13+
parser.add_argument("-mt", "--max_new_tokens", type=int, default=20, help="Maximal number of new tokens")
14+
parser.add_argument("-d", "--device", type=str, default="CPU", help="Device")
15+
16+
args = parser.parse_args()
17+
18+
# Perf metrics is stored in DecodedResults.
19+
# In order to get DecodedResults instead of a string input should be a list.
20+
prompt = [args.prompt]
21+
model_path = args.model
22+
device = args.device
23+
num_warmup = args.num_warmup
24+
num_iter = args.num_iter
25+
26+
config = ov_genai.GenerationConfig()
27+
config.max_new_tokens = args.max_new_tokens
28+
29+
pipe = ov_genai.LLMPipeline(model_path, device)
30+
31+
for _ in range(num_warmup):
32+
pipe.generate(prompt, config)
33+
34+
res = pipe.generate(prompt, config)
35+
perf_metrics = res.perf_metrics
36+
for _ in range(num_iter - 1):
37+
res = pipe.generate(prompt, config)
38+
perf_metrics += res.perf_metrics
39+
40+
print(f"Load time: {perf_metrics.get_load_time():.2f} ms")
41+
print(f"Generate time: {perf_metrics.get_generate_duration().mean:.2f} ± {perf_metrics.get_generate_duration().std:.2f} ms")
42+
print(f"Tokenization time: {perf_metrics.get_tokenization_duration().mean:.2f} ± {perf_metrics.get_tokenization_duration().std:.2f} ms")
43+
print(f"Detokenization time: {perf_metrics.get_detokenization_duration().mean:.2f} ± {perf_metrics.get_detokenization_duration().std:.2f} ms")
44+
print(f"TTFT: {perf_metrics.get_ttft().mean:.2f} ± {perf_metrics.get_ttft().std:.2f} ms")
45+
print(f"TPOT: {perf_metrics.get_tpot().mean:.2f} ± {perf_metrics.get_tpot().std:.2f} ms")
46+
print(f"Throughput : {perf_metrics.get_throughput().mean:.2f} ± {perf_metrics.get_throughput().std:.2f} tokens/s")
47+
48+
if __name__ == "__main__":
49+
main()

src/README.md

+91
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,97 @@ int main(int argc, char* argv[]) {
196196
}
197197
```
198198
199+
### Performance Metrics
200+
201+
`openvino_genai.PerfMetrics` (referred as `PerfMetrics` for simplicity) is a structure that holds performance metrics for each generate call. `PerfMetrics` holds fields with mean and standard deviations for the following metrics:
202+
- Time To the First Token (TTFT), ms
203+
- Time per Output Token (TPOT), ms/token
204+
- Generate total duration, ms
205+
- Tokenization duration, ms
206+
- Detokenization duration, ms
207+
- Throughput, tokens/s
208+
209+
and:
210+
- Load time, ms
211+
- Number of generated tokens
212+
- Number of tokens in the input prompt
213+
214+
Performance metrics are stored either in the `DecodedResults` or `EncodedResults` `perf_metric` field. Additionally to the fields mentioned above, `PerfMetrics` has a member `raw_metrics` of type `openvino_genai.RawPerfMetrics` (referred to as `RawPerfMetrics` for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on `PerfMetrics`.
215+
216+
```python
217+
import openvino_genai as ov_genai
218+
pipe = ov_genai.LLMPipeline(model_path, "CPU")
219+
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
220+
perf_metrics = result.perf_metrics
221+
222+
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
223+
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
224+
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
225+
print(f'Throughput: {perf_metrics.get_throughput()get_.mean():.2f} tokens/s')
226+
```
227+
228+
```cpp
229+
#include "openvino/genai/llm_pipeline.hpp"
230+
#include <iostream>
231+
232+
int main(int argc, char* argv[]) {
233+
std::string model_path = argv[1];
234+
ov::genai::LLMPipeline pipe(model_path, "CPU");
235+
auto result = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
236+
auto perf_metrics = result.perf_metrics;
237+
238+
std::cout << std::fixed << std::setprecision(2);
239+
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
240+
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
241+
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
242+
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
243+
}
244+
```
245+
output:
246+
```sh
247+
mean_generate_duration: 76.28
248+
mean_ttft: 42.58
249+
mean_tpot 3.80
250+
```
251+
252+
>**Note**: If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
253+
254+
Several `perf_metrics` can be added to each other. In that case `raw_metrics` are concatenated and mean/std values are recalculated. This accumulates statistics from several `generate()` calls
255+
256+
```cpp
257+
#include "openvino/genai/llm_pipeline.hpp"
258+
#include <iostream>
259+
260+
int main(int argc, char* argv[]) {
261+
std::string model_path = argv[1];
262+
ov::genai::LLMPipeline pipe(model_path, "CPU");
263+
auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
264+
auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
265+
auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics
266+
267+
std::cout << std::fixed << std::setprecision(2);
268+
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
269+
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
270+
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
271+
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
272+
}
273+
```
274+
275+
```python
276+
import openvino_genai as ov_genai
277+
pipe = ov_genai.LLMPipeline(model_path, "CPU")
278+
res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
279+
res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
280+
perf_metrics = res_1.perf_metrics + res_2.perf_metrics
281+
282+
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
283+
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
284+
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
285+
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
286+
```
287+
288+
For more examples of how metrics are used, please refer to the Python [benchmark_genai.py](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/3/samples/python/benchmark_genai/README.md) and C++ [benchmark_genai](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/3/samples/cpp/benchmark_genai/README.md) samples.
289+
199290
## How It Works
200291

201292
For information on how OpenVINO™ GenAI works, refer to the [How It Works Section](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/src/docs/HOW_IT_WORKS.md).

src/cpp/include/openvino/genai/llm_pipeline.hpp

+6
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,13 @@
55

66
#include <optional>
77
#include <variant>
8+
#include <chrono>
89

910
#include "openvino/core/any.hpp"
1011
#include "openvino/genai/generation_config.hpp"
1112
#include "openvino/genai/tokenizer.hpp"
1213
#include "openvino/genai/streamer_base.hpp"
14+
#include "openvino/genai/perf_metrics.hpp"
1315

1416
namespace ov {
1517
namespace genai {
@@ -29,11 +31,13 @@ using StringInputs = std::variant<std::string, std::vector<std::string>>;
2931
*
3032
* @param tokens sequence of resulting tokens
3133
* @param scores sum of logarithmic probabilities of all tokens in the sequence
34+
* @param metrics performance metrics with tpot, ttft, etc. of type ov::genai::PerfMetrics
3235
*/
3336
class EncodedResults {
3437
public:
3538
std::vector<std::vector<int64_t>> tokens;
3639
std::vector<float> scores;
40+
PerfMetrics perf_metrics;
3741
};
3842

3943
/**
@@ -42,11 +46,13 @@ class EncodedResults {
4246
*
4347
* @param texts vector of resulting sequences
4448
* @param scores scores for each sequence
49+
* @param metrics performance metrics with tpot, ttft, etc. of type ov::genai::PerfMetrics
4550
*/
4651
class DecodedResults {
4752
public:
4853
std::vector<std::string> texts;
4954
std::vector<float> scores;
55+
PerfMetrics perf_metrics;
5056

5157
// @brief Convert DecodedResults to a string.
5258
operator std::string() const {

0 commit comments

Comments
 (0)