You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+56-19
Original file line number
Diff line number
Diff line change
@@ -48,22 +48,6 @@ python optillm.py
48
48
* Running on http://192.168.10.48:8000
49
49
2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit
50
50
```
51
-
52
-
### Starting the optillm proxy for a local server (e.g. llama.cpp)
53
-
54
-
- Set the `OPENAI_API_KEY` env variable to a placeholder value
55
-
- e.g. `export OPENAI_API_KEY="no_key"`
56
-
- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens
57
-
- Run `python3 optillm.py --base_url base_url` to start the proxy
58
-
- e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
59
-
60
-
> [!WARNING]
61
-
> Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
62
-
> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, `re2`, and `z3`.
63
-
64
-
> [!NOTE]
65
-
> You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.
66
-
67
51
## Usage
68
52
69
53
Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the `base_url` as `http://localhost:8000/v1`.
@@ -155,7 +139,60 @@ In the diagram:
155
139
-`A` is an existing tool (like [oobabooga](https://github.com/oobabooga/text-generation-webui/)), framework (like [patchwork](https://github.com/patched-codes/patchwork))
156
140
or your own code where you want to use the results from optillm. You can use it directly using any OpenAI client sdk.
157
141
-`B` is the optillm service (running directly or in a docker container) that will send requests to the `base_url`.
158
-
-`C` is any service providing an OpenAI API compatible chat completions endpoint.
142
+
-`C` is any service providing an OpenAI API compatible chat completions endpoint.
143
+
144
+
### Local inference server
145
+
146
+
We support loading any HuggingFace model or LoRA directly in optillm. To use the built-in inference server set the `OPTILLM_API_KEY` to any value (e.g. `export OPTILLM_API_KEY="optillm"`)
147
+
and then use the same in your OpenAI client. You can pass any HuggingFace model in model field. If it is a private model make sure you set the `HF_TOKEN` environment variable
148
+
with your HuggingFace key. We also support adding any number of LoRAs on top of the model by using the `+` separator.
149
+
150
+
E.g. The following code loads the base model `meta-llama/Llama-3.2-1B-Instruct` and then adds two LoRAs on top - `patched-codes/Llama-3.2-1B-FixVulns` and `patched-codes/Llama-3.2-1B-FastApply`.
151
+
You can specify which LoRA to use using the `active_adapter` param in `extra_args` field of OpenAI SDK client. By default we will load the last specified adapter.
You can also use the alternate decoding techniques like `cot_decoding` and `entropy_decoding` directly with the local inference server.
167
+
168
+
```python
169
+
response = client.chat.completions.create(
170
+
model="meta-llama/Llama-3.2-1B-Instruct",
171
+
messages=messages,
172
+
temperature=0.2,
173
+
extra_body={
174
+
"decoding": "cot_decoding", # or "entropy_decoding"
175
+
# CoT specific params
176
+
"k": 10,
177
+
"aggregate_paths": True,
178
+
# OR Entropy specific params
179
+
"top_k": 27,
180
+
"min_p": 0.03,
181
+
}
182
+
)
183
+
```
184
+
185
+
### Starting the optillm proxy with an external server (e.g. llama.cpp or ollama)
186
+
187
+
- Set the `OPENAI_API_KEY` env variable to a placeholder value
188
+
- e.g. `export OPENAI_API_KEY="sk-no-key"`
189
+
- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens
190
+
- Run `python3 optillm.py --base_url base_url` to start the proxy
191
+
- e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
192
+
193
+
> [!WARNING]
194
+
> Note that llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
195
+
> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, `re2`, and `z3`. Use the built-in local inference server to use these approaches.
0 commit comments