leap-laboratories
diff --git a/‎README.md
+170-68 b/‎README.md
+170-68
diff --git a/‎docs/assets/api-llm-attribution.png
289 KB b/‎docs/assets/api-llm-attribution.png
289 KB
diff --git a/‎docs/assets/table.png ‎docs/assets/local-llm-attribution.png b/‎docs/assets/table.png ‎docs/assets/local-llm-attribution.png
@@ -1,28 +1,85 @@
 # LLM Attribution Library
 
-The LLM Attribution Library is a Python package designed to compute the attributions of each token in an input string to the generated tokens in a language model. This is particularly useful for understanding the influence of specific input tokens on the output of a language model.
-
-![Attribution Table](docs/assets/table.png)
-
-- [LLM Attribution Library](#llm-attribution-library)
-  - [Technical Overview](#technical-overview)
-  - [Requirements](#requirements)
-    - [Packaging](#packaging)
-    - [Linting](#linting)
-  - [Installation](#installation)
-  - [Usage](#usage)
-    - [Limitations](#limitations)
-      - [Batch dimensions](#batch-dimensions)
-      - [Input Embeddings](#input-embeddings)
-    - [GPU Acceleration](#gpu-acceleration)
-    - [Logging](#logging)
-    - [Cleaning Up](#cleaning-up)
-  - [Development](#development)
-  - [Testing](#testing)
-
-## Technical Overview
-
-The library uses gradient-based attribution to quantify the influence of input tokens on the output of a GPT-2 model. For each output token, it computes the gradients with respect to the input embeddings. The L1 norm of these gradients is then used as the attribution score, representing the total influence of each input token on the output. This approach provides a direct measure of the sensitivity of the output to changes in the input, aiding in model interpretation and diagnosis.
+The LLM Attribution Library is designed to compute the contribution of each token in a prompt to the generated response of a language model.  
+
+It can be used with both local LLMs:  
+
+![Local LLM Attribution Table](docs/assets/local-llm-attribution.png)
+
+And OpenAI LLMs accessible through an API:  
+![API-accessible LLM Attribution Table](docs/assets/api-llm-attribution.png)
+
+## Index  
+- [Quickstart](#quickstart)
+- [Requirements](#requirements)
+  - [Packaging](#packaging)
+  - [Linting](#linting)
+- [Installation](#installation)
+- [API Design](#api-design)
+  - [BaseLLMAttributor](#basellmattributor)
+  - [LocalLLMAttributor](#localllmattributor)
+  - [APILLMAttributor](#apillmattributor)
+  - [PerturbationStrategy and AttributionStrategy](#perturbationstrategy-and-attributionstrategy)
+  - [ExperimentLogger](#experimentlogger)
+- [Limitations](#limitations)
+  - [Batch dimensions](#batch-dimensions)
+  - [Input Embeddings](#input-embeddings)
+- [GPU Acceleration](#gpu-acceleration)
+- [Development](#development)
+- [Testing](#testing)
+
+## Quickstart
+
+Example of gradient-based attrubution using gemma-2b locally:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from attribution.attribution import Attributor
+
+model_id = "google/gemma-2b-it"
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto").cuda()
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+embeddings = model.get_input_embeddings().weight.detach()
+
+attributor = Attributor(model=model, embeddings=embeddings, tokenizer=tokenizer)
+attr_scores, token_ids = attributor.get_attributions(
+    input_string="the five continents are asia, europe, afri",
+    generation_length=7,
+)
+
+attributor.print_attributions(
+    word_list=tokenizer.convert_ids_to_tokens(token_ids),
+    attr_scores=attr_scores,
+    token_ids=token_ids,
+    generation_length=7,
+)
+```
+
+Attrubution using GPT-3 via OpenAI's API:  
+
+```python
+from attribution.api_attribution import APILLMAttributor
+from attribution.experiment_logger import ExperimentLogger
+from attribution.token_perturbation import NthNearestPerturbationStrategy
+
+attributor = APILLMAttributor()
+logger = ExperimentLogger()
+
+input_text = "The clock shows 9:47 PM. How many minutes 'til 10?"
+attributor.compute_attributions(
+    input_text,
+    perturbation_strategy=NthNearestPerturbationStrategy(n=-1),
+    attribution_strategies=["cosine", "prob_diff", "token_displacement"],
+    logger=logger,
+    perturb_word_wise=True,
+)
+
+logger.print_sentence_attribution()
+logger.print_attribution_matrix(exp_id=1)
+
+```
+
+Usage examples can be found in the `examples/` folder a more in-depth case study in `research/gemma-2b-case-study.ipynb`.
 
 ## Requirements
 
@@ -74,43 +131,106 @@ uv pip compile requirements.in -o requirements.txt
 uv pip compile requirements-dev.in -o requirements-dev.txt
 ```
 
-## Usage
 
-Usage examples can be found in the `examples/` folder.
+## API Design
+
+The attributors are designed to compute the contribution made by each token in an input string to the tokens generated by a language model.
 
-The following shows a simple example of attrubution using gemma-2b:
+### BaseLLMAttributor
+
+[BaseLLMAttributor] is an abstract base class that defines the interface for all LLM attributors. It declares the `compute_attributions` method, which must be implemented by any concrete attributor class. This method takes an input text and computes the attribution scores for each token.
 
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
+class BaseLLMAttributor(ABC):
+    @abstractmethod
+    def compute_attributions(
+        self, input_text: str, **kwargs
+    ) -> Optional[Tuple[torch.Tensor, torch.Tensor]]:
+        pass
+```
 
-from attribution.attribution import Attributor
+### LocalLLMAttributor
 
-model_id = "google/gemma-2b-it"
-model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto").cuda()
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-embeddings = model.get_input_embeddings().weight.detach()
+`LocalLLMAttributor` uses a local model to compute attributions. The model, tokenizer, and embeddings are passed to the constructor.
 
-attributor = Attributor(model=model, embeddings=embeddings, tokenizer=tokenizer)
-attr_scores, token_ids = attributor.get_attributions(
-    input_string="the five continents are asia, europe, afri",
-    generation_length=7,
-)
+```python
+class LocalLLMAttributor:
+    def __init__(
+        self,
+        model: nn.Module,
+        tokenizer: transformers.PreTrainedTokenizerBase,
+        embeddings: torch.Tensor,
+        device: Optional[str] = None,
+        log_level: int = logging.WARNING,
+    ):
+        ...
+    def compute_attributions(
+        self, input_string: str, **kwargs
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        ...
+```
 
-attributor.print_attributions(
-    word_list=tokenizer.convert_ids_to_tokens(token_ids),
-    attr_scores=attr_scores,
-    token_ids=token_ids,
-    generation_length=7,
-)
+The `compute_attributions` method generates tokens from the input string and computes the gradients of the output with respect to the input embeddings. These gradients are used to compute the attribution scores.
+
+`LocalLLMAttributor` uses gradient-based attribution to quantify the influence of input tokens on the output of a model. For each output token, it computes the gradients with respect to the input embeddings. The L1 norm of these gradients is then used as the attribution score, representing the total influence of each input token on the output
+
+
+#### Cleaning Up
+
+A convenience method is provided to clean up memory used by Python and Torch. This can be useful when running the library in a cloud notebook environment:
+
+```python
+local_attributor.cleanup()
 ```
 
-### Limitations
+### APILLMAttributor
+
+`APILLMAttributor` uses the OpenAI API to compute attributions. Given that gradients are not accessible, the attributor perturbs the input with a given `PerturbationStrategy` and measures the magnitude of change of the generated output with an `attribution_strategy`.
+
+The `compute_attributions` method:
+1. Sends a chat completion request to the OpenAI API.
+2. Uses a `PerturbationStrategy` to modify the input prompt, and sends the perturbed input to OpenAI's API to generate a perturbed output. Each token of the input prompt is perturbed separately, to obtain an attribution score for each input token.
+3. Uses an `attribution_strategy` to compute the magnitude of change between the original and perturbed output.
+4. Logs attribution scores to an `ExperimentLogger` if passed.
+
+```python
+class APILLMAttributor(BaseLLMAttributor):
+    def __init__(
+        self,
+        model: Optional[PreTrainedModel] = None,
+        tokenizer: Optional[PreTrainedTokenizer] = None,
+        token_embeddings: Optional[np.ndarray] = None,
+    ):
+        ...
+    def compute_attributions(self, input_text: str, **kwargs):
+        ...
+```
+
+### PerturbationStrategy and AttributionStrategy
+
+`PerturbationStrategy` is an abstract base class that defines the interface for all perturbation strategies. It declares the `get_replacement_token` method, which must be implemented by any concrete perturbation strategy class. This method takes a token id and returns a replacement token id.
+
+The `attribution_strategy` parameter is a string that specifies the method to use for computing attributions. The available strategies are "cosine", "prob_diff", and "token_displacement".
+
+- **Cosine Similarity Attribution**: Measures the cosine similarity between the embeddings of the original and perturbed outputs. The embeddings are obtained from a pre-trained model (e.g. GPT2). The cosine similarity is calculated for each pair of tokens in the same position on the original and perturbed outputs. For example, it compares the token in position 0 in the original response to the token in position 0 in the perturbed response. Additionally, the cosine similarity of the entire sentence embeddings is computed. The difference in sentence similarity and token similarities are returned.
 
-#### Batch dimensions
+- **Probability Difference Attribution**: Calculates the absolute difference in probabilities for each token in the original and perturbed outputs. The probabilities are obtained from the `top_logprobs` field of the tokens, which contains the most likely tokens for each token position. The mean of these differences is returned, as well as the probability difference for each token position.
+
+- **Token Displacement Attribution**: Calculates the displacement of each token in the original output within the perturbed output's `top_logprobs` predicted tokens. The `top_logprobs` field contains the most likely tokens for each token position. If a token from the original output is not found in the `top_logprobs` of the perturbed output, a maximum displacement value is assigned. The mean of these displacements is returned, as well as the displacement of each original output token.
+
+
+### ExperimentLogger
+
+The `ExperimentLogger` class is used to log the results of different experiment runs. It provides methods for starting and stopping an experiment, logging the input and output tokens, and logging the attribution scores. The `api_llm_attribution.ipynb` notebook shows an example of how to use `ExperimentLogger` to compare the results of different attribution strategies.
+
+
+## Limitations
+
+### Batch dimensions
 
 Currently this library only supports models that take inputs with a batch dimension. This is common across most modern models, but not always the case (e.g. GPT2).
 
-#### Input Embeddings
+### Input Embeddings
 
 This library only supports models that have a common interface to pass in embeddings, and generate outputs without sampling of the form:
 
@@ -120,7 +240,7 @@ outputs = model(inputs_embeds=input_embeddings)
 
 This format is common across HuggingFace models.
 
-### GPU Acceleration
+## GPU Acceleration
 
 To run the attribution process on a device of your choice, pass the device identifier into the `Attributor` class constructor:
 
@@ -136,27 +256,6 @@ The device identifider must match the device used on the first embeddings layer
 
 If no device is specified, the model device will be used by default.
 
-### Logging
-
-The library uses the `logging` module to log messages. You can configure the logging level via an optional argument in the `Attributor` class constructor:
-
-```python
-import logging
-
-attributor = Attributor(
-    model=model,
-    tokenizer=tokenizer,
-    log_level=logging.INFO
-)
-```
-
-### Cleaning Up
-
-A convenience method is provided to clean up memory used by Python and Torch. This can be useful when running the library in a cloud notebook environment:
-
-```python
-attributor.cleanup()
-```
 
 ## Development
 
@@ -181,3 +280,6 @@ To run the integration tests:
 ```bash
 python -m pytest tests/integration
 ```
+
+## Research 
+Some preliminary exploration and research into using attribution and quantitatively measuring attribution success can be found in the research folder of this repository. We'd be excited to see expansion of this small library, including both algorithmic improvements, further attribution and perturbation methods, and more rigorous and exhaustive experimentation. We welcome pull requests and issues from external collaborators.