-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithm changes to handle output tokens shifting position in response to perturbation #8
Conversation
… Removed model passing to the OpenAIAttributor, using token_embeddings instead. Made openAI model an argument to Attributor instatiation.
… which was n't documented anywhere and is a weird gotcha. I fixed it so that we don't assume the tokenizer does this and handle splitting words with spaces separately.
… the output tokens change position. Works much better now with prob_diff metric.
…length and token position
…entence_embeddings for consistency
…y strings (rather than assuming openAI choices). Fixed bug where it assumed cosine similarity (encoded by gpt2 tokenizer) would have same number of tokens as openai token output – this was causing stuff to get cut off in logging.
…urbation substrate to ""
There's a couple of files whose format doesn't adhere to what the ruff linter expects. Do you mind installing the It may also be good to install the vscode extension of Pylance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, mostly on style and bits in the README.
Also I think Seba said already, it looks like some linter/formatting errors have crept in, which probably means you don't have the Ruff extension installed.
This is partly my fault, because I didn't put instructions in the repo on how to set up the pre-commit which would have caught this. I also didn't add a formatting check to the pipeline. The linter is failing in CI though.
attribution/api_attribution.py
Outdated
new_lp = all_top_logprobs[all_toks.index(otl.token)] | ||
|
||
else: | ||
new_lp = -100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this number to a constant defined on top of the script.
Example_PIZZA.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to /examples
?
Also, this may be a small nitpick, but given that python script names usually are lowercase and snake_case, maybe we can make it example_PIZZA.ipynb
attribution/attribution_metrics.py
Outdated
# Return difference in sentence similarity and token similarities | ||
return self_similarity - sentence_similarity, 1 - token_similarities | ||
cd = 1-cosine_similarity(original_output_emb, perturbed_output_emb) | ||
token_distance = cd.min(axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intuition behind how this would behave for longer inputs/outputs?
I'm thinking that with longer strings, the chances of having frequent tokens that don't add semantic meaning (the
, a
, etc) increase, therefore the chances of token_distance
being low due to these tokens appearing at different places increases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a different note, do you mind explaining to me why we compute the min across axis=-1
?
Is this dimension of length token_embeddings.shape[-1]
, so same as the vocabulary length? I may be wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! So, cosine_similarity actually gives us a pairwise score between every unperturbed output embedding and every perturbed output embedding. So if unperturbed output embeddings is shape (10,768), and the perturbed is (6, 768), cosine_distance = 1-cosine_similarity(original_output_emb, perturbed_output_emb) gives us a matrix of shape (10, 6). And we want the minimum values for each unperturbed output embedding, so we take the min over the last dimension to give an attribution vector length 10 – i.e. one element for each token in the original unperturbed output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense. Yeah we're computing the pairwise similarities instead of comparing elements in the same position as before. I think that's good although I agree it has some limitations -- repeated tokens and frequent tokens which appear twice for different reasons may affect the attribution score. I think we can leave it as is for now 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved from my side. Thanks for these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! I left a small comment on a duplicate notebook. Thanks for all these changes, it's looking really good.
Example_PIZZA.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving this! Although I think a second copy was created, do you mind deleting Example_PIZZA.ipynb
or merging it with examples/example_PIZZA.ipynb
?
attribution/attribution_metrics.py
Outdated
# Return difference in sentence similarity and token similarities | ||
return self_similarity - sentence_similarity, 1 - token_similarities | ||
cd = 1-cosine_similarity(original_output_emb, perturbed_output_emb) | ||
token_distance = cd.min(axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense. Yeah we're computing the pairwise similarities instead of comparing elements in the same position as before. I think that's good although I agree it has some limitations -- repeated tokens and frequent tokens which appear twice for different reasons may affect the attribution score. I think we can leave it as is for now 👍🏼
Algorithm changes to handle common instances where the output tokens change location in response to perturbation. For example:
"It's 9:47. How long until 10?" -> "13 minutes."
We perturb it (say, with word-wise fixed perturbation, in this case removing "It's"), and get:
"9:47. How long until 10?" -> "It's 13 minutes until 10."
Previously, this phrasing change would cause us to calculate attribution for "It's" in the input, based on the difference between the output tokens "13" and "It's". That doesn't make sense and was leading to instability and weird results.
I've changed it so that we look for matching tokens in the original vs perturbed output (and top 20 logprobs). This means we now compare the probabilities/cosine similarities of "13" and "13" instead of "13" and "It's".
This PR also includes a couple of bug fixes and some minor changes that seemed sensible to me, so I'm afraid it's a bit of a mixed bag.
Minor:
I renamed the repo to PIZZA, for obvious reasons.
I also did a fair bit of general tidying up for a general audience, e.g.:
Major:
Bug fixes: