Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm changes to handle output tokens shifting position in response to perturbation #8

Merged
merged 27 commits into from
Jul 5, 2024

Conversation

jessicarumbelow
Copy link
Contributor

@jessicarumbelow jessicarumbelow commented Jun 28, 2024

Algorithm changes to handle common instances where the output tokens change location in response to perturbation. For example:

"It's 9:47. How long until 10?" -> "13 minutes."

We perturb it (say, with word-wise fixed perturbation, in this case removing "It's"), and get:

"9:47. How long until 10?" -> "It's 13 minutes until 10."

Previously, this phrasing change would cause us to calculate attribution for "It's" in the input, based on the difference between the output tokens "13" and "It's". That doesn't make sense and was leading to instability and weird results.

I've changed it so that we look for matching tokens in the original vs perturbed output (and top 20 logprobs). This means we now compare the probabilities/cosine similarities of "13" and "13" instead of "13" and "It's".

This PR also includes a couple of bug fixes and some minor changes that seemed sensible to me, so I'm afraid it's a bit of a mixed bag.

Minor:

I renamed the repo to PIZZA, for obvious reasons.
I also did a fair bit of general tidying up for a general audience, e.g.:

  • Renamed APILLMAttributor to OpenAIAttributor, since that's the only API we support RN.
  • Removed local model passing to OpenAIAttributor – we actually only need the embeddings and tokenizer, so the previous use of the local model (and subsequent reliance on the embeddings being accessible with the default huggingface signature) was unnecessary. This way the user should be able to use any embedding matrix.
  • Made the OpenAI model selection an argument instead of hardcoded.
  • Made a new notebook for the blog post examples. This is paused until Andrew's improvements are ready.
  • Consolidated all example code into examples folder - previous was split between "research" and "examples".
  • Changed the debug logging from logging just the perturbed token to logging the entire perturbed input. This i feel is much more readable and makes interpreting the results more intuitive, as you can see the same input that the model did.
  • Changed some of the ordering in the readme to highlight the OpenAI attribution, since that's the most interesting part to my mind.
  • Changed tokenizer access in get_sentence_embeddings, for consistency.
  • Made colour map nicer (centered, visually divergent)
  • Changed default fixed perturbation substrate to "" empty string (since using the " " space token generally causes double spaces in perturbed inputs.)

Major:

  • Added "ignore_output_token_location" to make attribution work even if the output tokens change position.
  • Similarly, changed cosine-similarity attribution so it doesn't assume same output length and token position after perturbation.
  • Refactored cosine_similarity so it works on any input and output strings (rather than assuming openAI choices, and getting the strings from them).

Bug fixes:

  • Word-wise perturbation previously relied on the tokenizer adding leading spaces, which wasn't documented anywhere and is a weird gotcha for people in future who might want to use a different tokenizer that doesn't have an "add leading spaces" argument. I fixed it so that we don't assume the tokenizer does this and handle splitting words with spaces separately.
  • Fixed bug where it assumed cosine similarity (encoded by local tokenizer) would have same number of tokens as openai token output – this was causing stuff to get cut off in logging.

… Removed model passing to the OpenAIAttributor, using token_embeddings instead. Made openAI model an argument to Attributor instatiation.
… which was n't documented anywhere and is a weird gotcha. I fixed it so that we don't assume the tokenizer does this and handle splitting words with spaces separately.
… the output tokens change position. Works much better now with prob_diff metric.
…y strings (rather than assuming openAI choices).

Fixed bug where it assumed cosine similarity (encoded by gpt2 tokenizer) would have same number of tokens as openai token output – this was causing stuff to get cut off in logging.
@sebastian-sosa
Copy link
Contributor

sebastian-sosa commented Jul 2, 2024

There's a couple of files whose format doesn't adhere to what the ruff linter expects. Do you mind installing the ruff extension on VSCode and enabling it? So that it fixes styling errors on save.

It may also be good to install the vscode extension of Pylance.

@jessicarumbelow jessicarumbelow changed the title Jessica Algorithm changes to handle output tokens shifting position in response to perturbation Jul 2, 2024
Copy link
Contributor

@robbiemccorkell robbiemccorkell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, mostly on style and bits in the README.

Also I think Seba said already, it looks like some linter/formatting errors have crept in, which probably means you don't have the Ruff extension installed.

This is partly my fault, because I didn't put instructions in the repo on how to set up the pre-commit which would have caught this. I also didn't add a formatting check to the pipeline. The linter is failing in CI though.

new_lp = all_top_logprobs[all_toks.index(otl.token)]

else:
new_lp = -100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this number to a constant defined on top of the script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to /examples?

Also, this may be a small nitpick, but given that python script names usually are lowercase and snake_case, maybe we can make it example_PIZZA.ipynb

# Return difference in sentence similarity and token similarities
return self_similarity - sentence_similarity, 1 - token_similarities
cd = 1-cosine_similarity(original_output_emb, perturbed_output_emb)
token_distance = cd.min(axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intuition behind how this would behave for longer inputs/outputs?

I'm thinking that with longer strings, the chances of having frequent tokens that don't add semantic meaning (the, a, etc) increase, therefore the chances of token_distance being low due to these tokens appearing at different places increases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a different note, do you mind explaining to me why we compute the min across axis=-1?

Is this dimension of length token_embeddings.shape[-1], so same as the vocabulary length? I may be wrong though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! So, cosine_similarity actually gives us a pairwise score between every unperturbed output embedding and every perturbed output embedding. So if unperturbed output embeddings is shape (10,768), and the perturbed is (6, 768), cosine_distance = 1-cosine_similarity(original_output_emb, perturbed_output_emb) gives us a matrix of shape (10, 6). And we want the minimum values for each unperturbed output embedding, so we take the min over the last dimension to give an attribution vector length 10 – i.e. one element for each token in the original unperturbed output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can't really capture repetitions - it's not recognised as a change in the output, because we pick the lowest distance for each embedding. I'm not sure how to get around this though.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense. Yeah we're computing the pairwise similarities instead of comparing elements in the same position as before. I think that's good although I agree it has some limitations -- repeated tokens and frequent tokens which appear twice for different reasons may affect the attribution score. I think we can leave it as is for now 👍🏼

Copy link
Contributor

@robbiemccorkell robbiemccorkell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from my side. Thanks for these changes

Copy link
Contributor

@sebastian-sosa sebastian-sosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! I left a small comment on a duplicate notebook. Thanks for all these changes, it's looking really good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this! Although I think a second copy was created, do you mind deleting Example_PIZZA.ipynb or merging it with examples/example_PIZZA.ipynb?

# Return difference in sentence similarity and token similarities
return self_similarity - sentence_similarity, 1 - token_similarities
cd = 1-cosine_similarity(original_output_emb, perturbed_output_emb)
token_distance = cd.min(axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense. Yeah we're computing the pairwise similarities instead of comparing elements in the same position as before. I think that's good although I agree it has some limitations -- repeated tokens and frequent tokens which appear twice for different reasons may affect the attribution score. I think we can leave it as is for now 👍🏼

@robbiemccorkell robbiemccorkell merged commit 79d4136 into main Jul 5, 2024
1 check passed
@robbiemccorkell robbiemccorkell deleted the jessica branch July 5, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants