Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch max tokens before exceeding it #193

Closed
aedocw opened this issue Jan 11, 2024 · 6 comments
Closed

Catch max tokens before exceeding it #193

aedocw opened this issue Jan 11, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@aedocw
Copy link
Owner

aedocw commented Jan 11, 2024

Sometimes a sentence is too long and Coqui sends a warning saying text exceeds max tokens and may result in truncated speech. On Discord, this user shared code that uses the tokenizer to count the tokens. It would be good to take this and either implement as-is, or use parts of this to more intelligently break up sentences before sending them to TTS.

def split_text_into_chunks(text, max_tokens=225):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    # lowercase the text
    text = text.lower()

    # Step 1: Split by new line
    lines = text.split('\n')
    # Remove empty lines
    lines = [line for line in lines if line.strip()]
    # Remove Note Lines starting with "{{" and ending with "}}"
    lines = [line for line in lines if not (line.startswith("{{") and line.endswith("}}"))]

    chunks = []

    for line in lines:
        line_tokens = tokenizer.tokenize(line)
        line_length = len(line_tokens)

        # Step 2: Check the length of the chunks
        if line_length <= max_tokens:
            chunks.append(line)
        else:
            # Step 3: Reduce by sentence if the line is longer than the max tokens
            sentences = re.split(r'(?<=[.!?]) +', line)
            current_chunk = []
            current_length = 0

            for sentence in sentences:
                sentence_tokens = tokenizer.tokenize(sentence)
                sentence_length = len(sentence_tokens)

                if current_length + sentence_length <= max_tokens:
                    current_chunk.append(sentence)
                    current_length += sentence_length
                else:
                    # Add the current chunk to chunks and start a new chunk
                    if current_chunk:
                        chunks.append(' '.join(current_chunk).strip())

                    current_chunk = [sentence]
                    current_length = sentence_length

            # Add the last chunk if it's not empty
            if current_chunk:
                chunks.append(' '.join(current_chunk).strip())

    return chunks
@Vodou4460
Copy link

Hey, this seems to be very interesting

@aedocw aedocw self-assigned this Jan 11, 2024
@aedocw aedocw added the enhancement New feature or request label Jan 11, 2024
@aedocw
Copy link
Owner Author

aedocw commented Jan 15, 2024

I pushed up a branch that has this code in it. Seems to work fine in initial testing for me, but I only saw this error very rarely. If anyone has text that reliably triggers this, please give it a try and report back.

In implementing this I realize I'm breaking things up into sentences multiple times. Once I'm satisfied this works well, I'll try to go back and clean things up so it really just goes from chapters, to token-length sentence chunks, to tts.

@aedocw
Copy link
Owner Author

aedocw commented Jan 16, 2024

I don't think this is working, or maybe it's the way I've implemented it? Seems to be conflating some things though. The specific error I thought this would address is the "character limit of XX for lanage", and ELSEWHERE I've seen errors of something like exceeding 400 tokens. I think these are two separate limitations.

What this code does is take a sentence and use yet another tokenizer to break it into a sentence, then count the words in that sentence:

>>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> line = "this is a line that is being used for this example."
>>> line_tokens = tokenizer.tokenize(line)
>>> line_tokens
['this', 'Ġis', 'Ġa', 'Ġline', 'Ġthat', 'Ġis', 'Ġbeing', 'Ġused', 'Ġfor', 'Ġthis', 'Ġexample', '.']
>>> len(line_tokens)
12
>>> len(line)
51

The ideal solution would:

  1. Find what the character limit is for each language (this must be exposed somewhere, just have to dig around)
  2. Break up any sentence greater than that many characters.
  3. Not necessary to count the number of words to stay under 400 because every sentence is going to be kept under 250 characters or so.

@mmol67
Copy link

mmol67 commented Jan 17, 2024

Find what the character limit is for each language (this must be exposed somewhere, just have to dig around)

Hello. Are you talking about this?
class VoiceBpeTokenizer: def __init__(self, vocab_file=None): self.tokenizer = None if vocab_file is not None: self.tokenizer = Tokenizer.from_file(vocab_file) self.char_limits = { "en": 250, "de": 253, "fr": 273, "es": 239, "it": 213, "pt": 203, "pl": 224, "zh-cn": 82, "ar": 166, "cs": 186, "ru": 182, "nl": 251, "tr": 226, "ja": 71, "hu": 224, "ko": 95, }

I've found the reference in there coqui-ai/TTS#3197

They are saying they change the limit to 2500 and it is working (sometimes)!?!

@unifirer
Copy link

I pushed up a branch that has this code in it. Seems to work fine in initial testing for me, but I only saw this error very rarely. If anyone has text that reliably triggers this, please give it a try and report back.

In implementing this I realize I'm breaking things up into sentences multiple times. Once I'm satisfied this works well, I'll try to go back and clean things up so it really just goes from chapters, to token-length sentence chunks, to tts.

try any book, it happens 50 times per book

@aedocw
Copy link
Owner Author

aedocw commented Apr 6, 2024

Since switching to sending only one sentence at a time to TTS I have not been able to reproduce this. Closing now, but if you can reproduce reliably please include sample that triggers this.

@aedocw aedocw closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants