Catch max tokens before exceeding it #193

aedocw · 2024-01-11T03:04:48Z

Sometimes a sentence is too long and Coqui sends a warning saying text exceeds max tokens and may result in truncated speech. On Discord, this user shared code that uses the tokenizer to count the tokens. It would be good to take this and either implement as-is, or use parts of this to more intelligently break up sentences before sending them to TTS.

def split_text_into_chunks(text, max_tokens=225):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    # lowercase the text
    text = text.lower()

    # Step 1: Split by new line
    lines = text.split('\n')
    # Remove empty lines
    lines = [line for line in lines if line.strip()]
    # Remove Note Lines starting with "{{" and ending with "}}"
    lines = [line for line in lines if not (line.startswith("{{") and line.endswith("}}"))]

    chunks = []

    for line in lines:
        line_tokens = tokenizer.tokenize(line)
        line_length = len(line_tokens)

        # Step 2: Check the length of the chunks
        if line_length <= max_tokens:
            chunks.append(line)
        else:
            # Step 3: Reduce by sentence if the line is longer than the max tokens
            sentences = re.split(r'(?<=[.!?]) +', line)
            current_chunk = []
            current_length = 0

            for sentence in sentences:
                sentence_tokens = tokenizer.tokenize(sentence)
                sentence_length = len(sentence_tokens)

                if current_length + sentence_length <= max_tokens:
                    current_chunk.append(sentence)
                    current_length += sentence_length
                else:
                    # Add the current chunk to chunks and start a new chunk
                    if current_chunk:
                        chunks.append(' '.join(current_chunk).strip())

                    current_chunk = [sentence]
                    current_length = sentence_length

            # Add the last chunk if it's not empty
            if current_chunk:
                chunks.append(' '.join(current_chunk).strip())

    return chunks

The text was updated successfully, but these errors were encountered:

Vodou4460 · 2024-01-11T08:13:13Z

Hey, this seems to be very interesting

aedocw · 2024-01-15T23:40:29Z

I pushed up a branch that has this code in it. Seems to work fine in initial testing for me, but I only saw this error very rarely. If anyone has text that reliably triggers this, please give it a try and report back.

In implementing this I realize I'm breaking things up into sentences multiple times. Once I'm satisfied this works well, I'll try to go back and clean things up so it really just goes from chapters, to token-length sentence chunks, to tts.

aedocw · 2024-01-16T17:54:42Z

I don't think this is working, or maybe it's the way I've implemented it? Seems to be conflating some things though. The specific error I thought this would address is the "character limit of XX for lanage", and ELSEWHERE I've seen errors of something like exceeding 400 tokens. I think these are two separate limitations.

What this code does is take a sentence and use yet another tokenizer to break it into a sentence, then count the words in that sentence:

>>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> line = "this is a line that is being used for this example."
>>> line_tokens = tokenizer.tokenize(line)
>>> line_tokens
['this', 'Ġis', 'Ġa', 'Ġline', 'Ġthat', 'Ġis', 'Ġbeing', 'Ġused', 'Ġfor', 'Ġthis', 'Ġexample', '.']
>>> len(line_tokens)
12
>>> len(line)
51

The ideal solution would:

Find what the character limit is for each language (this must be exposed somewhere, just have to dig around)
Break up any sentence greater than that many characters.
Not necessary to count the number of words to stay under 400 because every sentence is going to be kept under 250 characters or so.

mmol67 · 2024-01-17T13:03:28Z

Find what the character limit is for each language (this must be exposed somewhere, just have to dig around)

Hello. Are you talking about this?
class VoiceBpeTokenizer: def __init__(self, vocab_file=None): self.tokenizer = None if vocab_file is not None: self.tokenizer = Tokenizer.from_file(vocab_file) self.char_limits = { "en": 250, "de": 253, "fr": 273, "es": 239, "it": 213, "pt": 203, "pl": 224, "zh-cn": 82, "ar": 166, "cs": 186, "ru": 182, "nl": 251, "tr": 226, "ja": 71, "hu": 224, "ko": 95, }

I've found the reference in there coqui-ai/TTS#3197

They are saying they change the limit to 2500 and it is working (sometimes)!?!

unifirer · 2024-01-19T01:53:30Z

I pushed up a branch that has this code in it. Seems to work fine in initial testing for me, but I only saw this error very rarely. If anyone has text that reliably triggers this, please give it a try and report back.

In implementing this I realize I'm breaking things up into sentences multiple times. Once I'm satisfied this works well, I'll try to go back and clean things up so it really just goes from chapters, to token-length sentence chunks, to tts.

try any book, it happens 50 times per book

aedocw · 2024-04-06T03:14:52Z

Since switching to sending only one sentence at a time to TTS I have not been able to reproduce this. Closing now, but if you can reproduce reliably please include sample that triggers this.

aedocw self-assigned this Jan 11, 2024

aedocw added the enhancement New feature or request label Jan 11, 2024

aedocw mentioned this issue Jan 15, 2024

character limit of 273 for language 'fr'" error #153

Closed

aedocw closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch max tokens before exceeding it #193

Catch max tokens before exceeding it #193

aedocw commented Jan 11, 2024

Vodou4460 commented Jan 11, 2024

aedocw commented Jan 15, 2024

aedocw commented Jan 16, 2024

mmol67 commented Jan 17, 2024 •

edited

Loading

unifirer commented Jan 19, 2024

aedocw commented Apr 6, 2024

Catch max tokens before exceeding it #193

Catch max tokens before exceeding it #193

Comments

aedocw commented Jan 11, 2024

Vodou4460 commented Jan 11, 2024

aedocw commented Jan 15, 2024

aedocw commented Jan 16, 2024

mmol67 commented Jan 17, 2024 • edited Loading

unifirer commented Jan 19, 2024

aedocw commented Apr 6, 2024

mmol67 commented Jan 17, 2024 •

edited

Loading