A comprehensive implementation of n-gram language models with various smoothing techniques for natural text generation and analysis.
SmoothLM is an implementation of n-gram language models with three different smoothing techniques (Laplace, Good-Turing, and Linear Interpolation). This project includes tools for text tokenization, language model training, next-word prediction, and perplexity calculation—providing a complete toolkit for understanding and implementing statistical language models.
- 📚 Tokenization: Custom regex-based tokenizer supporting various text patterns
- 🧩 N-gram Models: Implementation of n-gram models (n=1,3,5)
- 🧠 Smoothing Techniques:
- Laplace (Add-One) Smoothing
- Good-Turing Smoothing
- Linear Interpolation
- 🔍 Text Generation: Next word prediction using trained models
- 📊 Perplexity Analysis: Calculate and visualize perplexity scores
- 🧪 OOD Testing: Evaluation of model behavior in out-of-distribution scenarios
# Clone the repository
git clone https://github.com/yourusername/SmoothLM.git
cd SmoothLM
# Set up a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows, use: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
python3 tokenizer.py
# Input prompt will appear
# Example: "Is that what you mean? I am unsure."
# Output: [['Is', 'that', 'what', 'you', 'mean', '?'], ['I', 'am', 'unsure', '.']]
python3 language_model.py <lm_type> <corpus_path>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to training corpus
# Example:
python3 language_model.py i ./corpus/pride_and_prejudice.txt
# Input prompt: "I am a woman."
# Output: score: 0.69092021
python3 generator.py <lm_type> <corpus_path> <k>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to training corpus
# k: number of candidate words to display
# Example:
python3 generator.py i ./corpus/pride_and_prejudice.txt 3
# Input: "An apple a day keeps the doctor"
# Output:
# away 0.4
# happy 0.2
# fresh 0.1
python3 perplexity.py <lm_type> <corpus_path> <n> <split>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to corpus
# n: n-gram size (1, 3, or 5)
# split: 'train' or 'test'
# Example:
python3 perplexity.py g ./corpus/ulysses.txt 3 test
# Outputs perplexity scores to a file
The project evaluates different n-gram models and smoothing techniques for text generation:
- Unigram (N=1): Poor performance with incoherent and random predictions
- Trigram (N=3): Improved but still lacks fluency, with subject/meaning inconsistencies
- 5-gram (N=5): Best performance among non-smoothed models, most coherent results
- Laplace: Improved handling of unseen contexts, but overall weaker than other methods
- Good-Turing: Best overall performance, especially for higher n-values
- Linear Interpolation: Good performance that could be improved with better λ weights
Key findings from perplexity evaluation:
- Test set perplexity consistently higher than training set perplexity
- Good-Turing outperforms Laplace for lower-order n-grams
- Linear interpolation shows challenges with data sparsity in higher-order n-grams
- Laplace smoothing produces higher perplexity than other methods
SmoothLM/
├── tokenizer.py # Text tokenization implementation
├── language_model.py # Language model implementation
├── generator.py # Text generation implementation
├── script.py # Perplexity calculation
├── graph.py # Plotting graphs
├── Pride and Prejudice - Jane Austen.txt # Jane Austen's novel corpus
├── Ulysses - James Joyce.txt # James Joyce's novel corpus
├── output/
│ ├── 2022101094_good_turing_1_test_perplexity_pride # Perplexity score files
.....
└── README.md
The tokenizer handles various text patterns including:
- Regular words and punctuation
- URLs, hashtags, and mentions
- Percentages and numerical expressions
- Time expressions and periods
Implementation follows the formula:
PGT(w1...wn) = r*/N
where r* = (r+1)S(r+1)/S(r)
For unseen events: PGT(w1...wn) = N1/N
Combines multiple n-gram models with λ weights:
P(wn|w1...wn-1) = λ1P1(wn) + λ2P2(wn|wn-1) + ... + λnPn(wn|w1...wn-1)
This project was developed as part of the Introduction to NLP (CS7.401) course at IIIT Hyderabad for Spring 2025. The implementation follows the assignment guidelines and was completed by January 23rd, 2025.
- Introduction to N-Gram Language Modeling Methods
- Jurafsky & Martin - Speech and Language Processing
- Good-Turing Smoothing Paper
- Linear Interpolation for Language Modeling
[Mayank Mittal] - 2022101094 International Institute of Information Technology, Hyderabad