🔮 SmoothLM: N-gram Language Modeling & Smoothing

A comprehensive implementation of n-gram language models with various smoothing techniques for natural text generation and analysis.

📋 Overview

SmoothLM is an implementation of n-gram language models with three different smoothing techniques (Laplace, Good-Turing, and Linear Interpolation). This project includes tools for text tokenization, language model training, next-word prediction, and perplexity calculation—providing a complete toolkit for understanding and implementing statistical language models.

✨ Features

📚 Tokenization: Custom regex-based tokenizer supporting various text patterns
🧩 N-gram Models: Implementation of n-gram models (n=1,3,5)
🧠 Smoothing Techniques:
- Laplace (Add-One) Smoothing
- Good-Turing Smoothing
- Linear Interpolation
🔍 Text Generation: Next word prediction using trained models
📊 Perplexity Analysis: Calculate and visualize perplexity scores
🧪 OOD Testing: Evaluation of model behavior in out-of-distribution scenarios

🚀 Installation

# Clone the repository
git clone https://github.com/yourusername/SmoothLM.git
cd SmoothLM

# Set up a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

🛠️ Usage

Tokenization

python3 tokenizer.py
# Input prompt will appear
# Example: "Is that what you mean? I am unsure."
# Output: [['Is', 'that', 'what', 'you', 'mean', '?'], ['I', 'am', 'unsure', '.']]

Language Model Training & Evaluation

python3 language_model.py <lm_type> <corpus_path>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to training corpus

# Example:
python3 language_model.py i ./corpus/pride_and_prejudice.txt
# Input prompt: "I am a woman."
# Output: score: 0.69092021

Text Generation

python3 generator.py <lm_type> <corpus_path> <k>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to training corpus
# k: number of candidate words to display

# Example:
python3 generator.py i ./corpus/pride_and_prejudice.txt 3
# Input: "An apple a day keeps the doctor"
# Output:
# away 0.4
# happy 0.2
# fresh 0.1

Perplexity Calculation

python3 perplexity.py <lm_type> <corpus_path> <n> <split>
# lm_type: 'l' (Laplace), 'g' (Good-Turing), 'i' (Interpolation)
# corpus_path: path to corpus
# n: n-gram size (1, 3, or 5)
# split: 'train' or 'test'

# Example:
python3 perplexity.py g ./corpus/ulysses.txt 3 test
# Outputs perplexity scores to a file

📊 Results & Analysis

Generation Results

The project evaluates different n-gram models and smoothing techniques for text generation:

No Smoothing

Unigram (N=1): Poor performance with incoherent and random predictions
Trigram (N=3): Improved but still lacks fluency, with subject/meaning inconsistencies
5-gram (N=5): Best performance among non-smoothed models, most coherent results

With Smoothing

Laplace: Improved handling of unseen contexts, but overall weaker than other methods
Good-Turing: Best overall performance, especially for higher n-values
Linear Interpolation: Good performance that could be improved with better λ weights

Perplexity Analysis

Key findings from perplexity evaluation:

Test set perplexity consistently higher than training set perplexity
Good-Turing outperforms Laplace for lower-order n-grams
Linear interpolation shows challenges with data sparsity in higher-order n-grams
Laplace smoothing produces higher perplexity than other methods

📂 Project Structure

SmoothLM/
├── tokenizer.py                            # Text tokenization implementation
├── language_model.py                       # Language model implementation
├── generator.py                            # Text generation implementation
├── script.py                               # Perplexity calculation
├── graph.py                                # Plotting graphs
├── Pride and Prejudice - Jane Austen.txt   # Jane Austen's novel corpus
├── Ulysses - James Joyce.txt               # James Joyce's novel corpus
├── output/
│   ├── 2022101094_good_turing_1_test_perplexity_pride            # Perplexity score files
     .....     
└── README.md

🔍 Implementation Details

Tokenization

The tokenizer handles various text patterns including:

Regular words and punctuation
URLs, hashtags, and mentions
Percentages and numerical expressions
Time expressions and periods

Good-Turing Smoothing

Implementation follows the formula:

P_GT(w₁...w_n) = r*/N

where r* = (r+1)S(r+1)/S(r)

For unseen events: P_GT(w₁...w_n) = N₁/N

Linear Interpolation

Combines multiple n-gram models with λ weights:

P(w_n|w₁...w_n-1) = λ₁P₁(w_n) + λ₂P₂(w_n|w_n-1) + ... + λ_nP_n(w_n|w₁...w_n-1)

📝 Assignment Details

This project was developed as part of the Introduction to NLP (CS7.401) course at IIIT Hyderabad for Spring 2025. The implementation follows the assignment guidelines and was completed by January 23rd, 2025.

🔗 References

👤 Author

[Mayank Mittal] - 2022101094 International Institute of Information Technology, Hyderabad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔮 SmoothLM: N-gram Language Modeling & Smoothing

📋 Overview

✨ Features

🚀 Installation

🛠️ Usage

Tokenization

Language Model Training & Evaluation

Text Generation

Perplexity Calculation

📊 Results & Analysis

Generation Results

No Smoothing

With Smoothing

Perplexity Analysis

📂 Project Structure

🔍 Implementation Details

Tokenization

Good-Turing Smoothing

Linear Interpolation

📝 Assignment Details

🔗 References

👤 Author

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
images		images
output		output
Pride and Prejudice - Jane Austen.txt		Pride and Prejudice - Jane Austen.txt
README.md		README.md
README_code.md		README_code.md
Report_2022101094.pdf		Report_2022101094.pdf
Ulysses - James Joyce.txt		Ulysses - James Joyce.txt
generator.py		generator.py
good_turing_model_Pride and Prejudice - Jane Austen.txt_n1.pkl		good_turing_model_Pride and Prejudice - Jane Austen.txt_n1.pkl
good_turing_model_Pride and Prejudice - Jane Austen.txt_n3.pkl		good_turing_model_Pride and Prejudice - Jane Austen.txt_n3.pkl
graph.py		graph.py
language_model.py		language_model.py
script.py		script.py
tokenizer.py		tokenizer.py

mayankmittal29/SmoothLM---N-gram_Language_Modeling_with_Smoothing

Folders and files

Latest commit

History

Repository files navigation

🔮 SmoothLM: N-gram Language Modeling & Smoothing

📋 Overview

✨ Features

🚀 Installation

🛠️ Usage

Tokenization

Language Model Training & Evaluation

Text Generation

Perplexity Calculation

📊 Results & Analysis

Generation Results

No Smoothing

With Smoothing

Perplexity Analysis

📂 Project Structure

🔍 Implementation Details

Tokenization

Good-Turing Smoothing

Linear Interpolation

📝 Assignment Details

🔗 References

👤 Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages