Domain Classification for Keboola

This project provides scripts to train a model for classifying domains based on their descriptions, and then make predictions using the trained model. The scripts are specifically designed to work within the Keboola data platform.

Model Architecture

Best Model Overview

The best-performing model in this project is a Logistic Regression classifier with the following characteristics:

Embedding Model: Sentence-BERT all-MiniLM-L6-v2 for converting text descriptions into meaningful numeric vectors
Class Weighting: Employs balanced class weights to handle class imbalance in the training data
Training Data: Trained on the full dataset without train/test splitting for maximum data utilization
Random Seed: Uses a fixed random seed (42) for all random processes to ensure reproducibility

How the Model Works

Text Embedding: Domain descriptions are transformed into 384-dimensional embeddings using Sentence-BERT
Classification: A logistic regression model predicts the category for each domain description
Domain Aggregation: URL-level predictions are aggregated to domain-level using a rule-based approach:
- If any URL for a domain is classified as 'partner', the entire domain is classified as 'partner'
- If any URL is classified as 'partner-seo' (and none are 'partner'), the domain is classified as 'partner-seo'
- Otherwise, the majority classification is used

Performance Considerations

The model prioritizes recall for the 'partner' and 'partner-seo' classes over overall accuracy
Class weights are applied to handle the imbalanced nature of the domain categories
The weighted score metric is calculated as: (0.7 × partner_recall) + (0.3 × partner_seo_recall)

Scripts Overview

There are two main scripts:

train_best_model_keboola.py - Trains a model on the complete dataset and saves the model files
predict_keboola.py - Loads a trained model and makes predictions on new domain data

Input/Output Structure

For Training (train_best_model_keboola.py)

Input Tables:

/data/in/tables/domains_train.csv - Contains domains with the 'domain', 'description', and 'category' columns for training

Output Tables:

/data/out/tables/url_level_predictions.csv - URL-level predictions on the training data
/data/out/tables/data_domains_predictions.csv - Domain-level aggregated predictions
/data/out/tables/key_metrics.csv - Evaluation metrics (recall, accuracy, etc.)
/data/out/tables/domains_predictions_summary.csv - Summary of predictions by category

Output Files:

/data/out/files/best_classifier_model.joblib - The trained classifier model
/data/out/files/embedding_model/ - The sentence transformer embedding model
/data/out/files/class_weights.json - Class weights used in training
/data/out/files/model_config.json - Model configuration details
/data/out/files/confusion_matrices/ - Generated confusion matrix images
/data/out/files/training_log.log - Detailed training log

For Prediction (predict_keboola.py)

Input Tables:

/data/in/tables/data_domains_classification.csv - Contains domains with the 'domain' and 'description' columns

Input Files:

/data/in/files/best_classifier_model.joblib - The trained classifier model
/data/in/files/embedding_model/ - The sentence transformer embedding model

Output Tables:

/data/out/tables/url_level_predictions.csv - URL-level predictions
/data/out/tables/data_domains_predictions.csv - Domain-level aggregated predictions
/data/out/tables/domains_predictions_summary.csv - Summary of predictions by category

Output Files:

/data/out/tables/prediction_log.txt - Detailed prediction log

Keboola Configuration

Training Component

Create a Keboola component with Python 3.7+
Upload the train_best_model_keboola.py script
Configure input mappings:
- Map your domain training data to /data/in/tables/domains_train.csv
Configure output mappings:
- Map all files in /data/out/files/ to your storage
- Map all tables in /data/out/tables/ to your storage

Prediction Component

Create a Keboola component with Python 3.7+
Upload the predict_keboola.py script
Configure input mappings:
- Map your new domain data to /data/in/tables/data_domains_classification.csv
- Map the previously saved model files to /data/in/files/
Configure output mappings:
- Map all tables in /data/out/tables/ to your storage

Dependencies

The scripts require the following Python packages:

pandas
numpy
scikit-learn
sentence-transformers
joblib
matplotlib
seaborn

These should be installed in the Keboola Python environment.

Prediction Process

The domain classification follows these rules for domain-level aggregation:

If any URL for a domain is classified as 'partner', the domain is classified as 'partner'
If any URL is classified as 'partner-seo' and none are 'partner', the domain is classified as 'partner-seo'
Otherwise, the majority classification is used

Troubleshooting

If you encounter issues:

Check the log files for detailed error information
Ensure all required input files are present with the correct format
Verify that the file paths in the input/output mappings match the paths expected by the scripts

Name	Name	Last commit message	Last commit date
Latest commit adelakostelecka Add training and prediction best model Keboola Mar 25, 2025 387b2e9 · Mar 25, 2025 History 6 Commits
src	src	Add training and prediction best model Keboola	Mar 25, 2025
.gitignore	.gitignore	Tuning lr, train best model and predict	Mar 21, 2025
README.md	README.md	Add Readme	Mar 21, 2025
domains_predictions.csv	domains_predictions.csv	Tuning lr, train best model and predict	Mar 21, 2025
domains_predictions_summary.csv	domains_predictions_summary.csv	Tuning lr, train best model and predict	Mar 21, 2025
requirements.txt	requirements.txt	Add source files	Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Classification for Keboola

Model Architecture

Best Model Overview

How the Model Works

Performance Considerations

Scripts Overview

Input/Output Structure

For Training (train_best_model_keboola.py)

Input Tables:

Output Tables:

Output Files:

For Prediction (predict_keboola.py)

Input Tables:

Input Files:

Output Tables:

Output Files:

Keboola Configuration

Training Component

Prediction Component

Dependencies

Prediction Process

Troubleshooting

About

Releases

Packages

Languages

keboola/classification_sales_partners

Folders and files

Latest commit

History

Repository files navigation

Domain Classification for Keboola

Model Architecture

Best Model Overview

How the Model Works

Performance Considerations

Scripts Overview

Input/Output Structure

For Training (train_best_model_keboola.py)

Input Tables:

Output Tables:

Output Files:

For Prediction (predict_keboola.py)

Input Tables:

Input Files:

Output Tables:

Output Files:

Keboola Configuration

Training Component

Prediction Component

Dependencies

Prediction Process

Troubleshooting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages