This repository contains code for building machine learning models to detect food-related hazards and categorize them into predefined categories. It demonstrates the progression from a basic Support Vector Machine (SVM) model to an advanced BioBERT-based classification system.
Support Vector Machines (SVM) were used as the baseline model for this task:
- Text Encoding: The
text
column was transformed using TF-IDF (with a maximum of 3,500 features). - Model Training: Separate SVM classifiers with a linear kernel were trained for each target (
hazard-category
,product-category
,hazard
, andproduct
). - Performance: The model achieved reasonable results but was limited by the simplicity of the feature representation and the lack of contextual embeddings.
SVM was chosen as the basic model due to its simplicity and efficiency in handling smaller datasets. While effective for basic classification tasks, its performance is constrained when compared to modern transformer-based methods like BioBERT.
The advanced model leverages BioBERT for sequence classification:
- Deep Learning Framework: Built with PyTorch and HuggingFace Transformers.
- Pre-trained Model: BioBERT (
dmis-lab/biobert-base-cased-v1.1
) was fine-tuned for this task. - Features:
- Uses BioBERT embeddings for text encoding.
- Supports multi-target prediction across hazard and product categories.
- Saves predictions for both primary tasks.
This repository was created to participate in SemEval 2025 Task 9: The Food Hazard Detection Challenge. The challenge evaluates explainable classification systems for titles of food-incident reports collected from the web. These algorithms aim to assist automated crawlers in identifying and extracting food safety issues from web sources, including social media. Transparency is a crucial aspect of this task due to the potential high economic impact of food hazards.
- Uses BioBERT for sequence classification tasks.
- Trains on labeled food hazard data and validates on unlabeled data.
- Implements custom text preprocessing and label encoding.
- Supports multi-target prediction across hazard categories and product categories.
- Saves predictions for both primary tasks.
- Python 3.7+
- Google Colab environment (with Google Drive integration)
- Libraries:
torch
: For building and training the deep learning model.transformers
: For pre-trained BioBERT models and tokenization.pandas
: For loading and preprocessing structured data.scikit-learn
: For label encoding (LabelEncoder
) and evaluation metrics (f1_score
,classification_report
).tqdm
: For progress bars during training.re
: For text preprocessing (removing special characters, cleaning text).
The model is configured with the following settings:
- Maximum Sequence Length: 256
- Batch Size: 16
- Learning Rate: 0.00005
- Epochs: 50
- Early Stopping Patience: 6
- Model Name:
dmis-lab/biobert-base-cased-v1.1
-
Mount Google Drive:
from google.colab import drive drive.mount('/content/drive')
-
Install Required Libraries:
pip install torch transformers pandas scikit-learn tqdm
-
Load and Preprocess Data:
- Download the labeled training data and load it into a DataFrame.
- Clean the text using a custom function.
- Apply similar preprocessing to the unlabeled validation data.
-
Train the Model:
- Train separate models for each target (
hazard-category
,product-category
,hazard
,product
). - Use early stopping to avoid overfitting.
- Save the best-performing model for each target.
- Train separate models for each target (
-
Generate Predictions:
- Load the saved models.
- Generate predictions on the unlabeled validation data.
- Save the predictions to CSV files.
- Note: For training, the
text
column was used instead of thetitle
column, as it provided better predictions. This improvement is likely due to the additional context and detailed information available in thetext
column. - Labeled Training Data: Contains food-related hazard information and associated categories.
- URL:
https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv
- Alternatively, located in the folder
data
as the fileincidents_train.csv
.
- URL:
- Unlabeled Validation Data: This dataset is used by the competition to evaluate the results.
- Located in the
data
folder as a zip file namedpublic_data.zip
, which contains the fileincidents.csv
.
- Located in the
- Predictions:
ST1_predictions.csv
: Contains text, predicted hazard and product categories.ST2_predictions.csv
: Contains text, predicted hazard and product names.
- Processed Predictions:
ST1_predictions_cleaned.csv
: Contains predicted hazard and product categories.ST2_predictions_cleaned.csv
: Contains predicted hazard and product names.
The performance of the model is evaluated using the F1 Macro score on two tasks:
- F1 Macro: 0.7558958415
- 11/29/2024 23:26 leaderboard position: 5th
- F1 Macro: 0.4714
- 11/29/2024 23:26 leaderboard position: 3rd
- Task 1 generally performs better due to the well-defined categorical nature of hazard and product categories.
- Task 2 is more challenging, as it involves predicting specific product and hazard names, which may require finer-grained understanding.
The F1 Macro score is computed as follows:
from sklearn.metrics import f1_score
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
# Compute F1 for hazards
f1_hazards = f1_score(
hazards_true,
hazards_pred,
average='macro'
)
# Compute F1 for products
f1_products = f1_score(
products_true[hazards_pred == hazards_true],
products_pred[hazards_pred == hazards_true],
average='macro'
)
return (f1_hazards + f1_products) / 2.
This metric emphasizes the importance of correctly predicting both hazards and products. A perfect score of 1.0 indicates that both hazards and products are entirely accurate, while hazards alone being correct results in a score of 0.5.
- Clone this repository and upload it to Google Colab.
- Ensure your Google Drive is mounted.
- Update file paths and configuration settings if needed.
- Run the script to train models and generate predictions.
- Tokenizer: HuggingFace's
AutoTokenizer
- Model: HuggingFace's
AutoModelForSequenceClassification
- Loss Function: Cross-Entropy Loss
- Optimizer: Adam
- Learning Rate Scheduler: ReduceLROnPlateau
|-- incidents_train.csv # Labeled training data
|-- incidents.csv # Unlabeled validation data
|-- ST1_predictions.csv # Predictions for ST1
|-- ST2_predictions.csv # Predictions for ST2
|-- ST1_predictions_cleaned.csv # Processed ST1 predictions
|-- ST2_predictions_cleaned.csv # Processed ST2 predictions
This project is licensed under the GNU General Public License Version 3, 29 June 2007. See the LICENSE file for details.
- HuggingFace Transformers: For providing pre-trained models and tokenizer utilities that formed the backbone of this project.
- BioBERT: For offering a domain-specific transformer model optimized for biomedical and text classification tasks.
- PyTorch: For being the core deep learning framework used to implement and train the neural network.
- ChatGPT: For providing insights and suggestions during the development of this project.
- Stack Overflow: For being an invaluable resource for resolving coding challenges and gaining technical insights.