Food Hazard Detection

This repository contains code for building machine learning models to detect food-related hazards and categorize them into predefined categories. It demonstrates the progression from a basic Support Vector Machine (SVM) model to an advanced BioBERT-based classification system.

Basic Model Method: SVM

Support Vector Machines (SVM) were used as the baseline model for this task:

Text Encoding: The text column was transformed using TF-IDF (with a maximum of 3,500 features).
Model Training: Separate SVM classifiers with a linear kernel were trained for each target (hazard-category, product-category, hazard, and product).
Performance: The model achieved reasonable results but was limited by the simplicity of the feature representation and the lack of contextual embeddings.

Why SVM?

SVM was chosen as the basic model due to its simplicity and efficiency in handling smaller datasets. While effective for basic classification tasks, its performance is constrained when compared to modern transformer-based methods like BioBERT.

Advanced Model Method: BioBERT

The advanced model leverages BioBERT for sequence classification:

Deep Learning Framework: Built with PyTorch and HuggingFace Transformers.
Pre-trained Model: BioBERT (dmis-lab/biobert-base-cased-v1.1) was fine-tuned for this task.
Features:
- Uses BioBERT embeddings for text encoding.
- Supports multi-target prediction across hazard and product categories.
- Saves predictions for both primary tasks.

About

This repository was created to participate in SemEval 2025 Task 9: The Food Hazard Detection Challenge. The challenge evaluates explainable classification systems for titles of food-incident reports collected from the web. These algorithms aim to assist automated crawlers in identifying and extracting food safety issues from web sources, including social media. Transparency is a crucial aspect of this task due to the potential high economic impact of food hazards.

Features

Uses BioBERT for sequence classification tasks.
Trains on labeled food hazard data and validates on unlabeled data.
Implements custom text preprocessing and label encoding.
Supports multi-target prediction across hazard categories and product categories.
Saves predictions for both primary tasks.

Requirements

Python 3.7+
Google Colab environment (with Google Drive integration)
Libraries:
- torch: For building and training the deep learning model.
- transformers: For pre-trained BioBERT models and tokenization.
- pandas: For loading and preprocessing structured data.
- scikit-learn: For label encoding (LabelEncoder) and evaluation metrics (f1_score, classification_report).
- tqdm: For progress bars during training.
- re: For text preprocessing (removing special characters, cleaning text).

Configuration

The model is configured with the following settings:

Maximum Sequence Length: 256
Batch Size: 16
Learning Rate: 0.00005
Epochs: 50
Early Stopping Patience: 6
Model Name: dmis-lab/biobert-base-cased-v1.1

Setup and Execution

Mount Google Drive:

from google.colab import drive
drive.mount('/content/drive')

Install Required Libraries:

pip install torch transformers pandas scikit-learn tqdm

Load and Preprocess Data:
- Download the labeled training data and load it into a DataFrame.
- Clean the text using a custom function.
- Apply similar preprocessing to the unlabeled validation data.
Train the Model:
- Train separate models for each target (hazard-category, product-category, hazard, product).
- Use early stopping to avoid overfitting.
- Save the best-performing model for each target.
Generate Predictions:
- Load the saved models.
- Generate predictions on the unlabeled validation data.
- Save the predictions to CSV files.

Data

Note: For training, the text column was used instead of the title column, as it provided better predictions. This improvement is likely due to the additional context and detailed information available in the text column.
Labeled Training Data: Contains food-related hazard information and associated categories.
- URL: https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv
- Alternatively, located in the folder data as the file incidents_train.csv.
Unlabeled Validation Data: This dataset is used by the competition to evaluate the results.
- Located in the data folder as a zip file named public_data.zip, which contains the file incidents.csv.

Output

Predictions:
- ST1_predictions.csv: Contains text, predicted hazard and product categories.
- ST2_predictions.csv: Contains text, predicted hazard and product names.
Processed Predictions:
- ST1_predictions_cleaned.csv: Contains predicted hazard and product categories.
- ST2_predictions_cleaned.csv: Contains predicted hazard and product names.

Results

The performance of the model is evaluated using the F1 Macro score on two tasks:

Task 1: Hazard and Product Categories

F1 Macro: 0.7558958415
11/29/2024 23:26 leaderboard position: 5^th

Task 2: Hazard and Product Names

F1 Macro: 0.4714
11/29/2024 23:26 leaderboard position: 3^rd

Observations

Task 1 generally performs better due to the well-defined categorical nature of hazard and product categories.
Task 2 is more challenging, as it involves predicting specific product and hazard names, which may require finer-grained understanding.

Scoring Function

The F1 Macro score is computed as follows:

from sklearn.metrics import f1_score

def compute_score(hazards_true, products_true, hazards_pred, products_pred):
  # Compute F1 for hazards
  f1_hazards = f1_score(
    hazards_true,
    hazards_pred,
    average='macro'
  )

  # Compute F1 for products
  f1_products = f1_score(
    products_true[hazards_pred == hazards_true],
    products_pred[hazards_pred == hazards_true],
    average='macro'
  )

  return (f1_hazards + f1_products) / 2.

This metric emphasizes the importance of correctly predicting both hazards and products. A perfect score of 1.0 indicates that both hazards and products are entirely accurate, while hazards alone being correct results in a score of 0.5.

How to Run

Clone this repository and upload it to Google Colab.
Ensure your Google Drive is mounted.
Update file paths and configuration settings if needed.
Run the script to train models and generate predictions.

Model Architecture

Tokenizer: HuggingFace's AutoTokenizer
Model: HuggingFace's AutoModelForSequenceClassification
Loss Function: Cross-Entropy Loss
Optimizer: Adam
Learning Rate Scheduler: ReduceLROnPlateau

File Structure

|-- incidents_train.csv    # Labeled training data
|-- incidents.csv          # Unlabeled validation data
|-- ST1_predictions.csv    # Predictions for ST1
|-- ST2_predictions.csv    # Predictions for ST2
|-- ST1_predictions_cleaned.csv # Processed ST1 predictions
|-- ST2_predictions_cleaned.csv # Processed ST2 predictions

License

This project is licensed under the GNU General Public License Version 3, 29 June 2007. See the LICENSE file for details.

Acknowledgments

HuggingFace Transformers: For providing pre-trained models and tokenizer utilities that formed the backbone of this project.
BioBERT: For offering a domain-specific transformer model optimized for biomedical and text classification tasks.
PyTorch: For being the core deep learning framework used to implement and train the neural network.
ChatGPT: For providing insights and suggestions during the development of this project.
Stack Overflow: For being an invaluable resource for resolving coding challenges and gaining technical insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Food Hazard Detection

Basic Model Method: SVM

Why SVM?

Advanced Model Method: BioBERT

About

Features

Requirements

Configuration

Setup and Execution

Data

Output

Results

Task 1: Hazard and Product Categories

Task 2: Hazard and Product Names

Observations

Scoring Function

How to Run

Model Architecture

File Structure

License

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
code		code
data		data
predictions		predictions
LICENSE.txt		LICENSE.txt
README.md		README.md

License

jerryk42/Food-Hazard-Detection

Folders and files

Latest commit

History

Repository files navigation

Food Hazard Detection

Basic Model Method: SVM

Why SVM?

Advanced Model Method: BioBERT

About

Features

Requirements

Configuration

Setup and Execution

Data

Output

Results

Task 1: Hazard and Product Categories

Task 2: Hazard and Product Names

Observations

Scoring Function

How to Run

Model Architecture

File Structure

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages