project Natural-language-processing
Disease_recognition_in_Spanish.ipynb Recognition and tagging of diseases from medical records / scientific literature in Spanish language (Finetuning of pretrained HF model of RoBERTa family on Token Classification task: customisation of NER tagging model for recognition of new entity using spaCy)
NER_Finetuning_HuggingFace_model.ipynb Fine-tuning of RoBERTa family model (from HuggingFace) for Named Entity Recogition: training on specific dataset
Extract_text_from_PDF_to_JSON.ipynb Exploring various methods for text scraping from large PDF files
swear_words_filter_testing.ipynb Evaluation of various approaches for profanity detection / swear words filtering (5 different libraries were tested)
word_frequency_barchart_wordcloud.py Word Frequency Bar Chart and Word Cloud (from Shakespeare’s Hamlet)
Input/ .* Texts for analysis, datasets and Masks for WordCloud
Output/ .* Generated datasets and wordclouds