This is a simple application which works on Django engine for providing access via REST API. The technical information is stored in default sqlite3 DB.
The app is using spaCy for running Named Entity Recognition (NER) and MongoDB as a storage for the results.
-
Clone repository.
-
Move to the project's folder and run
pip install -r requirements.txt
. -
Set environment variable
DJANGO_SETTINGS_MODULE=ner_django_app.settings
.Note: migrations are not required here since we are not using Django models.
-
If you have some custom installation of MongoDB, set environment variables
MONGO_DB_NAME
andMONGO_DB_URI
. -
Make sure that MongoDB server is running.
-
Optional: you can set
DEBUG=True
.
-
Go to the project folder.
-
Put the input archives with patents as XML files to the folder
./inputs
.Note: such way of input was selected since the original requirement was to have ability to process ~10.000 files which is usually more convenient when files are stored in some SFTP or another folder-like storage.
-
Start the Django development server with command:
./ner_django_app/manage.py runserver 8000
-
Go to the localhost:8000 (you will be automatically redirected to the required page.).
-
Press button
Run Pipeline
to process all archives in the./inputs
folder. -
You will see message "Success!" if everything is fine or "Something went wrong" if there was a failure.
-
Check the MongoDB where the results are stored in collection patents of DB
MONGO_DB_NAME
. -
You can drop this collection by pressing button
Clean MongoDB
.
This application runs just general NER, without searching for any chemistry terms. We can use more specific library, e.g. ChemDataExtractor.