Implementation of the paper - "Integrating Topics and Syntax"

Link to the paper

Installation

Setup an environment in conda or pip and install the below packages

python=3.10
ipykernel
jupyter
scikit-learn
spacy
pandas
matplotlib
nltk
gensim
tqdm

Setting up the dataset

Preprocessing

python preprocess_data.py
  --size <num-of-docs>
  --dataset <options - news/nips>

The above command would preprocess your datasets, and writes the vocab and the document as a list of token ids, into a folder named {dataset}_{num-docs}

Training the model

python main.py
  --alpha <document specific topic distribution's symmetric Dirichlet parameter> 
  --beta <topic specific word distribution> 
  --delta <document specific topic distribution>
  --gamma <distribution of transition between classes>
  --num_iter <iterations of gibbs sampling>
  --num_topics <T>
  --num_classes <C>
  --dataset <path-to-preprocessed-dataset>

The model output files are written to the folder - out/{alpha}_{beta}_{gamma}_{delta}_{num_topics}_{num_classes}_{num_iterations}_{dataset}

Evaluation

Document classification

To run the document classification on newsgroup dataset.

python doc_classifier.py
  --theta_file <path-to-theta.txt>
  --skip_indices_file <path-to-skipped_indices.txt>
  --train_test_split <split-fraction>

Where the theta.txt contains the document topic counts. And the skipped_indices.txt files contains the list of indices of documents you skipped when training our model.

Topic Coherence score

Creates a plot of topic coherence score against iterations calculated using gensim. One must supply correct directory containing phi_z.txt file.

python metrics.py

Pure LDA

Trains a ldamodel in gensim on supplied data. Creates a plot of coherence against number of topics.

python pure_lda.py

References

LDA from scratch

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
datasets		datasets
experiments		experiments
models		models
.gitignore		.gitignore
README.md		README.md
biker_jesus.jpg		biker_jesus.jpg
data_loader.py		data_loader.py
data_run.sh		data_run.sh
doc_classifier.py		doc_classifier.py
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
generate.py		generate.py
generate.sh		generate.sh
main.py		main.py
metrics.py		metrics.py
parallel.sh		parallel.sh
parallel_main.py		parallel_main.py
plot.py		plot.py
plot.sh		plot.sh
preprocess_data.py		preprocess_data.py
pure_lda.py		pure_lda.py
run.sh		run.sh
struggle.jpg		struggle.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of the paper - "Integrating Topics and Syntax"

Installation

Setting up the dataset

Preprocessing

Training the model

Evaluation

Document classification

Topic Coherence score

Pure LDA

References

About

Releases

Packages

Contributors 4

Languages

abhijithasokan/integrating_topics_and_syntax

Folders and files

Latest commit

History

Repository files navigation

Implementation of the paper - "Integrating Topics and Syntax"

Installation

Setting up the dataset

Preprocessing

Training the model

Evaluation

Document classification

Topic Coherence score

Pure LDA

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages