MGAD: Multilingual Generation of Analogy Datasets

Submitted to LREC2018

Description

We present a novel, minimally supervised method of generating word embedding evaluation datasets for a large number of languages. Our approach utilizes existing dependency treebanks and parsers in order to create language-specific syntactic analogy datasets that do not rely on translation or human annotation. As part of our work, we offer syntactic analogy datasets for three previously unexplored languages: Arabic, Hindi, and Russian. These can be found in the data/ subdirectory.

Usage

Prior to running extract.py, it is recommended that a feature template for generating synactic analogies be provided in-file. The following is a sample template written for Hindi:

NOUN|Number=Plur|Case=Nom   NOUN|Number=Sing|Case=Nom
VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part  VERB|Case=Nom|VerbForm=Inf

The features expressed here can be found at the Universal Dependencies website.

To run extract.py, enter the following command in terminal, where the corpus is a connllu-formatted UD treebank:

cat corpus.connllu | python extract.py --all > output_file.txt

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
README.md		README.md
ar.txt		ar.txt
extract.py		extract.py
hi.txt		hi.txt
ru.txt		ru.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MGAD: Multilingual Generation of Analogy Datasets

Description

Usage

About

Releases

Packages

Contributors 2

Languages

rutrastone/MGAD

Folders and files

Latest commit

History

Repository files navigation

MGAD: Multilingual Generation of Analogy Datasets

Description

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages