Skip to content

rutrastone/MGAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MGAD: Multilingual Generation of Analogy Datasets

Submitted to LREC2018

Description

We present a novel, minimally supervised method of generating word embedding evaluation datasets for a large number of languages. Our approach utilizes existing dependency treebanks and parsers in order to create language-specific syntactic analogy datasets that do not rely on translation or human annotation. As part of our work, we offer syntactic analogy datasets for three previously unexplored languages: Arabic, Hindi, and Russian. These can be found in the data/ subdirectory.

Usage

Prior to running extract.py, it is recommended that a feature template for generating synactic analogies be provided in-file. The following is a sample template written for Hindi:

NOUN|Number=Plur|Case=Nom   NOUN|Number=Sing|Case=Nom
VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part  VERB|Case=Nom|VerbForm=Inf

The features expressed here can be found at the Universal Dependencies website.

To run extract.py, enter the following command in terminal, where the corpus is a connllu-formatted UD treebank:

cat corpus.connllu | python extract.py --all > output_file.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages