-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KenLM wrapper #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LM class should be able to directly evaluate a sentence string instead of a list of word token indexes. String tokenizing, dictionary loading, dictionary lookup etc. should be included inside it to make the usage easier.
|
||
.. code-block:: python | ||
|
||
pip install https://github.com/kpu/kenlm/archive/master.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move it to "requirements.txt"?
|
||
pip install https://github.com/kpu/kenlm/archive/master.zip | ||
|
||
Please refer to **Scalable Modified Kneser-Ney Language Model Estimation** for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add a reference link.
- Add a dot at the end of the sentence.
class KenLMModel(object): | ||
""" | ||
Wrapper for KenLM language model. | ||
You should install KenLM python interface first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should install --> Please install.
|
||
class KenLMModel(object): | ||
""" | ||
Wrapper for KenLM language model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any full name for KenLM ? Add a reference link.
""" | ||
Initialize variables and load model. | ||
|
||
:param model_path: Path of language model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a dot. Same below.
:param id_list: Id list of word. | ||
:param id_str_dict: list | ||
""" | ||
assert len(id_list) > 0, 'invalid id list' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace it with raise ValueError
.
import kenlm | ||
self._model = kenlm.LanguageModel(self._model_path) | ||
|
||
def score_sentence_ids(self, id_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string type is better than an id list.
And this class should be responsible for tokenization, dictionary loading and dictionary looking-up. Don't leave it to users.
end token explicitly if the input sentence has been completed. | ||
|
||
:param id_list: Id list of word. | ||
:param id_str_dict: list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add return and rtype in the comments.
return math.pow(10, score) | ||
|
||
|
||
if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove below or add this to a unit test file.
id_str_dict, | ||
verbose=False): | ||
""" | ||
Initialize variables and load model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function comments should be moved above to class comments.
resolves #2229