-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add language model #60
Conversation
提交PR时,应该没有使用pre-commit工具,代码没有做格式化,travis-CI 检查识别,请使用 pre-commit工具格式化一下这些脚本,再push一次吧。可以参考此链接:https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/contribute_to_paddle_cn.md#使用-pre-commit-钩子。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请修改代码。
language_model/lm_ngram.py
Outdated
name="fourthw", type=paddle.data_type.integer_value(vocab_size)) | ||
|
||
# embedding layer | ||
Efirst = wordemb(firstword) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_projection
不能独立存在,只能作为mixed_layer
的输入,line 44 ~47 没问题吗?- 这里的写法有问题,请直接参考这里的写法:https://github.com/PaddlePaddle/models/blob/develop/text_classification/text_classification_cnn.py#L26
language_model/lm_ngram.py
Outdated
Esecond = wordemb(secondword) | ||
Ethird = wordemb(thirdword) | ||
Efourth = wordemb(fourthword) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 44 ~ 47:
Efirst --> firse_emb
其余变量类型,保持统一的命名风格。
language_model/lm_ngram.py
Outdated
Ethird = wordemb(thirdword) | ||
Efourth = wordemb(fourthword) | ||
|
||
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contextemb --> context_emb
language_model/lm_ngram.py
Outdated
|
||
# hidden layer | ||
hidden = paddle.layer.fc( | ||
input=contextemb, size=hidden_size, act=paddle.activation.Relu()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么直接选择了 Relu
作为激活?
language_model/lm_ngram.py
Outdated
input=hidden, size=hidden_size, act=paddle.activation.Relu()) | ||
|
||
# fc and output layer | ||
predictword = paddle.layer.fc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predictword --> predict_word
language_model/lm_rnn.py
Outdated
# generate | ||
texts = {} # type: {text : prob} | ||
texts[input] = 1 | ||
for _ in range(num_words): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文本生成遇到 <EOS>
要终止。
language_model/lm_rnn.py
Outdated
output_layer=output, parameters=parameters) | ||
|
||
# generate text | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成指定输入文件路径,输出到另一个文件。
language_model/lm_rnn.py
Outdated
train() | ||
|
||
# -- predict -- | ||
predict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请把训练和预测放在不同的文件中。
language_model/lm_rnn.py
Outdated
if rnn_type == 'lstm': | ||
rnn_cell = paddle.networks.simple_lstm(input=emb, size=hidden_size) | ||
for _ in range(num_layer - 1): | ||
rnn_cell = paddle.networks.simple_lstm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
激活请显示的指定一下。不要省略。
language_model/lm_ngram.py
Outdated
""" | ||
|
||
assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档的格式化很差。请注意文档的格式规范。
language_model/README.md
Outdated
## 简介 | ||
语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。给定句子(词语序列): | ||
|
||
<div align=center><img src='images/s.png'/></div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 图片的标记不对,请参考 https://github.com/PaddlePaddle/book/wiki/Github-Markdown%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98
- 请直接书写 Latex 公式,公式不要用图片。
language_model/README.md
Outdated
|
||
它的概率可以表示为: | ||
|
||
<div align=center><img src='images/ps.png'/> (式1)</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请写Latex公式不要用图片。
language_model/README.md
Outdated
|
||
<div align=center><img src='images/ps.png'/> (式1)</div> | ||
|
||
语言模型可以计算(式1)中的P(S)及其中间结果。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量请用latex公式。
language_model/README.md
Outdated
## 关于本例 | ||
Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**: | ||
|
||
* data_util.py:实现了对语料的读取以及词典的建立、保存和加载。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_util.py --> data_util
language_model/README.md
Outdated
* lm_rnn.py:实现了基于rnn的语言模型的定义、训练以及做预测。 | ||
* lm_ngram.py:实现了基于n-gram的语言模型的定义、训练以及做预测。 | ||
|
||
***注:**一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- N-Gram 的 N 请使用 Latex 公式表示。$N$-Gram
-
$N$ -Gram 请引导至PaddleBook。
language_model/README.md
Outdated
|
||
* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及训练器trainer。如下: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缩进太乱。
language_model/README.md
Outdated
cost=cost, parameters=parameters, update_equation=adam_optimizer) | ||
``` | ||
|
||
* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 使用文本块
- 缩进太乱
language_model/README.md
Outdated
* 编码格式:utf-8,本例中已经对中文做了适配。 | ||
* 内容格式:每个句子占一行;每行中的各词之间使用一个空格分开。 | ||
* 按需要配置lm\_rnn.py中\_\_main\_\_函数中对于data的配置: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缩进太乱。
language_model/README.md
Outdated
## n-gram 语言模型 | ||
|
||
n-gram模型也称为n-1阶马尔科夫模型,它有一个有限历史假设:当前词的出现概率仅仅与前面n-1个词相关。因此 (式1) 可以近似为: | ||
<div align=center><img src='images/ps2.png'/></div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language_model/README.md
Outdated
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) # load parameters | ||
``` | ||
|
||
* 2,根据4(n-1)个词的上文预测下一个单词并打印: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请注意代码块缩进,太乱了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
language_model/README.md
Outdated
# 语言模型 | ||
|
||
## 简介 | ||
语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。**它是自然语言处理领域的一个重要的基础模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个加粗的标记显示有问题,请修改一下。
need some one to review whether my code is right.