Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add language model #60

Merged
merged 31 commits into from
Jun 14, 2017
Merged

add language model #60

merged 31 commits into from
Jun 14, 2017

Conversation

zhaopu7
Copy link
Contributor

@zhaopu7 zhaopu7 commented Jun 1, 2017

need some one to review whether my code is right.

@zhaopu7 zhaopu7 changed the title add language model code and PTB data add language model Jun 2, 2017
@lcy-seso
Copy link
Collaborator

lcy-seso commented Jun 2, 2017

提交PR时,应该没有使用pre-commit工具,代码没有做格式化,travis-CI 检查识别,请使用 pre-commit工具格式化一下这些脚本,再push一次吧。可以参考此链接:https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/contribute_to_paddle_cn.md#使用-pre-commit-钩子。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请修改代码。

name="fourthw", type=paddle.data_type.integer_value(vocab_size))

# embedding layer
Efirst = wordemb(firstword)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 44 ~ 47:
Efirst --> firse_emb
其余变量类型,保持统一的命名风格。

Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)

contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contextemb --> context_emb


# hidden layer
hidden = paddle.layer.fc(
input=contextemb, size=hidden_size, act=paddle.activation.Relu())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么直接选择了 Relu 作为激活?

input=hidden, size=hidden_size, act=paddle.activation.Relu())

# fc and output layer
predictword = paddle.layer.fc(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictword --> predict_word

# generate
texts = {} # type: {text : prob}
texts[input] = 1
for _ in range(num_words):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文本生成遇到 <EOS> 要终止。

output_layer=output, parameters=parameters)

# generate text
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成指定输入文件路径,输出到另一个文件。

train()

# -- predict --
predict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请把训练和预测放在不同的文件中。

if rnn_type == 'lstm':
rnn_cell = paddle.networks.simple_lstm(input=emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_lstm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

激活请显示的指定一下。不要省略。

"""

assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档的格式化很差。请注意文档的格式规范。

## 简介
语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。给定句子(词语序列):

<div align=center><img src='images/s.png'/></div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 图片的标记不对,请参考 https://github.com/PaddlePaddle/book/wiki/Github-Markdown%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98
  2. 请直接书写 Latex 公式,公式不要用图片。


它的概率可以表示为:

<div align=center><img src='images/ps.png'/> &nbsp;&nbsp;&nbsp;&nbsp;(式1)</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请写Latex公式不要用图片。


<div align=center><img src='images/ps.png'/> &nbsp;&nbsp;&nbsp;&nbsp;(式1)</div>

语言模型可以计算(式1)中的P(S)及其中间结果。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量请用latex公式。

## 关于本例
Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**:

* data_util.py:实现了对语料的读取以及词典的建立、保存和加载。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_util.py --> data_util

* lm_rnn.py:实现了基于rnn的语言模型的定义、训练以及做预测。
* lm_ngram.py:实现了基于n-gram的语言模型的定义、训练以及做预测。

***注:**一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. N-Gram 的 N 请使用 Latex 公式表示。$N$-Gram
  2. $N$-Gram 请引导至PaddleBook。


* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及训练器trainer。如下:

```python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缩进太乱。

cost=cost, parameters=parameters, update_equation=adam_optimizer)
```

* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 使用文本块
  2. 缩进太乱

* 编码格式:utf-8,本例中已经对中文做了适配。
* 内容格式:每个句子占一行;每行中的各词之间使用一个空格分开。
* 按需要配置lm\_rnn.py中\_\_main\_\_函数中对于data的配置:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缩进太乱。

## n-gram 语言模型

n-gram模型也称为n-1阶马尔科夫模型,它有一个有限历史假设:当前词的出现概率仅仅与前面n-1个词相关。因此 (式1) 可以近似为:
<div align=center><img src='images/ps2.png'/></div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) # load parameters
```

* 2,根据4(n-1)个词的上文预测下一个单词并打印:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请注意代码块缩进,太乱了。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# 语言模型

## 简介
语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。**它是自然语言处理领域的一个重要的基础模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个加粗的标记显示有问题,请修改一下。

@lcy-seso lcy-seso merged commit be4ad5f into PaddlePaddle:develop Jun 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants