-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement master server client, remove unnecessary dummy variable #2429
Conversation
go/master/service.go
Outdated
@@ -13,18 +15,15 @@ const ( | |||
targetTaskCount = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是在哪里用的?没找到用的地方!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao Thanks! Removed.
// | ||
// SetDataset can be call multiple times. But only the first call will | ||
// be honored. | ||
func (s *Service) SetDataset(globPaths []string, dummy *int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数有点长了,是否拆一下的好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Done.
go/master/service.go
Outdated
// TODO(helin): client need to retry in this | ||
// error case. Gotcha: RPC client can't | ||
// compare returned error with predefined | ||
// erros like io.EOF. Because interface don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erros=>errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
|
||
return c.client.Close() | ||
} | ||
|
||
// Connect connects the connection to a address. | ||
func (c *Conn) Connect(addr string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有一个疑问没有看懂:connect的过程为何不能用一次锁,而要分为两次?
c.mu.Lock()
defer c.mu.Unlock()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
接下来的rpc.DialHTTP
是很重的操作,不想让其Block其他的c.mu.Lock()
。比如master addr一开始是a
,rpc.DialHTTP
正在进行。这时候master addr变成了b
,那么我们不想让Connect("b")
被Connect("a")
block。
以后这里会加上Dial失败的等待和重试,操作就更重了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
明白了。谢谢!
return c | ||
} | ||
|
||
func (c *Client) monitorMaster(addr Addresser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方有些疑惑请教一下:
- 这个函数应该是隔一段时间连接一下Master,这个调用RPC的时候出现网络的错误再重新连接是否更好一点?
- 地址只有一个,没看懂curMaster和lastMater的含义。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果master挂了,被重启到其他ip,Addresser可以从这个参数取得最新的ip,连接到最近的ip。
多谢helin,看懂了。
建议把上边这句话加入到注释当中去。
10 // Addresser provide the address of the master server.
11 type Addresser interface {
12 Address() string
13 }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
// get the lastest address of the master server,
// connect to the new address once address changed.
curMaster := addr.Address()
if curMaster != lastMaster {
..
go/master/service.go
Outdated
for _, s := range globPaths { | ||
match, err := filepath.Glob(s) | ||
if err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Master收到错误参数然后panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多谢指出!不该panic。Done.
go/master/service.go
Outdated
@@ -123,6 +214,8 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
return err | |||
} | |||
|
|||
*task = t.Task | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
247行应该加出错的task日志等,这样方便出错误的时候统计和调试。
而且我们是否应该设定一个阈值,当失败的任务达到整体任务的一个比例的时候,这个Job失败?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change type of taskQueues.Failed
from []Task
to []taskEntry
, so that we can use numTimeout
to record the failed times.
Thanks! I have print log mentioning how many times the task failed, because if we print the log, we can record what happens. But have not changed the type. Do you think if it's sufficient? :)
-- Helin
Cool, I think it's sufficient in this PR.
--Yanxu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao 好主意!已经把出错日志加上。
Job失败我觉得就不必了,只要用户知道有task fail就行,让用户来决定什么情况下Job是失败的吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 Thanks! I have print log mentioning how many times the task failed, because if we print the log, we can record what happens. But have not changed the type. Do you think if it's sufficient? :)
go/master/service.go
Outdated
select { | ||
case <-s.ready: | ||
} | ||
|
||
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
t, ok := s.taskQueues.Pending[taskID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taskID如果是不同Epoch的,这个地方会有问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果没有这个task在pending状态了,我们也不需要检查task是不是还没有完成了(已经不是pending,那就是完成或者出错,都不需要考虑了)。
go/master/service.go
Outdated
return errors.New("no more available task") | ||
} | ||
s.taskQueues.Todo = s.taskQueues.Done | ||
s.taskQueues.Todo = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo=nil? Done=nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks! Done. Added this condition into test case as well.
go/master/service.go
Outdated
s.initDone = true | ||
return nil | ||
} | ||
|
||
// GetTask gets a new task from the service. | ||
func (s *Service) GetTask(dummy int, task *Task) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数长且复杂。我们需要在另外的PR中提交测试用例。GetTask和TaskFinish。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢!确实比较复杂了,我改了下,拆出来了一个函数。
已经有test case了,请看:https://github.com/PaddlePaddle/Paddle/pull/2429/files#diff-dc14b64eab9d49fd494527c071e6121aR82
@@ -123,6 +214,8 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
return err | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
211行是否加一个assert,保证Pending中不应该有task t?
当然,理论上是不应该有的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得不需要了。这里的正确性我认为用unit test保证就好。
@@ -162,17 +255,27 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得是否还应该有一个TaskFailed接口,这样,有些trainer知道自己错了就可以直接上报错误,而不用等待超时?
另外:@typhoonzero, Kubernetes的continer如果退出,master是否需要、能不能及时感知?以前经常碰到程序一启动很快就死(比如core)的情况,如果可以感知,等待的时间可以大幅减少。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes的continer如果退出,master是否需要、能不能及时感知?以前经常碰到程序一启动很快就死(比如core)的情况,如果可以感知,等待的时间可以大幅减少。
等待的时间可以大幅减少是只trainer还是master。container退出我理解是trainer的container退出。trainer是以Kubernetes的Job启动的,任意一个container退出非0返回值,Kubernetes都会自动重启这个container。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master需要判断task是否失败了。比如core。现在用的是timeout判断的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是用kubernetes + docker,应该不会出现马上core掉的现象,运行稳定版本的镜像是没有依赖,库版本不一致的问题。如果是core类的问题应该是在运行一段时间之后出现FPE之类的。其他的错误倒是可能引起trainer启动之后就失败,比如用户的train.py
出错。
但此类的问题我认为应该交给kubernetes处理,而不是master。比如kubernetes控制一个job中的trainer如果50%都fail了,那这个job就整体fail了,此时对应的master和pserver也不再有运行的意义。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果core的话也没法调用函数通知master了吧。
建议还是要是这个问题出现且严重再优化,否则只能把系统变得复杂,难以维护。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的。后续根据执行的情况来优化这些问题。另外也同意增加TaskFail这样的master的RPC接口,防止必然失败的task重新被分发。
go/master/service.go
Outdated
@@ -123,6 +214,8 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
return err | |||
} | |||
|
|||
*task = t.Task | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得在gettask,taskfinish,retry task,task to fail的地方都应该有日志记录task信息,这样我们在调试的时候如果发现了问题就可以通过task的生命周期来找其中的问题,而且最好方便用关键字过滤:)。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done.
go/master/service.go
Outdated
s.taskQueues.Todo = append(s.taskQueues.Todo, s.taskQueues.Done...) | ||
s.taskQueues.Done = nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要记录日志。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
go/master/service.go
Outdated
} | ||
|
||
// task finished, reset timeout | ||
t.NumTimeout = 0 | ||
s.taskQueues.Done = append(s.taskQueues.Done, t) | ||
delete(s.taskQueues.Pending, taskID) | ||
|
||
if len(s.taskQueues.Pending) == 0 { | ||
s.taskQueues.Todo = append(s.taskQueues.Todo, s.taskQueues.Done...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要记录日志。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
go/master/service.go
Outdated
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
t, ok := s.taskQueues.Pending[taskID] | ||
if !ok { | ||
return ErrPendingTaskNotFound | ||
return errors.New("pending task not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要记录日志。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When building with recordio (github.com/PaddlePaddle/recordio) it says
../../../recordio/chunk.go:95: undefined: snappy.NewBufferedWriter
@typhoonzero It works on my machine,
✝➜ master git:(master_client) ✗ pwd
/root/gopath/src/github.com/PaddlePaddle/Paddle/go/cmd/master
✝➜ master git:(master_client) ✗ go build master.go
✝➜ master git:(master_client) ✗
Can you try
go get ./...
-- Helin
It seems that my github.com/golang/snappy
is not on the master branch for some reason. Update this repo fix the problem. Thanks!
-- Wuyi
go/pserver/client.go
Outdated
for i := range knownServers { | ||
if knownServers[i].Addr != curServers[i].Addr { | ||
for i := range lastServers { | ||
if lastServers[i].Addr != curServers[i].Addr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果改成下边的模式是不是会好一些:
if lastServers[i].Addr == curServers[i].Addr{
continue
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Good idea. Done.
go/master/service.go
Outdated
} | ||
|
||
if len(paths) == 0 { | ||
return nil, errors.New("no valid datset specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datset
=> dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
go/master/service.go
Outdated
@@ -123,6 +214,8 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
return err | |||
} | |||
|
|||
*task = t.Task | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change type of taskQueues.Failed
from []Task
to []taskEntry
, so that we can use numTimeout
to record the failed times.
Thanks! I have print log mentioning how many times the task failed, because if we print the log, we can record what happens. But have not changed the type. Do you think if it's sufficient? :)
-- Helin
Cool, I think it's sufficient in this PR.
--Yanxu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) | ||
|
||
func main() { | ||
port := flag.Int("port", 8080, "port of the master server.") | ||
dataset := flag.String("training_dataset", "", "dataset: comma separated path to RecordIO paths, supports golb patterns.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we follow wangyi's suggestion here? a strategy that used to avoid port occupied. server listen on port number of "0", then the operate system(k8s) will assign a idle port, program can query for that port.
I'm not familiar with k8s, is it have different port namespace in each pod? I have googled, but still not sure about that. if so, please ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzhwinter That strategy of using 0 for port is mostly used for unit tests when we don't have control over the testing environment. E.g., https://github.com/PaddlePaddle/Paddle/blob/develop/go/pserver/client_test.go#L20
For non-container based production use, admin will know which port is free. And if we are using container (k8s is based on container technology) this is not a problem, because typically there is only one program running inside the container, so all ports not reserved by OS will be free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it. Thanks!
No description provided.