-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mjob create -o emits "socket hang up" #194
Comments
This should be fixed in 1.2.8. |
For reference, I'll describe the issue and debugging process in detail. Some users reported that this would happen very reliably:
I was never able to reproduce this nearly every time, but I was able to see this reliably using Node 0.10.29 after up to 60-70 tries in a row. The problem was not specific to any of the public loadbalancer IPs. The packet captures for the failed attempts showed two TCP connections and only three HTTP requests: the requests to create the job, to add inputs, and to end inputs. There were no requests to check the job's state. Termination was a little weird in that we went through the simultaneous-close TCP path, but that's not that unusual and there was nothing to explain the "socket hang up" error. It looked like the client just shut down the connection before it should have (and then, presumably, tried to use it). Here's how many attempts it took to reproduce it using different Node versions. I ran these tests from a zone in our Amsterdam datacenter to try to match the latency the user saw from their network:
I discovered that the problem went away if I modified the code to disable Node's built-in HTTP agent, but the problem wasn't with the agent itself. I set NODE_DEBUG=tls,http in hopes that I would see a message indicating why the agent was closing the connection. I found this message was only reported for the cases that failed:
From the code, it looked like this was happening when the CryptoStream was being destroyed. In some desperation, I used DTrace to trace writes to stderr and emit a JS stack trace when this happened, but (not surprisingly), this event was asynchronous with respect to the thing that caused it. But it was CryptoStream.end() that caused it, so I traced the same thing using this messy D script:
I ran that as "./trace.d PID", where PID was the pid of the shell from which I tried to reproduce the problem. (The D script only traces child processes of that shell.) Then from that shell, I tried to repro using this script (it started simpler, and grew complicated as I needed more data):
Here's the stack trace DTrace caught that pointed me to the bug:
Notice that we're ending the stream from lib/queue.js, a queue that's used for adding inputs. The problem was that we're closing the client when we finish adding inputs, but then we go to check the job state, we pick a socket that we've just destroyed, and we get this "socket hang up" error. The only question is how this ever worked, and our suspicion is that the socket shutdown process on lower-latency connections normally happens fast enough that we don't get around to making the status request until all those sockets have been shut down, at which point we just create a new connection for the last request. On a higher-latency connection, the shutdown process must still be pending when we try to make the next request, and we try to use it, only to find it immediately shutdown on us. |
Users have reported "socket hang up" errors from "mjob create -o".
The text was updated successfully, but these errors were encountered: