mjob create -o emits "socket hang up" #194

davepacheco · 2014-07-14T18:27:07Z

Users have reported "socket hang up" errors from "mjob create -o".

davepacheco · 2014-07-14T20:29:31Z

This should be fixed in 1.2.8.

davepacheco · 2014-07-14T20:58:46Z

For reference, I'll describe the issue and debugging process in detail.

Some users reported that this would happen very reliably:

$ echo /$MANTA_USER/stor/1.sh | mjob create -o -m "wc -l" -r "wc -l"
added 1 input to a446c456-0b95-11e4-9c8b-5707efc995c6
mjob create: error: socket hang up

I was never able to reproduce this nearly every time, but I was able to see this reliably using Node 0.10.29 after up to 60-70 tries in a row. The problem was not specific to any of the public loadbalancer IPs. The packet captures for the failed attempts showed two TCP connections and only three HTTP requests: the requests to create the job, to add inputs, and to end inputs. There were no requests to check the job's state. Termination was a little weird in that we went through the simultaneous-close TCP path, but that's not that unusual and there was nothing to explain the "socket hang up" error. It looked like the client just shut down the connection before it should have (and then, presumably, tried to use it).

Here's how many attempts it took to reproduce it using different Node versions. I ran these tests from a zone in our Amsterdam datacenter to try to match the latency the user saw from their network:

Version   Number of attempts before reproducing "socket hang-up"
v0.10.11: 32 attempts, 1 attempt, 42 attempts, 43 attempts
v0.10.24: 64 attempts, 42 attempts, 52 attempts, 1 attempt
v0.10.29: 75 attempts, 15 attempts, 62 attempts, 21 attempts

I discovered that the problem went away if I modified the code to disable Node's built-in HTTP agent, but the problem wasn't with the agent itself. I set NODE_DEBUG=tls,http in hopes that I would see a message indicating why the agent was closing the connection. I found this message was only reported for the cases that failed:

HTTP: HTTP socket close

From the code, it looked like this was happening when the CryptoStream was being destroyed. In some desperation, I used DTrace to trace writes to stderr and emit a JS stack trace when this happened, but (not surprisingly), this event was asynchronous with respect to the thing that caused it. But it was CryptoStream.end() that caused it, so I traced the same thing using this messy D script:

#!/usr/sbin/dtrace -ws

syscall::write:entry
/arg0 == 2 && progenyof($1) && execname == "node" && 
 strstr(copyinstr(arg1, arg2), "HTTP socket close") != NULL/
{
        trace(pid);
        raise(SIGABRT);
        exit(0);
}

syscall::write:entry
/arg0 == 2 && progenyof($1) && execname == "node" &&
 (this->s = copyinstr(arg1, arg2)) != NULL &&
 (strstr(this->s, "cleartext.destroy") != NULL ||
 strstr(this->s, "cleartext.end") != NULL ||
 strstr(this->s, "encrypted.destroy") != NULL ||
 strstr(this->s, "encrypted.end") != NULL)/
{
        trace(pid);
        jstack(150, 8192);
}

I ran that as "./trace.d PID", where PID was the pid of the shell from which I tried to reproduce the problem. (The D script only traces child processes of that shell.) Then from that shell, I tried to repro using this script (it started simpler, and grew complicated as I needed more data):

#!/bin/bash

#
# tryfail: try to reproduce node-manta#194
#

#
# Configuration
#

#
# Log this whole run of tests to this temporary file.  We'll also create
# several similarly-named files for each run.
#
tf_tmpfile=/var/tmp/$(basename $0).$$

# If true, record a packet capture for each attempt.
tf_dosnoop=false

#
# Manta host to use.  We override this so that we can more easily sift through
# packet captures.
#
tf_mantahost="165.225.172.22"

# IP interface to snoop on.
tf_ipinterface=net0


#
# Internal state
#

# number of iterations so far
tf_count=0
# last pid launched in the background
tf_pid=
# whether the last test failed
tf_failed=false


#
# Setup
#
export MANTA_URL=https://$tf_mantahost
export MANTA_TLS_INSECURE=1

cat > $tf_tmpfile <<EOF
using tmpfile $tf_tmpfile
using node: $(node -v) (from $(which node))
using MANTA_URL=$MANTA_URL (and MANTA_TLS_INSECURE=1)
EOF
cat $tf_tmpfile

#
# Run tests until it fails.
#
while :; do
        tf_count=$((tf_count + 1))

        if [[ $tf_dosnoop == "true" ]]; then
                snoop -o $tf_tmpfile.$tf_count.snoop -d $tf_ipinterface \
                    $tf_mantahost >>$tf_tmpfile 2>&1 &
                tf_pid=$!
        fi

        export NODE_DEBUG=http,tls
        echo /dap/stor/hello | mjob create -v -m wc -o \
            >> $tf_tmpfile 2>$tf_tmpfile.$tf_count.log || tf_failed=true
        export NODE_DEBUG=

        if [[ $tf_dosnoop == "true" ]]; then
                sleep 1
                kill $tf_pid
                wait
        fi

        [[ $tf_failed == "true" ]] && break;

        if [[ $((tf_count % 10)) -eq 0 ]]; then
                echo "note: no luck after $tf_count attempts, but still trying"
        fi
done

echo
echo "possibly repro'd after $tf_count attempts:"
echo "--------------------"
tail -1 $tf_tmpfile.$tf_count.log
echo "--------------------"
echo "client bunyan log: $tf_tmpfile.$tf_count.log"
if [[ $tf_dosnoop == "true" ]]; then
        echo "snoop capture: $tf_tmpfile.$tf_count.snoop"
fi

Here's the stack trace DTrace caught that pointed me to the bug:

             0xfed17fa5
              0x840a511
              0x840be5e
              0x82b1741
              << internal code >>
              (anon) as fs.writeSync at fs.js pos 13750
              << adaptor >>
              (anon) as SyncWriteStream.write at fs.js pos 44728
              << adaptor >>
              (anon) as Console.warn at console.js pos 2152
              << adaptor >>
              << internal >>
              << entry >>
              0x8467758
              0x8468f00
              0x8569314
              << internal code >>
              (anon) as b at native v8natives.js pos 26469
              << adaptor >>
              (anon) as debug at tls.js pos 2390
              (anon) as CryptoStream.end at tls.js pos 17430
              << adaptor >>
              (anon) as (anon) at /root/node_modules/manta/node_modules/restify/lib/clients/http_client.js line 578
              << adaptor >>
              forEach at native array.js pos 12950
              << adaptor >>
              (anon) as (anon) at /root/node_modules/manta/node_modules/restify/lib/clients/http_client.js line 578
              << adaptor >>
              forEach at native array.js pos 12950
              << adaptor >>
              close at /root/node_modules/manta/node_modules/restify/lib/clients/http_client.js line 578
              close at /root/node_modules/manta/lib/client.js line 1112
              _cb at /root/node_modules/manta/bin/mjob pos 3605
              << adaptor >>
              << internal >>
              g at events.js pos 5742
              emit at events.js pos 2216
              end at /root/node_modules/manta/lib/queue.js pos 1648
              << adaptor >>
              << internal >>
              (anon) as f at /root/node_modules/manta/node_modules/once/once.js pos 290
              closeOnDrain at /root/node_modules/manta/lib/queue.js pos 1950
              << internal >>
              g at events.js pos 5742
              emit at events.js pos 2216
              _drain at /root/node_modules/manta/lib/queue.js pos 1145
              _tickCallback at node.js pos 13561
              << internal >>
              << entry >>
              0x8467758
              0x8468f00
              0x842f92c
              0x82a08a2
              0x82a09b0
              0x82b981b
              0x82b9bee
              0x840f826
              0x840fe35
              0x8414cdf
              0x8408c3a
              0x82a30d7
              0x82b380b
              0x81fba03

Notice that we're ending the stream from lib/queue.js, a queue that's used for adding inputs.

The problem was that we're closing the client when we finish adding inputs, but then we go to check the job state, we pick a socket that we've just destroyed, and we get this "socket hang up" error. The only question is how this ever worked, and our suspicion is that the socket shutdown process on lower-latency connections normally happens fast enough that we don't get around to making the status request until all those sockets have been shut down, at which point we just create a new connection for the last request. On a higher-latency connection, the shutdown process must still be pending when we try to make the next request, and we try to use it, only to find it immediately shutdown on us.

davepacheco added bug labels Jul 14, 2014

davepacheco self-assigned this Jul 14, 2014

davepacheco pushed a commit to davepacheco/node-manta that referenced this issue Jul 14, 2014

TritonDataCenter#194 mjob create -o emits "socket hang up"

1dd44cc

davepacheco closed this as completed Jul 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mjob create -o emits "socket hang up" #194

mjob create -o emits "socket hang up" #194

davepacheco commented Jul 14, 2014

davepacheco commented Jul 14, 2014

davepacheco commented Jul 14, 2014

mjob create -o emits "socket hang up" #194

mjob create -o emits "socket hang up" #194

Comments

davepacheco commented Jul 14, 2014

davepacheco commented Jul 14, 2014

davepacheco commented Jul 14, 2014