Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hostnames never re-resolved into IP addresses #115

Closed
briancervenka opened this issue Nov 2, 2015 · 32 comments
Closed

Hostnames never re-resolved into IP addresses #115

briancervenka opened this issue Nov 2, 2015 · 32 comments

Comments

@briancervenka
Copy link

I had a cluster of a few relay instances pointing to an amazon aws ELB instance. I have a temporary cluster running to tee traffic from our clients, so that it is sent to an older radar cluster, as well as a newer radar cluster with new hashing scheme and backend nodes.

Over time, AWS changes the DNS CNAMES for an ELB instance to point to new IP addresses, as they do backend restructuring. It seems that carbon-c-relay, though, resolves the IP addresses once, and keeps that until restart. It would be nice if it periodically refreshed the IP list behind the CNAME it is aiming at. Either on every connection, cache it for the DNS TTL, or perhaps only after a specific number of connection errors occur.

My config file is this:
cluster retired
any_of useall oldelb.internal.amazonaws.com:2001;

cluster new
any_of useall newelb.internal.amazonaws.com:2001;

match * send to retired;
match * send to new stop;

@briancervenka
Copy link
Author

Looks like this would not be straightforward to implement. Would probably need to pass hostnames instead of IP addresses into the server_new, and let the resolution happen at connection-time instead of inside of config_new. The useall option may make this tougher to do.

@grobian
Copy link
Owner

grobian commented Nov 3, 2015

yes, the design is not really suited for this, need to think about it, don't want zillions of hostname lookups either.

@grobian
Copy link
Owner

grobian commented Nov 5, 2015

What actually would help/work is to SIGHUP the relay process periodically or when a DNS change happens. It will re-read config, and hence also re-resolve the hostnames in use.

@grobian
Copy link
Owner

grobian commented Nov 7, 2016

You can emulate this behaviour by SIGHUP-ing the relay, I think that's the best you can get for now.

@grobian grobian closed this as completed Nov 7, 2016
@tehlers320
Copy link

Collectd went the other way on this issue: collectd/collectd@4a89ba5.

This is really inconvenient, i just had to reload my relay for the same issue.

@grobian
Copy link
Owner

grobian commented Nov 10, 2016

So, if the relay would automagically reload itself every X minutes, you'd be happy?

@grobian grobian reopened this Nov 10, 2016
@tehlers320
Copy link

If it's just a reload then anybody can just cron sighup, it's a workable solution. If anything honoring the TTLs in DNS would be the way to go. It shouldn't be carbon-c-relays job to do the job of nscd or its concern. Thanks for all of your great work on carbon-c-relay btw.

@grobian
Copy link
Owner

grobian commented Nov 10, 2016

Ah, so you want to have min(TTLs) as refresh interval or something. I see.

@piotr1212
Copy link
Contributor

In an ideal world I think it should resolve on every connect and let the OS (nscd/dnsmasq/systemd-resolved) do the caching.

In the real world many systems don't have DNS caching configured....

@grobian
Copy link
Owner

grobian commented Jan 12, 2017

Since we/I have always been using IP addresses (to avoid any resolution whatsoever) that is definitely the configuration for people who don't want any resolution to take place.

So, thinking this a bit through:

  • the "useall" clause is there for expanding all DNS pointers, when this is used resolving will only take place during reading config
  • without that, the first record returned by the resolver is used
  • it shouldn't be that much of a deal to move that to the connection code instead
  • people who don't want this should use IPs, I think
  • is there a use-case for wanting to resolve, but never switch IPs?

grobian added a commit that referenced this issue Jan 12, 2017
Determine when we don't have explicit IP addresses and re-resolve every
time we attempt to connect.  This also means that we now try all
addresses returned in order instead of just the first entry.
grobian added a commit that referenced this issue Jan 12, 2017
@grobian
Copy link
Owner

grobian commented Jan 12, 2017

There, I guess we'll have to see how that holds up now. :)

@grobian grobian closed this as completed Jan 12, 2017
@szibis
Copy link

szibis commented Jun 26, 2017

This is still an issue because carbon-c-relay is not reconnecting after ELB backend going down and up again - IP behind ELB stays as before no change on ips and resolve is not changing.

For example, I have some telegrafs on auto-scaling-group behind ELB and after nodes replace relays just stop sending data. ELB is just available only as DNS endpoint.

When I restarted all telegrafs on relays nodes we can see this

tcp        1      0 10.1.3.53:43957         54.164.107.221:2003     CLOSE_WAIT
tcp        1      0 10.1.3.53:50974         52.86.86.19:2003        CLOSE_WAIT
tcp        1      0 10.1.3.53:42321         54.164.52.56:2003       CLOSE_WAIT
tcp        1      0 10.1.3.53:40545         52.20.198.174:2003      CLOSE_WAIT
tcp        1      0 10.1.3.53:55262         54.174.165.14:2003      CLOSE_WAIT
tcp        1      0 10.1.3.53:51640         52.204.84.6:2003        CLOSE_WAIT
tcp        1      0 10.1.3.53:35064         52.20.177.181:2003      CLOSE_WAIT
tcp        1      0 10.1.3.53:44177         54.236.94.159:2003      CLOSE_WAIT

On telegraf node

tcp6       0      0 :::2003                 :::*                    LISTEN

After relay restart to restore traffic and no SIGHUP helping. Looks like no reconnect from relay side.

tcp6       0      0 :::2003                 :::*                    LISTEN
tcp6  7540466      0 172.23.83.99:2003       172.23.51.10:28297      ESTABLISHED
tcp6  6802956      0 172.23.83.99:2003       172.23.5.137:33178      ESTABLISHED
tcp6  7379974      0 172.23.83.99:2003       172.23.120.24:37858     ESTABLISHED
tcp6  7547503      0 172.23.83.99:2003       172.23.67.52:42339      ESTABLISHED
tcp6  7718030      0 172.23.83.99:2003       172.23.67.52:42349      ESTABLISHED
tcp6  6793848      0 172.23.83.99:2003       172.23.120.24:37848     ESTABLISHED
tcp6  7626775      0 172.23.83.99:2003       172.23.5.137:33188      ESTABLISHED
tcp6  7542310      0 172.23.83.99:2003       172.23.67.52:42329      ESTABLISHED

Only relay restart help for this situation and this is hard to use at this moment.

Used carbon-c-relay version: carbon-c-relay v3.1 (ac8cc1)

/usr/bin/relay -p 2013 -w 32 -b 10000 -q 3000000 -B 32 -T 500 -f /etc/carbon-c-relay/relay.conf

Part of relay config

cluster telegraf
    any_of useall metrics-proxy.monitoring:2003

ELB works in TCP modes on front and backend and idle TCP connection timeout is set to 60 seconds.

image

As you can see relay is making few connections and never closing this connection by itself.

@piotr1212
Copy link
Contributor

piotr1212 commented Jun 26, 2017

idle TCP connection timeout is set to 60 seconds

I'm no expert on ELB but if this does what I think it does (drop connections after 60s no traffic) then this is extremely low. You will at least have to configure all your systems with a lower keepalive_time in /proc/sys/net/ipv4/tcp_keepalive_time. If you don't do that the OS has no way of "knowing" the connection is down and you'll see a lot of stale ESTABLISHED connections in netstat.

@grobian
Copy link
Owner

grobian commented Jun 27, 2017

@szibis what is the TTL on the DNS record(s) behind metrics-proxy.monitoring? Also, you specified useall, which expands the addresses, so it never re-resolves. What happens if you drop useall?

I think the missing bit here is, that the relay "sees" this as a single target, you used useall to expand into multiple targets, but that makes them static. Since the config didn't change, the SIGHUP didn't trigger a re-resolve. This is a use-case that's not considered above, and I don't think it is easy to implement. Making the SIGHUP work to re-resolve should be possible though.

@szibis
Copy link

szibis commented Jun 27, 2017

@grobian for ELB there is no TTL, this is an internal alias in AWS which is immediately updated when its change.

Yes, useall could be a problem because when replaced all nodes behind or just AWS change ELB addresses exposed to the public then this could stop working, but this is not an issue in this case because IP's are not changed in this short interval. I need useall because I need to spread long-lived connections through all IP's behind ELB DNS.

For me, SIGHUP to re-resolve would be very good option for ELB.

From AWS doc:

Note that TCP keep-alive probes do not prevent the load balancer from terminating the connection because they do not send data in the payload.

@szibis
Copy link

szibis commented Jun 27, 2017

All I do now is to increase the idle timeout on ELB to maximum 3600 seconds and this only move problem in time but still exist.

@grobian
Copy link
Owner

grobian commented Jun 27, 2017

Ok, I'm slightly confused then. If the IP's aren't changed, then there is nothing to re-resolve, is it?

Re-reading your first comment, you mention the ELB is restarted or something, and carbon-c-relay not re-connecting. However, your post also seems to suggest connections still stay in ESTABLISHED state, e.g. what @piotr1212 said, the OS/relay cannot know the connection is dead, a proper reset wasn't sent. The relay should wait for approx 10 seconds on a write to succeed, after that it backs out with a log message and closes the connection. Do you see "failed to write()" log messages?

@piotr1212
Copy link
Contributor

From AWS doc:
Note that TCP keep-alive probes do not prevent the load balancer from terminating the connection because they do not send data in the payload.

That is unfortunate. You could still lower the keepalive, it won't keep the connection open but the OS will not receive responses to the keepalive and will notice that the connection is stale, and close the socket earlier, not perfect, but better.

All I do now is to increase the idle timeout on ELB to maximum 3600 seconds and this only move problem in time but still exist.

It's not very likely there will be no traffic on the connection for longer than 3600 seconds, while not perfect it might work in practice.

The relay should wait for approx 10 seconds on a write to succeed, after that it backs out with a log message and closes the connection. Do you see "failed to write()" log messages?

I don't think this works when the connection is in established state and the packets are silently dropped.

@szibis
Copy link

szibis commented Jun 27, 2017

Nope no failed to write to this hosts.

Starting relay:

tcp        0      0 10.1.3.53:50903         52.44.76.160:2003       ESTABLISHED
tcp        0      0 10.1.3.53:43040         52.6.113.3:2003         ESTABLISHED
tcp        0      0 10.1.3.53:42497         34.193.255.117:2003     ESTABLISHED
tcp        0      0 10.1.3.53:38261         34.225.175.229:2003     ESTABLISHED
tcp        0      0 10.1.3.53:54926         54.175.210.7:2003       ESTABLISHED
tcp        0      0 10.1.3.53:42420         52.22.187.162:2003      ESTABLISHED
tcp        0      0 10.1.3.53:52155         54.173.66.133:2003      ESTABLISHED
tcp        0      0 10.1.3.53:35500         52.2.36.92:2003         ESTABLISHED
tcp        0      0 10.1.3.53:59475         34.199.63.111:2003      ESTABLISHED
tcp        0      0 10.1.3.53:41256         52.20.82.245:2003       ESTABLISHED
tcp        0      0 10.1.3.53:56868         34.196.224.197:2003     ESTABLISHED
tcp        0      0 10.1.3.53:33560         52.86.89.95:2003        ESTABLISHED

This ip are from ELB after resolve - each ELB is resolving to 4 IP's. We have 3 ELB dns names in relay config.

Each ELB all instances active 4 instances on each ELB. Health-check on TCP:2003 telegraf ports in 5 seconds intervals with 3 seconds timeout.

On telegraf side process is running and available and ELB health-check are working with telegraf.

After couple of hours all traffic is going down and all connections from relay through this ELB to telegraf are in state CLOSE_WAIT

image

And after some time on relays:

tcp        1      0 10.1.3.53:52362         54.173.66.133:2003      CLOSE_WAIT
tcp        0      0 10.1.3.53:35684         52.2.36.92:2003         ESTABLISHED
tcp        0      0 10.1.3.53:59668         34.199.63.111:2003      ESTABLISHED
tcp        1      0 10.1.3.53:55113         54.175.210.7:2003       CLOSE_WAIT
tcp        1      0 10.1.3.53:51103         52.44.76.160:2003       CLOSE_WAIT
tcp        1      0 10.1.3.53:33749         52.86.89.95:2003        CLOSE_WAIT
tcp        1      0 10.1.3.53:42684         34.193.255.117:2003     CLOSE_WAIT
tcp        1      0 10.1.3.53:43224         52.6.113.3:2003         CLOSE_WAIT
tcp        1      0 10.1.3.53:41454         52.20.82.245:2003       CLOSE_WAIT
tcp        1      0 10.1.3.53:57047         34.196.224.197:2003     CLOSE_WAIT
tcp        1      0 10.1.3.53:42610         52.22.187.162:2003      CLOSE_WAIT
tcp        1      0 10.1.3.53:38462         34.225.175.229:2003     CLOSE_WAIT

Eventually all connections will be in CLOSE_WAIT state.

And some tcpdump to this CLOSE_WAIT:

10:14:58.516324 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [.], seq 29462:32358, ack 1, win 211, options [nop,nop,TS val 561785533 ecr 10474666], length 2896
10:14:58.516332 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 32358:33038, ack 1, win 211, options [nop,nop,TS val 561785533 ecr 10474666], length 680
10:14:58.813869 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33038:33230, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 192
10:14:58.813881 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33230:33423, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 193
10:14:58.813885 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33423:33614, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 191
10:14:58.813888 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33614:33730, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 116
10:14:58.813892 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33730:33837, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 107
10:14:58.813895 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33837:33948, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 111
10:14:58.813897 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 33948:34065, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474666], length 117
10:14:58.814538 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [.], seq 34369:37265, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474741], length 2896
10:14:58.814545 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [.], seq 37265:40161, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474741], length 2896
10:14:58.814600 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [.], seq 40161:43057, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474741], length 2896
10:14:58.814616 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [.], seq 43057:44505, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474741], length 1448
10:14:58.814619 IP 10.1.3.53.55113 > 54.175.210.7.2003: Flags [P.], seq 44505:44529, ack 1, win 211, options [nop,nop,TS val 561785607 ecr 10474741], length 24

@szibis
Copy link

szibis commented Jun 27, 2017

After pkill -1 relay

tcp        0      0 10.1.3.53:36701         52.2.36.92:2003         ESTABLISHED
tcp        0      0 10.1.3.53:52112         52.44.76.160:2003       ESTABLISHED
tcp        0      0 10.1.3.53:53366         54.173.66.133:2003      ESTABLISHED
tcp        0      0 10.1.3.53:42467         52.20.82.245:2003       ESTABLISHED
tcp        0      0 10.1.3.53:43629         52.22.187.162:2003      ESTABLISHED
tcp        0      0 10.1.3.53:56136         54.175.210.7:2003       ESTABLISHED
tcp        0      0 10.1.3.53:43693         34.193.255.117:2003     ESTABLISHED
tcp        0      0 10.1.3.53:34753         52.86.89.95:2003        ESTABLISHED
tcp        0      0 10.1.3.53:39469         34.225.175.229:2003     ESTABLISHED
tcp        0      0 10.1.3.53:60680         34.199.63.111:2003      ESTABLISHED
tcp        0      0 10.1.3.53:44245         52.6.113.3:2003         ESTABLISHED
tcp        0      0 10.1.3.53:58057         34.196.224.197:2003     ESTABLISHED

on telegraf host

tcp6       0      0 :::2003                 :::*                    LISTEN
tcp6  8157044      0 172.23.112.3:2003       172.23.36.161:45973     ESTABLISHED
tcp6  124418      0 172.23.112.3:2003       172.23.3.247:3812       ESTABLISHED

and other telegraf host

tcp6       0      0 :::2003                 :::*                    LISTEN
tcp6       0      0 172.23.18.37:2003       172.23.100.164:56920    ESTABLISHED

But somehow after couple of hours even this SIGHUP from cron every 3 minutes not working but i need to confirm this leaving this to late today.

@szibis
Copy link

szibis commented Jun 27, 2017

After all 9 relays restart on telegraf side:

tcp6  1281359      0 172.23.112.3:2003       172.23.100.164:6941     ESTABLISHED
tcp6  4432279      0 172.23.112.3:2003       172.23.3.247:4616       ESTABLISHED
tcp6  3488699      0 172.23.112.3:2003       172.23.36.161:46915     ESTABLISHED
tcp6  3616907      0 172.23.112.3:2003       172.23.93.179:44495     ESTABLISHED
tcp6  2638934      0 172.23.112.3:2003       172.23.3.247:4578       ESTABLISHED
tcp6  1127992      0 172.23.112.3:2003       172.23.36.161:46945     ESTABLISHED
tcp6  3634935      0 172.23.112.3:2003       172.23.100.164:6923     ESTABLISHED
tcp6  8621461      0 172.23.112.3:2003       172.23.3.247:3812       ESTABLISHED
tcp6  1345620      0 172.23.112.3:2003       172.23.93.179:44511     ESTABLISHED

All is working for some time just like in screen from ELB cloudwatch console graph.

@grobian
Copy link
Owner

grobian commented Jun 27, 2017

I may be a bit slow here, but can you describe how your architecture looks like? I see relays -> ELB, and relays -> telegraf, you also mention telegraf -> ELB, how does it work, who talks to who?

@szibis
Copy link

szibis commented Jun 27, 2017

  1. relays -> (influx-proxy) ELB -> telegraf with auto-scaling-group (socket_listener) -> influxdb cluster
  2. relays -> hashing with replication-factor=2 to graphite-stores with go-carbon
  3. relays -> relays-aggregators -> graphite-stores with go-carbon

Currently, the problem exists in first flow.

@piotr1212
Copy link
Contributor

I'm a bit confused about how your architecture exactly is. For flow 1:
What does the "(influx-proxy)" do?
Does the relay connect to the ELB or does ELB only manage the DNS records?
You said it happens when the backend goes down/up, the backend is telegraf?
IP's and DNS records don't change while this happens?

@szibis
Copy link

szibis commented Jun 27, 2017

Yes, relay is sending traffic to ELB and behind this ELB I have a cluster of telegraf auto-scaled by ASG based on spot instances.

What does the "(influx-proxy)" do?

This is the only indicator of ELB name

Does the relay connect to the ELB or does ELB only manage the DNS records?

When an instance fails, then it is starting and ASG adding this new instance to ELB.
ELB is always a DNS record as an endpoint to connect to and I use this endpoint in carbon-c-relay which is resolving this record and sending data to IP's (ELB IP's under ELB domain).

You said it happens when the backend goes down/up, the backend is telegraf?

Yes, it happens when any telegraf goes down, restored by ASG or just when I restart telegraf daemon. If I do this on all telegraf instances then traffic going to CLOSE_WAIT on all carbon-c-relay connections to this ELB IP's.

IP's and DNS records don't change while this happens?

Yes, IP doesn't change and domain of ELB is newer changing.

My question is why carbon-c-relay making only one connection per IP?
Any option by hard reconnect every defined number of seconds/metrics send should fix workaround this as proof-of-concept?

I have similar problems with diamond and graphite handler or influxdb handler. Closing connections then reconnecting solve this. In graphite every number of seconds in influxdb handler just every n sent batches of metrics. But maybe this is a different issue.

@grobian
Copy link
Owner

grobian commented Jun 28, 2017

Ok, so it seems the ELB does weird stuff (IMO). Apart from that I don't know if it loadbalances at all (c-relay only makes 1 connection to each IP), it seems not to reset connections that are forwarded.

I'm wondering whether it would be possible to have the IPs of the telegraf nodes in the carbon-c-relay config (DNS useall would be fine). This way, SIGHUP will update the IP list if it changes in DNS, and the relay will use ALL telegrafs because it will connect to ALL of them, instead of making just a single connection to the ELB. It probably is also able to detect write errors in this case, which makes it fail over.

@szibis
Copy link

szibis commented Jun 28, 2017

@grobian and there is the second part of the issue I do test it and when I have more than 15 or 20 (i don't remember right now but we can test it one more time) telegraphs defined as any_of then when I add more I do have buffer overflow in carbon-c-relay - tested on 3.1

And when I use telegrafs direct then i can use them with static IP's without any DNS, but this scenario need more testing.

@grobian
Copy link
Owner

grobian commented Jun 28, 2017

I'd be very interested in the buffer overflow.

@magec
Copy link

magec commented Sep 12, 2017

I have an issue related to this. I would like to use a simple forward cluster with a couple of carbon-cache nodes, simply for ha. I use consul and a DNS on top of it, so whenever a new carbon-cache server gets discovered, consul adds this new carbon-cache to the DNS response. The thing is, if you write the useall, the resolution is done at 'reading config time' thus you wont notice changes in the infrastructure. If you skip the useall, you get resolution but the metric is only sent to one of the carbon-caches, I want to re-resolve and also sent it to every server. Is there a way of doing this?

@grobian
Copy link
Owner

grobian commented Sep 12, 2017

this is currently not possible, it requires a complete config reload

@magec
Copy link

magec commented Sep 12, 2017

Ok, I expected that. Thanks for the answer either way, and for carbon-c-relay!

@grobian
Copy link
Owner

grobian commented Sep 13, 2017

Just to clarify, I think that the current design of the router cannot cope with "dynamic" clusters like this. There are likely too many assumptions that the cluster composition is static in terms of size. Re-resolving works out fine per destination, but re-creation of an entire cluster requires a reload of the entire config. It could be possible perhaps to trigger config reload on something different than SIGHUP, e.g. config file change (mtime) or a watched dns record change (via TTL, or something, don't know). Such implementation would be somewhat expensive since it would lock the relay down on every actual config change (like reload does currently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants