Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dynamically reloading sampling strategies #2040

Closed
wants to merge 16 commits into from

Conversation

annanay25
Copy link
Member

@annanay25 annanay25 commented Jan 27, 2020

Which problem is this PR solving?

Short description of the changes

  • Create new viper instance to watch changes to the sampling strategies file
  • Wrap the sampling strategy store in a RWMutex for concurrent access.

Unverified

This user has not yet uploaded their public signing key.
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25
Copy link
Member Author

Going through the discussion on #1688 -
IIUC fsnotify requires the creation of new watchers for every file being monitored if we desire notifications on independent updates (we don't want the UI to reload when the sampling strategies file is changed) - so I don't think there any advantage of creating a "central component" for hot reload. Also, since the reload callbacks are going to be different for each component, we'll need a go-routine running in each of these components anyway.

The obvious con is repetition of code.

Signed-off-by: Annanay <annanayagarwal@gmail.com>
@codecov
Copy link

codecov bot commented Jan 28, 2020

Codecov Report

Merging #2040 into master will decrease coverage by 1.12%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2040      +/-   ##
==========================================
- Coverage   97.39%   96.27%   -1.13%     
==========================================
  Files         207      214       +7     
  Lines       10307    10564     +257     
==========================================
+ Hits        10039    10171     +132     
- Misses        224      332     +108     
- Partials       44       61      +17     
Impacted Files Coverage Δ
...in/sampling/strategystore/static/strategy_store.go 91.83% <71.42%> (-8.17%) ⬇️
cmd/env/command.go 100.00% <0.00%> (ø)
cmd/query/app/flags.go 100.00% <0.00%> (ø)
cmd/collector/app/metrics.go 100.00% <0.00%> (ø)
cmd/collector/app/options.go 100.00% <0.00%> (ø)
cmd/collector/app/span_processor.go 100.00% <0.00%> (ø)
cmd/collector/app/zipkin/http_handler.go 100.00% <0.00%> (ø)
cmd/collector/app/grpcserver/grpc_server.go
cmd/collector/app/thrift_span_handler.go
cmd/collector/app/service_name_normalizer.go
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29e7131...8721e11. Read the comment docs.

Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25
Copy link
Member Author

annanay25 commented Jan 29, 2020

Deployment spec used in test -

---
kind: ConfigMap 
apiVersion: v1 
metadata:
  name: jaeger-sampling-strategies
data:
  config: |-
    {
      "service_strategies": [
        {
          "service": "foo",
          "type": "probabilistic",
          "param": 0.8,
          "operation_strategies": [
            {
              "operation": "op1",
              "type": "probabilistic",
              "param": 0.2
            },
            {
              "operation": "op2",
              "type": "probabilistic",
              "param": 0.4
            }
          ]
        },
        {
          "service": "bar",
          "type": "ratelimiting",
          "param": 5
        }
      ],
      "default_strategy": {
        "type": "probabilistic",
        "param": 0.5
      }
    }

---
kind: Pod 
apiVersion: v1 
metadata:
  name: jaeger-collector
spec:
  containers:
    - name: jaeger-collector
      image: annanay25/jaeger-collector:latest
      imagePullPolicy: IfNotPresent
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
  volumes:
    - name: config-volume
      configMap:
        name: jaeger-sampling-strategies
        items:
          - key: config
            path: config

Output -

2020/01/28 20:13:33 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
{"level":"info","ts":1580242413.4304128,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1580242413.4310129,"caller":"flags/admin.go:115","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1580242413.4312766,"caller":"flags/admin.go:121","msg":"Starting admin HTTP server","http-port":14269}
{"level":"info","ts":1580242413.4314523,"caller":"flags/admin.go:107","msg":"Admin server started","http-port":14269,"health-status":"unavailable"}
{"level":"info","ts":1580242413.4318726,"caller":"memory/factory.go:56","msg":"Memory storage initialized","configuration":{"MaxTraces":0}}
{"level":"info","ts":1580242413.4350672,"caller":"app/span_processor.go:130","msg":"Dynamically adjusting the queue size at runtime.","memory-mib":0,"queue-size-warmup":2000}
{"level":"info","ts":1580242413.4376912,"caller":"static/strategy_store.go:102","msg":"watching","file":"/etc/config/config"}
{"level":"info","ts":1580242413.4380167,"caller":"collector/main.go:125","msg":"Starting jaeger-collector TChannel server","port":14267}
{"level":"warn","ts":1580242413.4380605,"caller":"collector/main.go:126","msg":"TChannel has been deprecated and will be removed in a future release"}
{"level":"info","ts":1580242413.4382987,"caller":"grpcserver/grpc_server.go:64","msg":"Starting jaeger-collector gRPC server","grpc-port":"14250"}
{"level":"info","ts":1580242413.4395673,"caller":"collector/main.go:157","msg":"Starting jaeger-collector HTTP server","http-host-port":":14268"}
{"level":"info","ts":1580242413.4396791,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"warn","ts":1580242473.4400034,"caller":"app/span_processor.go:239","msg":"The dynamic queue size warmup value is set, but not the amount of memory to use. Skipping."}
.
.
{"level":"warn","ts":1580242497.5086741,"caller":"static/strategy_store.go:68","msg":"the sampling strategies file has been removed"}
{"level":"info","ts":1580242497.5095131,"caller":"static/strategy_store.go:75","msg":"Updating sampling strategies file!","Strategies":{"default_strategy":{"service":"","operation_strategies":null,"type":"probabilistic","param":0.5},"service_strategies":[{"service":"foo","operation_strategies":[{"operation":"op1","type":"probabilistic","param":0.2},{"operation":"op2","type":"probabilistic","param":0.4}],"type":"probabilistic","param":0.7},{"service":"bar","operation_strategies":null,"type":"ratelimiting","param":5}]}}
.
.

.
{"level":"warn","ts":1580242658.4557033,"caller":"static/strategy_store.go:68","msg":"the sampling strategies file has been removed"}
{"level":"info","ts":1580242658.4559944,"caller":"static/strategy_store.go:75","msg":"Updating sampling strategies file!","Strategies":{"default_strategy":{"service":"","operation_strategies":null,"type":"probabilistic","param":0.5},"service_strategies":[{"service":"foo","operation_strategies":[{"operation":"op1","type":"probabilistic","param":0.2},{"operation":"op2","type":"probabilistic","param":0.4}],"type":"probabilistic","param":0.9},{"service":"bar","operation_strategies":null,"type":"ratelimiting","param":5}]}}

Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25 annanay25 marked this pull request as ready for review February 3, 2020 15:02
@annanay25 annanay25 requested a review from a team as a code owner February 3, 2020 15:02
@annanay25 annanay25 changed the title WIP - Add support for dynamically reloading sampling strategies Add support for dynamically reloading sampling strategies Feb 3, 2020
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
This reverts commit c7eb095.

Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25
Copy link
Member Author

@yurishkuro @jpkrohling - Can we merge this?

// This is a workaround for k8s configmaps. Since k8s loads configmaps as
// symlinked files within the containers, changes to the configmap register
// as `fsnotify.Remove` events.
if err := watcher.Add(strategiesFile); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a race condition here? If the file was removed, even temporarily, and you call watcher.Add() on non-existing file, will it not error out immediately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this line is specifically added for the k8s configmap scenario. It is added to handle successive reloads without restarting the collector (without this line, the collector reloads the strategies only the first time the file was changed). I've verified that it works.

However, if the config file was manually removed, yes it will error out immediately, but that works for us.

annanay25 and others added 2 commits February 11, 2020 20:24
Co-Authored-By: Yuri Shkuro <yurishkuro@users.noreply.github.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
}
continue
}
if event.Op&fsnotify.Write == fsnotify.Write {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about Create event? I think it would've addressed my question above.

for {
select {
case event := <-watcher.Events:
if event.Op&fsnotify.Remove == fsnotify.Remove {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these treated as bitmasks? The Op constants are defined as ordinary numbers (https://godoc.org/github.com/fsnotify/fsnotify#Op).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

select {
case event := <-watcher.Events:
if event.Op&fsnotify.Remove == fsnotify.Remove {
if event.Name == strategiesFile {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we're watching just a single file, when would the name not equal?

logger.Info("watching", zap.String("file", options.StrategiesFile))
}

dir := filepath.Dir(options.StrategiesFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to watch the dir?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

func (h *strategyStore) runWatcherLoop(watcher *fsnotify.Watcher, strategiesFile string) {
for {
select {
case event := <-watcher.Events:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not simplify this by handling all event? E.g.

case event := <-watcher.Events:
    if event.Op == fsnotify.Remove {
        // resubscribe to handle k8s use case
    }
    h.loadAndParseStrategies(strategiesFile)
    continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is actually very close to this

  1. Remove events could occur in a non k8s environment. In which case we need to handle only write events, hence the following block-
if event.Op&fsnotify.Write == fsnotify.Write {
	h.loadAndParseStrategies(strategiesFile)
}
  1. If they occur in a k8s environment, we need to reload the file strategies file hence the first part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: the second case isn't specific to Kubernetes, but it's how Kubernetes mounts those files in the local file system (with two symlinks at different levels). The same would happen on non-Kubernetes environments as well, if people use symlinks on directories pointing to the "current" version of a config map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that you want special handing for Remove by re-subscribing, and all events should result in parsing the file. Right now the code ignores several events altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the code ignores several events altogether.

?

@annanay25
Copy link
Member Author

annanay25 commented Feb 11, 2020

OK, I think I added the directory watcher to get past Codecov. Could you check if the logic at this commit works? 31ea29e

PR coverage is 54%, but that's because we're not writing k8s specific test cases.

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as long as Travis passes.

Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25
Copy link
Member Author

Gah, Codecov is at 40%, but the only lines left out are fsnotify errors.

@annanay25
Copy link
Member Author

@jpkrohling @yurishkuro - soft ping, can we move forward with this? The test cases for this will have to be written in the operator repo

@yurishkuro
Copy link
Member

Please address review comments. Only 40% of new code is covered, please bring it above the 95% target.

@annanay25
Copy link
Member Author

@yurishkuro - I've already added one test case. Test cases to cover the remaining lines of code will have to be added in the operator repo.

@yurishkuro
Copy link
Member

Tests in the operator repo do not affect code coverage. Why can’t we unit test?

@annanay25
Copy link
Member Author

Tests in the operator repo do not affect code coverage

Yes, I'm aware.

Lines checking for fsnotify errors are the only ones untested. Not sure how they can be unit tested.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing errors is no different from testing normal code, with the usual technique - using a stub. The fsnotify package can be abstracted with a small interface that could be stubbed in tests.

func (h *strategyStore) loadAndParseStrategies(strategiesFile string) error {
s, err := loadStrategies(strategiesFile)
if err != nil {
h.logger.Warn("using the last saved configuration for sampling strategies.", zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.logger.Warn("using the last saved configuration for sampling strategies.", zap.Error(err))
h.logger.Warn("Unable to reload the sampling strategies file. Using the last saved configuration.", zap.Error(err), zap.String("file", strategiesFile))

logger.Info("watching", zap.String("file", options.StrategiesFile))
}

dir := filepath.Dir(options.StrategiesFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

func (h *strategyStore) runWatcherLoop(watcher *fsnotify.Watcher, strategiesFile string) {
for {
select {
case event := <-watcher.Events:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the code ignores several events altogether.

?

for {
select {
case event := <-watcher.Events:
if event.Op&fsnotify.Remove == fsnotify.Remove {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@NareshBhanawat
Copy link

NareshBhanawat commented Mar 26, 2020

instead of monitoring file change, why dont we expose an API which can accept json payload of sampling strategy?

@defool
Copy link
Contributor

defool commented Apr 19, 2020

Checking file periodically is another way to reload sampling strategies, which is simpler and more reliable. Can I create a PR to do this? @yurishkuro

@yurishkuro
Copy link
Member

@defool take a look at #2186

@yurishkuro
Copy link
Member

#2188 implements a simpler, timer-based reload, instead of watching for file changes. I suggest we consider closing this PR - @annanay25

@annanay25 annanay25 closed this May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

collector: reload sampling strategies on file content change
5 participants