Skip to content

Commit 889f4f1

Browse files
authored
Improve e2e troubleshooting (#448)
* Improve e2e troubleshooting Improve / fix some issues with e2e tests: - Add more logs; print some useful information such as when cluster is still up - Improve readiness (e.g: had agents pods crashing) - Use more up to date templates for loki and kafka (similar to what we have in docs repo) * remove -a flag; do not tag e2e Tagging e2e is not necessary and has some undesired side effect such as excluding these e2e source files from building/linting, which can invisibilise some problems * Add doc
1 parent 48eb61b commit 889f4f1

File tree

13 files changed

+276
-72
lines changed

13 files changed

+276
-72
lines changed

Makefile

+4-3
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ docker-generate: ## Create the container that generates the eBPF binaries
144144
.PHONY: compile
145145
compile: ## Compile ebpf agent project
146146
@echo "### Compiling project"
147-
GOARCH=${GOARCH} GOOS=$(GOOS) go build -mod vendor -a -o bin/netobserv-ebpf-agent cmd/netobserv-ebpf-agent.go
147+
GOARCH=${GOARCH} GOOS=$(GOOS) go build -mod vendor -o bin/netobserv-ebpf-agent cmd/netobserv-ebpf-agent.go
148148

149149
.PHONY: build-and-push-bc-image
150150
build-and-push-bc-image: docker-generate ## Build and push bytecode image
@@ -153,7 +153,7 @@ build-and-push-bc-image: docker-generate ## Build and push bytecode image
153153
.PHONY: test
154154
test: ## Test code using go test
155155
@echo "### Testing code"
156-
GOOS=$(GOOS) go test -mod vendor -a ./... -coverpkg=./... -coverprofile cover.all.out
156+
GOOS=$(GOOS) go test -mod vendor ./pkg/... ./cmd/... -coverpkg=./... -coverprofile cover.all.out
157157

158158
.PHONY: cov-exclude-generated
159159
cov-exclude-generated:
@@ -175,7 +175,8 @@ tests-e2e: prereqs ## Run e2e tests
175175
go clean -testcache
176176
# making the local agent image available to kind in two ways, so it will work in different
177177
# environments: (1) as image tagged in the local repository (2) as image archive.
178-
$(OCI_BIN) build . --build-arg TARGETARCH=$(GOARCH) -t localhost/ebpf-agent:test
178+
rm -f ebpf-agent.tar || true
179+
$(OCI_BIN) build . --build-arg LDFLAGS="" --build-arg TARGETARCH=$(GOARCH) -t localhost/ebpf-agent:test
179180
$(OCI_BIN) save -o ebpf-agent.tar localhost/ebpf-agent:test
180181
GOOS=$(GOOS) go test -p 1 -timeout 30m -v -mod vendor -tags e2e ./e2e/...
181182

README.md

+4
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,10 @@ make generate
132132

133133
Regularly tested on Fedora.
134134

135+
### Running end-to-end tests
136+
137+
Refer to the specific documentation: [e2e readme](./e2e/README.md)
138+
135139
## Known issues
136140

137141
### Extrenal Traffic in Openshift (OVN-Kubernetes CNI)

e2e/README.md

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
## eBPF Agent e2e tests
2+
3+
e2e tests can be run with:
4+
5+
```bash
6+
make tests-e2e
7+
```
8+
9+
If you use podman, you may need to run it as root instead:
10+
11+
```bash
12+
sudo make tests-e2e
13+
```
14+
15+
### What it does
16+
17+
It builds an image with the current code, including pre-generated BPF bytecode, starts a KIND cluster and deploys the agent on it. It also deploys a typical NetObserv stack, that includes flowlogs-pipeline, Loki and/or Kafka.
18+
19+
It then runs a couple of smoke tests on that cluster, such as testing sending pings between pods and verifying that the expected flows are created.
20+
21+
The tests leverage Kube's [e2e-framework](https://github.com/kubernetes-sigs/e2e-framework). They are based on manifest files that you can find in [this directory](./cluster/base/).
22+
23+
### How to troubleshoot
24+
25+
During the tests, you can run any `kubectl` command to the KIND cluster.
26+
27+
If you use podman/root and don't want to open a root session you can simply copy the root kube config:
28+
29+
```bash
30+
sudo cp /root/.kube/config /tmp/agent-kind-kubeconfig
31+
sudo -E chown $USER:$USER /tmp/agent-kind-kubeconfig
32+
export KUBECONFIG=/tmp/agent-kind-kubeconfig
33+
```
34+
35+
Then:
36+
37+
```bash
38+
$ kubectl get pods
39+
NAME READY STATUS RESTARTS AGE
40+
flp-29bmd 1/1 Running 0 6s
41+
loki-7c98dfd6d4-c8q9m 1/1 Running 0 56s
42+
```
43+
44+
### Cleanup
45+
46+
The KIND cluster should be cleaned up after tests. Sometimes it won't, like with forced exit or for some kinds of failures.
47+
When that's the case, you should see a message telling you to manually cleanup the cluster:
48+
49+
```
50+
^CSIGTERM received, cluster might still be running
51+
To clean up, run: kind delete cluster --name basic-test-cluster20241212-125815
52+
FAIL github.com/netobserv/netobserv-ebpf-agent/e2e/basic 172.852s
53+
```
54+
55+
If that's not the case, you can manually retrieve the cluster name to delete:
56+
57+
```bash
58+
$ kind get clusters
59+
basic-test-cluster20241212-125815
60+
61+
$ kind delete cluster --name=basic-test-cluster20241212-125815
62+
Deleting cluster "basic-test-cluster20241212-125815" ...
63+
Deleted nodes: ["basic-test-cluster20241212-125815-control-plane"]
64+
```
65+
66+
If not cleaned up, a subsequent run of e2e tests will fail due to addresses (ports) already in use.

e2e/basic/common.go

+5-6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
//go:build e2e
2-
31
package basic
42

53
import (
@@ -37,7 +35,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
3735
return ctx
3836
},
3937
).Assess("correctness of client -> server (as Service) request flows",
40-
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
38+
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
4139
lq := bt.lokiQuery(t,
4240
`{DstK8S_OwnerName="server",SrcK8S_OwnerName="client"}`+
4341
`|="\"DstAddr\":\"`+pci.serverServiceIP+`\""`)
@@ -82,7 +80,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
8280
return ctx
8381
},
8482
).Assess("correctness of client -> server (as Pod) request flows",
85-
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
83+
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
8684
lq := bt.lokiQuery(t,
8785
`{DstK8S_OwnerName="server",SrcK8S_OwnerName="client"}`+
8886
`|="\"DstAddr\":\"`+pci.serverPodIP+`\""`)
@@ -124,7 +122,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
124122
return ctx
125123
},
126124
).Assess("correctness of server (from Service) -> client response flows",
127-
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
125+
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
128126
lq := bt.lokiQuery(t,
129127
`{DstK8S_OwnerName="client",SrcK8S_OwnerName="server"}`+
130128
`|="\"SrcAddr\":\"`+pci.serverServiceIP+`\""`)
@@ -167,7 +165,7 @@ func (bt *FlowCaptureTester) DoTest(t *testing.T, isIPFIX bool) {
167165
return ctx
168166
},
169167
).Assess("correctness of server (from Pod) -> client response flows",
170-
func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
168+
func(ctx context.Context, t *testing.T, _ *envconf.Config) context.Context {
171169
lq := bt.lokiQuery(t,
172170
`{DstK8S_OwnerName="client",SrcK8S_OwnerName="server"}`+
173171
`|="\"SrcAddr\":\"`+pci.serverPodIP+`\""`)
@@ -282,6 +280,7 @@ func (bt *FlowCaptureTester) lokiQuery(t *testing.T, logQL string) tester.LokiQu
282280
query, err = bt.Cluster.Loki().Query(1, logQL)
283281
require.NoError(t, err)
284282
require.NotNil(t, query)
283+
require.NotNil(t, query.Data)
285284
require.NotEmpty(t, query.Data.Result)
286285
}, test.Interval(time.Second))
287286
result := query.Data.Result[0]

e2e/basic/flow_test.go

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
//go:build e2e
2-
31
package basic
42

53
import (
@@ -152,6 +150,7 @@ func getPingFlows(t *testing.T, newerThan time.Time, expectedBytes int) (sent, r
152150
}, test.Interval(time.Second))
153151

154152
test.Eventually(t, time.Minute, func(t require.TestingT) {
153+
// testCluster.Loki().DebugPrint(100, `{app="netobserv-flowcollector",DstK8S_OwnerName="pinger"}`)
155154
query, err = testCluster.Loki().
156155
Query(1, fmt.Sprintf(`{SrcK8S_OwnerName="server",DstK8S_OwnerName="pinger"}`+
157156
`|~"\"Proto\":1[,}]"`+ // Proto 1 == ICMP

e2e/cluster/base/02-loki.yml

+63-11
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@ data:
2020
server:
2121
http_listen_port: 3100
2222
grpc_listen_port: 9096
23+
grpc_server_max_recv_msg_size: 10485760
24+
http_server_read_timeout: 1m
25+
http_server_write_timeout: 1m
26+
log_level: error
27+
target: all
2328
common:
2429
path_prefix: /loki-store
2530
storage:
@@ -31,9 +36,32 @@ data:
3136
instance_addr: 127.0.0.1
3237
kvstore:
3338
store: inmemory
39+
compactor:
40+
compaction_interval: 5m
41+
retention_enabled: true
42+
retention_delete_delay: 2h
43+
retention_delete_worker_count: 150
44+
frontend:
45+
compress_responses: true
46+
ingester:
47+
chunk_encoding: snappy
48+
chunk_retain_period: 1m
49+
query_range:
50+
align_queries_with_step: true
51+
cache_results: true
52+
max_retries: 5
53+
results_cache:
54+
cache:
55+
enable_fifocache: true
56+
fifocache:
57+
max_size_bytes: 500MB
58+
validity: 24h
59+
parallelise_shardable_queries: true
60+
query_scheduler:
61+
max_outstanding_requests_per_tenant: 2048
3462
schema_config:
3563
configs:
36-
- from: 2020-10-24
64+
- from: 2022-01-01
3765
store: boltdb-shipper
3866
object_store: filesystem
3967
schema: v11
@@ -47,15 +75,39 @@ data:
4775
active_index_directory: /loki-store/index
4876
shared_store: filesystem
4977
cache_location: /loki-store/boltdb-cache
50-
datasource.yaml: |
51-
apiVersion: 1
52-
datasources:
53-
- name: Loki
54-
type: loki
55-
access: proxy
56-
url: http://localhost:3100
57-
isDefault: true
58-
version: 1
78+
cache_ttl: 24h
79+
limits_config:
80+
ingestion_rate_strategy: global
81+
ingestion_rate_mb: 10
82+
ingestion_burst_size_mb: 10
83+
max_label_name_length: 1024
84+
max_label_value_length: 2048
85+
max_label_names_per_series: 30
86+
reject_old_samples: true
87+
reject_old_samples_max_age: 15m
88+
creation_grace_period: 10m
89+
enforce_metric_name: false
90+
max_line_size: 256000
91+
max_line_size_truncate: false
92+
max_entries_limit_per_query: 10000
93+
max_streams_per_user: 0
94+
max_global_streams_per_user: 0
95+
unordered_writes: true
96+
max_chunks_per_query: 2000000
97+
max_query_length: 721h
98+
max_query_parallelism: 32
99+
max_query_series: 10000
100+
cardinality_limit: 100000
101+
max_streams_matchers_per_query: 1000
102+
max_concurrent_tail_requests: 10
103+
retention_period: 24h
104+
max_cache_freshness_per_query: 5m
105+
max_queriers_per_tenant: 0
106+
per_stream_rate_limit: 3MB
107+
per_stream_rate_limit_burst: 15MB
108+
max_query_lookback: 0
109+
min_sharding_lookback: 0s
110+
split_queries_by_interval: 1m
59111
---
60112
apiVersion: apps/v1
61113
kind: Deployment
@@ -83,7 +135,7 @@ spec:
83135
name: loki-config
84136
containers:
85137
- name: loki
86-
image: grafana/loki:2.4.1
138+
image: grafana/loki:2.9.0
87139
volumeMounts:
88140
- mountPath: "/loki-store"
89141
name: loki-store

0 commit comments

Comments
 (0)