connectivity: Introduce BGP CP connectivity tests #2649

rastislavs · 2024-06-28T13:39:38Z

Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the node-without-cilium.

The aim of this PR is mostly to introduce the BGP + FRR testing infrastructure, more testing scenarios need to be added in follow-up PRs. At the moment, we are testing:

BGPv1 control plane, IPv4 + IPv6 scenario,
BGPv2 control plane, IPv4 + IPv6 scenario,
in both scenarios we test PodCIDR advertisements and ClusterIP service advertisements.

Example run with debug enabled:

[=] [cilium-test] Test [bgp-control-plane-v1] [80/84]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (10.244.1.188:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.1.188:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (10.244.0.254:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.0.254:8080]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node (10.96.136.144:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.136.144:8080]
..  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node (10.96.194.250:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.194.250:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (fd00:10:244::9227:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244::9227]:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (fd00:10:244:1::1a8:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244:1::1a8]:8080]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node (fd00:10:96::4567:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4567]:8080]
..  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node (fd00:10:96::4763:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4763]:8080]
.  🐛 Finalizing Test bgp-control-plane-v1
  [-] Scenario [bgp-control-plane-v2/bgpv2-advertisements]
[=] [cilium-test] Test [bgp-control-plane-v2] [81/84]
  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (10.244.0.254:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.0.254:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (10.244.1.188:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.1.188:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node (10.96.194.250:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.194.250:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node (10.96.136.144:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.136.144:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (fd00:10:244::9227:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244::9227]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (fd00:10:244:1::1a8:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244:1::1a8]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node (fd00:10:96::4567:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4567]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node (fd00:10:96::4763:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4763]:8080]
.  🐛 Finalizing Test bgp-control-plane-v2

Example e2e job: https://github.com/cilium/cilium/actions/runs/9778585998/job/26995747615

harsimran-pabla

Awesome stuff, some minor comments otherwise 🚀

connectivity/check/deployment.go

connectivity/check/frr.go

connectivity/tests/bgp.go

YutaroHayakawa

Thanks for working on this! Great stuff! I had a few comments.

One more thing. Is it possible to collect the FRR's running state on test failure? Otherwise, it's a bit hard to investigate the failure.

connectivity/tests/bgp.go

Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the node-without-cilium. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>

rastislavs · 2024-07-03T13:09:18Z

@YutaroHayakawa

Is it possible to collect the FRR's running state on test failure? Otherwise, it's a bit hard to investigate the failure.

I added DumpFRRBGPState function that will dump BGP state into the log in case that the test failed.

YutaroHayakawa

My concerns are addressed. Thanks!

nebril

ci-structure LGTM

joestringer · 2024-07-24T04:21:54Z

connectivity/check/deployment.go

+	if ct.Features[features.BGPControlPlane].Enabled && ct.Features[features.NodeWithoutCilium].Enabled {
+		_, err = ct.clients.src.GetDaemonSet(ctx, ct.params.TestNamespace, frrDaemonSetNameName, metav1.GetOptions{})
+		if err != nil {
+			ct.Logf("✨ [%s] Deploying %s daemonset...", ct.clients.src.ClusterName(), frrDaemonSetNameName)


I don't think that the CLI should be deploying FRR like this.

The CLI is intended to be run in production environments. Deploying another BGP speaker into the live production network as part of the tests seems inadvisable.

Usually we solve this by deploying the environment with the relevant platform details (such as BGP infrastructure) in the GitHub Action, and then only turn on the specific tests based on detecting the available configuration knobs and infrastructure that can be used to do the testing.

Ah, I see, we hide it in the "unsafe" category of tests: https://github.com/cilium/cilium-cli/pull/2649/files/b0dafd55b753585cf8a174b8e3a3d7d9f5e78790#diff-f2e389864746a093ca953d2ce4f6bca61ef7fe01c0ea08eca12324246abcd1afR22

Some background to my thinking here: In Cilium's "CI 2.0" AKA Ginkgo running in Jenkins, we did not do a good job of isolating the platform configuration and environment setup vs. the actual tests. Over time what this resulted in is (a) random tests would provision stuff and not properly clean up, leading to flakes and (b) a large amount of the overall test runtime went towards spinning up and tearing down the environments. After that, while setting up the notion of "CI 3.0" AKA cilium-cli based testing, the intent was to start with GitHub actions representing target environments of various kinds, then define only the actual runtime tests inside the cilium-cli. This way, when you run the cilium-cli testsuite, maybe it deploys a few Pods but for the most part it is just executing a bunch of different commands in the various Pods and checking the results of the tests. It also matches much more closely with the likely usage from end-users who may want to validate whether their environment is successfully implementing the features they need to use.

I understand the concerns @joestringer, and thanks for explaining the background. Let me first mention some rationale behind these tests and hopefully we can come to a conclusion on how to solve them:

As part of cilium-cli, we already have tests that require "NodeWithoutCilium" (example) and even install routes on those nodes (see WithIPRoutesFromOutsideToPodCIDRs)

In comparison to these, BGP tests are different only in the way how the routes are installed (indirectly via FRR vs. directly via cilium-cli), plus the fact that an additional FRR daemonset needs to be deployed on the "NodeWithoutCilium"

The "NodeWithoutCilium" is sufficient for testing most of the BGP features, there is no need for more complex infra setup, and separate GitHub workflow(s) to bring it up

To address your concerns, I think we have following options:

If installing the routes on the NodeWithoutCilium is the concern, we could also configure the test FRR to not install any routes - we would just check the FRRs RIB to assert, this way the tests would not affect the environment in any way

If the FRR pod deployment on the NodeWithoutCilium as part of the cilium-cli tests setup is the main concern, we can OFC remove this from cilium-cli and introduce a new Github workflow that would bring it up manually and then run the tests

CC @harsimran-pabla

Hmm. Taking a step back, my base assumption for the functional code in Cilium CLI is that it provides a way for users to manage and assess the state of their cluster. Then in terms of providing testable environments, we would declare those environments outside the CLI code (such as inside GHA). This way, let's say a user wanted to run smoke tests for their environment running BGP and understand whether the cluster is operating successfully, they could deploy the exact same tests to validate this compared to what we use in CI to validate releases.

I think that this notion falls apart a bit when it comes to the NodesWithoutCilium feature and other integration tests (as opposed to smoke tests). So part of the gap here may also just be that smoke tests are specifically not part of the goal for this feature - this is purely about pre-release pipeline testing and not solving the user smoke test use case. I'll note that all of the --unsafe functionality falls into this category.

Maybe something I'm looking for is to cleanly separate cilium connectivity test as the baseline smoke tests that deploy a range of Pod topologies and validate connectivity between them, vs. the functionality which bootstraps testing environments and runs integration tests.

I chatted out-of-band a little with Michi and he floated an idea that maybe some of this logic should exist in other commands. My examples above largely assume that this functionality is built into the GitHub Actions due to the clean separation it provides. That said I'm open to the idea that maybe the better language to implement that logic is Go and the same objectives could just be achieved with different commands. Whether that looks like having a cilium deploy-dev-env command for this, or moving some of this under cilium bgp ... I'm not exactly sure. Part of me would prefer if overall this CI testing focused logic is all under a command that is clearly marked for development only, such as cilium dev .... Though in general the UX should probably follow cilium <verb> <noun>.

When it comes to the NodesWithoutCilium / --unsafe, we've already made the jump to integration testing and assuming that this is not a user environment, so my initial point for this thread is moot. In practical terms I don't think it makes a big difference if we're installing routes or FRR on this node that is deployed for testing purposes, so maybe there's not much actionable here. I probably jumped a little bit quickly from "We introduced this test into v1.16 -> broke v1.15 upgrade" to "why are we deploying FRR?" without recognizing the --unsafe guards in this test. Ultimately #2712 was the root cause for the issue I was facing.

I did have some similar concerns/feedback when we initially introduced the route installation logic, so this is a bit of a continuation of those concerns. How can we ensure that the CLI continues to provide a good user experience as we develop the integration testing side of the CLI. But that's a broader discussion, so not necessarily directly related to this specific PR (though I see more PRs with similar approaches to testing, so it's worth spending some time thinking about this topic).

I personally like the idea of having two categories of tests, one for smoke tests that can really run in any cluster (cilium connectivity test) and another that requires some special/non-production environment (e.g. tests requiring NodesWithoutCilium / --unsafe) - something like cilium integration test. These are anyway currently skipped when running by users to smoke-test their production clusters. At the same time, the "integration" tests still may be useful even outside of the cilium CI, e.g. for verifying Cilium integration into various k8s platforms, to verify that advanced Cilium features work well there (maybe we could also call them conformance tests).

rastislavs temporarily deployed to ci June 28, 2024 13:39 — with GitHub Actions Inactive

rastislavs force-pushed the bgp-tests branch from 0cb4387 to 5245ba1 Compare June 28, 2024 13:49

rastislavs had a problem deploying to ci June 28, 2024 13:49 — with GitHub Actions Error

rastislavs force-pushed the bgp-tests branch from 5245ba1 to 168d160 Compare June 28, 2024 13:53

rastislavs temporarily deployed to ci June 28, 2024 13:54 — with GitHub Actions Inactive

harsimran-pabla reviewed Jun 28, 2024

View reviewed changes

rastislavs force-pushed the bgp-tests branch from 168d160 to 16c6314 Compare July 1, 2024 06:20

rastislavs temporarily deployed to ci July 1, 2024 06:20 — with GitHub Actions Inactive

rastislavs mentioned this pull request Jul 1, 2024

ci: enable BGP Control Plane in e2e tests cilium/cilium#33488

Merged

YutaroHayakawa requested changes Jul 1, 2024

View reviewed changes

connectivity/tests/bgp.go Show resolved Hide resolved

connectivity/tests/bgp.go Show resolved Hide resolved

rastislavs added 2 commits July 3, 2024 14:52

connectivity: Introduce BGP CP connectivity tests

e6ba851

Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the node-without-cilium. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>

CODEOWNERS: Assign BGP/FRR source files to sig-bgp

b0dafd5

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>

rastislavs force-pushed the bgp-tests branch from 16c6314 to b0dafd5 Compare July 3, 2024 12:52

rastislavs temporarily deployed to ci July 3, 2024 12:52 — with GitHub Actions Inactive

rastislavs marked this pull request as ready for review July 3, 2024 14:30

rastislavs requested review from a team as code owners July 3, 2024 14:30

rastislavs requested review from youngnick, nebril and squeed July 3, 2024 14:30

maintainer-s-little-helper bot requested a review from YutaroHayakawa July 4, 2024 03:47

YutaroHayakawa approved these changes Jul 4, 2024

View reviewed changes

nebril approved these changes Jul 4, 2024

View reviewed changes

squeed approved these changes Jul 4, 2024

View reviewed changes

youngnick removed their request for review July 4, 2024 23:37

michi-covalent requested a review from harsimran-pabla July 6, 2024 17:09

harsimran-pabla approved these changes Jul 8, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 8, 2024

aditighag merged commit 736139d into cilium:main Jul 9, 2024
13 checks passed

joestringer reviewed Jul 24, 2024

View reviewed changes

joestringer mentioned this pull request Jul 26, 2024

CONTRIBUTING.md: Add guidelines regarding cluster state mutation #2726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectivity: Introduce BGP CP connectivity tests #2649

connectivity: Introduce BGP CP connectivity tests #2649

rastislavs commented Jun 28, 2024 •

edited

Loading

harsimran-pabla left a comment

YutaroHayakawa left a comment

rastislavs commented Jul 3, 2024

YutaroHayakawa left a comment

nebril left a comment

joestringer Jul 24, 2024 •

edited

Loading

joestringer Jul 24, 2024

joestringer Jul 24, 2024

joestringer Jul 24, 2024

rastislavs Jul 24, 2024 •

edited

Loading

joestringer Jul 25, 2024 •

edited

Loading

joestringer Jul 25, 2024 •

edited

Loading

rastislavs Jul 25, 2024 •

edited

Loading

connectivity: Introduce BGP CP connectivity tests #2649

connectivity: Introduce BGP CP connectivity tests #2649

Conversation

rastislavs commented Jun 28, 2024 • edited Loading

harsimran-pabla left a comment

Choose a reason for hiding this comment

YutaroHayakawa left a comment

Choose a reason for hiding this comment

rastislavs commented Jul 3, 2024

YutaroHayakawa left a comment

Choose a reason for hiding this comment

nebril left a comment

Choose a reason for hiding this comment

joestringer Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

joestringer Jul 24, 2024

Choose a reason for hiding this comment

joestringer Jul 24, 2024

Choose a reason for hiding this comment

joestringer Jul 24, 2024

Choose a reason for hiding this comment

rastislavs Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

joestringer Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

joestringer Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

rastislavs Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

rastislavs commented Jun 28, 2024 •

edited

Loading

joestringer Jul 24, 2024 •

edited

Loading

rastislavs Jul 24, 2024 •

edited

Loading

joestringer Jul 25, 2024 •

edited

Loading

joestringer Jul 25, 2024 •

edited

Loading

rastislavs Jul 25, 2024 •

edited

Loading