Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectivity: Introduce BGP CP connectivity tests #2649

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

rastislavs
Copy link
Contributor

@rastislavs rastislavs commented Jun 28, 2024

Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the node-without-cilium.

The aim of this PR is mostly to introduce the BGP + FRR testing infrastructure, more testing scenarios need to be added in follow-up PRs. At the moment, we are testing:

  • BGPv1 control plane, IPv4 + IPv6 scenario,
  • BGPv2 control plane, IPv4 + IPv6 scenario,
  • in both scenarios we test PodCIDR advertisements and ClusterIP service advertisements.
Example run with debug enabled:
[=] [cilium-test] Test [bgp-control-plane-v1] [80/84]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (10.244.1.188:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.1.188:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (10.244.0.254:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.0.254:8080]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node (10.96.136.144:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.136.144:8080]
..  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node (10.96.194.250:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.194.250:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (fd00:10:244::9227:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244::9227]:8080]
.  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-pod-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (fd00:10:244:1::1a8:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244:1::1a8]:8080]
  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node (fd00:10:96::4567:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4567]:8080]
..  [.] Action [bgp-control-plane-v1/bgpv1-advertisements/curl-echo-service-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node (fd00:10:96::4763:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4763]:8080]
.  🐛 Finalizing Test bgp-control-plane-v1
  [-] Scenario [bgp-control-plane-v2/bgpv2-advertisements]
[=] [cilium-test] Test [bgp-control-plane-v2] [81/84]
  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (10.244.0.254:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.0.254:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (10.244.1.188:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.244.1.188:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv4-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-other-node (10.96.194.250:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.194.250:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv4-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (172.22.0.4) -> cilium-test/echo-same-node (10.96.136.144:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://10.96.136.144:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node-58999bbffd-qsrq8 (fd00:10:244::9227:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244::9227]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-pod-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node-7f896b84-sbqv4 (fd00:10:244:1::1a8:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:244:1::1a8]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv6-0: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-same-node (fd00:10:96::4567:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4567]:8080]
.  [.] Action [bgp-control-plane-v2/bgpv2-advertisements/curl-echo-service-ipv6-1: cilium-test/echo-external-node-89864b5bd-4lzt4 (fc00:c111::4) -> cilium-test/echo-other-node (fd00:10:96::4763:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fd00:10:96::4763]:8080]
.  🐛 Finalizing Test bgp-control-plane-v2

Example e2e job: https://github.com/cilium/cilium/actions/runs/9778585998/job/26995747615

Copy link
Contributor

@harsimran-pabla harsimran-pabla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff, some minor comments otherwise 🚀

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Great stuff! I had a few comments.

One more thing. Is it possible to collect the FRR's running state on test failure? Otherwise, it's a bit hard to investigate the failure.

Introduces BGP Control Plane connectivity tests, working
with FRR router instance running in the host network
namespace on the node-without-cilium.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
@rastislavs
Copy link
Contributor Author

@YutaroHayakawa

Is it possible to collect the FRR's running state on test failure? Otherwise, it's a bit hard to investigate the failure.

I added DumpFRRBGPState function that will dump BGP state into the log in case that the test failed.

@rastislavs rastislavs marked this pull request as ready for review July 3, 2024 14:30
@rastislavs rastislavs requested review from a team as code owners July 3, 2024 14:30
@rastislavs rastislavs requested review from youngnick, nebril and squeed July 3, 2024 14:30
Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns are addressed. Thanks!

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci-structure LGTM

@youngnick youngnick removed their request for review July 4, 2024 23:37
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 8, 2024
@aditighag aditighag merged commit 736139d into cilium:main Jul 9, 2024
13 checks passed
if ct.Features[features.BGPControlPlane].Enabled && ct.Features[features.NodeWithoutCilium].Enabled {
_, err = ct.clients.src.GetDaemonSet(ctx, ct.params.TestNamespace, frrDaemonSetNameName, metav1.GetOptions{})
if err != nil {
ct.Logf("✨ [%s] Deploying %s daemonset...", ct.clients.src.ClusterName(), frrDaemonSetNameName)
Copy link
Member

@joestringer joestringer Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the CLI should be deploying FRR like this.

The CLI is intended to be run in production environments. Deploying another BGP speaker into the live production network as part of the tests seems inadvisable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we solve this by deploying the environment with the relevant platform details (such as BGP infrastructure) in the GitHub Action, and then only turn on the specific tests based on detecting the available configuration knobs and infrastructure that can be used to do the testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some background to my thinking here: In Cilium's "CI 2.0" AKA Ginkgo running in Jenkins, we did not do a good job of isolating the platform configuration and environment setup vs. the actual tests. Over time what this resulted in is (a) random tests would provision stuff and not properly clean up, leading to flakes and (b) a large amount of the overall test runtime went towards spinning up and tearing down the environments. After that, while setting up the notion of "CI 3.0" AKA cilium-cli based testing, the intent was to start with GitHub actions representing target environments of various kinds, then define only the actual runtime tests inside the cilium-cli. This way, when you run the cilium-cli testsuite, maybe it deploys a few Pods but for the most part it is just executing a bunch of different commands in the various Pods and checking the results of the tests. It also matches much more closely with the likely usage from end-users who may want to validate whether their environment is successfully implementing the features they need to use.

Copy link
Contributor Author

@rastislavs rastislavs Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concerns @joestringer, and thanks for explaining the background. Let me first mention some rationale behind these tests and hopefully we can come to a conclusion on how to solve them:

  • As part of cilium-cli, we already have tests that require "NodeWithoutCilium" (example) and even install routes on those nodes (see WithIPRoutesFromOutsideToPodCIDRs)
  • In comparison to these, BGP tests are different only in the way how the routes are installed (indirectly via FRR vs. directly via cilium-cli), plus the fact that an additional FRR daemonset needs to be deployed on the "NodeWithoutCilium"
  • The "NodeWithoutCilium" is sufficient for testing most of the BGP features, there is no need for more complex infra setup, and separate GitHub workflow(s) to bring it up

To address your concerns, I think we have following options:

  • If installing the routes on the NodeWithoutCilium is the concern, we could also configure the test FRR to not install any routes - we would just check the FRRs RIB to assert, this way the tests would not affect the environment in any way
  • If the FRR pod deployment on the NodeWithoutCilium as part of the cilium-cli tests setup is the main concern, we can OFC remove this from cilium-cli and introduce a new Github workflow that would bring it up manually and then run the tests

CC @harsimran-pabla

Copy link
Member

@joestringer joestringer Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Taking a step back, my base assumption for the functional code in Cilium CLI is that it provides a way for users to manage and assess the state of their cluster. Then in terms of providing testable environments, we would declare those environments outside the CLI code (such as inside GHA). This way, let's say a user wanted to run smoke tests for their environment running BGP and understand whether the cluster is operating successfully, they could deploy the exact same tests to validate this compared to what we use in CI to validate releases.

I think that this notion falls apart a bit when it comes to the NodesWithoutCilium feature and other integration tests (as opposed to smoke tests). So part of the gap here may also just be that smoke tests are specifically not part of the goal for this feature - this is purely about pre-release pipeline testing and not solving the user smoke test use case. I'll note that all of the --unsafe functionality falls into this category.

Maybe something I'm looking for is to cleanly separate cilium connectivity test as the baseline smoke tests that deploy a range of Pod topologies and validate connectivity between them, vs. the functionality which bootstraps testing environments and runs integration tests.

I chatted out-of-band a little with Michi and he floated an idea that maybe some of this logic should exist in other commands. My examples above largely assume that this functionality is built into the GitHub Actions due to the clean separation it provides. That said I'm open to the idea that maybe the better language to implement that logic is Go and the same objectives could just be achieved with different commands. Whether that looks like having a cilium deploy-dev-env command for this, or moving some of this under cilium bgp ... I'm not exactly sure. Part of me would prefer if overall this CI testing focused logic is all under a command that is clearly marked for development only, such as cilium dev .... Though in general the UX should probably follow cilium <verb> <noun>.

Copy link
Member

@joestringer joestringer Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to the NodesWithoutCilium / --unsafe, we've already made the jump to integration testing and assuming that this is not a user environment, so my initial point for this thread is moot. In practical terms I don't think it makes a big difference if we're installing routes or FRR on this node that is deployed for testing purposes, so maybe there's not much actionable here. I probably jumped a little bit quickly from "We introduced this test into v1.16 -> broke v1.15 upgrade" to "why are we deploying FRR?" without recognizing the --unsafe guards in this test. Ultimately #2712 was the root cause for the issue I was facing.

I did have some similar concerns/feedback when we initially introduced the route installation logic, so this is a bit of a continuation of those concerns. How can we ensure that the CLI continues to provide a good user experience as we develop the integration testing side of the CLI. But that's a broader discussion, so not necessarily directly related to this specific PR (though I see more PRs with similar approaches to testing, so it's worth spending some time thinking about this topic).

Copy link
Contributor Author

@rastislavs rastislavs Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the idea of having two categories of tests, one for smoke tests that can really run in any cluster (cilium connectivity test) and another that requires some special/non-production environment (e.g. tests requiring NodesWithoutCilium / --unsafe) - something like cilium integration test. These are anyway currently skipped when running by users to smoke-test their production clusters. At the same time, the "integration" tests still may be useful even outside of the cilium CI, e.g. for verifying Cilium integration into various k8s platforms, to verify that advanced Cilium features work well there (maybe we could also call them conformance tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants