-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectivity: Introduce BGP CP connectivity tests #2649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff, some minor comments otherwise 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Great stuff! I had a few comments.
One more thing. Is it possible to collect the FRR's running state on test failure? Otherwise, it's a bit hard to investigate the failure.
Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the node-without-cilium. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns are addressed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ci-structure LGTM
if ct.Features[features.BGPControlPlane].Enabled && ct.Features[features.NodeWithoutCilium].Enabled { | ||
_, err = ct.clients.src.GetDaemonSet(ctx, ct.params.TestNamespace, frrDaemonSetNameName, metav1.GetOptions{}) | ||
if err != nil { | ||
ct.Logf("✨ [%s] Deploying %s daemonset...", ct.clients.src.ClusterName(), frrDaemonSetNameName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the CLI should be deploying FRR like this.
The CLI is intended to be run in production environments. Deploying another BGP speaker into the live production network as part of the tests seems inadvisable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we solve this by deploying the environment with the relevant platform details (such as BGP infrastructure) in the GitHub Action, and then only turn on the specific tests based on detecting the available configuration knobs and infrastructure that can be used to do the testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, we hide it in the "unsafe" category of tests: https://github.com/cilium/cilium-cli/pull/2649/files/b0dafd55b753585cf8a174b8e3a3d7d9f5e78790#diff-f2e389864746a093ca953d2ce4f6bca61ef7fe01c0ea08eca12324246abcd1afR22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some background to my thinking here: In Cilium's "CI 2.0" AKA Ginkgo running in Jenkins, we did not do a good job of isolating the platform configuration and environment setup vs. the actual tests. Over time what this resulted in is (a) random tests would provision stuff and not properly clean up, leading to flakes and (b) a large amount of the overall test runtime went towards spinning up and tearing down the environments. After that, while setting up the notion of "CI 3.0" AKA cilium-cli based testing, the intent was to start with GitHub actions representing target environments of various kinds, then define only the actual runtime tests inside the cilium-cli. This way, when you run the cilium-cli testsuite, maybe it deploys a few Pods but for the most part it is just executing a bunch of different commands in the various Pods and checking the results of the tests. It also matches much more closely with the likely usage from end-users who may want to validate whether their environment is successfully implementing the features they need to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concerns @joestringer, and thanks for explaining the background. Let me first mention some rationale behind these tests and hopefully we can come to a conclusion on how to solve them:
- As part of cilium-cli, we already have tests that require "NodeWithoutCilium" (example) and even install routes on those nodes (see WithIPRoutesFromOutsideToPodCIDRs)
- In comparison to these, BGP tests are different only in the way how the routes are installed (indirectly via FRR vs. directly via cilium-cli), plus the fact that an additional FRR daemonset needs to be deployed on the "NodeWithoutCilium"
- The "NodeWithoutCilium" is sufficient for testing most of the BGP features, there is no need for more complex infra setup, and separate GitHub workflow(s) to bring it up
To address your concerns, I think we have following options:
- If installing the routes on the NodeWithoutCilium is the concern, we could also configure the test FRR to not install any routes - we would just check the FRRs RIB to assert, this way the tests would not affect the environment in any way
- If the FRR pod deployment on the NodeWithoutCilium as part of the cilium-cli tests setup is the main concern, we can OFC remove this from cilium-cli and introduce a new Github workflow that would bring it up manually and then run the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Taking a step back, my base assumption for the functional code in Cilium CLI is that it provides a way for users to manage and assess the state of their cluster. Then in terms of providing testable environments, we would declare those environments outside the CLI code (such as inside GHA). This way, let's say a user wanted to run smoke tests for their environment running BGP and understand whether the cluster is operating successfully, they could deploy the exact same tests to validate this compared to what we use in CI to validate releases.
I think that this notion falls apart a bit when it comes to the NodesWithoutCilium
feature and other integration tests (as opposed to smoke tests). So part of the gap here may also just be that smoke tests are specifically not part of the goal for this feature - this is purely about pre-release pipeline testing and not solving the user smoke test use case. I'll note that all of the --unsafe
functionality falls into this category.
Maybe something I'm looking for is to cleanly separate cilium connectivity test
as the baseline smoke tests that deploy a range of Pod topologies and validate connectivity between them, vs. the functionality which bootstraps testing environments and runs integration tests.
I chatted out-of-band a little with Michi and he floated an idea that maybe some of this logic should exist in other commands. My examples above largely assume that this functionality is built into the GitHub Actions due to the clean separation it provides. That said I'm open to the idea that maybe the better language to implement that logic is Go and the same objectives could just be achieved with different commands. Whether that looks like having a cilium deploy-dev-env
command for this, or moving some of this under cilium bgp ...
I'm not exactly sure. Part of me would prefer if overall this CI testing focused logic is all under a command that is clearly marked for development only, such as cilium dev ...
. Though in general the UX should probably follow cilium <verb> <noun>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it comes to the NodesWithoutCilium
/ --unsafe
, we've already made the jump to integration testing and assuming that this is not a user environment, so my initial point for this thread is moot. In practical terms I don't think it makes a big difference if we're installing routes or FRR on this node that is deployed for testing purposes, so maybe there's not much actionable here. I probably jumped a little bit quickly from "We introduced this test into v1.16 -> broke v1.15 upgrade" to "why are we deploying FRR?" without recognizing the --unsafe
guards in this test. Ultimately #2712 was the root cause for the issue I was facing.
I did have some similar concerns/feedback when we initially introduced the route installation logic, so this is a bit of a continuation of those concerns. How can we ensure that the CLI continues to provide a good user experience as we develop the integration testing side of the CLI. But that's a broader discussion, so not necessarily directly related to this specific PR (though I see more PRs with similar approaches to testing, so it's worth spending some time thinking about this topic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the idea of having two categories of tests, one for smoke tests that can really run in any cluster (cilium connectivity test
) and another that requires some special/non-production environment (e.g. tests requiring NodesWithoutCilium
/ --unsafe
) - something like cilium integration test
. These are anyway currently skipped when running by users to smoke-test their production clusters. At the same time, the "integration" tests still may be useful even outside of the cilium CI, e.g. for verifying Cilium integration into various k8s platforms, to verify that advanced Cilium features work well there (maybe we could also call them conformance tests).
Introduces BGP Control Plane connectivity tests, working with FRR router instance running in the host network namespace on the
node-without-cilium
.The aim of this PR is mostly to introduce the BGP + FRR testing infrastructure, more testing scenarios need to be added in follow-up PRs. At the moment, we are testing:
Example run with debug enabled:
Example e2e job: https://github.com/cilium/cilium/actions/runs/9778585998/job/26995747615