-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(observability): Initial API for tap
#6363
Conversation
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Reviewers:
|
@@ -116,8 +116,8 @@ goauth = { version = "0.8.1", optional = true } | |||
smpl_jwt = { version = "0.5.0", optional = true } | |||
|
|||
# API | |||
async-graphql = { version = "2.4.10", optional = true } | |||
async-graphql-warp = { version = "2.4.10", optional = true } | |||
async-graphql = { version = "=2.4.11", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one-liner took a surprisingly long time to figure out :-)
The 2.5 branch of async-graphql bumps to Warp 0.3, which changes the signature of Warp filters and interacting with the GraphQL pieces. The bump was needed because of an issue with Context
I was running into (the mechanism used to pass the event inspector to the active HTTP request without resorting to a global; I'm trying to move away from the latter.)
Unfortunately, we have other code relying on 0.2.5, which broke a lot of things. After some trial/error, it seems this particular patch version is both stapled to the right Warp version for us, and fixes the Context issue.
I'd like to follow-up with a general bump to Warp 0.3 later to avoid being stuck on this rung of async-graphql (and also because it's the latest anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 do you mind opening an issue for the warp upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6383 👍
@@ -216,6 +216,7 @@ tokio-postgres = { version = "0.5.5", optional = true, features = ["runtime", "w | |||
thread_local = "= 1.0.1" | |||
dashmap = "3" | |||
encoding_rs = { version = "0.8", features = ["serde"] } | |||
parking_lot = "0.11.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using parking_lot here for its RwLock
implementation, which according to the readme:
The numbers for RwLock vary depending on the number of reader and writer threads, but are almost always faster than the standard library RwLock, and even up to 50x faster in some cases.
Given this claim, I was surprised it wasn't being used in other places where we've reached for the stdlib version. There may be design decisions/considerations I'm not privy to that preclude its use - please advise if this sticks out as being undesirable.
If this holds up as a decent choice, there's further usage in API code I could swap it out for.
src/api/schema/events/log.rs
Outdated
impl LogEvent { | ||
/// Log message | ||
async fn message(&self) -> Option<String> { | ||
serde_json::from_value(json!(self.0.get("message")?)).ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit round the houses to extract a String. Is there a way to avoid the initial serialisation to a JSON string? I'm pretty sure I'm missing something obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you probably want to match:
match self.0.get("message").and_then(|message| match {
v @ Value::Bytes(_) => Some(v.to_string_lossy()),
_ => None,
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I actually missed an existing method of event::Value
that reduces this to:
Some(self.0.get("message")?.to_string_lossy())
&'a self, | ||
ctx: &'a Context<'a>, | ||
component_name: String, | ||
) -> impl Stream<Item = LogEvent> + 'a { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this eventually being extracted out into a general filter, but think it's okay inline for now.
This reverts commit acc8e5e.
Co-authored-by: Pablo Sichert <mail@pablosichert.com>
Hey @leebenson I'm trying this out. I can't get it to work with a Update: I wasn't able to get this to work with transforms. I tried the Also, RE
I kinda think a non-existent components should return an error. |
Thanks @sghall. The shift to a sink may have dislodged something with transforms -- will audit tomorrow. Thanks for checking this out. 👍 on returning an error, too. |
src/app.rs
Outdated
let tap_controller = if cfg!(feature = "api") { | ||
let tap_controller = TapController::new(); | ||
|
||
let tap_sinks = config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking the issue with transforms is here. I tried this locally and seemed to do the trick:
let tap_sinks = config
.sources
.keys().chain(config.transforms.keys())
.map(|name| {
(format!("_tap_{}", &name), tap_controller.make_sink(name))
})
.collect::<IndexMap<String, SinkOuter>>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that’s it! Nicely caught 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the tap
is surviving a config change (hot update). Oh, yep, you said that..
Handling reloads. There's currently no mechanism to re-register components that may have changed.
Think that would be a nice follow up or include it here. I think that's definitely needed if you a trying to use this to debug your config.
Putting this back into draft for now, to implement the reload plumbing. |
I think this should be alright to review. I've done some refactoring to work with existing topology plumbing, and attempted to limit the surface area of tap, so that it's mostly an implementation detail of building config and not something that needs directly configuring in I've also attempted to improve the log output to make it clearer what's being instantiated -- e.g (full example):
Here, the following lines are pertinent during startup:
If
This effectively:
Additionally:
Follow-up workThere's at least a couple of areas I'd like to improve on in follow-up work:
The other work mentioned in the opening PR notes will also be tackled separately - filtering, velocity control, etc. |
&self, | ||
cx: SinkContext, | ||
) -> crate::Result<(super::VectorSink, super::Healthcheck)> { | ||
let mut lock = self.locked_tx.write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the blocking mutex in the async context will lead to reactor locking. I suggest using async-aware mutexes - they'll park the task while waiting, rather than blocking the thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggested lib/lock that you'd recommend in this scenario? I saw https://lib.rs/crates/future-parking_lot, but have no experience with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or at least the current tokio version equivalent of it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, tokio mutexes are good! Also, https://docs.rs/futures/0.3.12/futures/lock/struct.Mutex.html, and a potentially more interesting https://docs.rs/futures/0.3.12/futures/lock/struct.BiLock.html - it's more lightweight than the usual mutex, but it is unconventional.
There's also an https://docs.rs/arc-swap/1.2.0/arc_swap/ - it's also doesn't block, so it's fine to use within async context.
@@ -212,8 +212,9 @@ tokio-postgres = { version = "0.5.5", features = ["runtime", "with-chrono-0_4"], | |||
toml = "0.5.8" | |||
typetag = "0.1.6" | |||
url = "2.2.0" | |||
uuid = { version = "0.8", features = ["serde", "v4"], optional = true } | |||
uuid = { version = "0.8", features = ["serde", "v4"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise I'm undoing a recent PR to add uuid
as optional. The crate size is ~37kb, so I don't think this adds much bloat. This prevents having to add feature guards in a few places. If this is contentious, I can add the guards.
Signed-off-by: Lee Benson <lee@leebenson.com>
I'm not sure I follow this line of reasoning. Sinks very much narrow the scope of their input to particular components, just not via the sink API itself. It seems like we're reimplementing some existing functionality of the topology here. In let new_sink = config.build();
let (tx, rx) = mpsc::channel();
let sink_task = tokio::spawn(new_sink.run(rx));
for input in config.inputs {
topology.outputs[input].send(ControlMessage::Add(config.name, tx.clone()))
} And with that, the output of the components specified in Important for the How I imagined this working was roughly that the API endpoint would have a reference to the current This would be simpler with a somewhat more ergonomic API around the topology, but fundamentally it should be possible without introducing any tap-specific concepts into the topology code itself. This would really be ideal, because the code is already somewhat hard to follow and simplifying it will only get more difficult if we pile in additional concepts. This certainly isn't the only way to build this, as you've demonstrated, but I'm not convinced I'd explained it properly and wanted to make sure we at least consider this approach thoroughly before deciding to add to the topology as we're doing here. |
@lukesteensen - please ignore the original PR notes regarding this not being viable as a sink. I wasn't thinking about it in the right way. This is no longer relevant. Thanks for your other notes re: simplifying the connection between topology and tap. That makes a lot of sense. I'll give this a play today. |
Based on feedback here, I wound up working from a clean branch to reimplement. This has drifted quite a lot, and I'll open a new PR when ready -- closing. |
Introduces the beginnings of
tap
- a way to observeevent::LogEvent
s via the API.The intended use-case is an (upcoming)
vector tap
CLI subcommand + UI dashboard.Demo
14 second demo (no sound):
Note that this shows a
Subscription.events
field; this was changed toSubscription.logEvents
:https://www.loom.com/share/9bceb26e5f37447da3834bd346047487
4 min version (narrated):
This expands on the above demo with commentary, and additional fields that have since been added:
https://www.loom.com/share/b543370c88f84e86869f99286060e189
Current scope
This PR adds an initial
Subscription.logEvents
subscription, which returns all log events generated by a source or transform component, identified bycomponentName: String
.Later work will include:
LogEvent
.vector tap
client to consume the API.Example query
Assuming this configured source:
The following subscription query:
Will return payloads like the following, at 1 second intervals (controlled via
batch_interval = 1.0
)Later work will include the ability to override the natural velocity of events, sample at user-defined intervals.
Implementation
As
event::LogEvent
s are emitted via sources/transforms, a newevent::EventInspector
inspects the events and uses a tokio broadcast channel to send them to subscribers.In the API, a listener is created ad-hoc based on an active subscription.
Based on the discussion in #3212, I initially attempted to model the emitter as a sink to use the existing fanout pattern. This seemed unfeasible - the
StreamSink
trait'srun
method receives an event without awareness of its origin. Since a key characteristic oftap
is the ability to narrow the scope to particular components, this appeared to require wholesale changes to existing implementors, which felt redundant and potentially wasteful given the narrow use-case.Instead, I adopted an existing pattern where an
EventProcessed
is emitted via an.inspect()
call on the input rx channel itself, prior to being received by the downstream component. One side-effect to this is that clients technically receive the event before a connecting event has had chance to receive it. I felt this is okay though, given that the component being inspected is the one that actually emits the event; there's no concept of downstream ack'ing that I felt we needed to cater for.