Gateway publish to orchestrator. #3211

j0sh · 2024-10-18T05:45:38Z

Requires Scaffolding for realtime-to-realtime #3210
Set up a trickle HTTP endpoints on the orchestrator (requires golang 1.22 for the new routes; will send a separate patch to bump the go.mod and do any CI adjustments)
Pulls a RTMP stream from MediaMTX on the gateway when a new stream comes in
Converts the RTMP into mpegts segments
Publishes segments to the orchestrator via trickle HTTP

leszko

Good work, I think this is a very important PR, because it describes the protocol between G<>O. I added 2 comments, because I think I still don't understand how we plan it all to work.

Other than that, @j0sh is it possible to actually run the code from this PR? If yes, could you describe how to run it locally? I think if I plan with it, I can have some more comments and understand if better.

leszko · 2024-10-21T12:29:34Z

server/ai_mediaserver.go

+		// Kick off the RTMP pull and segmentation as soon as possible
+		ssr := media.NewSwitchableSegmentReader()
+		go func() {
+			media.RunSegmentation("rtmp://localhost/"+streamName, ssr.Read)


Does it mean, that Gateway will pull the stream from MediaMTX? Isn't it opposite to what you designed on the diagram?

Yes, gateway initiates the RTMP pull, but the flow of media still goes from mediamtx -> gateway

I guess there could be a better distinction between "who initiates the pull" vis-a-vis the actual flow of media, but RTMP isn't really a request-response protocol in the same way the other HTTP flows are in this diagram. Open to suggestions.

Maybe you can add one more arrow with "initiate RTMP" 🙃 For me it is/was pretty confusing.

leszko · 2024-10-21T12:41:00Z

server/ai_process.go

+			u = sess.Transcoder() + u
+			return url.Parse(u)
+		}
+		pub, err := appendHostname(resp.JSON200.PublishUrl)


Why do you need PublishUrl in Orchestrator? Isn't it Gateway Publishing to Orchestrator?

I think in general it's hard to grasp how this trickle server works. Could you describe it somewhere? I understand it should correspond to diagram from your doc, but where is the publish/subscribe part?

Why do you need PublishUrl in Orchestrator? Isn't it Gateway Publishing to Orchestrator?

Yeah this was a bit of a spur of the moment addition when I was looking at the overall flow, where we are doing this request -> response call anyway. This gives the exact endpoint where a publish should happen. Likewise for the subscribe URL - it tells the gateway where to pull the results.

We could skip this entirely and hard-code the URLs via well-known paths, distinguish jobs via ids in HTTP headers, etc but this is an easy way for us to add a bit of topological flexibility without breaking the protocol later (eg, routing the stream to a different machine).

In fact from some of the conversations on Discord right now, there is probably another way to make this even more robust: to get a list of subscribe URLs (think multiple renditions of low latency video transcoding)

j0sh · 2024-10-21T22:35:32Z

Force-pushed to fix merge conflicts from the ai-video rebase

is it possible to actually run the code from this PR? If yes, could you describe how to run it locally?

Absolutely! The big thing is making sure you have MediaMTX running. This is pretty simple - it is a single executable + config file. Download a pre-built release and use this config file [1]. Stick the executable in the same directory as the config and run ./mediamtx

Then for the gateway:

./livepeer -gateway -rtmpAddr :1936 -httpAddr :5936 -orchAddr localhost:8935

For the orchestrator + worker:

./livepeer -orchestrator -aiWorker -aiModels 'live-video-to-video:stream-diffusion:false' -serviceAddr localhost:8935 -transcoder

Publish to http://<mediamtx-host>:8889/streamname/publish ; see more details in livepeer/ai-runner#209

I still don't understand how we plan it all to work

That's fair; as mentioned earlier I think that is partially a result of recent pressure to deliver things without enough time to fully design an end-to-end flow within go-livepeer.

BTW, I am afraid that this PR is probably not quite enough for you to base your work on just yet, if the plan is still to carry payments within media stream. The trickle server on the orchestrator only behaves as a simple pipe between publisher and subscriber, and does not have any mechanisms (yet) to execute additional code based on incoming segments, eg for us to process PM tickets or record metrics for selection, and we need to adjust the publisher API to also include custom headers. I'll have those in within the next day or so.

NB: One elephant in the room here is "how do we integrate mediamtx into our infrastructure" and I will spin up a separate thread to discuss that.

[1] The only difference between this config file and the MediaMTX sample config is in the addition of a runOnReady curl hook, and enabling STUN support for WebRTC (because my cloud dev box doesn't work without it; prod might vary). Diff below:

diff --git b/mediamtx.yml a/mediamtx.yml
index c3aed76..cf7c60c 100644
--- b/mediamtx.yml
+++ a/mediamtx.yml
@@ -376,8 +376,8 @@ webrtcAdditionalHosts: []
 # ICE servers. Needed only when local listeners can't be reached by clients.
 # STUN servers allows to obtain and share the public IP of the server.
 # TURN/TURNS servers forces all traffic through them.
-webrtcICEServers2: []
-  # - url: stun:stun.l.google.com:19302
+webrtcICEServers2:
+  - url: stun:stun.l.google.com:19302
   # if user is "AUTH_SECRET", then authentication is secret based.
   # the secret must be inserted into the password field.
   # username: ''
@@ -643,7 +643,7 @@ pathDefaults:
   #   a regular expression.
   # * MTX_SOURCE_TYPE: source type
   # * MTX_SOURCE_ID: source ID
-  runOnReady:
+  runOnReady: curl http://localhost:5936/live-video-start -F stream=$MTX_PATH
   # Restart the command if it exits.
   runOnReadyRestart: no
   # Command to run when the stream is not available anymore.

leszko · 2024-10-22T13:58:09Z

Absolutely! The big thing is making sure you have MediaMTX running. This is pretty simple - it is a single executable +
Thanks. I think I have it working 🙏

trickle/trickle_server.go

trickle/trickle_publisher.go

codecov · 2024-10-31T22:45:45Z

Codecov Report

Attention: Patch coverage is 0% with 973 lines in your changes missing coverage. Please review.

Project coverage is 34.98009%. Comparing base (6845c68) to head (39febf2).
Report is 1 commits behind head on ai-video.

Files with missing lines	Patch %	Lines
trickle/trickle_server.go	0.00000%	319 Missing ⚠️
media/rtmp2segment.go	0.00000%	203 Missing ⚠️
trickle/trickle_publisher.go	0.00000%	126 Missing ⚠️
trickle/trickle_subscriber.go	0.00000%	115 Missing ⚠️
trickle/local_subscriber.go	0.00000%	47 Missing ⚠️
server/ai_process.go	0.00000%	43 Missing ⚠️
trickle/local_publisher.go	0.00000%	43 Missing ⚠️
media/segment_reader.go	0.00000%	22 Missing ⚠️
server/ai_live_video.go	0.00000%	19 Missing ⚠️
server/ai_mediaserver.go	0.00000%	19 Missing ⚠️
... and 2 more

Additional details and impacted files

@@                 Coverage Diff                 @@
##            ai-video       #3211         +/-   ##
===================================================
- Coverage   35.93147%   34.98009%   -0.95138%     
===================================================
  Files            126         135          +9     
  Lines          34961       35909        +948     
===================================================
- Hits           12562       12561          -1     
- Misses         21693       22642        +949     
  Partials         706         706

Files with missing lines	Coverage Δ
server/rpc.go	`67.86787% <ø> (ø)`
media/select_linux.go	`0.00000% <0.00000%> (ø)`
server/ai_http.go	`12.74900% <0.00000%> (-0.05100%)`	⬇️
server/ai_live_video.go	`0.00000% <0.00000%> (ø)`
server/ai_mediaserver.go	`11.18644% <0.00000%> (-0.43328%)`	⬇️
media/segment_reader.go	`0.00000% <0.00000%> (ø)`
server/ai_process.go	`0.61350% <0.00000%> (-0.02171%)`	⬇️
trickle/local_publisher.go	`0.00000% <0.00000%> (ø)`
trickle/local_subscriber.go	`0.00000% <0.00000%> (ø)`
trickle/trickle_subscriber.go	`0.00000% <0.00000%> (ø)`
... and 3 more

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6845c68...39febf2. Read the comment docs.

Files with missing lines	Coverage Δ
server/rpc.go	`67.86787% <ø> (ø)`
media/select_linux.go	`0.00000% <0.00000%> (ø)`
server/ai_http.go	`12.74900% <0.00000%> (-0.05100%)`	⬇️
server/ai_live_video.go	`0.00000% <0.00000%> (ø)`
server/ai_mediaserver.go	`11.18644% <0.00000%> (-0.43328%)`	⬇️
media/segment_reader.go	`0.00000% <0.00000%> (ø)`
server/ai_process.go	`0.61350% <0.00000%> (-0.02171%)`	⬇️
trickle/local_publisher.go	`0.00000% <0.00000%> (ø)`
trickle/local_subscriber.go	`0.00000% <0.00000%> (ø)`
trickle/trickle_subscriber.go	`0.00000% <0.00000%> (ø)`
... and 3 more

... and 1 file with indirect coverage changes

j0sh · 2024-11-04T20:32:47Z

Marking as ready for review so we can get this in and start building other things on top of it. It it also causing merge conflicts with new pipelines so best to get those out of the way

leszko · 2024-11-05T13:13:18Z

server/ai_http.go

@@ -49,6 +50,8 @@ func startAIServer(lp lphttp) error {

 	openapi3filter.RegisterBodyDecoder("image/png", openapi3filter.FileBodyDecoder)

+	trickle.ConfigureServerWithMux(lp.transRPC)


Do we maybe want to have a feature flag for this?

For the context, we'll merge ai-video into master most probably tomorrow. Then, this change may go to out transcoding broadcaster.

This would only be triggered on orchestrators with AI enabled, and in order to do anything, they would also need a live pipeline loaded - otherwise, selection should fail on the gateway side.

We can still add a (temporary?) flag if there is something else that would protect against.

Yeah, maybe you're right. We can go with this part without feature flag.

leszko · 2024-11-05T13:16:38Z

trickle/trickle_server.go

+var FirstByteTimeout = errors.New("pending read timeout")
+
+func ConfigureServerWithMux(mux *http.ServeMux) {
+	/* TODO we probably want to configure the below


Why not to configure it right in this PR?

Just hadn't gotten around to it yet. Updated the PR with some of this.

leszko · 2024-11-05T13:20:05Z

trickle/trickle_server.go

+	mux.HandleFunc("DELETE "+BaseServerPath+"{streamName}", streamManager.handleDelete)
+}
+
+func (sm *StreamManager) getStream(streamName string) (*Stream, bool) {


Nit: Don't you want to reorder functions to follow the stepdown rule. I guess it's simpler to read the code.

Old habits from C, where you have to define (or at least declare) functions before using them. I am somewhat used to looking up for a given definition, rather than down but I suppose it does not really matter in golang. If the convention is step down style, we can do that but there will always be some subjectivity about what is "important" enough to come first (declaration-first at least gives something of a rule to follow)

leszko · 2024-11-05T13:21:08Z

trickle/trickle_server.go

+	stream, exists := sm.streams[streamName]
+	if !exists {
+		stream = &Stream{
+			segments: make([]*Segment, 5),


Suggested change

segments: make([]*Segment, 5),

segments: make([]*Segment, maxSegmentsPerStream),

Shouldn't you use maxSegmentsPerStream instead of 5 here?

Yep, fixed in cccdde8

leszko · 2024-11-05T13:21:30Z

trickle/trickle_server.go

+	return stream
+}
+
+func (sm *StreamManager) clearAllStreams() {


Were is it (or will it be) used?

Shutdown originally but I think it is not used at the moment

leszko · 2024-11-05T13:22:13Z

trickle/trickle_server.go

+		return
+	}
+
+	// TODO properly clear sessions once we have a good solution


What is the proper solution for clearing the stream?

Clearing the stream is not the problem, but session reuse was (eg, trying to start a new session with the same name). We can work around the issue for now by loosening up some of the constraints around sequence numbering. In practice I think this will be less of an issue as long as the O hands out a fresh ID for each session which it does here.

leszko · 2024-11-05T13:24:45Z

trickle/trickle_server.go

+	s.segments = make([]*Segment, maxSegmentsPerStream)
+}
+
+func (sm *StreamManager) handleDelete(w http.ResponseWriter, r *http.Request) {


IIUC Gateway will send a request to delete the stream. All good, but what happens if Gateway never sends that request? Shouldn't we have some automatic delete/cleanup timeout?

Yeah we will need to sweep the server periodically, but that can come later

leszko · 2024-11-05T13:34:30Z

trickle/trickle_server.go

+// After reaching the last read byte, start buffering up the bytes of Segment Y
+// as they come in. Process requests normally.
+
+// Handle post requests for a given index


I'm trying to understand the logic of this function. Could you maybe describe it in the docs above?

My understanding is that it's an HTTP POST request which is kept forever open and the body is streamed all the time. That is why you have this timeoutReader. Is that correct?

Thought I answered this earleir - but I added some explanation in the trickle readme that should hopefully clear things up. (Also removed the long comment right above this.)

To sum up, POST requests are not meant to be kept open forever. For video, they are a GOP / segment in length.

Clients will pre-connect the next POST in the sequence to minimize set-up time, and we use the timeoutReader to send down periodic 100 Continue keepalives on that preconnect while waiting for content to come in. As soon as content comes in (eg, the first byte), we stop sending keepalives and proceed as normal. Added some comments to clarify that

leszko

@j0sh I added some comments, but in general, it's hard for me to understand the details of what happens in the Tickle/Transport protocol. I think I understand how it works with G and O, but I don't understand the details of the protocol itself.

Some things that could help:

Add a README or comments in trickle_server.go with how it works from the overall perspective (I understanding is that Publisher sends a forever lasting HTTP POST req and subscriber sends a forever lasting GET req, but I'm not sure about it 🙃 )
Add better comments to the parts that are not obvious, some examples:
- Why do we have firstByteRead, why there is a difference with failing to read 1st byte and 3rd byte?
- Why do we have timeoutReader?
- What is idx in the stream? If it's an index of the segment in the stream and it's set as POST param, then I guess my understanding that we have one long-lasting HTTP POST connection is wrong
- ...
Add unit tests

Another option is, if we don't have time to make it right ☝️ is to just merge it. I just don't feel comfortable if go live and you're the only person understanding it. But we could merge it and improve on that later if it helps with the Realtime Video work distribution and unblocks some other work. Then, I'm ok with merging it behind some feature flag.

trickle/trickle_publisher.go

j0sh · 2024-11-05T20:40:56Z

@leszko Thanks for the review!

Added a README with some details, updated the code to address some of the TODOs and PR feedback, plus bug fixes

trickle/trickle_subscriber.go

leszko · 2024-11-06T09:46:51Z

trickle/local_subscriber.go

+	seq int
+}
+
+func NewLocalSubscriber(sm *Server, channelName string) *TrickleLocalSubscriber {


Why do we need it if it's not used?

It is used in #3232

leszko · 2024-11-06T09:47:59Z

trickle/README.md

@@ -0,0 +1,57 @@
+# Trickle Protocol


Thanks for the README. Super helpful!

Could you also add info about what's changefeed?

Added a brief description in 5760116 but we don't use it here - its still a little iffy, but I have been testing it out with a pub-sub example that has been running on the demo machine

leszko · 2024-11-06T09:49:48Z

trickle/trickle_server.go

+	if idx == -1 {
+		idx = s.latestWrite
+		// TODO figure out how to better handle restarts while maintaining ordering
+		/* } else if idx > s.latestWrite { */


Removed in 5760116

leszko

@j0sh Thanks for adding README.

I think it's ok to merge it. We'll need to work on some better documentation and unit tests, but we can do it in a separate PR. I can say I understand 50% of how trickle works internally, but I guess it's good enough for now 🙃

After second thought, I think we don't need any feature flag.

trickle/trickle_publisher.go

j0sh requested review from victorges, leszko, emranemran and mjh1 October 18, 2024 05:45

j0sh mentioned this pull request Oct 18, 2024

Bump golang version to 1.23.2 #3213

Merged

5 tasks

leszko reviewed Oct 21, 2024

View reviewed changes

j0sh force-pushed the ja/add-mediamtx-listener branch from a92dacb to 31077ff Compare October 21, 2024 19:38

j0sh force-pushed the ja/ai-live-publish branch from 110cbc6 to 737d637 Compare October 21, 2024 20:18

j0sh mentioned this pull request Oct 22, 2024

Scaffolding for realtime-to-realtime #3210

Merged

leszko reviewed Oct 23, 2024

View reviewed changes

trickle/trickle_server.go Outdated Show resolved Hide resolved

j0sh force-pushed the ja/add-mediamtx-listener branch from 9e461a7 to e8f2568 Compare October 24, 2024 06:46

Base automatically changed from ja/add-mediamtx-listener to ai-video October 29, 2024 21:10

j0sh force-pushed the ja/ai-live-publish branch from 737d637 to 53c7495 Compare October 31, 2024 22:28

j0sh mentioned this pull request Oct 31, 2024

Add response params for live-video-to-video livepeer/ai-runner#256

Merged

github-advanced-security bot found potential problems Oct 31, 2024

View reviewed changes

trickle/trickle_publisher.go Fixed Show resolved Hide resolved

j0sh force-pushed the ja/ai-live-publish branch 3 times, most recently from a494055 to 4db8c94 Compare November 4, 2024 20:30

j0sh marked this pull request as ready for review November 4, 2024 20:32

j0sh requested a review from rickstaa as a code owner November 4, 2024 20:32

leszko reviewed Nov 5, 2024

View reviewed changes

github-advanced-security bot found potential problems Nov 5, 2024

View reviewed changes

trickle/trickle_publisher.go Dismissed Show dismissed Hide dismissed

j0sh force-pushed the ja/ai-live-publish branch 2 times, most recently from 42d056a to 67576cc Compare November 6, 2024 06:35

github-advanced-security bot found potential problems Nov 6, 2024

View reviewed changes

trickle/trickle_subscriber.go Dismissed Show dismissed Hide dismissed

j0sh mentioned this pull request Nov 6, 2024

Trickle subscribe on O and G #3232

Merged

leszko reviewed Nov 6, 2024

View reviewed changes

leszko approved these changes Nov 6, 2024

View reviewed changes

j0sh added 6 commits November 7, 2024 06:07

Gateway publish to orchestrator.

3bf0fa7

Updates to latest trickle

6d2cca9

Close publish on rtmp termination

d8eb344

PR feedback

7e9076b

f i x u p ! Updates to latest trickle

fc0ba56

spelling for CI

9241380

j0sh force-pushed the ja/ai-live-publish branch from a070df4 to 45dd661 Compare November 7, 2024 06:07

trickle updates

5760116

j0sh force-pushed the ja/ai-live-publish branch from 45dd661 to 5760116 Compare November 7, 2024 06:10

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

trickle/trickle_publisher.go Dismissed Show dismissed Hide dismissed

compile

39febf2

j0sh merged commit 19cee56 into ai-video Nov 7, 2024
15 checks passed

j0sh deleted the ja/ai-live-publish branch November 7, 2024 07:12

		@@ -49,6 +50,8 @@ func startAIServer(lp lphttp) error {

		openapi3filter.RegisterBodyDecoder("image/png", openapi3filter.FileBodyDecoder)

		trickle.ConfigureServerWithMux(lp.transRPC)

	segments: make([]*Segment, 5),
	segments: make([]*Segment, maxSegmentsPerStream),

Gateway publish to orchestrator. #3211

Gateway publish to orchestrator. #3211

Conversation

j0sh commented Oct 18, 2024

leszko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j0sh commented Oct 21, 2024

leszko commented Oct 22, 2024

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

j0sh commented Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leszko left a comment • edited Loading

Choose a reason for hiding this comment

j0sh commented Nov 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leszko left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 31, 2024 •

edited

Loading

j0sh commented Nov 4, 2024 •

edited

Loading

leszko left a comment •

edited

Loading