slow parallel processing #37

tobowers · 2018-07-17T07:38:31Z

Hi, I'm still researching this so it's a work-in-progress but I have narrowed it down enough that I thought it worthy of discussion. I've been using #30 in a pre-production project and I noticed some bottlenecks in parallel code. I started to research and I narrowed it down to WrapObject creating blocking-like behavior.

In a real-world project with hundreds of parallel tasks, I was noticing 200-500ms times in wrapping simple objects.

I narrowed it down to https://github.com/ipfs/go-ipld-cbor/blob/refmt/node.go#L184 but haven't yet found the root cause.

The rest of that function does not cause long blocking behavior.

Here's a failing test:

type testEncodeStruct struct {
	Id string
	Number int
}

func TestParallelWrap(t *testing.T) {
	cbornode.RegisterCborType(testEncodeStruct{})

	times := 10000

	obj := &testEncodeStruct{Id: "test", Number: 100}

	responses := make(chan int, times)

	for i := 0; i < times; i++ {
		go func() {
			now := time.Now()
			node,err := cbornode.WrapObject(obj, multihash.SHA2_256, -1)
			end := int(time.Now().Sub(now))
			assert.Nil(t, err)
			assert.Len(t, node.RawData(), 18)
			responses <- end
		}()
	}

	respSlice := make([]int, times)

	for i := 0; i < times; i++ {
		respSlice[i] = <-responses
	}
	t.Logf("max was: %v", max(respSlice))

	assert.True(t, max(respSlice)< int(100 * time.Millisecond))

}

whyrusleeping · 2018-07-17T20:41:23Z

@tobowers Thanks for reporting! Getting usage on that PR definitely helps us move the ball forward

Stebalien · 2018-07-17T21:40:33Z

@tobowers could you try my PR #38?

tobowers · 2018-07-18T07:51:28Z

So... this could totally be my own shitty parallel code causing context switching... so I really hope I'm not leading us down a rabbit hole :). However, anyway: using this code https://github.com/QuorumControl/encodeexplore I do not see an overall speed improvement using the new rfmt-pool branch: ( https://github.com/QuorumControl/encodeexplore/compare/test/refmt-pool )

I'm still seeing spikes in decode/encode... these lines: https://github.com/QuorumControl/encodeexplore/blob/master/gossip/gossip.go#L234-L239

producing:

ERROR[07-18|09:47:58.545] decode                                   took=1.051360211s
ERROR[07-18|09:47:58.622] decode                                   took=1.128933855s
ERROR[07-18|09:47:58.693] decode                                   took=1.191881103s
ERROR[07-18|09:47:58.716] decode                                   took=1.190810543s
ERROR[07-18|09:47:58.717] decode                                   took=1.191089245s
ERROR[07-18|09:47:58.735] decode                                   took=1.19506691s

tobowers · 2018-07-18T07:52:23Z

FYI: These encodes/decodes are getting called 10s of thousands of times.

Stebalien · 2018-07-18T12:43:38Z

So, WrapObject is CPU heavy as it computes a sha256 hash. We could do that lazily but we worried that we'd always want it anyways so we might as well compute it up-front. However, it shouldn't take a second.

I'd try using pprof: https://blog.golang.org/profiling-go-programs. A CPU profile should give you a good idea of what's happening.

tobowers · 2018-07-18T13:00:07Z

When I was looking before it wasn't the sha256, but instead the rfmt decode. I will check again with the profile. Thanks for the link!

Stebalien · 2018-07-18T13:14:23Z

Thanks. The profile will be really helpful.

tobowers · 2018-07-18T13:16:36Z

Interesting:

      flat  flat%   sum%        cum   cum%
     3.84s  6.82%  6.82%      3.84s  6.82%  runtime.memmove
     3.70s  6.57% 13.39%     14.71s 26.13%  runtime.mallocgc
     2.71s  4.81% 18.21%      4.79s  8.51%  runtime.scanobject
     2.65s  4.71% 22.91%      2.94s  5.22%  runtime.heapBitsForObject
     2.29s  4.07% 26.98%      2.29s  4.07%  runtime.usleep
     1.61s  2.86% 29.84%      2.49s  4.42%  runtime.bulkBarrierPreWrite
     1.60s  2.84% 32.68%      1.60s  2.84%  runtime.memclrNoHeapPointers
     1.33s  2.36% 35.04%      1.33s  2.36%  syscall.Syscall
     1.29s  2.29% 37.34%      1.50s  2.66%  runtime.heapBitsSetType
     1.22s  2.17% 39.50%      4.55s  8.08%  runtime.gcWriteBarrier
     1.14s  2.02% 41.53%      1.14s  2.02%  runtime.nextFreeFast (inline)
     1.08s  1.92% 43.45%      6.36s 11.30%  fmt.(*pp).doPrintf
     0.99s  1.76% 45.20%      0.99s  1.76%  github.com/quorumcontrol/encodeexplore/vendor/github.com/minio/sha256-simd.blockAvx2
     0.97s  1.72% 46.93%     22.84s 40.57%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.traverse
     0.97s  1.72% 48.65%      2.54s  4.51%  reflect.Value.Slice
     0.95s  1.69% 50.34%      3.81s  6.77%  runtime.wbBufFlush1
     0.93s  1.65% 51.99%      4.78s  8.49%  fmt.(*pp).printArg
     0.80s  1.42% 53.41%      1.52s  2.70%  fmt.(*fmt).fmt_integer
     0.63s  1.12% 54.53%      0.63s  1.12%  runtime.heapBits.bits (inline)
     0.63s  1.12% 55.65%      0.63s  1.12%  runtime.mach_semaphore_signal
     0.54s  0.96% 56.61%      0.54s  0.96%  runtime.(*mspan).refillAllocCache
     0.52s  0.92% 57.53%      0.52s  0.92%  runtime.acquirem (inline)
     0.51s  0.91% 58.44%      1.11s  1.97%  fmt.(*buffer).WriteString (inline)
     0.50s  0.89% 59.33%      3.11s  5.52%  runtime.convT2Estring
     0.49s  0.87% 60.20%      0.54s  0.96%  runtime.(*gcBits).bitp (inline)
     0.48s  0.85% 61.05%      0.48s  0.85%  runtime.releasem (inline)
     0.47s  0.83% 61.88%      3.75s  6.66%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Marshaller).Step
     0.47s  0.83% 62.72%      1.24s  2.20%  reflect.Value.assignTo
     0.42s  0.75% 63.46%      3.75s  6.66%  reflect.Value.Set
     0.41s  0.73% 64.19%         1s  1.78%  sync.(*Pool).Put
     0.38s  0.67% 64.87%     13.37s 23.75%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Unmarshaller).Step
     0.37s  0.66% 65.52%      2.07s  3.68%  runtime.mapassign_fast64
     0.36s  0.64% 66.16%      0.76s  1.35%  runtime.evacuate_fast64
     0.35s  0.62% 66.79%      0.35s  0.62%  runtime.(*bmap).setoverflow (inline)
     0.35s  0.62% 67.41%      0.75s  1.33%  sync.(*Pool).pin
     0.34s   0.6% 68.01%      2.85s  5.06%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Marshaller).Recurse
     0.34s   0.6% 68.61%      0.80s  1.42%  reflect.Value.Bytes
     0.32s  0.57% 69.18%      0.32s  0.57%  runtime.mach_semaphore_wait
     0.31s  0.55% 69.73%      0.61s  1.08%  runtime.gcmarknewobject
     0.30s  0.53% 70.27%     12.81s 22.75%  fmt.Sprintf

tobowers · 2018-07-18T13:18:43Z

(pprof) top20 -cum
Showing nodes accounting for 6.56s, 11.65% of 56.30s total
Dropped 279 nodes (cum <= 0.28s)
Showing top 20 nodes out of 192
      flat  flat%   sum%        cum   cum%
         0     0%     0%     43.50s 77.26%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.WrapObject
     0.01s 0.018% 0.018%     25.01s 44.42%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).Start.func1.1
         0     0% 0.018%     24.91s 44.25%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).OnMessage
         0     0% 0.018%     23.65s 42.01%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).Start.func1.2
     0.01s 0.018% 0.036%     23.01s 40.87%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).doGossip
     0.04s 0.071%  0.11%     22.88s 40.64%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.newObject
     0.97s  1.72%  1.83%     22.84s 40.57%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.traverse
     0.17s   0.3%  2.13%     20.43s 36.29%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/shared.TokenPump.Run
         0     0%  2.13%     15.32s 27.21%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor/encoding.(*PooledCloner).Clone
         0     0%  2.13%     15.27s 27.12%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt.(*cloner).Clone
         0     0%  2.13%     15.26s 27.10%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt.cloner.Clone
     0.29s  0.52%  2.65%     15.07s 26.77%  runtime.systemstack
     3.70s  6.57%  9.22%     14.71s 26.13%  runtime.mallocgc
     0.38s  0.67%  9.89%     13.37s 23.75%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Unmarshaller).Step
     0.25s  0.44% 10.34%     13.23s 23.50%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineWildcard).Step
     0.10s  0.18% 10.52%     12.89s 22.90%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).Step
     0.30s  0.53% 11.05%     12.81s 22.75%  fmt.Sprintf
     0.06s  0.11% 11.15%     12.77s 22.68%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).(github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.step_AcceptValue)-fm
     0.28s   0.5% 11.65%     12.71s 22.58%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).step_AcceptValue
         0     0% 11.65%     12.44s 22.10%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.compLinks
(pprof)

Stebalien · 2018-07-18T13:23:13Z

Could you post a graph? Try running the svg command.

Stebalien · 2018-07-18T13:23:55Z

It looks like we really should consider traversing lazily.

tobowers · 2018-07-18T13:40:44Z

https://github.com/QuorumControl/encodeexplore/blob/test/refmt-pool/profile001.svg

tobowers · 2018-07-18T13:49:23Z

without the noise from sprintf: https://github.com/QuorumControl/encodeexplore/blob/test/refmt-pool/profile002.svg

Stebalien · 2018-07-18T14:15:32Z

Yeah. Calculating the tree up-front is slowing us down. There's no reason we really need to do that.

whyrusleeping · 2018-07-18T18:54:37Z

@Stebalien the reason we do it up front is to make sure that the other methods that rely on it having been computed don't need to return an error

Stebalien · 2018-07-18T20:59:39Z

I see. The performance issue is concatenating the strings so we could always do that later. However, I think a more complete fix is to make two node interfaces (a lazy one for resolving, computing the cid, deserialising into an object and an eager one for introspecting).

tobowers · 2018-07-19T09:41:36Z

In that 2nd one, the concatenation doesn't seem to take up as much time (I just switched the fmt.Sprintf to a standard string concatenation and a strconv for the integer.)

Stebalien · 2018-07-19T18:36:37Z

I guess calling that "the" performance issue is incorrect. However, it still takes up 10% of the time.

whyrusleeping · 2018-07-19T19:37:12Z

a more complete fix is to make two node interfaces

Yeah, Thats what @warpfork was saying too. It sounds like a good idea to me

warpfork · 2018-07-25T19:37:13Z

This is not related to parallel processing, but since I've got the issue open, and it's about performance...

I've been nudging around some benchmarks of go-ipld-cbor and one thing that's come up as a big timesink on some of my pprof graphs is the handling of CIDs. CIDs use the refmt 'transform' function feature... and that's one of those features which is powerful and expressive but by nature does some things that are not very friendly to the GC. That's something I'm going to look at trying to optimize better.

In general GC can be a big factor of performance, and also distressingly hard to figure out the impacts of because it can quietly make other code seem slower. The graph view and the weights of the arrows heading into the GC nodes are probably the most informative; I don't really know of a great way to get a textual list of that from the other pprof commands (although if anyone has suggetions... ❤️ )

aschmahmann · 2020-06-19T17:41:43Z

Any IPLD folks have thoughts on this @warpfork?

warpfork · 2020-06-19T21:32:20Z

I think we could probably mark this closed. Some performance observations were made here, but I'm not sure what actions would be described next by this issue. (More efforts to reduce allocations and reduce GC pressure, maybe? But how we would actually do that probably needs other discussions than this.)

We've also done an enormous amount of work in the redesigned go-ipld-prime interfaces to be sensitive to GC pressure and design to avoid unnecessary allocations at all levels. (In essence, that library also avoids all the precomputations of tree structure that seems to have been flagged as problematic here(?) -- and we needed to tackle that with a new library and new interfaces, because there was really no incremental way to change something as central as that.) I'd suggest looking to that library for improvements now and in the future.

tobowers mentioned this issue Jul 17, 2018

blocking parallel execution? polydawn/refmt#31

Open

dignifiedquire mentioned this issue Sep 26, 2018

Improve performance of message application filecoin-project/venus#979

Closed

aschmahmann closed this as completed Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow parallel processing #37

slow parallel processing #37

tobowers commented Jul 17, 2018 •

edited

Loading

whyrusleeping commented Jul 17, 2018

Stebalien commented Jul 17, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

whyrusleeping commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 19, 2018 •

edited

Loading

Stebalien commented Jul 19, 2018

whyrusleeping commented Jul 19, 2018

warpfork commented Jul 25, 2018

aschmahmann commented Jun 19, 2020

warpfork commented Jun 19, 2020

slow parallel processing #37

slow parallel processing #37

Comments

tobowers commented Jul 17, 2018 • edited Loading

whyrusleeping commented Jul 17, 2018

Stebalien commented Jul 17, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 18, 2018

tobowers commented Jul 18, 2018

Stebalien commented Jul 18, 2018

whyrusleeping commented Jul 18, 2018

Stebalien commented Jul 18, 2018

tobowers commented Jul 19, 2018 • edited Loading

Stebalien commented Jul 19, 2018

whyrusleeping commented Jul 19, 2018

warpfork commented Jul 25, 2018

aschmahmann commented Jun 19, 2020

warpfork commented Jun 19, 2020

tobowers commented Jul 17, 2018 •

edited

Loading

tobowers commented Jul 19, 2018 •

edited

Loading