Sharded directory fetching is unusably slow #4908

ajbouh · 2018-04-02T23:05:21Z

Version information:

0.4.15-dev

Type:

Bug/performance issue

Description:

More context is available over in tesserai/iptf#2

I'm trying to get reasonable performance for just listing the names of entries in a sharded directory that's not yet cached locally. This operation takes hours right now. With @Stebalien's help I've been able to determine that it's only requesting one hash at a time (as indicated by ipfs bitswap wantlist.

Seems like IPFS should be requesting more than one block at a time in this scenario. Creating a separate issue to track this specific performance issue separately from others.

child of #5487

The text was updated successfully, but these errors were encountered:

kevina · 2018-04-25T04:34:55Z

This seams like an easy enough fix, so I will look into it. If someone beats me to it please remove my assignment.

Stebalien · 2018-04-25T21:27:25Z

@kevina have fun 😄. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing.

At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff.

ajbouh · 2018-04-25T22:36:27Z

But parallelization of children nodes should address the scenario with 1e6 children at a single level, right?

…

On Wed, Apr 25, 2018, 23:27 Steven Allen ***@***.***> wrote: @kevina <https://github.com/kevina> have fun 😄. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing. At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4908 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcnfmMEQvXFRgNCSl-45m6jugLW7g5ks5tsOpJgaJpZM4TENgI> .

Stebalien · 2018-04-25T23:39:51Z

@ajbouh due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children.

ajbouh · 2018-04-26T00:40:50Z

But right now we fetch only one at a time, so isn't that a 5-256x improvement? Initial fetch of ImageNet took hours...

…

On Thu, Apr 26, 2018, 01:40 Steven Allen ***@***.***> wrote: @ajbouh <https://github.com/ajbouh> due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4908 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcnV9IDrQQl8Vgd6NtEsHq6dhttj_9ks5tsQlRgaJpZM4TENgI> .

Stebalien · 2018-04-26T00:49:35Z

@ajbouh in practice, more like 4x. Definitely an improvement but we can do much better.

kevina · 2018-04-26T01:00:56Z

Yeah we need to be reading the blocks as they come in from the network, and then fetching any other needed blocks in parallel. This should be possible, but I have not looked into the code yet. However, we would need to limit the number of requests fetched in parallel somehow.

@Stebalien do you have some good test hashes?

Stebalien · 2018-04-26T01:37:07Z

@kevina I just created a large directory with tiny files locally and tested with iptb. I find that's generally the best way to make a reproducible test.

kevina · 2018-04-26T01:58:46Z

@Stebalien where did the 4x number come from?

Stebalien · 2018-04-26T02:13:45Z

@kevina most of the shards had few directories and we'd wait until we'd downloaded all of them before moving on. This gives us a sawtooth pattern where we were often only downloading a few stragglers.

kevina · 2018-04-29T21:57:11Z

@ajbouh #4979 should help significantly

ajbouh · 2018-04-29T22:08:45Z

@kevina excellent! Have you tried to ls the ImageNet CID with this change?

kevina · 2018-04-29T22:25:27Z

Yes, That directory is _huge_ and even with batching there are still a huge number of network requests, so I have not let it complete. I am doing that now and will report back, but I encourage you to try it out also.

ajbouh · 2018-04-29T22:29:25Z

Is someone tracking the optimization work needed to get this ls operation to work in a reasonable amount of time? I'm not talking about doing a get... Just an ls...?

kevina · 2018-04-29T23:20:42Z

@ajbouh it just finished, it completed in around 30 minutes, not great but better. There are around 1281167 entries consisting of around 112220 blocks. That a lot of blocks to retrieve so I am not sure how much better we can do. The p.r. retrieves the blocks in batch sizes up to 320 (see code for reason for this number) and it seamed to be taxing the resources on my machine so I am not sure how much larger I want to make this number.

ajbouh · 2018-04-30T04:38:43Z

@kevina I'm not sure we're talking about the same CID here. I'm talking about one with ~10^6 entries? Is it easy to determine how many bytes are required to represent the sharded directory?

It seems we should expect it to go as fast as an ipfs get of a file of the same size, no?

kevina · 2018-04-30T04:57:54Z

I am testing:

ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

My initial numbers where wrong so I updated the count.

It is not the size that is important but the number of blocks that need to be retrieved. With hamt sharding of a directory object the block size is likely to smaller than with normal sharding of a file which is broken up into equal size segments (of which I forgot the exact number but I think its around 43k).

ajbouh · 2018-04-30T05:02:11Z

I see, so perhaps sharded directories just aren't designed for this use case and we should be thinking about using something else?

We need to be able to quickly enumerate all entries so we can decide which to fetch next. Perhaps a single manifest file with a known name is the easiest way to accomplish this?

kevina · 2018-04-30T05:12:50Z

@ajbouh perhaps, however the number of blocks required is also really high.

@whyrusleeping @Stebalien thoughts?

whyrusleeping · 2018-05-01T06:03:05Z

Investigating...

whyrusleeping · 2018-05-01T08:05:24Z

@kevina's code looks reasonable. Probably want to combine that with bitswap sessions and a higher bitswap activeWants count. Once concurrency of fetching is no longer the issue, there are other optimizations to look at, namely requester side batching of blocks that we receive. Right now every block we get through bitswap gets put to the datastore individually, batching those together could add some significant improvements.

In any case, @ajbouh do you need the entire list of names for your operation? Listing 10 million directory entries is going to be slow (order of tens of seconds) unless we work some fancy caching magic. Maybe theres a better way we can query this information?

ajbouh · 2018-05-01T14:23:41Z

Yes, I need to stream through all entries in a directory, batching, sampling and shuffling them in a consistent and user-specifiable manner.

@whyrusleeping what are you thinking the primary bottleneck is?

If we're talking about 1M entries that each need about 100 bytes, that's only a 100MB total download. This seems like something that we should be able to do in 10 seconds or less on a fast connection. If it's already on the local disk it should be even faster.

What am I missing here?

Stebalien · 2018-05-01T17:02:28Z

@kevina were you using iptb on a separate network when you tested that (i.e., would bitswap sessions have affected it)?

kevina · 2018-05-01T19:17:55Z

@Stebalien I was not even using iptb, just testing it from my computer.

Stebalien · 2018-05-02T00:55:40Z

@kevina could you run a quick test with iptb? That'll tell us how much bitswap sessions would help and how much, e.g., network latency/bandwidth affect it.

kevina · 2018-05-02T01:27:00Z

@Stebalien can you be a little more specific on what different combinations you want to test?

kevina · 2018-05-02T06:16:24Z

@Stebalien okay, I tested commit 3f79eab in pr #4979 as I think you wanted. I started a iptb testbad with just 2 nodes and connected them and then ran

time ./iptb run 0 ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 > /dev/null

It took 14m40s.

The second ipfs node in the cluster already has the hash and all the parent shards.

whyrusleeping · 2018-05-02T06:25:08Z

@kevina how long does it take to run that when you already have all the blocks?

Also, are these nodes using badger or flatfs?

ajbouh · 2018-07-13T18:36:31Z

@whyrusleeping Good point about making things easy to measure.

Being able to ls the directory (with 10^6 files in it) over LAN in under 10 seconds is a great starting point. As is < 1 second to see first entry in the directory.

What else can I provide to help?

whyrusleeping · 2018-07-17T04:37:49Z

@ajbouh Any other nicely measurable perf requirements you can think of are definitely appreciated, but I think this is enough to go on.

Things to note, it may be easiest to make a separate ipfs fast-ls command, or add a streaming option to ipfs ls. It currently blocks until it has all the entries, and then prints them out.

ajbouh · 2018-07-17T21:07:25Z

Yeah, I think I'm using the streaming API under the hood.

For context: this is part of a larger goal to train a state of the art machine learning model from your laptop with Google's TPUs.

Would much rather use IPFS for this as using cloud storage makes working with open source folks very difficult. Is also makes working from your laptop much harder.

TPUs are approximately $1 for 10 minutes of use. Getting the overhead of data loading/fetching to be just a few seconds is absolutely critical. For clarity, cloud storage has essentially zero up-front overhead for already-hosted datasets.

Looking forward to getting this figured out!

whyrusleeping · 2018-07-17T21:32:41Z

@ajbouh thats really cool! Let's get this train moving then :)

Yeah, I think I'm using the streaming API under the hood.

Unless youre running custom ipfs code, I don't think youre getting what you think you are. In ls here: https://github.com/ipfs/go-ipfs/blob/master/core/commands/ls.go#L170 It collects all the results up, and the outputs them all at once. I threw together a quick PoC of a fully streaming ls command here: https://github.com/ipfs/go-ipfs/compare/hack/fastls?expand=1 We should think about how to integrate that properly.

ajbouh · 2018-07-17T21:54:45Z

Correction, not using the streaming API just yet, but we are using custom code.

tesserai/iptf#3 (comment)

That said, TensorFlow's own directory listing logic is not streaming, so some creativity will be required on my part for some operations: https://github.com/tensorflow/tensorflow/blob/e7f158858479400f17a1b6351e9827e3aa83e7ff/tensorflow/core/platform/file_system.h#L116

Agreed on getting the train moving!

Based on other threads, it seems like badger isn't a short term option. Who has the baton for this right now?

eingenito · 2018-09-20T22:36:27Z

@ajbouh @hannahhoward has picked this back up. See the linked issue for more details.

hannahhoward · 2018-10-15T15:40:34Z

@ajbouh I am not sure if we've cut a new release since ipfs/go-unixfs#19 was merged but I'd be curious to hear how this affects your performance

ajbouh · 2018-10-17T01:06:20Z

Thanks for the ping, @hannahhoward

I am also curious about the performance but have not tried a recent build myself. Have you tried the operations I referenced in tesserai/iptf#2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

Stebalien · 2018-10-22T23:12:10Z

@hannahhoward we haven't.

eingenito · 2019-02-11T16:59:40Z

I believe this has been addressed in 0.4.18. Please reopen if needed.

ajbouh · 2019-02-11T17:09:40Z

Hi @eingenito! Have you tried the operations I referenced in tesserai/iptf#2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

Stebalien · 2019-02-11T18:31:10Z

The first part was addressed in 0.4.18 but the second part is the sessions improvements that'll land in 0.4.19. Hopefully that'll be sufficient.

ajbouh · 2019-02-11T18:53:08Z

Will the session stuff address the duplicated blocks issue?

Stebalien · 2019-02-11T20:02:47Z

Significantly but it's still far from perfect.

ajbouh · 2019-02-11T20:27:05Z

@b5 ^

Stebalien · 2019-05-07T20:58:17Z

This is probably as fast as it's going to get for the foreseeable future.

ajbouh · 2019-05-07T23:46:39Z

Hi @Stebalien, do you mind quantifying how fast things are now?

kevina self-assigned this Apr 25, 2018

kevina mentioned this issue Apr 26, 2018

Fetch Hamt Children Nodes in Parallel #4979

Closed

michaelavila assigned michaelavila and eingenito Sep 11, 2018

michaelavila added the status/in-progress In progress label Sep 11, 2018

michaelavila assigned hannahhoward Sep 11, 2018

hannahhoward mentioned this issue Sep 18, 2018

Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points #5487

Open

daviddias mentioned this issue Sep 21, 2018

2018 Q4 OKRs Planning ipfs/js-ipfs#1566

Merged

This was referenced Sep 21, 2018

Use EnumerateChildrenAsync in for enumerating HAMT links ipfs/go-unixfs#19

Merged

Add sessions when fetching MerkleDAG in LS #5509

Merged

eingenito closed this as completed Feb 11, 2019

ghost removed the status/in-progress In progress label Feb 11, 2019

Stebalien reopened this Feb 11, 2019

kevina removed their assignment Feb 11, 2019

Stebalien closed this as completed May 7, 2019

Sharded directory fetching is unusably slow #4908

Sharded directory fetching is unusably slow #4908

Comments

ajbouh commented Apr 2, 2018 • edited by hannahhoward Loading

Version information:

Type:

Description:

kevina commented Apr 25, 2018

Stebalien commented Apr 25, 2018

ajbouh commented Apr 25, 2018 via email

Stebalien commented Apr 25, 2018

ajbouh commented Apr 26, 2018 via email

Stebalien commented Apr 26, 2018

kevina commented Apr 26, 2018

Stebalien commented Apr 26, 2018

kevina commented Apr 26, 2018

Stebalien commented Apr 26, 2018

kevina commented Apr 29, 2018

ajbouh commented Apr 29, 2018

kevina commented Apr 29, 2018 via email

ajbouh commented Apr 29, 2018

kevina commented Apr 29, 2018 • edited Loading

ajbouh commented Apr 30, 2018

kevina commented Apr 30, 2018

ajbouh commented Apr 30, 2018

kevina commented Apr 30, 2018

whyrusleeping commented May 1, 2018

whyrusleeping commented May 1, 2018 • edited Loading

ajbouh commented May 1, 2018

Stebalien commented May 1, 2018

kevina commented May 1, 2018

Stebalien commented May 2, 2018

kevina commented May 2, 2018

kevina commented May 2, 2018

whyrusleeping commented May 2, 2018

ajbouh commented Jul 13, 2018

whyrusleeping commented Jul 17, 2018

ajbouh commented Jul 17, 2018

whyrusleeping commented Jul 17, 2018

ajbouh commented Jul 17, 2018

eingenito commented Sep 20, 2018

hannahhoward commented Oct 15, 2018

ajbouh commented Oct 17, 2018

Stebalien commented Oct 22, 2018

eingenito commented Feb 11, 2019 • edited Loading

ajbouh commented Feb 11, 2019

Stebalien commented Feb 11, 2019

ajbouh commented Feb 11, 2019

Stebalien commented Feb 11, 2019

ajbouh commented Feb 11, 2019

Stebalien commented May 7, 2019

ajbouh commented May 7, 2019

ajbouh commented Apr 2, 2018 •

edited by hannahhoward

Loading

kevina commented Apr 29, 2018 •

edited

Loading

whyrusleeping commented May 1, 2018 •

edited

Loading

eingenito commented Feb 11, 2019 •

edited

Loading