-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharded directory fetching is unusably slow #4908
Comments
This seams like an easy enough fix, so I will look into it. If someone beats me to it please remove my assignment. |
@kevina have fun 😄. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing. At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff. |
But parallelization of children nodes should address the scenario with 1e6
children at a single level, right?
…On Wed, Apr 25, 2018, 23:27 Steven Allen ***@***.***> wrote:
@kevina <https://github.com/kevina> have fun 😄. Unfortunately, it's
actually a bit frustrating. Parallelizing fetching all the children of a
single node is simple however, many of the nodes deep in sharded directory
trees only have a few children so the speedup is a bit depressing.
At the end of the day, it becomes a memory/parallelism +
throughput/latency tradeoff.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4908 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcnfmMEQvXFRgNCSl-45m6jugLW7g5ks5tsOpJgaJpZM4TENgI>
.
|
@ajbouh due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children. |
But right now we fetch only one at a time, so isn't that a 5-256x
improvement? Initial fetch of ImageNet took hours...
…On Thu, Apr 26, 2018, 01:40 Steven Allen ***@***.***> wrote:
@ajbouh <https://github.com/ajbouh> due to sharding, we have at most 256
children at each level. Fetching 256 at a time is great however, many of
the deeper (partially filled) nodes in the tree end up with 5-10 children.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4908 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcnV9IDrQQl8Vgd6NtEsHq6dhttj_9ks5tsQlRgaJpZM4TENgI>
.
|
@ajbouh in practice, more like 4x. Definitely an improvement but we can do much better. |
Yeah we need to be reading the blocks as they come in from the network, and then fetching any other needed blocks in parallel. This should be possible, but I have not looked into the code yet. However, we would need to limit the number of requests fetched in parallel somehow. @Stebalien do you have some good test hashes? |
@kevina I just created a large directory with tiny files locally and tested with iptb. I find that's generally the best way to make a reproducible test. |
@Stebalien where did the 4x number come from? |
@kevina most of the shards had few directories and we'd wait until we'd downloaded all of them before moving on. This gives us a sawtooth pattern where we were often only downloading a few stragglers. |
@kevina excellent! Have you tried to ls the ImageNet CID with this change? |
Yes, That directory is _huge_ and even with batching there are still a huge
number of network requests, so I have not let it complete. I am doing that
now and will report back, but I encourage you to try it out also.
|
Is someone tracking the optimization work needed to get this |
@ajbouh it just finished, it completed in around 30 minutes, not great but better. There are around 1281167 entries consisting of around 112220 blocks. That a lot of blocks to retrieve so I am not sure how much better we can do. The p.r. retrieves the blocks in batch sizes up to 320 (see code for reason for this number) and it seamed to be taxing the resources on my machine so I am not sure how much larger I want to make this number. |
@kevina I'm not sure we're talking about the same CID here. I'm talking about one with ~10^6 entries? Is it easy to determine how many bytes are required to represent the sharded directory? It seems we should expect it to go as fast as an |
I am testing:
My initial numbers where wrong so I updated the count. It is not the size that is important but the number of blocks that need to be retrieved. With hamt sharding of a directory object the block size is likely to smaller than with normal sharding of a file which is broken up into equal size segments (of which I forgot the exact number but I think its around 43k). |
I see, so perhaps sharded directories just aren't designed for this use case and we should be thinking about using something else? We need to be able to quickly enumerate all entries so we can decide which to fetch next. Perhaps a single manifest file with a known name is the easiest way to accomplish this? |
@ajbouh perhaps, however the number of blocks required is also really high. @whyrusleeping @Stebalien thoughts? |
Investigating... |
@kevina's code looks reasonable. Probably want to combine that with bitswap sessions and a higher bitswap activeWants count. Once concurrency of fetching is no longer the issue, there are other optimizations to look at, namely requester side batching of blocks that we receive. Right now every block we get through bitswap gets put to the datastore individually, batching those together could add some significant improvements. In any case, @ajbouh do you need the entire list of names for your operation? Listing 10 million directory entries is going to be slow (order of tens of seconds) unless we work some fancy caching magic. Maybe theres a better way we can query this information? |
Yes, I need to stream through all entries in a directory, batching, sampling and shuffling them in a consistent and user-specifiable manner. @whyrusleeping what are you thinking the primary bottleneck is? If we're talking about 1M entries that each need about 100 bytes, that's only a 100MB total download. This seems like something that we should be able to do in 10 seconds or less on a fast connection. If it's already on the local disk it should be even faster. What am I missing here? |
@kevina were you using iptb on a separate network when you tested that (i.e., would bitswap sessions have affected it)? |
@Stebalien I was not even using iptb, just testing it from my computer. |
@kevina could you run a quick test with iptb? That'll tell us how much bitswap sessions would help and how much, e.g., network latency/bandwidth affect it. |
@Stebalien can you be a little more specific on what different combinations you want to test? |
@Stebalien okay, I tested commit 3f79eab in pr #4979 as I think you wanted. I started a iptb testbad with just 2 nodes and connected them and then ran
It took 14m40s. The second ipfs node in the cluster already has the hash and all the parent shards. |
@kevina how long does it take to run that when you already have all the blocks? Also, are these nodes using badger or flatfs? |
@whyrusleeping Good point about making things easy to measure. Being able to ls the directory (with 10^6 files in it) over LAN in under 10 seconds is a great starting point. As is < 1 second to see first entry in the directory. What else can I provide to help? |
@ajbouh Any other nicely measurable perf requirements you can think of are definitely appreciated, but I think this is enough to go on. Things to note, it may be easiest to make a separate |
Yeah, I think I'm using the streaming API under the hood. For context: this is part of a larger goal to train a state of the art machine learning model from your laptop with Google's TPUs. Would much rather use IPFS for this as using cloud storage makes working with open source folks very difficult. Is also makes working from your laptop much harder. TPUs are approximately $1 for 10 minutes of use. Getting the overhead of data loading/fetching to be just a few seconds is absolutely critical. For clarity, cloud storage has essentially zero up-front overhead for already-hosted datasets. Looking forward to getting this figured out! |
@ajbouh thats really cool! Let's get this train moving then :)
Unless youre running custom ipfs code, I don't think youre getting what you think you are. In ls here: https://github.com/ipfs/go-ipfs/blob/master/core/commands/ls.go#L170 It collects all the results up, and the outputs them all at once. I threw together a quick PoC of a fully streaming ls command here: https://github.com/ipfs/go-ipfs/compare/hack/fastls?expand=1 We should think about how to integrate that properly. |
Correction, not using the streaming API just yet, but we are using custom code. That said, TensorFlow's own directory listing logic is not streaming, so some creativity will be required on my part for some operations: https://github.com/tensorflow/tensorflow/blob/e7f158858479400f17a1b6351e9827e3aa83e7ff/tensorflow/core/platform/file_system.h#L116 Agreed on getting the train moving! Based on other threads, it seems like badger isn't a short term option. Who has the baton for this right now? |
@ajbouh @hannahhoward has picked this back up. See the linked issue for more details. |
@ajbouh I am not sure if we've cut a new release since ipfs/go-unixfs#19 was merged but I'd be curious to hear how this affects your performance |
Thanks for the ping, @hannahhoward I am also curious about the performance but have not tried a recent build myself. Have you tried the operations I referenced in tesserai/iptf#2 They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 |
@hannahhoward we haven't. |
I believe this has been addressed in 0.4.18. Please reopen if needed. |
Hi @eingenito! Have you tried the operations I referenced in tesserai/iptf#2 They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 |
The first part was addressed in 0.4.18 but the second part is the sessions improvements that'll land in 0.4.19. Hopefully that'll be sufficient. |
Will the session stuff address the duplicated blocks issue? |
Significantly but it's still far from perfect. |
@b5 ^ |
This is probably as fast as it's going to get for the foreseeable future. |
Hi @Stebalien, do you mind quantifying how fast things are now? |
Version information:
0.4.15-dev
Type:
Bug/performance issue
Description:
More context is available over in tesserai/iptf#2
I'm trying to get reasonable performance for just listing the names of entries in a sharded directory that's not yet cached locally. This operation takes hours right now. With @Stebalien's help I've been able to determine that it's only requesting one hash at a time (as indicated by
ipfs bitswap wantlist
.Seems like IPFS should be requesting more than one block at a time in this scenario. Creating a separate issue to track this specific performance issue separately from others.
child of #5487
The text was updated successfully, but these errors were encountered: