|
| 1 | +What golang Kinds work best to implement CIDs? |
| 2 | +============================================== |
| 3 | + |
| 4 | +There are many possible ways to implement CIDs. This package explores them. |
| 5 | + |
| 6 | +### criteria |
| 7 | + |
| 8 | +There's a couple different criteria to consider: |
| 9 | + |
| 10 | +- We want the best performance when operating on the type (getters, mostly); |
| 11 | +- We want to minimize the number of memory allocations we need; |
| 12 | +- We want types which can be used as map keys, because this is common. |
| 13 | + |
| 14 | +The priority of these criteria is open to argument, but it's probably |
| 15 | +mapkeys > minalloc > anythingelse. |
| 16 | +(Mapkeys and minalloc are also quite entangled, since if we don't pick a |
| 17 | +representation that can work natively as a map key, we'll end up needing |
| 18 | +a `KeyRepr()` method which gives us something that does work as a map key, |
| 19 | +an that will almost certainly involve a malloc itself.) |
| 20 | + |
| 21 | +### options |
| 22 | + |
| 23 | +There are quite a few different ways to go: |
| 24 | + |
| 25 | +- Option A: CIDs as a struct; multihash as bytes. |
| 26 | +- Option B: CIDs as a string. |
| 27 | +- Option C: CIDs as an interface with multiple implementors. |
| 28 | +- Option D: CIDs as a struct; multihash also as a struct or string. |
| 29 | +- Option E: CIDs as a struct; content as strings plus offsets. |
| 30 | + |
| 31 | +The current approach on the master branch is Option A. |
| 32 | + |
| 33 | +Option D is distinctive from Option A because multihash as bytes transitively |
| 34 | +causes the CID struct to be non-comparible and thus not suitable for map keys |
| 35 | +as per https://golang.org/ref/spec#KeyType . (It's also a bit more work to |
| 36 | +pursue Option D because it's just a bigger splash radius of change; but also, |
| 37 | +something we might also want to do soon, because we *do* also have these same |
| 38 | +map-key-usability concerns with multihash alone.) |
| 39 | + |
| 40 | +Option E is distinctive from Option D because Option E would always maintain |
| 41 | +the binary format of the cid internally, and so could yield it again without |
| 42 | +malloc, while still potentially having faster access to components than |
| 43 | +Option B since it wouldn't need to re-parse varints to access later fields. |
| 44 | + |
| 45 | +Option C is the avoid-choices choice, but note that interfaces are not free; |
| 46 | +since "minimize mallocs" is one of our major goals, we cannot use interfaces |
| 47 | +whimsically. |
| 48 | + |
| 49 | +Note there is no proposal for migrating to `type Cid []bytes`, because that |
| 50 | +is generally considered to be strictly inferior to `type Cid string`. |
| 51 | + |
| 52 | + |
| 53 | +Discoveries |
| 54 | +----------- |
| 55 | + |
| 56 | +### using interfaces as map keys forgoes a lot of safety checks |
| 57 | + |
| 58 | +Using interfaces as map keys pushes a bunch of type checking to runtime. |
| 59 | +E.g., it's totally valid at compile time to push a type which is non-comparable |
| 60 | +into a map key; it will panic at *runtime* instead of failing at compile-time. |
| 61 | + |
| 62 | +There's also no way to define equality checks between implementors of the |
| 63 | +interface: golang will always use its innate concept of comparison for the |
| 64 | +concrete types. This means its effectively *never safe* to use two different |
| 65 | +concrete implementations of an interface in the same map; you may add elements |
| 66 | +which are semantically "equal" in your mind, and end up very confused later |
| 67 | +when both impls of the same "equal" object have been stored. |
| 68 | + |
| 69 | +### sentinel values are possible in any impl, but some are clearer than others |
| 70 | + |
| 71 | +When using `*Cid`, the nil value is a clear sentinel for 'invalid'; |
| 72 | +when using `type Cid string`, the zero value is a clear sentinel; |
| 73 | +when using `type Cid struct` per Option A or D... the only valid check is |
| 74 | +for a nil multihash field, since version=0 and codec=0 are both valid values. |
| 75 | + |
| 76 | +### usability as a map key is important |
| 77 | + |
| 78 | +We already covered this in the criteria section, but for clarity: |
| 79 | + |
| 80 | +- Option A: ❌ |
| 81 | +- Option B: ✔ |
| 82 | +- Option C: ~ (caveats, and depends on concrete impl) |
| 83 | +- Option D: ✔ |
| 84 | +- Option E: ✔ |
| 85 | + |
| 86 | +### living without offsets requires parsing |
| 87 | + |
| 88 | +Since CID (and multihash!) are defined using varints, they require parsing; |
| 89 | +we can't just jump into the string at a known offset in order to yield e.g. |
| 90 | +the multicodec number. |
| 91 | + |
| 92 | +In order to get to the 'meat' of the CID (the multihash content), we first |
| 93 | +must parse: |
| 94 | + |
| 95 | +- the CID version varint; |
| 96 | +- the multicodec varint; |
| 97 | +- the multihash type enum varint; |
| 98 | +- and the multihash length varint. |
| 99 | + |
| 100 | +Since there are many applications where we want to jump straight to the |
| 101 | +multihash content (for example, when doing CAS sharding -- see the |
| 102 | +[disclaimer](https://github.com/multiformats/multihash#disclaimers) about |
| 103 | +bias in leading bytes), this overhead may be interesting. |
| 104 | + |
| 105 | +How much this overhead is significant is hard to say from microbenchmarking; |
| 106 | +it depends largely on usage patterns. If these traversals are a significant |
| 107 | +timesink, it would be an argument for Option D/E. |
| 108 | +If these traversals are *not* a significant timesink, we might be wiser |
| 109 | +to keep to Option B, because keeping a struct full of offsets will add several |
| 110 | +words of memory usage per CID, and we keep a *lot* of CIDs. |
| 111 | + |
| 112 | +### interfaces cause boxing which is a significant performance cost |
| 113 | + |
| 114 | +See `BenchmarkCidMap_CidStr` and friends. |
| 115 | + |
| 116 | +Long story short: using interfaces *anywhere* will cause the compiler to |
| 117 | +implicitly generate boxing and unboxing code (e.g. `runtime.convT2E`); |
| 118 | +this is both another function call, and more concerningly, results in |
| 119 | +large numbers of unbatchable memory allocations. |
| 120 | + |
| 121 | +Numbers without context are dangerous, but if you need one: 33%. |
| 122 | +It's a big deal. |
| 123 | + |
| 124 | +This means attempts to "use interfaces, but switch to concrete impls when |
| 125 | +performance is important" are a red herring: it doesn't work that way. |
| 126 | + |
| 127 | +This is not a general inditement against using interfaces -- but |
| 128 | +if a situation is at the scale where it's become important to mind whether |
| 129 | +or not pointers are a performance impact, then that situation also |
| 130 | +is one where you have to think twice before using interfaces. |
| 131 | + |
| 132 | +### one way or another: let's get rid of that star |
| 133 | + |
| 134 | +We should switch completely to handling `Cid` and remove `*Cid` completely. |
| 135 | +Regardless of whether we do this by migrating to interface, or string |
| 136 | +implementations, or simply structs with no pointers... once we get there, |
| 137 | +refactoring to any of the *others* can become a no-op from the perspective |
| 138 | +of any downstream code that uses CIDs. |
| 139 | + |
| 140 | +(This means all access via functions, never references to fields -- even if |
| 141 | +we were to use a struct implementation. *Pretend* there's a interface, |
| 142 | +in other words.) |
| 143 | + |
| 144 | +There are probably `gofix` incantations which can help us with this migration. |
0 commit comments