Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing data for Display Names #3260

Open
Tracked by #3913
sffc opened this issue Apr 4, 2023 · 13 comments
Open
Tracked by #3913

Optimizing data for Display Names #3260

sffc opened this issue Apr 4, 2023 · 13 comments
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility

Comments

@sffc
Copy link
Member

sffc commented Apr 4, 2023

The DisplayNames component comes with a large amount of data. It is the largest locale-specific data in ICU and will also likely be the largest in ICU4X.

There are a few things that make DisplayNames interesting:

  1. The majority of the display names are probably not useful to carry for most clients. For example, users speaking Japanese are more likely to need the translation for the Katakana script than the translation for the Cherokee script. We should explore something like japanext and likelysubtagsext where we have a core set and an extended set.
  2. Regional variants often override only a small number of strings. For example, en-GB and en-US might be equivalent for all region names except for one or two. This doesn't play nicely with the deduplication mechanism we've thusfar relied on.

CC @snktd @robertbastian @markusicu

@sffc sffc added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality C-dnames Component: Language/Region/... Display Names labels Apr 4, 2023
@robertbastian
Copy link
Member

I think 2 is a big issue, and I think it also happens for other data. We could, instead of loading a single data struct in the formatter constructor, load all structs for the whole fallback chain. This could use naive fallback (i.e. chopping off tags), so no additional data would be needed. We can then remove redundant entries from en-GB and en-001 if they are in en (if we're using naive we'd still have duplication across GB and 001 though).

@sffc
Copy link
Member Author

sffc commented May 11, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023
@sffc
Copy link
Member Author

sffc commented Jul 5, 2023

Discussed on 2023-07-04. We will use the auxiliary key model, similar to currency formatter (#1441), which resolves the issues in the OP.

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023
@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023
@sffc sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023
@Manishearth Manishearth moved this to Not a blocker in icu4x 2.0 Feb 23, 2024
@sffc sffc removed this from icu4x 2.0 Feb 29, 2024
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Nov 20, 2024
@sffc
Copy link
Member Author

sffc commented Nov 23, 2024

Options:

  1. Fine grained data marker attributes
  2. Core/Extended

Discussion:

  • @zbraniecki Seems like it would be a footgun if clients aren't loading display names that they need.
  • @sffc How would you draw the line?
  • @zbraniecki Not really sure.
  • @zbraniecki Most clients are not in a position to decide whether "Tovalu" should be included. It seems option 2 is easier for clients, if we can make a reasonable set. We are in a better position to draw this line than our customers.
  • @sffc Option 1 is better for data deduplication. It's possible though that we could do something similar to how we reduced data size for time zones (where we allow fallback to root).
  • @sffc My main worry with Option 1 is its impact on compile times with so many finely sliced data marker attributes.

@sffc sffc added U-ecma402 User: ECMA-402 compatibility and removed discuss-priority Discuss at the next ICU4X meeting labels Nov 23, 2024
@sffc
Copy link
Member Author

sffc commented Nov 25, 2024

@sffc
Copy link
Member Author

sffc commented Dec 14, 2024

Trying to organize my thoughts here.

There are two different ways we are discussing about reducing data size:

  1. Deduplicating identical strings
    • Example: en-GB and en-US and fr have the same translation for Panama. We'd like to ship that translation only once.
  2. Removing cold strings
    • Example: the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

These two paths can be explored mostly independently, though some solutions impact both.

Deduplicating identical strings

Two general paths for this which are compatible with the ICU4X architecture:

  1. Make every string its own data marker attribute. This is what we decided to do last summer, Optimizing data for Display Names #3260 (comment), and it is what we currently do for the experimental unit and currency formats.
    • Pro: Smallest postcard data size
    • Pro: Allows for cold strings to be removed via datagen (solves problem 2)
    • Pro: Easy to implement
    • Con: Finely sliced data markers increase build times and baked data size (Baked data is big, and compiles slowly, for finely sliced data markers #5230)
    • Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)
  2. Load multiple payloads, such as "locale plus root" or "locale plus script", which have been deduplicated. This is what @robertbastian suggested in Optimizing data for Display Names #3260 (comment) and similar to what he implemented to deduplicate time zone names in Remove generic metazone values that match location values #5751
    • Pro: Doesn't test the boundaries of the Rust compiler
    • Pro: Should be fairly efficient for display name lookup
    • Con: Leaves some data size on the table (unclear how much)
    • Con: A bit harder to implement
    • Con: Increases stack size of DisplayNames type
    • Con: May favor Latin-script languages, but see Improve time zone location name deduplication #5901

Is there another approach I'm missing?

Removing cold strings

Two ways to go about this:

  1. Utilize data marker attributes (see above)
    • All pros and cons as above
    • Pro: The set of strings included for each locale is fully customizable by the client, though we could still ship default configurations
    • Con: A non-default set of strings requires running datagen
  2. Introduce Minimal/Core/Extended keys, similar to LocaleExpander.
    • Pro: Clients do not need to fiddle with datagen in order to meet their requirements
    • Pro: Puts the i18n team in charge of deciding what constitutes a "cold" string
    • Con: The choices offered to clients are coarse (such as minimal, core, or extended) and not customizable
    • Con: Increases stack size of DisplayNames type
    • Con: May reduce performance relative to not having the separate keys

Thoughts on any of the above?

@Manishearth @hsivonen

@Manishearth
Copy link
Member

Manishearth commented Dec 17, 2024

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'd rather attempt to deduplicate. Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'm not too worried about confusion with our default data. I think our default data should always be at least the "core" data set, which I'm considering to be the cartesian product of all modern locales. If we take the multiple-keys approach, the constructors can have names such as new_minimal, new, and new_extended.

I'd rather attempt to deduplicate.

Yes, deduplication is something we should do regardless of how/if we decide to handle cold strings.

Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Ideally we work out the compiler fundamentals before implementing a solution. It seems unlikely we'll have this ready in 2024, but it would be nice to have a path forward and implement this in 2025.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

True. What I had in mind when I wrote my comment was more like, "if you need one name at a time, a map lookup is about as fast as a data provider lookup, but if you need an iterator over all names, a map lookup is probably faster."

@Manishearth
Copy link
Member

Yep, 100%.

@robertbastian
Copy link
Member

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter. We also don't have any data on which combinations are "hot".

@sffc
Copy link
Member Author

sffc commented Jan 9, 2025

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter.

It's about tradeoffs. @zbraniecki likes to use the following thought experiment: in an environment with a fixed amount of space, what produces the better outcome for global users: including these cold display name strings, or including a whole new locale?

I do think we should keep the cold strings in the default configuration. But, I also think we are in a position to give this lever to clients who need it.

We also don't have any data on which combinations are "hot".

CLDR has some data on usage patterns between languages and regions. The data could probably use some improvement.

@hsivonen
Copy link
Member

hsivonen commented Jan 9, 2025

the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

I think it's bad to slice up the data like this. When a South Korean app asks a user for their country-or-region of residence as part of user registration (let's assume for the sake of simplicity here that it's legitimate to ask for such a thing), why should the menu exclude Uganda?

AFAICT, splicing up the DisplayNames data like this doesn't make much sense, since DisplayNames makes the most sense to use when you need to deal with a large set of possible names and especially if translating them is outside the core competence of the app's UI localizers. If the app only needs to deal with a small set of translatable strings that are familiar to the translators, the app doesn't need DisplayNames.

Continuing with the Korean example: Anecdotally, if a South Korean app has non-Korean UI languages, the UI languages are Korean, English, Japanese, and Chinese (without clearly saying Simplified or Traditional). That's a small set, but it doesn't follow that it makes sense to provide DisplayNames pruned to a small set that's somehow geographically proximate plus English. The app does not need DisplayNames—hot or cold—to get labels for an in-app UI for switching between these. It can use whatever mechanism if uses for its UI strings in general.

(And making English have a broader set of supported names than Korean on the assumption that people whose country-or-region of residence is in the cold set for Korean would use the English UI anyway would fail cases like Koreans who have moved abroad and are visiting registering for an account with the UI in Korean but country-or-region of residence in the cold set. So that would be bad.)

Having an API that works for presumed-hot values but not for presumed-cold values makes for an unreliable API. If developers perceive an API as too unreliable to use, then even the utility per size of even the hot set becomes worse.

As far as the JavaScript ECMA-402 side goes, as I said in the TG2 meeting, I think it would be better to not ship DisplayNames than to ship DisplayNames in an unreliable state.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Feb 3, 2025
@Manishearth
Copy link
Member

  • @sffc Two ways to reduce data size:
    1. Deduplicate identical strings
    2. Remove cold strings
  • @sffc I want to focus on (1). I had two approaches:
    1. Use finely-sliced data markers
    2. Load multiple payloads
  • @sffc Ideally we could use finely-sliced data markers, but they have lot of problems of their own, as discussed at length in Baked data is big, and compiles slowly, for finely sliced data markers #5230
  • @robertbastian Finely-sliced data markers might not be a good API. We might not know the regions to load until runtime.
  • @sffc We previously discussed a design that a DateTime formatter which only supports a single time zone, but the design we chose allows all time zones for formatting. In the case of display names, it's okay to say that you have the data marker when you choose the region.
  • @Manishearth There is a parallel of there being 2 types of API users as there are 2 types of MessageFormat users: people who want to format individual messages vs people who want to do bundles and load many at once
  • @robertbastian I don't think that is analogous at all.
  • @sffc That isn't the first analogy I would draw. The question of display names applies to currencies and units as well, or any case with a lot of identifiers. An approach that we have partially implemented for this is to have a wway to query the data marker attributes, by doing prefix matching. We had to do this for loading the model for Segmenter. We can iterate over all available regions at runtime.
  • @robertbastian I don't think we ever iterate. Iterate would require a b-tree map, and that takes a lot of ___.
  • @sffc I think Buffer will support it.
  • @robertbastian I don't like the idea of loading N payloads. To do that, you need to allocate DataPayload<> into a Vec.
  • @sffc Do we actually need that? Another design is, if you want a multi-{unit, region}, the formatter just saves a reference to the provider.
  • @robertbastian Then it would have a lifetime, which would be a first for us to do so. The last time when we designed a display name formatter for a single region, and yesterday we said that we can do that. In DateTime, you can't create a DateTimeFormatter per time zone.
  • @sffc I think this is a policy decision.
  • @robertbastian It's one of the 3 options on the table. The other options are heap allocations or formatters with lifetimes on the providers.
  • @sffc I think those options are in the same bucket. I want to evaluate the technical merits of the options first, and then design the API after that.
  • @sffc One piece of data that we don't have is the degree to whicih the data size difference between data markers & ___ .
  • @robertbastian It depends on the user's calling code. If the caller only ever formats GB, one wins over the other. But if all options are equal, then it doesn't make a difference.
  • @sffc There's loaded data size and linked data size. What is the impact on the linked data size? The linked data size still depends on auxiliary keys. None of the proposals on the table use actual data markers on the region. You have to do slicing, but you can do slicing manually.
  • @robertbastian With finely-sliced data markers, you can do filtering on the data keys. But with multiple payloads, you can't do such filtering.
  • @Manishearth We can revisit the policy about auxiliary keys. Part of the issue is that finely-sliced data doesn't scale to compile. It takes gigabytes of RAM to work, and we see this on CI. It seems to be a function of the number of tokens.
  • @Manishearth I don't think we're aligned on what the actual problems are constrains are.
  • @sffc ZeroTrie indexes into a static array. If it were to index into a ZeroVec, that would solve the problem. It's one of the points that I listed in issue Rename CodePointSet to CodePointInversionList #2230.
  • @sffc I want to discuss the merits of the two approached (finely-sliced data markers & load multiple payloads)
  • @Manishearth Yes. We're talking about the relative importance of these use cases. Between "format a specific set of locales" vs. "format any and all locales", do we have a preference? is there a tradeoff?
  • @sffc Two things to add, not completely related: For display names and probably for units, the model that finely-sliced would push us towards where the unit / region would be in the constructor would be okay. In the case of units, it's generally the case that the measure unit in the constructor. In the case of display names, in general, you will want multiple of them. But display names are so thin, so just use your data provider and provide it to your static function (? constructor?).
  • @robertbastian You can skip the formatter thing and go straight to the formatted thing.
  • @sffc My second statement, unrelated to the previous, is that I think we still have headway to make multi-payload to be competitive with the link size we can get finely-sliced. When I say link size, I assume that you're not doing custom slicing. Assuming that you link all regions and all locales, I think there's headway to make the multi-payload competitive. Every locale has a base and overrides. Overrides are where the locale differs from the base. That allows locales to share the Latin names and only store the difference when they differ. This helps low resource languages that don't have complete data yet, so they default to the root locale that has the Latin names in those cases of missing data. We have an exception to ___ when it's a performance related thing. If we ship the Azeri pack but only included 5 region names and not the other 30, then it would fall back to root. It somewhat harms the i18n, but not to the degree that we couldn't make a policy that this is a cost that we're willing to accept.
  • @Manishearth Every locale has a base and override. Is the base the root locale? Or does every locale have it's own base?
  • @sffc The base for a locale is the script-based.
  • @robertbastian I think what you're designing with the multi-payload solution is a workaround. I think finely-sliced data markers would allow us to get full deduplication. So if we can do that, we should.
  • @Manishearth I don't think it's entirely a workaround. We never designed our data loading infrastructure to have more than one payload. It is theoretically possible to design a data provider to load two separate keys referring to differing size subsets of the same data.
  • @sffc Showing data from provider/source/data/debug/datetime/LocationsRootV1 and provider/source/data/debug/datetime/LocationsV1. Example: the fr locale has ~400 lines in LocationsRootV1 but only 223 lines in LocationsV1. A lot of the locales in LocationsRootV1 have identical contents, so they get deduplicated, and we only store one copy.
  • @Manishearth Can we design finely-sliced data markers so that they support multi-payload loading?
  • @robertbastian Instead of passing around a CountryFormatter, you'd pass around your data provider and do on-demand loading. You wouldn't store a Vec of DataPayloads, but you need to build some kind of map, so effectively you're rebuilding the data provider
  • @Manishearth We've previously established that data loading is supposed to be cheap, so that seems fine.
  • @sffc The only downside that I can think of is the iteration over all of the things in the provider is not something that we support.
  • @robertbastian We have that, it's just behind the alloc feature. We can make it no-alloc
  • @Manishearth We've cared about the alloc stuff for a while.
  • @sffc One of the reasons that we haven't had the iteration provider because we don't want users to use that in order to determine what locales are supported.
  • @Manishearth Do we need iteration over locales for display names?
  • @sffc Having a region picker is the main use case.
  • @robertbastian But is iteration over display names the right way to do that? We dont do that for time zones, we iterate through time zone ids, not time zone display names
  • @robertbastian Not all locales have display names for all regions. So you can't simply iterate over display names.
  • @Manishearth Yeah, so the list of regions could be either "the regions for my app" or "all regions ICU4X supports".
  • @robertbastian Something like "known regions".
  • @sffc In general, ICU4X compiles with all regions, and you can slice which regions you want.
  • @robertbastian All regions changes with every CLDR release.
  • @Manishearth Fallback also affects which locales we choose. These are interesting questions, but they don't tie into display names, they tie into a different issue. We should file an issue for this and talk about it separately.
  • @sffc ___ has an API supportedValuesOf. Maybe the basis is the LocaleCanonicalizer.
  • @robertbastian I don't think LocaleCanonicalizer is the struct to do this.
  • @sffc Or maybe the struct is LocalExpander. The higher level point is that this example can be a basis for display names.
  • @Manishearth Also while I'm not a fan of doing ECMA specific APIs we could have a DisplayNamesSupportedValues marker (etc). It's locale unaware anyway, so doing this API based on a locale is incorrect: we can have a big vec of stuff.
  • @sffc One other problem with finely-sliced, we generally index each one of them on the region code, which is two letters. But then we have to repeat the name of the unit in the name of the key along with the locale when we load units. So we end up with a key like en-kilometer
  • @robertbastian Display names change when CLDR changes display names. Units change when ISO changes them. So they shouldn't be tied together. You should not checksum them together and use one to look up the other. We don't do this in time zones. The display names are not checksummed to the IANA mapper.
  • @sffc I am trying to parse @robertbastian's comment because I think checksumming is only feasible in the multi-payload approach, and I think I described that. I don't know why you say it's not feasible.
  • @robertbastian We didn't want to load the IANA mapper.
  • @Manishearth I think this is possible with the finely-sliced approach. I think it would work as such: If you have the locale French, then you would get the display names map without auxiliary keys. It would use the index numbers as the keys, where there is a side map that correlates the index number to the actual key name. This needs the checksum to be 100% correct. It's not my preferred, but we can do that. We can use at most 2 bytes for the index number.
  • @sffc It may not benefit display names, but units would benefit from using 2 bytes.
  • @Manishearth Units get added over time.
  • @sffc I like the checksum approach.
  • @Manishearth Turning this into numbers is okay.
  • @sffc This sounds like a code driven thing, and I would like things to be data driven.
  • @Manishearth This is a small amount of data.
  • @Manishearth This would give our supportedValuesOf API for free.
  • @robertbastian I'm not a big fan of checksumming markers together. If we can put this into a special compression locale for teh same marker, it would just need to be in teh same data struct somehow.
  • @Manishearth For datagen specifically, we can tweak it, instead of doing the regular databake implementation, call this special one for this one key. It produces two ZeroTries, one with the metadata, and one with the special stuff for the key. Doing it in datagen would work, but it would not be so easy in Postcard.
  • @sffc Another approach without doing the checksum, as long as BakedProvider or Postcard know the full space of all possible attributes, is that they can do this table thing internally and deal with during the load operation. Not sure if that was what @Manishearth was saying or not.
// singleton, checksummed
struct Compression {
   tzids: ZeroVec<TimeZoneId>, // [deber, usden]
}

// compression locale
struct DisplayNames {
    names: ZeroMap<TimeZoneId, str> [(deber, ""), (usden, "")]
}
  • @robertbastian At some point you can just run deflate on the payload bytes instead of inventing our own compression schemes
  • @Manishearth It sounds like we mostly agree on finely sliced? There's just an inefficiency we want to handle, we know we can handle it with a split metadata table, and the disagreement is between weht that table to be a singleton or purely internal in baked and postcard.
  • @sffc I'm onboard with finely sliced on the condition that we fix Baked data is big, and compiles slowly, for finely sliced data markers #5230.
  • @robertbastian Only thing we lose is the ability to filter regions on datagen time.
  • @sffc We can do it in multi-payload. There would be a three-layer approach. A singleton und locale would contain a display name for every region. And then the Latn and fr locales could choose to strip some of the display names. The observable behavior is that all of the display names are supported, but the i18n is less good, which we could decide is consistent with our policy.
  • @robertbastian Why are we discussing this compression a lot? ZeroTrie is efficient, yes?
  • @sffc Manish asked what the challenges were: I pointed out that the multipayload approach allows for the attributes to be better compressed via the checksum approach, but now we have a path for that compression in both models.
  • @Manishearth: Current impression is that everyone is on board with finely sliced, provided we fix Baked data is big, and compiles slowly, for finely sliced data markers #5230. We have a path towards compression with two routes (explicit metadata singleton vs internal stuff), with pros and cons that can be discussed. We have ideas for Baked data is big, and compiles slowly, for finely sliced data markers #5230.

Discussion on #5230:

  • @Manishearth If the data is a simple string, baked data can optimize how it's stored. The official data struct could be a repr(transparent).
  • @robertbastian Now we can make anything the data struct.
  • @sffc DisplayNames, maybe, but I want this to work in general for finely-sliced data including currencies which have a more complex data payload.
  • @robertbastian The optimization could work for Cow<'static> something. Which can include both strings and patterns.
  • @sffc This requires us to use the Yoke code path in DataPayload. Which might make "alloc" features harder to add.
  • @Manishearth We've resolved that problem. But we need to use ZeroFrom. It's unfortunate that we can't use StaticRef; but I'm okay with that.
  • @Manishearth I think there's one cost left. The Yoke code path may still link in a Drop impl, which could include panics and things. Code that used from_static_ref linked a constructor.
  • @robertbastian We could add a third variant to DataPayload. The Yoke variant is being abused as the Owned variant. This can be Copy if we have a Cow where we can turn off the alloc variant.
  • @Manishearth The specific cost is, even in a crate that is using alloc but cares about code size, Drop and panic impls get linked. Which can be a code size bloat that could be significant. One of the answers is, don't use alloc if you care about code size. I don't think we can solve the problem by adding a third variant to DataPayload. The destructor of Rc is also a problem, but it can be improved a bit. That's something that can be fixed upstream.
  • @sffc Another approach is for finely sliced data markers to specify the owned and borrowed versions separately.
  • @robertbastian We don't bake the owned, we stuff it into a zerovec and bake the zerovec.
  • @sffc We end up with a static reference to the ULE type not the data struct value type
  • @robertbastian: rn icu_provider is independent of ULE shenanigans
  • @sffc: DataMarker gains a new optional field for the borrowed variant. providers that benefit from that can use it
  • @Manishearth not a fan of adding a new variant
  • @sffc solution has some merits, cost is it makes our type system a bit more complicated.
  • @Manishearth i think we have a solution here already, the downside of the solution is "yoke cost" which we've solved in no_alloc and is mostly gone in alloc except for causing a small codesize bloat in the situation where you have enabled debug assertions in std (unusual), care heavily about codesize but have alloc enabled (unusual) and ar eavoiding linking in panic formatting code.
  • @sffc We know that ZeroFrom, DataPayload::get, and Cow::get are not completely free. We could, in principle, avoid all of those costs if we added a Borrowed associated type.
  • @Manishearth I think these costs are very minor relative to a DataProvider load operation.
  • @sffc The DataPayload load operation can be quite cheap, especially if we give it certain optimizations.
  • @Manishearth But still on a different order of magnitude than these pointer indirections.
  • @robertbastian The reason you may think get is expensive is not because of the discriminant/branch, it's expensive because the owned data payload cannot perform inlining across the get, which you can do with static. That's why we have borrowed. This is solved by borrowed variants, doesn't need to be solved in data payload any more.
  • @Manishearth: Solution: we could make bake export a function on data markers that have this optimization applied.
    • Example: pub fn load_direct_display_names_v1(req: DataRequest) -> &'static str
  • @sffc How do we get the type &'static str?
  • @Manishearth It can be a macro thing.
  • @Manishearth in DislayNames we dn't even need to do DisplayNamesBorrowed, we can just have this be a pub function.

Plan for #5230:

  • Have a "varzerovec optimization" for databake that instead of doing the regular array of data structs, it does a single varzerovec of data structs. The load impl converts to a yoke, which is not much of a cost, and not a cost at all in no_alloc
  • This optimization is triggered on a hardcoded list of concrete data structs (Cow<'static, str>, ZeroCow<'static, Foo>, ...) in the baked exporter, using Any downcasting to obtain the ULE bytes.
  • This optimization never triggers on "empty" (no locales/data) datagen invocations, like stubdata.
  • Future improvemnet:
    • The optimization can also generate a Baked::load_direct_{data_marker}() function that can be used for borrowed-variants or for direct-loading free functions. This is not necessary at the start but can be added as necessary.
    • For handling the "empty" case we may need to hardcode markers, or have this in DataMarkerInfo. We're not fully agreed on the best way to do this.
    • Note: we do have code currently assuming the variant of the datapayload and unwrapping, this would heuristically change the variant that is returned from load. We will need to be aware of this if we go ahead with this optimization

Agree: @Manishearth @robertbastian @sffc

Possible plans for key table optimization (#3260):

We have two optimizations possible. Both optimizations involve using a multi-level optimization where the actual aux keys stored in the trie are integers instead of region codes or units, and we use a separate shared singleton lookup table that maps aux keys to integers that get stored in the trie

  1. Do not do anything special: let ZeroTrie solve it
  2. Apply the side-table optimization internally in databake and postcard.
    • It may need BlobSchemaV4. Downside: new schema.
    • Or we could store the side-table in a \0t key, for example.
  3. Use a checksum, with a singleton key defining the key table. The logic will have to be done manually in the component. Downside: opaque to filtering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

No branches or pull requests

4 participants