Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global tables should be initialised data, not populated at run time #126

Open
adrianmay opened this issue Mar 10, 2025 · 8 comments
Open

Comments

@adrianmay
Copy link

adrianmay commented Mar 10, 2025

"Initialised data" is what you get in C if you write "int a=1;" at global scope. It's as much a part of the executable or library image as the machine code and there's no need to populate it at run time.

fec contains some global constant tables with values that are identical for all users. These are populated by "initialize". If these were initialised data instead, there'd be no "initialize" function and no need to check whether or not it had been called.

This would also provide the overwhelming advantage that the whole zfec library would be pure (i.e. not need IO in Haskell). An initialize function (even if called internally) forces the whole interface to return IO to tell Haskell to call it. It's clear that people want this because the current code achieves it with unsafePerformIO, but this causes a bug at present where Haskell doesn't know that the other functions only work after calling initialize. #84 knew this and advocated IO on the interface, as does my present draft MR, but I think the approach I describe here would be best.

The tables are about 65k, which is not large, but even if they were larger, it would be hard to argue that RAM is at less of a premium than disk, and having it on disk allows the RAM to load it on demand and discard it when pressure on RAM is high.

This would be accomplished by running the logic of "initialize" in isolation, printing the results in C syntax, and using that as a source file. For bonus points, this procedure could be run as a pre-build step, but personally I'd be just as happy to generate this source file once and check it into source control (also retaining the code that generated it).

@hacklschorsch
Copy link
Member

This would also fix #105 if I am not mistaken?

@meejah
Copy link
Collaborator

meejah commented Mar 13, 2025

Today @shapr and I spent a little bit of pairing time thinking about this. We agree "in general" that static data is better for this case than runtime-generated tables -- 17 years ago the answer may indeed have been the implemented solution.

It's not immediately obvious to me that there is "a table" (versus one for every n, k combination). Is this indeed just a single table, or does it depend on the N and K values?

Also, the whole concept of a static table smells like premature optimization to us -- do we need tables at all here? Can we just "do the math directly" (and let the optimizer do its thing)? Is there enough data here to blow out L1 or even L2 caches, such that it's actually counter-productive? (From some quick searching it seems that Haswell has 32KiB L1 caches, although it's not immediately obvious if that means "per die" or "per core".

@adrianmay
Copy link
Author

The tables that depend on n and k will continue to be built when fec is called. This ticket pertains to the tables that currently get set up by fec_init which takes no parameters.

I agree that flooding the cache just for the sake of multiplication is probably bad for performance. But is it just a simple multiplication? I haven't looked in detail, but there must be some kind of modulo arithmetic going on. I'll look more closely and comment again.

@adrianmay
Copy link
Author

It's definitely not a simple multiplication table. For instance, the 3s row goes 0,3,6,5,12,15,10,9...

It's symmetrical about the diagonal so we could in principle halve the size of it, but it would be a pain.

BTW, there are three smaller tables as well as the 256*256 byte "multiplication" table.

So shall I proceed?

@meejah
Copy link
Collaborator

meejah commented Mar 13, 2025

Yeah, it's not simple and we didn't spend enough to time understand precisely what it's even doing ... however, without performance tests proving that this is better, my instinct is still to "let the optimizer optimize".

Of course, you're not really proposing that here (right?) just to change when/how the table is populated.

I've not really delved into this library much before, so I'm not sure how much pain this relieves or doesn't ;) so probably not the best person (at least without spending more time) to answer what the best use of time here is

@exarkun
Copy link
Member

exarkun commented Mar 13, 2025

This isn't proof one way or another but the current implementation closely follows https://dl.acm.org/doi/10.1145/263876.263881 which has performance and efficiency considerations. The paper is quite old so it's entirely possible that the tradeoffs it considers are not appropriate to hardware currently in use.

However, another thing pointed out there is that the matrix is indeed not "simple multiplication". I very much doubt that zfec would perform acceptable without it since every single byte of input is transformed by some element of this matrix and and that transformation would involve many times more operations without the pre-computed matrix.

The question of when exactly the matrix is built seems worth considering but it seems to me that it does need to be built somewhere prior to an actual encode / decode operation.

@adrianmay
Copy link
Author

My motivation isn't primarily performance. It's to alleviate the need to call initialize. It seems like tighter programming to eliminate the uninitialised state altogether. Haskell has is own reason to care about this because with initialise the whole interface has to be in IO to avoid laziness bugs, but without initialize it could all safely be pure.

As JP says, there has to be a table, and it has to be in RAM when it's in use. After the change I'm proposing, it will always be on disk so it only needs to be loaded on demand and can be discarded if it hasn't been used for a while. But if it's been calculated at run time, it has to be preserved in RAM all the time. That's one performance implication. The other is that calculating it at run time wastes time. I can't see a performance disadvantage.

BTW, it took me only an hour to implement this on the train this morning.

@meejah
Copy link
Collaborator

meejah commented Mar 13, 2025

I certainly have no objections to making the table static data vs. runtime generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants