Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zstd] remove src minimal size limit on decompressSizeHint #118

Closed
wants to merge 3 commits into from

Conversation

XiaochenCui
Copy link

We want to mandate the 'zstd' decompressed data to the buffer pre-allocated by us in some circumstances. (No matter whether its size is smaller than zstdFrameHeaderSizeMax.)

For example, in our OLAP system kernel, we have a unified buffer pool to manage all bytes allocated for queries. An accidentally created large array will have a fatal impact on the correctness and performance of the system. (In this case, any compressed data which have a size smaller than 18 would result in a newly constructed byte slice with a size of 1MB, which is intolerable.)

I had also considered other workaround:

  • provide a global option to control the manner of binary allocation, the pros are it will make our condition branches simpler, and the cons are the difficulty of this library will get increased, and it will be tricky when different components of a system want different strategies.
  • provide a new decompression API, which will break the consistence of our library.
  • provide a new arg in the old API, same as point 2.

We want to mandate the 'zstd' decompressed data to the buffer pre-allocated by us in some circumstances. (No matter whether its size is smaller than `zstdFrameHeaderSizeMax`.)

For example, in our OLAP system kernel, we have a unified buffer pool to manage all bytes allocated for queries. An accidentally created large array will have a fatal impact on the correctness and performance of the system. (In this case, any compressed data which have a size smaller than 18 would result in a newly constructed byte slice with a size of 1MB, which is intolerable.)

I had also considered other workaround:
- provide a global option to control the manner of binary allocation, the pros are it will make our condition branches simpler, and the cons are the difficulty of this library will get increased, and it will be tricky when different components of a system want different strategies.
- provide a new decompression API, which will break the consistence of our library.
- provide a new arg in the old API, same as point 2.
@XiaochenCui
Copy link
Author

Closed it for not passed tests

Restrict the minimal hint size to 1 to prevent slice access from `index out of range`.
@XiaochenCui
Copy link
Author

Bug fixed

@XiaochenCui XiaochenCui reopened this May 23, 2022
@XiaochenCui
Copy link
Author

@Viq111
Request review or better solution on this issue sincerely.

@Viq111
Copy link
Collaborator

Viq111 commented May 26, 2022

Hello @XiaochenCui , thanks for the PR.
Would you have a example payload that would be smaller than 18 bytes and valid zstd ?
Indeed the 18 bytes for the frame is the max, we could change it to the min with is 2 bytes

@XiaochenCui
Copy link
Author

Hello @XiaochenCui , thanks for the PR. Would you have a example payload that would be smaller than 18 bytes and valid zstd ? Indeed the 18 bytes for the frame is the max, we could change it to the min with is 2 bytes

Yeah, we have a compressed column in our OLAP system with a length of 16. It's generated by a binary with 52 zeros.

Here is the code:

originalData = make([]byte, 52)
logging.Printf("len of original data: %d", len(originalData))

compressedData = Compress(CompressZSTD, originalData)
logging.Printf("len of compressed data: %d, content: %v", len(compressedData), compressedData)

The Compress function is just a wrapper for zstd2.CompressLevel(...)

And here is the output:

2022-05-29 16:56:50 | INFO  | compress_test.go:90 > len of original data: 52
2022-05-29 16:56:50 | INFO  | compress_test.go:93 > len of compress data: 16, content: [40 181 47 253 32 52 61 0 0 8 0 1 0 224 116 66]

@XiaochenCui
Copy link
Author

In zstd1.50, the zstd2.CompressLevel will always use our buffer for decompressed data.

However, in zstd1.52, it will generate a huge binary with size=1 million when we have compressed data with a size < 18, which is a critical problem for both performance and consistency. We downgrade the library to 1.50 to erase this trouble. But I really want to solve it from the root. (I thought the root cause is decompressSizeHint, tell me if it isn't)

@XiaochenCui
Copy link
Author

Here is the code that is more convenient for you:

        originalData := make([]byte, 52)
	logging.Printf("len of original data: %d", len(originalData))

	// compressedData = Compress(CompressZSTD, originalData)
	compressedData, err := zstd2.CompressLevel(make([]byte, 0, len(originalData)), originalData, zstd2.DefaultCompression)
	if err != nil {
		panic(err)
	}
	logging.Printf("len of compressed data: %d, content: %v", len(compressedData), compressedData)`

@XiaochenCui
Copy link
Author

@Viq111
I have updated the minimal size of hint to zstdFrameHeaderSizeMin, which is 2.

Btw, why do we have to "prevent DOS from maliciously-created payloads"? As I'm not good at application cryptography.

Viq111 added a commit that referenced this pull request Jun 2, 2022
Fix #118
With the introduction of #115 to prevent zipbombs, the check was too strict and was checking a too large size boundary. This fixes it and adds a test
@Viq111
Copy link
Collaborator

Viq111 commented Jun 2, 2022

Hello!

Ah I've been working on the code in parallel yesterday, here is the PR with a test: #120 that should fix your issue (thanks for reporting!)

Btw, why do we have to "prevent DOS from maliciously-created payloads"? As I'm not good at application cryptography.
See the original PR that introduced the change: #115 (and related issue #60) for the explanation.

If that's ok, I will merge #120 instead of this one as it has the added test ?

@XiaochenCui
Copy link
Author

Hello!

Ah I've been working on the code in parallel yesterday, here is the PR with a test: #120 that should fix your issue (thanks for reporting!)

Btw, why do we have to "prevent DOS from maliciously-created payloads"? As I'm not good at application cryptography.
See the original PR that introduced the change: #115 (and related issue #60) for the explanation.

If that's ok, I will merge #120 instead of this one as it has the added test ?

Sure, I've reviewed that pr and it's great! Thanks for tracking this issue.

@XiaochenCui XiaochenCui closed this Jun 4, 2022
kodiakhq bot referenced this pull request in cloudquery/cloudquery Jan 1, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [github.com/DataDog/zstd](https://github.com/DataDog/zstd) | indirect | patch | `v1.5.0` -> `v1.5.2` |

---

### Release Notes

<details>
<summary>DataDog/zstd</summary>

### [`v1.5.2`](https://github.com/DataDog/zstd/releases/tag/v1.5.2): zstd 1.5.2

[Compare Source](https://github.com/DataDog/zstd/compare/v1.5.2...v1.5.2)

This release updates the upstream zstd version to [1.5.2](https://github.com/facebook/zstd/releases/tag/v1.5.2) ([https://github.com/DataDog/zstd/pull/116](https://github.com/DataDog/zstd/pull/116))

The update `1.5.0` -> `1.5.2` overall has a similar performance profile. Please note that depending on the workload, performance could vary by -10% / +10%

### [`v1.5.2+patch1`](https://github.com/DataDog/zstd/releases/tag/v1.5.2%2Bpatch1): zstd 1.5.2 - wrapper patches 1

[Compare Source](https://github.com/DataDog/zstd/compare/v1.5.0...v1.5.2)

#### What's Changed

-   Fix unneededly allocated large decompression buffer by [@&#8203;XiaochenCui](https://github.com/XiaochenCui) ([#&#8203;118](https://github.com/DataDog/zstd/issues/118)) & [@&#8203;Viq111](https://github.com/Viq111) in [https://github.com/DataDog/zstd/pull/120](https://github.com/DataDog/zstd/pull/120)
-   Add SetNbWorkers api to the writer code (see [#&#8203;108](https://github.com/DataDog/zstd/issues/108)) by [@&#8203;bsergean](https://github.com/bsergean) in [https://github.com/DataDog/zstd/pull/117](https://github.com/DataDog/zstd/pull/117)
    -   For large workloads, the performance can be improved by 3-6x (see [https://github.com/DataDog/zstd/pull/117#issuecomment-1147812767](https://github.com/DataDog/zstd/pull/117#issuecomment-1147812767))
    -   `Write()` becomes async with workers > 1, make sure you read the method documentation before using

#### New Contributors

-   [@&#8203;bsergean](https://github.com/bsergean) made their first contribution in [https://github.com/DataDog/zstd/pull/117](https://github.com/DataDog/zstd/pull/117)
-   [@&#8203;XiaochenCui](https://github.com/XiaochenCui) for his work on [https://github.com/DataDog/zstd/pull/118](https://github.com/DataDog/zstd/pull/118) that led to [#&#8203;120](https://github.com/DataDog/zstd/issues/120)

**Full Changelog**: DataDog/zstd@v1.5.2...v1.5.2+patch1

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "before 3am on the first day of the month" (UTC), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC43Ny4wIiwidXBkYXRlZEluVmVyIjoiMzQuNzcuMCJ9-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants