Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass pointers to C as unsafe.Pointer, not uintptr (memory corruption) #91

Merged
merged 10 commits into from
Feb 16, 2021

Conversation

evanj
Copy link
Contributor

@evanj evanj commented Feb 7, 2021

The unsafe package says: "Conversion of a uintptr back to Pointer is
not valid in general." This code was violating this rule, by passing
uintptr value to C, which would then interpret them as pointers. This
causes memory corruption: #90

This change replaces all uses of uintptr with unsafe.Pointer to avoid
this memory corruption. This has the disadvantage of marking every
argument as escaping to heap. This means the buffers used to call the
zstd functions must be allocated on the heap. I suspect this should
not be a huge problem, since I would expect that high performance
code should already be managing its zstd buffers carefully.

The bug is as follows:

  • In zstd_stream_test.go: payload := []byte("Hello World!") is
    marked as "does not escape to heap" (from go test -gcflags -m).
    Therefore, it is allocated on the stack.
  • The test calls Writer.Write, which converts the argument to uintptr:
    srcPtr = uintptr(unsafe.Pointer(&srcData[0]))
  • Writer.Write then calls Cgo: C.ZSTD_compressStream2_wrapper(...)
  • The Go runtime decides the stack needs to be larger, so it copies
    it to a new location.
  • (Another thread): The Go runtime decides to reuse the old stack
    location, so it replaces the "Hello World!" bytes with new data.
  • (Original thread): Calls zstd, which reads the wrong bytes.

This change adds a test which nearly always crashes for me. While
investigating the other uses of uintptr, I also was able to trigger
a similar crash when calling CompressLevel, but only with:
GODEBUG=efence=1 go test .

I also added tests for Ctx.CompressLevel because it was not
obviously being tested. I did not reproduce this problem with that
function, but I suspect the same bug exists, since it uses the same
pattern.

For a minimal reproduction of this bug, see:
https://github.com/evanj/cgouintptrbug

The unsafe package says: "Conversion of a uintptr back to Pointer is
not valid in general." This code was violating this rule, by passing
uintptr value to C, which would then interpret them as pointers. This
causes memory corruption: #90

This change replaces all uses of uintptr with unsafe.Pointer to avoid
this memory corruption. This has the disadvantage of marking every
argument as escaping to heap. This means the buffers used to call the
zstd functions must be allocated on the heap. I suspect this should
not be a huge problem, since I would expect that high performance
code should already be managing its zstd buffers carefully.

The bug is as follows:

* In zstd_stream_test.go: payload := []byte("Hello World!") is
  marked as "does not escape to heap" (from go test -gcflags -m).
  Therefore, it is allocated on the stack.
* The test calls Writer.Write, which converts the argument to uintptr:
  srcPtr = uintptr(unsafe.Pointer(&srcData[0]))
* Writer.Write then calls Cgo: C.ZSTD_compressStream2_wrapper(...)
* The Go runtime decides the stack needs to be larger, so it copies
  it to a new location.
* (Another thread): The Go runtime decides to reuse the old stack
  location, so it replaces the "Hello World!" bytes with new data.
* (Original thread): Calls zstd, which reads the wrong bytes.

This change adds a test which nearly always crashes for me. While
investigating the other uses of uintptr, I also was able to trigger
a similar crash when calling CompressLevel, but only with:
    GODEBUG=efence=1 go test .

I also added tests for `Ctx.CompressLevel` because it was not
obviously being tested. I did not reproduce this problem with that
function, but I suspect the same bug exists, since it uses the same
pattern.

For a minimal reproduction of this bug, see:
https://github.com/evanj/cgouintptrbug
@evanj
Copy link
Contributor Author

evanj commented Feb 7, 2021

Based on this test, we should probably change the CircleCI tests to run the tests with GODEBUG=efence=1 to check against these types of extremely subtle issues. I can make an attempt to do that change at some point later.

Viq111 added a commit that referenced this pull request Feb 8, 2021
See #91
A bug was found that can only be consistently reproduced with efrence=1 so enable it in tests so we make sure to catch it
Viq111 added a commit that referenced this pull request Feb 8, 2021
See #91
A bug was found that can only be consistently reproduced with efrence=1 so enable it in tests so we make sure to catch it
Viq111 added a commit that referenced this pull request Feb 8, 2021
See #91

We need to add a test with efence enabled.
I tried to add it directly with the current suite but unfortunatly efence uses a lot of memory so we can only test it with smaller payload. Take the license as an example
Viq111 added a commit that referenced this pull request Feb 8, 2021
See #91

We need to add a test with efence enabled.
I tried to add it directly with the current suite but unfortunatly efence uses a lot of memory so we can only test it with smaller payload. Take the license as an example
@Viq111
Copy link
Collaborator

Viq111 commented Feb 8, 2021

Thanks for the PR!
I tried to import the tests onto the 1.x (without the fix) and run with efence on CircleCI but unfortunatly it still looks like it's hard to replicate (i.e: the tests pass without the fix: https://app.circleci.com/pipelines/github/DataDog/zstd/61/workflows/a2a5fbd9-d76e-4de8-b981-5510372bf198/jobs/213)

@evanj
Copy link
Contributor Author

evanj commented Feb 9, 2021

Rats you are right. I swear this was crashing reliably for me! Let me investigate a bit more...

@evanj
Copy link
Contributor Author

evanj commented Feb 9, 2021

My test for TestCompressLevel was incorrect; I think I screwed up pulling the version of the test into my branch. I'll fix this PR in a moment, but it does fail in CircleCI: https://app.circleci.com/pipelines/github/DataDog/zstd/64/workflows/e8bc7fb9-a336-4f30-b077-42c6da4206b4/jobs/225

The TestStreamCompressionDecompressionParallel test does not fail with GODEBUG=efence=1 which ... surprises me? I get memory corruption pretty reliably when running that test by itself on my machine, but not on CircleCI. I'll investigate that a bit more ...

This bug is very sensitive to stack sizes. Calling Decompress caused
that function to copy the stack, which then avoided the bug.
@evanj
Copy link
Contributor Author

evanj commented Feb 9, 2021

Okay, I have the other test for stream compression crashing in CircleCI as well: I needed to use more goroutines for some reason. https://app.circleci.com/pipelines/github/DataDog/zstd/67/workflows/c164d97f-9db4-46d5-83cf-2c9fa37b447f/jobs/235

This bug interestingly seems to go away with GODEBUG=efence=1; I think because that flag ensures that memory allocations never reuse the same memory. It only marks pages as "inaccessible" for large allocations, so I need to make the stack really big to get it to crash for me on multiple systems. I'm going to update the tests in this PR with the ones that seem to be more reliable.

In summary: this bug is very tricky, and may not happen in some cases!

Copy link
Collaborator

@Viq111 Viq111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this!
It's a bit unfortunate we had to walk back on some of the optimizations but it definitely make sense given the potential memory corruptions.

Let me know if you are able to get a test we can consistently reproduce on CircleCI (Ideally we'd want that before merging this PR (and if needed we can cleanup the tests like stackDepth if it turns out it cannot reproduce the bug))

@Viq111
Copy link
Collaborator

Viq111 commented Feb 11, 2021

I ran the benchmark against the same payload we run in the tests (i.e: mr from http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia) which is a ~large file of 9Mb on a Macbook Pro 2019 and there is no notable change in benchmark. I'm guessing it will likely have a slight impact on very small payloads (when you need to do a lot of calls into cgo):

goos: darwin
goarch: amd64
ᐅ benchstat before uintptr-bug
name                    old time/op   new time/op   delta
CtxCompression-16        91.7ms ± 6%   90.9ms ± 3%    ~     (p=0.574 n=8+8)
CtxDecompression-16      11.7ms ± 1%   11.8ms ± 1%  +0.50%  (p=0.024 n=9+9)
StreamCompression-16     96.4ms ± 2%   95.3ms ± 2%    ~     (p=0.055 n=8+10)
StreamDecompression-16   14.0ms ± 2%   13.9ms ± 1%    ~     (p=0.720 n=10+9)
Compression-16           91.2ms ± 3%   90.5ms ± 3%    ~     (p=0.579 n=10+10)
Decompression-16         11.8ms ± 2%   11.8ms ± 2%    ~     (p=0.853 n=10+10)

name                    old speed     new speed     delta
CtxCompression-16       109MB/s ± 6%  110MB/s ± 3%    ~     (p=0.574 n=8+8)
CtxDecompression-16     851MB/s ± 1%  847MB/s ± 1%  -0.49%  (p=0.024 n=9+9)
StreamCompression-16    103MB/s ± 2%  105MB/s ± 2%    ~     (p=0.055 n=8+10)
StreamDecompression-16  713MB/s ± 2%  717MB/s ± 1%    ~     (p=0.720 n=10+9)
Compression-16          109MB/s ± 3%  110MB/s ± 2%    ~     (p=0.543 n=10+10)
Decompression-16        846MB/s ± 2%  847MB/s ± 2%    ~     (p=0.853 n=10+10)

@Viq111 Viq111 mentioned this pull request Feb 11, 2021
@Viq111 Viq111 added blocker Something that needs to be fixed before next release bug labels Feb 11, 2021
@evanj
Copy link
Contributor Author

evanj commented Feb 11, 2021

I believe the tests that I added in commit 88ff5e1 do reproduce consistently on CircleCI. I needed to increase the number of goroutines for the parallel test, then use the recursive function stack depth trick to get it to crash with GODEBUG=efence=1. If you want to check that I didn't screw this up again, that would be great, but I did this separately:

https://app.circleci.com/pipelines/github/DataDog/zstd/64/workflows/e8bc7fb9-a336-4f30-b077-42c6da4206b4/jobs/225
https://app.circleci.com/pipelines/github/DataDog/zstd/67/workflows/c164d97f-9db4-46d5-83cf-2c9fa37b447f/jobs/235

@Viq111
Copy link
Collaborator

Viq111 commented Feb 12, 2021

Thanks! I ran the tests also on my branch and can confirm! https://app.circleci.com/pipelines/github/DataDog/zstd/79/workflows/a3fd4930-b663-4d4c-a57a-2a9c00740a70

go-efence seems to catch it. Without efence, go-1.15 failed but go-1.14 succeeded so it seems we can't replicate 100% of the time. I'm going to import the circleci config to add the efence test to make sure we catch those in the future as well and good to go!

See #91
We caught a nasty bug because of misusage of uintptr in Go. This was fixed in that PR
but add a test case in the circleci matrix to make sure we catch those in the future if
they happen again
Unfortunately linux32 limits us to 3GB of memory for 32 bits systems.
We will need to disable the memory-hungry tests there.

We also only need 4GB for the docker instance
@Viq111
Copy link
Collaborator

Viq111 commented Feb 12, 2021

@evanj I think I fixed the remaining test problems with the last 4 commits.
Let me know if they look good to you and I can merge

@evanj
Copy link
Contributor Author

evanj commented Feb 16, 2021

It is very annoying that this bug depends on many implementation-specific details, so is hard to replicate. These test fixes look good to me. Thanks for checking and being so persistent with this!

@Viq111 Viq111 merged commit 8cb8bac into 1.x Feb 16, 2021
@Viq111 Viq111 deleted the uintptr-bug branch February 16, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Something that needs to be fixed before next release bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants