Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support creating smaller eStargz images (--estargz-external-toc and --estargz-min-chunk-size) #956

Merged
merged 1 commit into from
Nov 2, 2022

Conversation

ktock
Copy link
Member

@ktock ktock commented Oct 21, 2022

This commit allows users optionally creating smaller eStargz images using the following flags to ctr-remote i convert or ctr-remote i optimize:

  • --estargz-external-toc: Separates TOC JSON into another image (called "TOC image"). The result eStargz doesn't contain TOC so we can expect a smaller size than normal eStargz.

  • --estargz-min-chunk-size: Specifies the minimal number of bytes of data must be written in one gzip stream. If it's > 0, multiple files and chunks can be written into one gzip stream. Smaller number of gzip header and smaller size of the result blob can be expected.

About --estargz-external-toc

eStargz supports separating TOC into an external image called TOC image. This type of eStargz is the same as the normal eStargz but doesn't contain TOC JSON file (stargz.index.json) in the layer blob and has a special footer.

TOC image is an OCI image containing TOC. Each layer contains a TOC JSON file (stargz.index.json) in the root directory. Layer descriptors in the manifest of the TOC image must contain an annotation containerd.io/snapshot/stargz/layer.digest. The value of this is the digest of the eStargz layer corresponding to that TOC.

{
  "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
  "digest": "sha256:64dedefd539280a5578c8b94bae6f7b4ebdbd12cb7a7df0770c4887a53d9af70",
  "size": 154425,
  "annotations": {
    "containerd.io/snapshot/stargz/layer.digest": "sha256:5da5601c1f2024c07f580c11b2eccf490cd499473883a113c376d64b9b10558f"
  }
}

Stargz snapshotter uses this annotation for searching TOC JSON of the mounting eStargz layer. Currently, it assumes the TOC image has the reference name same as the eStargz with -esgztoc suffix. For example, if an eStargz image is named ghcr.io/stargz-containers/ubuntu:22.04-esgz, stargz snapshotter acquires the TOC image from ghcr.io/stargz-containers/ubuntu:22.04-esgz-esgztoc. Note that future versions of stargz snapshotter will support more ways to search the TOC image (e.g. allowing custom suffix, using OCI Reference Type, etc.)

Usage

convert:

# ctr-remote i pull ghcr.io/stargz-containers/ubuntu:22.04
# ctr-remote i convert --oci --estargz --estargz-external-toc ghcr.io/stargz-containers/ubuntu:22.04 registry2:5000/ubuntu:22.04-ex

Layers in eStargz (registry2:5000/ubuntu:22.04-ex) don't contain TOC JSON. TOC image (registry2:5000/ubuntu:22.04-ex-esgztoc) contains TOC of all layers of the eStargz image.

Push them to the same registry:

# ctr-remote i push --plain-http registry2:5000/ubuntu:22.04-ex
# ctr-remote i push --plain-http registry2:5000/ubuntu:22.04-ex-esgztoc

Pull it lazily:

# ctr-remote i rpull --plain-http registry2:5000/ubuntu:22.04-ex
fetching sha256:14fb0ea2... application/vnd.oci.image.index.v1+json
fetching sha256:24471b45... application/vnd.oci.image.manifest.v1+json
fetching sha256:d2e4737e... application/vnd.oci.image.config.v1+json
# mount | grep "stargz on"
stargz on /var/lib/containerd-stargz-grpc/snapshotter/snapshots/1/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)

Optional --estargz-lossless flag for lossless conversion

ctr-remote i convert supports an optional flag --estargz-lossless specified with --estargz-external-toc. This converts an image to eStargz without changing the diffID (uncompressed digest) so even eStargz-agnostic gzip decompressor (e.g. gunzip) can restore the original tar blob. --estargz-record-in can't be used with this flag.

About --estargz-min-chunk-size

This option allows writing multiple small files into a single gzip stream. This can hopefully reduce the number of gzip headers and the size of the result blob.

To implement this, we need to introduce a new field innerOffset to TOC.

// TOCEntry is an entry in the stargz file's TOC (Table of Contents).
type TOCEntry struct {
...(omit)...
	// InnerOffset is an optional field indicates uncompressed offset
	// of this "reg" or "chunk" payload in a stream starts from Offset.
	// This field enables to put multiple "reg" or "chunk" payloads
	// in one chunk with having the same Offset but different InnerOffset.
	InnerOffset int64 `json:"innerOffset,omitempty"`
...(omit)...

This field allows the following structure.

minchunksize(1)

Unfortunatelly older version of stargz-snapshotter can't correctly understand the image created with innerOffset and can show a broken view of the filesystem.

Usage

conversion:

# ctr-remote i pull ghcr.io/stargz-containers/ubuntu:22.04
# ctr-remote i convert --oci --estargz --estargz-min-chunk-size=50000 ghcr.io/stargz-containers/ubuntu:22.04 registry2:5000/ubuntu:22.04-chunk50000
# ctr-remote i push --plain-http registry2:5000/ubuntu:22.04-chunk50000

Pull it lazily:

# ctr-remote i rpull --plain-http registry2:5000/ubuntu:22.04-ex
fetching sha256:5d1409a2... application/vnd.oci.image.index.v1+json
fetching sha256:859e2b50... application/vnd.oci.image.manifest.v1+json
fetching sha256:c07a44b9... application/vnd.oci.image.config.v1+json
# mount | grep "stargz on"
stargz on /var/lib/containerd-stargz-grpc/snapshotter/snapshots/1/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)

Comparison

Comparison of image size and extraction time of a single file

This compares the size of images among different configurations. This also compares the avarage time to take for extracting a single file from the result blob. Note that we use the highest compression level of gzip (gzip.BestCompression) for eStargz by default. TOC image isn't included in the image size.

kdeneon/plasma:latest

benchmark estargz/original (increased size) average time to read file
min-chunk-size=0 1.022 (40866736 B) 5.83ms
min-chunk-size=0,externalTOC 1.019 (35536826 B) 5.78ms
min-chunk-size=1000 1.018 (33216221 B) 5.71ms
min-chunk-size=1000,externalTOC 1.015 (27881919 B) 5.77ms
min-chunk-size=10000 1.006 (12021539 B) 6.78ms
min-chunk-size=10000,externalTOC 1.004 (6663504 B) 6.75ms
min-chunk-size=25000 1.003 (6004067 B) 8.38ms
min-chunk-size=25000,externalTOC 1.000 (637889 B) 8.13ms
min-chunk-size=50000 1.002 (3857263 B) 10.7ms
min-chunk-size=50000,externalTOC 0.999 (-1506057 B) 10.6ms

python:3.9-org

benchmark estargz/original (increased size) average time to read file
min-chunk-size=0 1.033 (11350121 B) 4.74ms
min-chunk-size=0,externalTOC 1.029 (9933107 B) 4.74ms
min-chunk-size=1000 1.028 (9797255 B) 4.83ms
min-chunk-size=1000,externalTOC 1.024 (8378028 B) 4.72ms
min-chunk-size=10000 1.009 (3007542 B) 5.58ms
min-chunk-size=10000,externalTOC 1.005 (1582342 B) 5.57ms
min-chunk-size=25000 1.004 (1400329 B) 7.12ms
min-chunk-size=25000,externalTOC 1.000 (-26734 B) 7.04ms
min-chunk-size=50000 1.002 (807473 B) 9.91ms
min-chunk-size=50000,externalTOC 0.998 (-619813 B) 9.8ms

ubuntu:22.04

benchmark estargz/original (increased size) average time to read file
min-chunk-size=0 1.066 (1996332 B) 5.01ms
min-chunk-size=0,externalTOC 1.061 (1841964 B) 4.99ms
min-chunk-size=1000 1.058 (1756194 B) 5.08ms
min-chunk-size=1000,externalTOC 1.053 (1601588 B) 4.99ms
min-chunk-size=10000 1.019 (589458 B) 5.83ms
min-chunk-size=10000,externalTOC 1.014 (434273 B) 5.76ms
min-chunk-size=25000 1.009 (279215 B) 7.52ms
min-chunk-size=25000,externalTOC 1.004 (123735 B) 7.37ms
min-chunk-size=50000 1.006 (169840 B) 10.1ms
min-chunk-size=50000,externalTOC 1.000 (14264 B) 10.2ms

Image can be smaller using larger number to min-chunk-size and/or using exernal TOC. It's even possible to make eStargz smaller than the original image.

We need to be careful about that when we specify larger number to min-chunk-size, the time for extracting a file can be larger as well. One of the possible reasons is that files in one gzip stream isn't seekable so when we extract a file from the blob we also need to extract the neighboring files in the same stream.

To mitigate this performance drawback of min-chunk-size, stargz-snapshotter introduces a logic to aggressively and automatically cache files in a stream. When stargz-snapshotter extracts a file from one gzip stream and that stream contains multiple files, it automatically caches all of the neighboring files for speeding up accessing them in the future.

Comparison of file read throughput (tar-ing the rootfs of the container)

This compares time to take tar-ing rootfs of the container running on stargz-snapshotter's FUSE.

Command: /bin/bash -c 'time ( tar --exclude=/sys --exclude=/proc --exclude=/dev -cf - / | cat > /dev/null )'

(average of 3 times)

min-chunk-size 0 1000 10000 25000 50000
kdeneon/plasma:latest 2m11.508s 2m11.162s 2m4.930s 2m3.928s 2m1.908s
python:3.9-org 29.451s 28.805s 27.241s 27.117s 25.928s
ubuntu:22.04 3.986s 3.876s 3.580s 3.434s 3.368s

No performance drawback has been observed for min-chunk-size-enabled images.

HelloBench

result

No performance drawback has been observed for min-chunk-size-enabled images.

TODOs

  • Allow configuring the name of TOC image (currently suffix -esgztoc is hardcoded)

@ktock ktock force-pushed the externalmetadata branch 2 times, most recently from 58bbb4f to 3dc794f Compare October 21, 2022 02:13
@ktock ktock requested a review from AkihiroSuda October 21, 2022 05:23
stargz on /var/lib/containerd-stargz-grpc/snapshotter/snapshots/1/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
```

> NOTE: This flag creates an eStargz image with newly-added `innerOffset` funtionality of eStargz. Older version of Stargz Snapshotter cannot perform lazy pulling for the images created with this flag.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Older version/Versions before vX.Y.Z/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ent.Name = cleanEntryName(ent.Name)
if ent.Type == "reg" || ent.Type == "chunk" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use switch{}, but can be another PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use switch.

if in, err := io.CopyN(io.Discard, dr, e.InnerOffset-nr); err != nil || in != e.InnerOffset-nr {
return 0, fmt.Errorf("discard of remaining %d bytes != %v, %v", e.InnerOffset-nr, in, err)
}
nr += e.InnerOffset - nr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean nr = nr + e.InnerOffset - nr, i.e., nr = e.InnterOffset ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Refactored this to use nr = e.InnterOffset .


// MinChunkSize optionally controls the minimum number of bytes
// of data must be written in one gzip stream before a new gzip
// NOTE: This adds a TOC property that old reader doesn't understand.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/old/prior to vX.Y.Z/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

subfield := "STARGZEXTERNALTOC" // len("STARGZEXTERNALTOC") = 17
binary.LittleEndian.PutUint16(header[2:4], uint16(len(subfield))) // little-endian per RFC1952
gz.Header.Extra = append(header, []byte(subfield)...)
gz.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"zstd-fastest": tutil.ZstdCompressionWithLevel(zstd.SpeedFastest),
"zstd-default": tutil.ZstdCompressionWithLevel(zstd.SpeedDefault),
"zstd-bettercompression": tutil.ZstdCompressionWithLevel(zstd.SpeedBetterCompression),
"gzip-nocompression": tutil.GzipCompressionWithLevel(gzip.NoCompression),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:

Suggested change
"gzip-nocompression": tutil.GzipCompressionWithLevel(gzip.NoCompression),
"gzip-no-compression": tutil.GzipCompressionWithLevel(gzip.NoCompression),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


var runes = []rune("1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

func RandomContents(n int) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func RandomContents(n int) string {
func RandomString(n int) string {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

b := make([]rune, n)
for i := range b {
b[i] = runes[rand.Intn(len(runes))]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be more efficient https://pkg.go.dev/crypto/rand#Read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe use base64 for asciization)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. Fixed.

return string(b)
}

func ViewContents(c []byte) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a helper function to get a omitted string from a long string for printing.
But the implementation looked wrong, so fixed.

Usage: "Separate TOC JSON into another image (called \"TOC image\"). The name of TOC image is the original + \"-esgztoc\" suffix. Both eStargz and the TOC image should be pushed to the same registry. stargz-snapshotter refers to the TOC image when it pulls the result eStargz image.",
},
cli.BoolFlag{
Name: "estargz-lossless",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Lossless images" may sound like graphic images as in PNG images 😆

estargz-keep-diff-id might be less confusing, but no strong opinion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. Renamed to flag to estargz-keep-diff-id.

## eStargz image with an external TOC

eStargz supports separating TOC into another image called *TOC image*.
This type of eStargz is the same as the normal eStargz but doesn't contain TOC JSON file (`stargz.index.json`) in the layer blob and has a special footer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to explain the motivation of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the motivation.

Pull it lazily:

```console
# ctr-remote i rpull --plain-http registry2:5000/ubuntu:22.04-ex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

Suggested change
# ctr-remote i rpull --plain-http registry2:5000/ubuntu:22.04-ex
# ctr-remote i rpull --plain-http registry2:5000/ubuntu:22.04-chunk50000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if _, err := rand.Read(b); err != nil {
t.Fatalf("failed rand.Read: %v", err)
}
return string(b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be non-ascii

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. I think it's fine that they're non-ascii. Fixed the function name to RandomBytes.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@AkihiroSuda AkihiroSuda merged commit 4642de4 into containerd:main Nov 2, 2022
@ktock ktock deleted the externalmetadata branch November 2, 2022 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants