Similar to the Writer, a Reader can be reused using the `Reset` method.
For the best possible throughput, there is a `EncodeBuffer(buf []byte)` function available.
However, it requires that the provided buffer isn't used after it is handed over to S2 and until the stream is flushed or closed.
For smaller data blocks, there is also a non-streaming interface: `Encode()`, `EncodeBetter()` and `Decode()`.
Do however note that these functions (similar to Snappy) does not provide validation of data,
so data corruption may be undetected. Stream encoding provides CRC checks of data.
It is possible to efficiently skip forward in a compressed stream using the `Skip()` method.
For big skips the decompressor is able to skip blocks without decompressing them.
## Single Blocks
Similar to Snappy S2 offers single block compression.
Blocks do not offer the same flexibility and safety as streams,
but may be preferable for very small payloads, less than 100K.
Using a simple `dst := s2.Encode(nil, src)` will compress `src` and return the compressed result.
It is possible to provide a destination buffer.
If the buffer has a capacity of `s2.MaxEncodedLen(len(src))` it will be used.
If not a new will be allocated.
Alternatively `EncodeBetter`/`EncodeBest` can also be used for better, but slightly slower compression.
Similarly to decompress a block you can use `dst, err := s2.Decode(nil, src)`.
Again an optional destination buffer can be supplied.
The `s2.DecodedLen(src)` can be used to get the minimum capacity needed.
If that is not satisfied a new buffer will be allocated.
Block function always operate on a single goroutine since it should only be used for small payloads.
# Commandline tools
Some very simply commandline tools are provided; `s2c` for compression and `s2d` for decompression.
Binaries can be downloaded on the [Releases Page](https://github.com/klauspost/compress/releases).
Installing then requires Go to be installed. To install them, use:
`go install github.com/klauspost/compress/s2/cmd/s2c@latest && go install github.com/klauspost/compress/s2/cmd/s2d@latest`
To build binaries to the current folder use:
`go build github.com/klauspost/compress/s2/cmd/s2c && go build github.com/klauspost/compress/s2/cmd/s2d`
## s2c
```
Usage: s2c [options] file1 file2
Compresses all files supplied as input separately.
Output files are written as 'filename.ext.s2' or 'filename.ext.snappy'.
By default output files will be overwritten.
Use - as the only file name to read from stdin and write to stdout.
Wildcards are accepted: testdir/*.txt will compress all files in testdir ending with .txt
Directories can be wildcards as well. testdir/*/*.txt will match testdir/subdir/b.txt
File names beginning with 'http://' and 'https://' will be downloaded and compressed.
Only http response code 200 is accepted.
Options:
-bench int
Run benchmark n times. No output will be written
-blocksize string
Max block size. Examples: 64K, 256K, 1M, 4M. Must be power of two and <= 4MB (default "4M")
-c Write all output to stdout. Multiple input files will be concatenated
-cpu int
Compress using this amount of threads (default 32)
-faster
Compress faster, but with a minor compression loss
-help
Display help
-index
Add seek index (default true)
-o string
Write output to another file. Single input file only
-pad string
Pad size to a multiple of this value, Examples: 500, 64K, 256K, 1M, 4M, etc (default "1")
-q Don't write any output to terminal, except errors
-rm
Delete source file(s) after successful compression
-safe
Do not overwrite output files
-slower
Compress more, but a lot slower
-snappy
Generate Snappy compatible output stream
-verify
Verify written files
```
## s2d
```
Usage: s2d [options] file1 file2
Decompresses all files supplied as input. Input files must end with '.s2' or '.snappy'.
Output file names have the extension removed. By default output files will be overwritten.
Use - as the only file name to read from stdin and write to stdout.
Wildcards are accepted: testdir/*.txt will compress all files in testdir ending with .txt
Directories can be wildcards as well. testdir/*/*.txt will match testdir/subdir/b.txt
File names beginning with 'http://' and 'https://' will be downloaded and decompressed.
Extensions on downloaded files are ignored. Only http response code 200 is accepted.
Options:
-bench int
Run benchmark n times. No output will be written
-c Write all output to stdout. Multiple input files will be concatenated
-help
Display help
-o string
Write output to another file. Single input file only
-offset string
Start at offset. Examples: 92, 64K, 256K, 1M, 4M. Requires Index
-q Don't write any output to terminal, except errors
-rm
Delete source file(s) after successful decompression
-safe
Do not overwrite output files
-tail string
Return last of compressed file. Examples: 92, 64K, 256K, 1M, 4M. Requires Index
-verify
Verify files, but do not write output
```
## s2sx: self-extracting archives
s2sx allows creating self-extracting archives with no dependencies.
By default, executables are created for the same platforms as the host os,
but this can be overridden with `-os` and `-arch` parameters.
Extracted files have 0666 permissions, except when untar option used.
```
Usage: s2sx [options] file1 file2
Compresses all files supplied as input separately.
If files have '.s2' extension they are assumed to be compressed already.
Output files are written as 'filename.s2sx' and with '.exe' for windows targets.
If output is big, an additional file with ".more" is written. This must be included as well.
By default output files will be overwritten.
Wildcards are accepted: testdir/*.txt will compress all files in testdir ending with .txt
Directories can be wildcards as well. testdir/*/*.txt will match testdir/subdir/b.txt
Options:
-arch string
Destination architecture (default "amd64")
-c Write all output to stdout. Multiple input files will be concatenated
-cpu int
Compress using this amount of threads (default 32)
-help
Display help
-max string
Maximum executable size. Rest will be written to another file. (default "1G")
-os string
Destination operating system (default "windows")
-q Don't write any output to terminal, except errors
-rm
Delete source file(s) after successful compression
-safe
Do not overwrite output files
-untar
Untar on destination
```
Available platforms are:
* darwin-amd64
* darwin-arm64
* linux-amd64
* linux-arm
* linux-arm64
* linux-mips64
* linux-ppc64le
* windows-386
* windows-amd64
By default, there is a size limit of 1GB for the output executable.
When this is exceeded the remaining file content is written to a file called
output+`.more`. This file must be included for a successful extraction and
placed alongside the executable for a successful extraction.
This file *must* have the same name as the executable, so if the executable is renamed,
so must the `.more` file.
This functionality is disabled with stdin/stdout.
### Self-extracting TAR files
If you wrap a TAR file you can specify `-untar` to make it untar on the destination host.
Files are extracted to the current folder with the path specified in the tar file.
Note that tar files are not validated before they are wrapped.
For security reasons files that move below the root folder are not allowed.
# Performance
This section will focus on comparisons to Snappy.
This package is solely aimed at replacing Snappy as a high speed compression package.
If you are mainly looking for better compression [zstandard](https://github.com/klauspost/compress/tree/master/zstd#zstd)
gives better compression, but typically at speeds slightly below "better" mode in this package.
Compression is increased compared to Snappy, mostly around 5-20% and the throughput is typically 25-40% increased (single threaded) compared to the Snappy Go implementation.
Streams are concurrently compressed. The stream will be distributed among all available CPU cores for the best possible throughput.
A "better" compression mode is also available. This allows to trade a bit of speed for a minor compression gain.
The content compressed in this mode is fully compatible with the standard decoder.
Snappy vs S2 **compression** speed on 16 core (32 thread) computer, using all threads and a single thread (1 CPU):
*Note: S2 dictionary compression is currently at an early implementation stage, with no assembly for
neither encoding nor decoding. Performance improvements can be expected in the future.*
Adding dictionaries allow providing a custom dictionary that will serve as lookup in the beginning of blocks.
The same dictionary *must* be used for both encoding and decoding.
S2 does not keep track of whether the same dictionary is used,
and using the wrong dictionary will most often not result in an error when decompressing.
Blocks encoded *without* dictionaries can be decompressed seamlessly *with* a dictionary.
This means it is possible to switch from an encoding without dictionaries to an encoding with dictionaries
and treat the blocks similarly.
Similar to [zStandard dictionaries](https://github.com/facebook/zstd#the-case-for-small-data-compression),
the same usage scenario applies to S2 dictionaries.
> Training works if there is some correlation in a family of small data samples. The more data-specific a dictionary is, the more efficient it is (there is no universal dictionary). Hence, deploying one dictionary per type of data will provide the greatest benefits. Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file.
S2 further limits the dictionary to only be enabled on the first 64KB of a block.
This will remove any negative (speed) impacts of the dictionaries on bigger blocks.
### Compression
Using the [github_users_sample_set](https://github.com/facebook/zstd/releases/download/v1.1.3/github_users_sample_set.tar.zst)
and a 64KB dictionary trained with zStandard the following sizes can be achieved.
All decompression functions map directly to equivalent s2 functions.
| Snappy | S2 replacement |
|------------------------|--------------------|
| snappy.Decode(...) | s2.Decode(...) |
| snappy.DecodedLen(...) | s2.DecodedLen(...) |
| snappy.NewReader(...) | s2.NewReader(...) |
Features like [quick forward skipping without decompression](https://pkg.go.dev/github.com/klauspost/compress/s2#Reader.Skip)
are also available for Snappy streams.
If you know you are only decompressing snappy streams, setting [`ReaderMaxBlockSize(64<<10)`](https://pkg.go.dev/github.com/klauspost/compress/s2#ReaderMaxBlockSize)
on your Reader will reduce memory consumption.
# Concatenating blocks and streams.
Concatenating streams will concatenate the output of both without recompressing them.
While this is inefficient in terms of compression it might be usable in certain scenarios.
The 10 byte 'stream identifier' of the second stream can optionally be stripped, but it is not a requirement.
Blocks can be concatenated using the `ConcatBlocks` function.
Snappy blocks/streams can safely be concatenated with S2 blocks and streams.
Streams with indexes (see below) will currently not work on concatenated streams.
# Stream Seek Index
S2 and Snappy streams can have indexes. These indexes will allow random seeking within the compressed data.
The index can either be appended to the stream as a skippable block or returned for separate storage.
When the index is appended to a stream it will be skipped by regular decoders,
so the output remains compatible with other decoders.
## Creating an Index
To automatically add an index to a stream, add `WriterAddIndex()` option to your writer.
Then the index will be added to the stream when `Close()` is called.
```
// Add Index to stream...
enc := s2.NewWriter(w, s2.WriterAddIndex())
io.Copy(enc, r)
enc.Close()
```
If you want to store the index separately, you can use `CloseIndex()` instead of the regular `Close()`.
This will return the index. Note that `CloseIndex()` should only be called once, and you shouldn't call `Close()`.
```
// Get index for separate storage...
enc := s2.NewWriter(w)
io.Copy(enc, r)
index, err := enc.CloseIndex()
```
The `index` can then be used needing to read from the stream.
This means the index can be used without needing to seek to the end of the stream
or for manually forwarding streams. See below.
Finally, an existing S2/Snappy stream can be indexed using the `s2.IndexStream(r io.Reader)` function.
## Using Indexes
To use indexes there is a `ReadSeeker(random bool, index []byte) (*ReadSeeker, error)` function available.
Calling ReadSeeker will return an [io.ReadSeeker](https://pkg.go.dev/io#ReadSeeker) compatible version of the reader.
If 'random' is specified the returned io.Seeker can be used for random seeking, otherwise only forward seeking is supported.
Enabling random seeking requires the original input to support the [io.Seeker](https://pkg.go.dev/io#Seeker) interface.
```
dec := s2.NewReader(r)
rs, err := dec.ReadSeeker(false, nil)
rs.Seek(wantOffset, io.SeekStart)
```
Get a seeker to seek forward. Since no index is provided, the index is read from the stream.
This requires that an index was added and that `r` supports the [io.Seeker](https://pkg.go.dev/io#Seeker) interface.
A custom index can be specified which will be used if supplied.
When using a custom index, it will not be read from the input stream.
```
dec := s2.NewReader(r)
rs, err := dec.ReadSeeker(false, index)
rs.Seek(wantOffset, io.SeekStart)
```
This will read the index from `index`. Since we specify non-random (forward only) seeking `r` does not have to be an io.Seeker
```
dec := s2.NewReader(r)
rs, err := dec.ReadSeeker(true, index)
rs.Seek(wantOffset, io.SeekStart)
```
Finally, since we specify that we want to do random seeking `r` must be an io.Seeker.
The returned [ReadSeeker](https://pkg.go.dev/github.com/klauspost/compress/s2#ReadSeeker) contains a shallow reference to the existing Reader,
meaning changes performed to one is reflected in the other.
To check if a stream contains an index at the end, the `(*Index).LoadStream(rs io.ReadSeeker) error` can be used.
## Manually Forwarding Streams
Indexes can also be read outside the decoder using the [Index](https://pkg.go.dev/github.com/klauspost/compress/s2#Index) type.
This can be used for parsing indexes, either separate or in streams.
In some cases it may not be possible to serve a seekable stream.
This can for instance be an HTTP stream, where the Range request
is sent at the start of the stream.
With a little bit of extra code it is still possible to use indexes
to forward to specific offset with a single forward skip.
It is possible to load the index manually like this:
```
var index s2.Index
_, err = index.Load(idxBytes)
```
This can be used to figure out how much to offset the compressed stream:
For compact storage [RemoveIndexHeaders](https://pkg.go.dev/github.com/klauspost/compress/s2#RemoveIndexHeaders) can be used to remove any redundant info from
a serialized index. If you remove the header it must be restored before [Loading](https://pkg.go.dev/github.com/klauspost/compress/s2#Index.Load).
This is expected to save 20 bytes. These can be restored using [RestoreIndexHeaders](https://pkg.go.dev/github.com/klauspost/compress/s2#RestoreIndexHeaders). This removes a layer of security, but is the most compact representation. Returns nil if headers contains errors.
Each block is structured as a snappy skippable block, with the chunk ID 0x99.
The block can be read from the front, but contains information so it can be read from the back as well.
Numbers are stored as fixed size little endian values or [zigzag encoded](https://developers.google.com/protocol-buffers/docs/encoding#signed_integers) [base 128 varints](https://developers.google.com/protocol-buffers/docs/encoding),
with un-encoded value length of 64 bits, unless other limits are specified.
// Adjust compressed offset for next loop, integer truncating division must be used.
CompressGuess += cOff/2
}
```
To decode from any given uncompressed offset `(wantOffset)`:
* Iterate entries until `entry[n].UncompressedOffset > wantOffset`.
* Start decoding from `entry[n-1].CompressedOffset`.
* Discard `entry[n-1].UncompressedOffset - wantOffset` bytes from the decoded stream.
See [using indexes](https://github.com/klauspost/compress/tree/master/s2#using-indexes) for functions that perform the operations with a simpler interface.
* Frame [Stream identifier](https://github.com/google/snappy/blob/master/framing_format.txt#L68) changed from `sNaPpY` to `S2sTwO`.
* [Framed compressed blocks](https://github.com/google/snappy/blob/master/format_description.txt) can be up to 4MB (up from 64KB).
* Compressed blocks can have an offset of `0`, which indicates to repeat the last seen offset.
Repeat offsets must be encoded as a [2.2.1. Copy with 1-byte offset (01)](https://github.com/google/snappy/blob/master/format_description.txt#L89), where the offset is 0.
The length is specified by reading the 3-bit length specified in the tag and decode using this table:
| Length | Actual Length |
|--------|----------------------|
| 0 | 4 |
| 1 | 5 |
| 2 | 6 |
| 3 | 7 |
| 4 | 8 |
| 5 | 8 + read 1 byte |
| 6 | 260 + read 2 bytes |
| 7 | 65540 + read 3 bytes |
This allows any repeat offset + length to be represented by 2 to 5 bytes.