mirror of
https://git.tukaani.org/xz.git
synced 2024-04-04 12:36:23 +02:00
325 lines
13 KiB
Text
325 lines
13 KiB
Text
|
|
||
|
Advanced features of liblzma
|
||
|
----------------------------
|
||
|
|
||
|
0. Introduction
|
||
|
|
||
|
Most developers need only the basic features of liblzma. These
|
||
|
features allow single-threaded encoding and decoding of .lzma files
|
||
|
in streamed mode.
|
||
|
|
||
|
In some cases developers want more. The .lzma file format is
|
||
|
designed to allow multi-threaded encoding and decoding and limited
|
||
|
random-access reading. These features are possible in non-streamed
|
||
|
mode and limitedly also in streamed mode.
|
||
|
|
||
|
To take advange of these features, the application needs a custom
|
||
|
.lzma file format handler. liblzma provides a set of tools to ease
|
||
|
this task, but it's still quite a bit of work to get a good custom
|
||
|
.lzma handler done.
|
||
|
|
||
|
|
||
|
1. Where to begin
|
||
|
|
||
|
Start by reading the .lzma file format specification. Understanding
|
||
|
the basics of the .lzma file structure is required to implement a
|
||
|
custom .lzma file handler and to understand the rest of this document.
|
||
|
|
||
|
|
||
|
2. The basic components
|
||
|
|
||
|
2.1. Stream Header and tail
|
||
|
|
||
|
Stream Header begins the .lzma Stream and Stream tail ends it. Stream
|
||
|
Header is defined in the file format specification, but Stream tail
|
||
|
isn't (thus I write "tail" with a lower-case letter). Stream tail is
|
||
|
simply the Stream Flags and the Footer Magic Bytes fields together.
|
||
|
It was done this way in liblzma, because the Block coders take care
|
||
|
of the rest of the stuff in the Stream Footer.
|
||
|
|
||
|
For now, the size of Stream Header is fixed to 11 bytes. The header
|
||
|
<lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
|
||
|
should use instead of a hardcoded number. Similarly, Stream tail
|
||
|
is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
|
||
|
|
||
|
It is possible, that a future version of the .lzma format will have
|
||
|
variable-sized Stream Header and tail. As of writing, this seems so
|
||
|
unlikely though, that it was considered simplest to just use a
|
||
|
constant instead of providing a functions to get and store the sizes
|
||
|
of the Stream Header and tail.
|
||
|
|
||
|
|
||
|
2.x. Stream tail
|
||
|
|
||
|
For now, the size of Stream tail is fixed to 3 bytes. The header
|
||
|
<lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
|
||
|
should use instead of a hardcoded number.
|
||
|
|
||
|
|
||
|
3. Keeping track of size information
|
||
|
|
||
|
The lzma_info_* functions found from <lzma/info.h> should ease the
|
||
|
task of keeping track of sizes of the Blocks and also the Stream
|
||
|
as a whole. Using these functions is strongly recommended, because
|
||
|
there are surprisingly many situations where an error can occur,
|
||
|
and these functions check for possible errors every time some new
|
||
|
information becomes available.
|
||
|
|
||
|
If you find lzma_info_* functions lacking something that you would
|
||
|
find useful, please contact the author.
|
||
|
|
||
|
|
||
|
3.1. Start offset of the Stream
|
||
|
|
||
|
If you are storing the .lzma Stream inside anothe file format, or
|
||
|
for some other reason are placing the .lzma Stream to somewhere
|
||
|
else than to the beginning of the file, you should tell the starting
|
||
|
offset of the Stream using lzma_info_start_offset_set().
|
||
|
|
||
|
The start offset of the Stream is used for two distinct purporses.
|
||
|
First, knowing the start offset of the Stream allows
|
||
|
lzma_info_alignment_get() to correctly calculate the alignment of
|
||
|
every Block. This information is given to the Block encoder, which
|
||
|
will calculate the size of Header Padding so that Compressed Data
|
||
|
is alignment at an optimal offset.
|
||
|
|
||
|
Another use for start offset of the Stream is in random-access
|
||
|
reading. If you set the start offset of the Stream, lzma_info_locate()
|
||
|
will be able to calculate the offset relative to the beginning of the
|
||
|
file containing the Stream (instead of offset relative to the
|
||
|
beginning of the Stream).
|
||
|
|
||
|
|
||
|
3.2. Size of Stream Header
|
||
|
|
||
|
While the size of Stream Header is constant (11 bytes) in the current
|
||
|
version of the .lzma file format, this may change in future.
|
||
|
|
||
|
|
||
|
3.3. Size of Header Metadata Block
|
||
|
|
||
|
This information is needed when doing random-access reading, and
|
||
|
to verify the value of this field stored in Footer Metadata Block.
|
||
|
|
||
|
|
||
|
3.4. Total Size of the Data Blocks
|
||
|
|
||
|
|
||
|
3.5. Uncompressed Size of Data Blocks
|
||
|
|
||
|
|
||
|
3.6. Index
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
x. Alignment
|
||
|
|
||
|
There are a few slightly different types of alignment issues when
|
||
|
working with .lzma files.
|
||
|
|
||
|
The .lzma format doesn't strictly require any kind of alignment.
|
||
|
However, if the encoder carefully optimizes the alignment in all
|
||
|
situations, it can improve compression ratio, speed of the encoder
|
||
|
and decoder, and slightly help if the files get damaged and need
|
||
|
recovery.
|
||
|
|
||
|
Alignment has the most significant effect compression ratio FIXME
|
||
|
|
||
|
|
||
|
x.1. Compression ratio
|
||
|
|
||
|
Some filters take advantage of the alignment of the input data.
|
||
|
To get the best compression ratio, make sure that you feed these
|
||
|
filters correctly aligned data.
|
||
|
|
||
|
Some filters (e.g. LZMA) don't necessarily mind too much if the
|
||
|
input doesn't match the preferred alignment. With these filters
|
||
|
the penalty in compression ratio depends on the specific type of
|
||
|
data being compressed.
|
||
|
|
||
|
Other filters (e.g. PowerPC executable filter) won't work at all
|
||
|
with data that is improperly aligned. While the data can still
|
||
|
be de-filtered back to its original form, the benefit of the
|
||
|
filtering (better compression ratio) is completely lost, because
|
||
|
these filters expect certain patterns at properly aligned offsets.
|
||
|
The compression ratio may even worse with incorrectly aligned input
|
||
|
than without the filter.
|
||
|
|
||
|
|
||
|
x.1.1. Inter-filter alignment
|
||
|
|
||
|
When there are multiple filters chained, checking the alignment can
|
||
|
be useful not only with the input of the first filter and output of
|
||
|
the last filter, but also between the filters.
|
||
|
|
||
|
Inter-filter alignment important especially with the Subblock filter.
|
||
|
|
||
|
|
||
|
x.1.2. Further compression with external tools
|
||
|
|
||
|
This is relatively rare situation in practice, but still worth
|
||
|
understanding.
|
||
|
|
||
|
Let's say that there are several SPARC executables, which are each
|
||
|
filtered to separate .lzma files using only the SPARC filter. If
|
||
|
Uncompressed Size is written to the Block Header, the size of Block
|
||
|
Header may vary between the .lzma files. If no Padding is used in
|
||
|
the Block Header to correct the alignment, the starting offset of
|
||
|
the Compressed Data field will be differently aligned in different
|
||
|
.lzma files.
|
||
|
|
||
|
All these .lzma files are archived into a single .tar archive. Due
|
||
|
to nature of the .tar format, every file is aligned inside the
|
||
|
archive to an offset that is a multiple of 512 bytes.
|
||
|
|
||
|
The .tar archive is compressed into a new .lzma file using the LZMA
|
||
|
filter with options, that prefer input alignment of four bytes. Now
|
||
|
if the independent .lzma files don't have the same alignment of
|
||
|
the Compressed Data fields, the LZMA filter will be unable to take
|
||
|
advantage of the input alignment between the files in the .tar
|
||
|
archive, which reduces compression ratio.
|
||
|
|
||
|
Thus, even if you have only single Block per file, it can be good for
|
||
|
compression ratio to align the Compressed Data to optimal offset.
|
||
|
|
||
|
|
||
|
x.2. Speed
|
||
|
|
||
|
Most modern computers are faster when multi-byte data is located
|
||
|
at aligned offsets in RAM. Proper alignment of the Compressed Data
|
||
|
fields can slightly increase the speed of some filters.
|
||
|
|
||
|
|
||
|
x.3. Recovery
|
||
|
|
||
|
Aligning every Block Header to start at an offset with big enough
|
||
|
alignment may ease or at least speed up recovery of broken files.
|
||
|
|
||
|
|
||
|
y. Typical usage cases
|
||
|
|
||
|
y.x. Parsing the Stream backwards
|
||
|
|
||
|
You may need to parse the Stream backwards if you need to get
|
||
|
information such as the sizes of the Stream, Index, or Extra.
|
||
|
The basic procedure to do this follows.
|
||
|
|
||
|
Locate the end of the Stream. If the Stream is stored as is in a
|
||
|
standalone .lzma file, simply seek to the end of the file and start
|
||
|
reading backwards using appropriate buffer size. The file format
|
||
|
specification allows arbitrary amount of Footer Padding (zero or more
|
||
|
NUL bytes), which you skip before trying to decode the Stream tail.
|
||
|
|
||
|
Once you have located the end of the Stream (a non-NULL byte), make
|
||
|
sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
|
||
|
Stream in a buffer. If there isn't enough bytes left from the file,
|
||
|
the file is too small to contain a valid Stream. Decode the Stream
|
||
|
tail using lzma_stream_tail_decoder(). Store the offset of the first
|
||
|
byte of the Stream tail; you will need it later.
|
||
|
|
||
|
You may now want to do some internal verifications e.g. if the Check
|
||
|
type is supported by the liblzma build you are using.
|
||
|
|
||
|
Decode the Backward Size field with lzma_vli_reverse_decode(). The
|
||
|
field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
|
||
|
Backward Size is not zero. Store the offset of the first byte of
|
||
|
the Backward Size; you will need it later.
|
||
|
|
||
|
Now you know the Total Size of the last Block of the Stream. It's the
|
||
|
value of Backward Size plus the size of the Backward Size field. Note
|
||
|
that you cannot use lzma_vli_size() to calculate the size since there
|
||
|
might be padding; you need to use the real observed size of the
|
||
|
Backward Size field.
|
||
|
|
||
|
At this point, the operation continues differently for Single-Block
|
||
|
and Multi-Block Streams.
|
||
|
|
||
|
|
||
|
y.x.1. Single-Block Stream
|
||
|
|
||
|
There might be Uncompressed Size field present in the Stream Footer.
|
||
|
You cannot know it for sure unless you have already parsed the Block
|
||
|
Header earlier. For security reasons, you probably want to try to
|
||
|
decode the Uncompressed Size field, but you must not indicate any
|
||
|
error if decoding fails. Later you can give the decoded Uncompressed
|
||
|
Size to Block decoder if Uncopmressed Size isn't otherwise known;
|
||
|
this prevents it from producing too much output in case of (possibly
|
||
|
intentionally) corrupt file.
|
||
|
|
||
|
Calculate the the start offset of the Stream:
|
||
|
|
||
|
backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
|
||
|
|
||
|
backward_offset is the offset of the first byte of the Backward Size
|
||
|
field. Remember to check for integer overflows, which can occur with
|
||
|
invalid input files.
|
||
|
|
||
|
Seek to the beginning of the Stream. Decode the Stream Header using
|
||
|
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
|
||
|
match the values found from Stream tail. You can use the
|
||
|
lzma_stream_flags_is_equal() macro for this.
|
||
|
|
||
|
Decode the Block Header. Verify that it isn't a Metadata Block, since
|
||
|
Single-Block Streams cannot have Metadata. If Uncompressed Size is
|
||
|
present in the Block Header, the value you tried to decode from the
|
||
|
Stream Footer must be ignored, since Uncompressed Size wasn't actually
|
||
|
present there. If Block Header doesn't have Uncompressed Size, and
|
||
|
decoding the Uncompressed Size field from the Stream Footer failed,
|
||
|
the file is corrupt.
|
||
|
|
||
|
If you were only looking for the Uncompressed Size of the Stream,
|
||
|
you now got that information, and you can stop processing the Stream.
|
||
|
|
||
|
To decode the Block, the same instructions apply as described in
|
||
|
FIXME. However, because you have some extra known information decoded
|
||
|
from the Stream Footer, you should give this information to the Block
|
||
|
decoder so that it can verify it while decoding:
|
||
|
- If Uncompressed Size is not present in the Block Header, set
|
||
|
lzma_options_block.uncompressed_size to the value you decoded
|
||
|
from the Stream Footer.
|
||
|
- Always set lzma_options_block.total_size to backward_size +
|
||
|
size_of_backward_size (you calculated this sum earlier already).
|
||
|
|
||
|
|
||
|
y.x.2. Multi-Block Stream
|
||
|
|
||
|
Calculate the start offset of the Footer Metadata Block:
|
||
|
|
||
|
backward_offset - backward_size
|
||
|
|
||
|
backward_offset is the offset of the first byte of the Backward Size
|
||
|
field. Remember to check for integer overflows, which can occur with
|
||
|
broken input files.
|
||
|
|
||
|
Decode the Block Header. Verify that it is a Metadata Block. Set
|
||
|
lzma_options_block.total_size to backward_size + size_of_backward_size
|
||
|
(you calculated this sum earlier already). Then decode the Footer
|
||
|
Metadata Block.
|
||
|
|
||
|
Store the decoded Footer Metadata to lzma_info structure using
|
||
|
lzma_info_set_metadata(). Set also the offset of the Backward Size
|
||
|
field using lzma_info_size_set(). Then you can get the start offset
|
||
|
of the Stream using lzma_info_size_get(). Note that any of these steps
|
||
|
may fail so don't omit error checking.
|
||
|
|
||
|
Seek to the beginning of the Stream. Decode the Stream Header using
|
||
|
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
|
||
|
match the values found from Stream tail. You can use the
|
||
|
lzma_stream_flags_is_equal() macro for this.
|
||
|
|
||
|
If you were only looking for the Uncompressed Size of the Stream,
|
||
|
it's possible that you already have it now. If Uncompressed Size (or
|
||
|
whatever information you were looking for) isn't available yet,
|
||
|
continue by decoding also the Header Metadata Block. (If some
|
||
|
information is missing, the Header Metadata Block has to be present.)
|
||
|
|
||
|
Decoding the Data Blocks goes the same way as described in FIXME.
|
||
|
|
||
|
|
||
|
y.x.3. Variations
|
||
|
|
||
|
If you know the offset of the beginning of the Stream, you may want
|
||
|
to parse the Stream Header before parsing the Stream tail.
|
||
|
|