xz-archive/tests/files/README


.xz and .lzma Test Files
------------------------

0. Introduction

    This directory contains bunch of files to test handling of .xz,
    .lzma (LZMA_Alone), and .lz (lzip) files in decoder implementations.
    Many of the files have been created by hand with a hex editor, thus
    there is no better "source code" than the files themselves. All the
    test files and this README have been put into the public domain.


1. File Types

    Good files (good-*) must decode successfully without requiring
    a lot of CPU time or RAM.

    Unsupported files (unsupported-*) are good files, but headers
    indicate features not supported by the current file format
    specification.

    Bad files (bad-*) must cause the decoder to give an error. Like
    with the good files, these files must not require a lot of CPU
    time or RAM before they get detected to be broken.


2. Descriptions of Individual .xz Files

2.1. Good Files

    good-0-empty.xz has one Stream with no Blocks.

    good-0pad-empty.xz has one Stream with no Blocks followed by
    four-byte Stream Padding.

    good-0cat-empty.xz has two zero-Block Streams concatenated without
    Stream Padding.

    good-0catpad-empty.xz has two zero-Block Streams concatenated with
    four-byte Stream Padding between the Streams.

    good-1-check-none.xz has one Stream with one Block with two
    uncompressed LZMA2 chunks and no integrity check.

    good-1-check-crc32.xz has one Stream with one Block with two
    uncompressed LZMA2 chunks and CRC32 check.

    good-1-check-crc64.xz is like good-1-check-crc32.xz but with CRC64.

    good-1-check-sha256.xz is like good-1-check-crc32.xz but with
    SHA256.

    good-2-lzma2.xz has one Stream with two Blocks with one uncompressed
    LZMA2 chunk in each Block.

    good-1-block_header-1.xz has both Compressed Size and Uncompressed
    Size in the Block Header. This has also four extra bytes of Header
    Padding.

    good-1-block_header-2.xz has known Compressed Size.

    good-1-block_header-3.xz has known Uncompressed Size.

    good-1-delta-lzma2.tiff.xz is an image file that compresses
    better with Delta+LZMA2 than with plain LZMA2.

    good-1-x86-lzma2.xz uses the x86 filter (BCJ) and LZMA2. The
    uncompressed file is compress_prepared_bcj_x86 found from the tests
    directory.

    good-1-sparc-lzma2.xz uses the SPARC filter and LZMA2. The
    uncompressed file is compress_prepared_bcj_sparc found from the tests
    directory.

    good-1-arm64-lzma2-1.xz uses the ARM64 filter and LZMA2. The
    uncompressed data is constructed so that it tests integer
    wrap around and sign extension.

    good-1-arm64-lzma2-2.xz is like good-1-arm64-lzma2-1.xz but with
    non-zero start offset. XZ Embedded doesn't support this file.

    good-1-riscv-lzma2-1.xz uses the RISC-V filter and LZMA2. The
    uncompressed data is constructed so it tests all of the instructions
    that should be encoded and a few that should not. Additionally, the
    file contains random bytes to help test unforeseen corner cases.

    good-1-riscv-lzma2-2.xz is like good-1-riscv-lzma2-1.xz but with
    non-zero start offset. XZ Embedded doesn't support this file.

    good-1-lzma2-1.xz has two LZMA2 chunks, of which the second sets
    new properties.

    good-1-lzma2-2.xz has two LZMA2 chunks, of which the second resets
    the state without specifying new properties.

    good-1-lzma2-3.xz has two LZMA2 chunks, of which the first is
    uncompressed and the second is LZMA. The first chunk resets dictionary
    and the second sets new properties.

    good-1-lzma2-4.xz has three LZMA2 chunks: First is LZMA, second is
    uncompressed with dictionary reset, and third is LZMA with new
    properties but without dictionary reset.

    good-1-lzma2-5.xz has an empty LZMA2 stream with only the end of
    payload marker. XZ Utils 5.0.1 and older incorrectly see this file
    as corrupt.

    good-1-3delta-lzma2.xz has three Delta filters and LZMA2.

    good-1-empty-bcj-lzma2.xz has an empty Block that uses PowerPC BCJ
    and LZMA2. liblzma from XZ Utils 5.0.1 and older may incorrectly
    return LZMA_BUF_ERROR in some cases. See commit message
    d8db706acb8316f9861abd432cfbe001dd6d0c5c for the details.


2.2. Unsupported Files

    unsupported-check.xz uses Check ID 0x02 which isn't supported by
    the current version of the file format. It is implementation-defined
    how this file handled (it may reject it, or decode it possibly with
    a warning).

    unsupported-block_header.xz has a non-null byte in Header Padding,
    which may indicate presence of a new unsupported field.

    unsupported-filter_flags-1.xz has unsupported Filter ID 0x7F.

    unsupported-filter_flags-2.xz specifies only Delta filter in the
    List of Filter Flags, but Delta isn't allowed as the last filter in
    the chain. It could be a little more correct to detect this file as
    corrupt instead of unsupported, but saying it is unsupported is
    simpler in case of liblzma.

    unsupported-filter_flags-3.xz specifies two LZMA2 filters in the
    List of Filter Flags. LZMA2 is allowed only as the last filter in the
    chain. It could be a little more correct to detect this file as
    corrupt instead of unsupported, but saying it is unsupported is
    simpler in case of liblzma.


2.3. Bad Files

    bad-0pad-empty.xz has one Stream with no Blocks followed by
    five-byte Stream Padding. Stream Padding must be a multiple of four
    bytes, thus this file is corrupt.

    bad-0catpad-empty.xz has two zero-Block Streams concatenated with
    five-byte Stream Padding between the Streams.

    bad-0cat-alone.xz is good-0-empty.xz concatenated with an empty
    LZMA_Alone file.

    bad-0cat-header_magic.xz is good-0cat-empty.xz but with one byte
    wrong in the Header Magic Bytes field of the second Stream. liblzma
    gives LZMA_DATA_ERROR for this. (LZMA_FORMAT_ERROR is used only if
    the first Stream of a file has invalid Header Magic Bytes.)

    bad-0-header_magic.xz is good-0-empty.xz but with one byte wrong
    in the Header Magic Bytes field. liblzma gives LZMA_FORMAT_ERROR for
    this.

    bad-0-footer_magic.xz is good-0-empty.xz but with one byte wrong
    in the Footer Magic Bytes field. liblzma gives LZMA_DATA_ERROR for
    this.

    bad-0-empty-truncated.xz is good-0-empty.xz without the last byte
    of the file.

    bad-0-nonempty_index.xz has no Blocks but Index claims that there is
    one Block.

    bad-0-backward_size.xz has wrong Backward Size in Stream Footer.

    bad-1-stream_flags-1.xz has different Stream Flags in Stream Header
    and Stream Footer.

    bad-1-stream_flags-2.xz has wrong CRC32 in Stream Header.

    bad-1-stream_flags-3.xz has wrong CRC32 in Stream Footer.

    bad-1-vli-1.xz has two-byte variable-length integer in the
    Uncompressed Size field in Block Header while one-byte would be enough
    for that value. It's important that the file gets rejected due to too
    big integer encoding instead of due to Uncompressed Size not matching
    the value stored in the Block Header. That is, the decoder must not
    try to decode the Compressed Data field.

    bad-1-vli-2.xz has ten-byte variable-length integer as Uncompressed
    Size in Block Header. It's important that the file gets rejected due
    to too big integer encoding instead of due to Uncompressed Size not
    matching the value stored in the Block Header. That is, the decoder
    must not try to decode the Compressed Data field.

    bad-1-block_header-1.xz has Block Header that ends in the middle of
    the Filter Flags field.

    bad-1-block_header-2.xz has Block Header that has Compressed Size and
    Uncompressed Size but no List of Filter Flags field.

    bad-1-block_header-3.xz has wrong CRC32 in Block Header.

    bad-1-block_header-4.xz has too big Compressed Size in Block Header
    (2^63 - 1 bytes while maximum is a little less, because the whole
    Block must stay smaller than 2^63). It's important that the file
    gets rejected due to invalid Compressed Size value; the decoder
    must not try decoding the Compressed Data field.

    bad-1-block_header-5.xz has zero as Compressed Size in Block Header.

    bad-1-block_header-6.xz has corrupt Block Header which may crash
    xz -lvv in XZ Utils 5.0.3 and earlier. It was fixed in the commit
    c0297445064951807803457dca1611b3c47e7f0f.

    bad-2-index-1.xz has wrong Unpadded Sizes in Index.

    bad-2-index-2.xz has wrong Uncompressed Sizes in Index.

    bad-2-index-3.xz has non-null byte in Index Padding.

    bad-2-index-4.xz wrong CRC32 in Index.

    bad-2-index-5.xz has zero as Unpadded Size. It is important that the
    file gets rejected specifically due to Unpadded Size having an invalid
    value.

    bad-3-index-uncomp-overflow.xz has Index whose Uncompressed Size
    fields have huge values whose sum exceeds the maximum allowed size
    of 2^63 - 1 bytes. In this file the sum is exactly 2^64.
    lzma_index_append() in liblzma <= 5.2.6 lacks the integer overflow
    check for the uncompressed size and thus doesn't catch the error
    when decoding the Index field in this file. This makes "xz -l"
    not detect the error and will display 0 as the uncompressed size.
    Note that regular decompression isn't affected by this bug because
    it uses lzma_index_hash_append() instead.

    bad-2-compressed_data_padding.xz has non-null byte in the padding of
    the Compressed Data field of the first Block.

    bad-1-check-crc32.xz has wrong Check (CRC32).

    bad-1-check-crc32-2.xz has Compressed Size and Uncompressed Size in
    Block Header but wrong Check (CRC32) in the actual data. This file
    differs by one byte from good-1-block_header-1.xz: the last byte of
    the Check field is wrong. This file is useful for testing error
    detection in the threaded decoder when a worker thread is configured
    to pass input one byte at a time to the Block decoder.

    bad-1-check-crc64.xz has wrong Check (CRC64).

    bad-1-check-sha256.xz has wrong Check (SHA-256).

    bad-1-lzma2-1.xz has LZMA2 stream whose first chunk (uncompressed)
    doesn't reset the dictionary.

    bad-1-lzma2-2.xz has two LZMA2 chunks, of which the second chunk
    indicates dictionary reset, but the LZMA compressed data tries to
    repeat data from the previous chunk.

    bad-1-lzma2-3.xz sets new invalid properties (lc=8, lp=0, pb=0) in
    the middle of Block.

    bad-1-lzma2-4.xz has two LZMA2 chunks, of which the first is
    uncompressed and the second is LZMA. The first chunk resets dictionary
    as it should, but the second chunk tries to reset state without
    specifying properties for LZMA.

    bad-1-lzma2-5.xz is like bad-1-lzma2-4.xz but doesn't try to reset
    anything in the header of the second chunk.

    bad-1-lzma2-6.xz has reserved LZMA2 control byte value (0x03).

    bad-1-lzma2-7.xz has EOPM at LZMA level.

    bad-1-lzma2-8.xz is like good-1-lzma2-4.xz but doesn't set new
    properties in the third LZMA2 chunk.

    bad-1-lzma2-9.xz has LZMA2 stream that is truncated at the end of
    a LZMA2 chunk (no end marker). The uncompressed size of the partial
    LZMA2 stream exceeds the value stored in the Block Header.

    bad-1-lzma2-10.xz has LZMA2 stream that, from point of view of a
    LZMA2 decoder, extends past the end of Block (and even the end of
    the file). Uncompressed Size in Block Header is bigger than the
    invalid LZMA2 stream may produce (even if a decoder reads until
    the end of the file). The Check type is None to nullify certain
    simple size-based sanity checks in a Block decoder.

    bad-1-lzma2-11.xz has LZMA2 stream that lacks the end of
    payload marker. When Compressed Size bytes have been decoded,
    Uncompressed Size bytes of output will have been produced but
    the LZMA2 decoder doesn't indicate end of stream.


3. Descriptions of Individual .lzma Files

3.1. Good Files

    good-unknown_size-with_eopm.lzma has unknown size in the header
    and end of payload marker at the end.

    good-known_size-without_eopm.lzma has a known size in the header
    and no end of payload marker at the end.

    good-known_size-with_eopm.lzma has a known size in the header
    and end of payload marker at the end. XZ Utils 5.2.5 and older
    will give an error at the end of the file after producing the
    correct uncompressed output.


3.2. Bad Files

    bad-unknown_size-without_eopm.lzma has unknown size in the header
    but no end of payload marker at the end. This file might be seen
    by a decoder as if it were truncated.

    bad-too_big_size-with_eopm.lzma has too big uncompressed size in
    the header and the end of payload marker will be detected before
    the specified number of bytes have been decoded.

    bad-too_small_size-without_eopm-1.lzma has too small uncompressed
    size in the header. The decoder will look for end of payload marker
    but instead find a literal that would produce more output.

    bad-too_small_size-without_eopm-2.lzma is like -1 above but instead
    of a literal the problem occurs with a short repeated match.

    bad-too_small_size-without_eopm-3.lzma is like -1 above but instead
    of a literal the problem occurs in the middle of a match.


4. Descriptions of Individual .lz (lzip) Files

4.1. Good Files

    good-1-v0.lz contains a single version 0 member. lzip 1.17 and
    *older* can decompress this; support for version 0 was removed
    in lzip 1.18.

    good-1-v0-trailing-1.lz is like good-1-v0.lz but contains
    trailing data that the decompressor must ignore.

    good-1-v1.lz contains a single version 1 member. lzip 1.3 and
    newer can decompress this.

    good-1-v1-trailing-1.lz is like good-1-v1.lz but contains
    trailing data that the decompressor must ignore.

    good-1-v1-trailing-2.lz is like good-1-v1.lz but contains
    trailing data whose first three bytes match the .lz magic bytes.
    With lzip >= 1.20 this file results in an error unless one uses
    the command line option --loose-trailing. lzip 1.3 to 1.19 decode
    this file successfully by default. XZ Utils uses the old behavior
    because it allows lzma_code() to stop at the first byte of the
    trailing data as long as the first byte isn't 0x4C (L in US-ASCII);
    otherwise the first 1-3 bytes that equal to the magic bytes are
    consumed and lost in lzma_code(), and this is visible in xz too:

        $ ( xz -dc ; cat ) < good-1-v1-trailing-2.lz
        Hello
        World!
        Trailing garbage

        $ ( xz -dc --single-stream ; cat ) < good-1-v1-trailing-2.lz
        Hello
        World!
        LZITrailing garbage

    good-2-v0-v1.lz contains two members of which the first is
    version 0 and the second version 1. lzip versions 1.3 to 1.17
    (inclusive) can decompress this.

    good-2-v1-v0.lz contains two members of which the first is
    version 1 and the second version 0. lzip versions 1.3 to 1.17
    (inclusive) can decompress this.

    good-2-v1-v1.lz contains two version 1 members. lzip versions 1.3
    and newer can decompress this.


4.2. Unsupported Files

    unsupported-1-v234.lz is like good-1-v1.lz except the version
    field has been set to 234 (0xEA) which, as of writing, isn't
    defined or supported by any .lz implementation.


4.3. Bad Files

    bad-1-v1-magic-1.lz is like good-1-v1.lz but the first magic byte
    is wrong.

    bad-1-v1-magic-2.lz is like good-1-v1.lz but the last (fourth)
    magic byte is wrong.

    bad-1-v1-dict-1.lz has too low value in the dictionary size field.

    bad-1-v1-dict-2.lz has too high value in the dictionary size field.

    bad-1-v1-crc32.lz has wrong CRC32 value.

    bad-1-v0-uncomp-size.lz is version 0 format with incorrect value
    in the uncompressed size field.

    bad-1-v1-uncomp-size.lz is version 1 format with incorrect value
    in the uncompressed size field.

    bad-1-v1-member-size.lz has incorrect value in the member size
    field.

    bad-1-v1-trailing-magic.lz has the four .lz magic bytes as trailing
    data. This should be detected as a truncated file and thus result
    in an error. That is, the last four bytes of the file should not be
    ignored as trailing garbage. lzip >= 1.18 matches this behavior
    while older versions ignore the last four bytes and don't indicate
    an error.