Sources: Read-only Streams of Data

Reading archives with zipsource

You can wrap any Julia readable IO object with the zipsource function. The returned object can be iterated to read archived files in archive order. Information about each file can be accessed throug the info method called on the object returned from the iterator. The object returned from the iterator is readable like any standard Julia IO object, but it is not writable.

Here are some examples:

Iterating through files from an archive on disk

This is perhaps the most common way to work with ZIP archives: reading them from disk and doing things with the contained files. Because zipsource reads from the beginning of the file to the end, you can only iterate through files in archive order and cannot randomly access files. Here is an example of how to work with this kind of file iteration:

using ZipStreams

# open an archive from an IO object
open("archive.zip") do io
    zs = zipsource(io)

    # iterate through files
    for f in zs
        
        # get information about each file from the info method
        println(info(f).name)

        # read from the file just like any other IO object
        println(readline(f))
        
        println(read(f, String))
    end
end

You can use the next_file method to access the next file in the archive without iterating in a loop. The method returns nothing if it reaches the end of the archive.

using ZipStreams

open("archive.zip") do io
    zs = zipsource(io)
    f = next_file(zs) # the first file in the archive, or nothing if there are no files archived
    # ...
    f = next_file(zs) # the next file in the archive, or nothing if there was only one file
    # ...
end

Because reading ZIP files from an archive on disk is a common use case, a convenience method taking a file name argument is provided:

using ZipStreams

zs = zipsource("archive.zip") # Note: the caller is responsible for closing this to free the file handle
# ... 
close(zs)

In addition, a method that takes as its first argument a unary function is included so that users can manage the lifetime of any file handles opened by zipsource in an open() do x ... end block:

using ZipStreams

zipsource("archive.zip") do zs
    # ...
end # file handle is automatically closed at the end of the block

The same method is defined for IO arguments, but it works slightly differently: the object passed is not closed when the block ends. It assumes that the caller is responsible for the IO object's lifetime. However, manually calling close on the source will always close the wrapped IO object. Here is an example:

using ZipStreams

io = open("archive.zip")
zipsource(io) do zs
    # ...
end
@assert isopen(io) == true

seekstart(io)
zipsource(io) do zs
    # ...
    close(zs) # called manually
end
@assert isopen(io) == false

Verifying the content of ZIP archives

A ZIP archive stores file sizes and checksums in two of three locations: one of either immediately before the archived file data (in the "Local File Header") or immediately after the archived file data (in the "Data Descriptor"), and always at the end of the file (in the "Central Directory"). Because the Central Directory is considered the ground truth, the Local File Header and Data Descriptor may report inaccurate values. To verify that the content of the file matches the values in the Local File Header, use the is_valid! method on the archived file. To verify that all file content in the archive matches the values in the Central Directory, use the is_valid! method on the archive itself. These methods will return false if they detect any inconsistencies.

For example, to validate the data in a single file stored in the archive:

using ZipStreams

zipsource("archive.zip") do zs
    f = next_file(zs)
    @assert is_valid!(f) # throws if there is an inconsistency
end

To validate the data in all of the remaining files in the archive:

using ZipStreams

io = open("archive.zip")
zipsource(io) do zs
    @assert is_valid!(zs) # validate all files and the archive itself
end

seekstart(io)
zipsource(io) do zs
    f = next_file(zs)
    read(f, UInt8) # read something from the first file
    @assert is_valid!(zs) # validate all files except the first!
end

close(io)

The is_valid! methods consume the data in the source and return a Boolean. When called on an archived file, you can pass as an optional first argument an IO object into which the remaining file data will be dumped. When called on the archive itself with an optional first argument IO object, it will dump the contents of the remaining files into the object, concatenated, and in archive order, excluding any files that have already been read by iterating or with `nextfile`_.

Reading from two places in an archive at once

Do not attempt to read from two places in an open archive at once, or jump between one open file and another, as this will result in undefined behavior!

using ZipStreams

zs = zipsource("archive.zip")
f1 = next_file(zs)
data1 = IOBuffer()
@assert is_valid!(data1, f1) # data1 contains all the file data as raw bytes
@assert take!(data1) == FILE_CONTENTS
close(zs)

zs = zipsource("archive.zip")
f2 = next_file(zs)
println(readline(f2)) # read a line off the file first
data2 = IOBuffer()
@assert id_valid!(data2, f2) # data2 contains the remaining file data excluding the first line!
@assert sizeof(take!(data2)) < sizeof(FILE_CONTENTS)
close(zs)

zs = zipsource("archive.zip")
all_data = IOBuffer()
@assert is_valid!(all_data, zs) # all_data now contains all files concatenated together
@assert take!(all_data) == vcat(FILE1_CONTENTS, FILE2_CONTENTS, etc)
close(zs)

The exclamation point in the method name is a warning to the user that these methods consume the data in the file or archive, as demonstrated in this example:

using ZipStreams

zs = zipsource("archive.zip")
is_valid!(zs)
@assert eof(zs) == true

API

ZipStreams.ZipArchiveSourceType
ZipArchiveSource

A read-only lazy streamable representation of a Zip archive.

The authoritative record of files present in a Zip archive is stored in the Central Directory at the end of the archive. This allows for easy appending of new files to the archive by overwriting the Central Directory and adding a new Central Directory with the updated contents afterward. It also allows for easy deletion of files from the old archive by overwriting the Central Directory with the updated contents and relying on compliant Zip archive extraction programs ignoring the actual bytes in the file and only trusting the new Central Directory.

Unfortunately, this choice makes reading the contents of a Zip archive sub-optimal, especially over streaming IO interfaces like networks, where seeking to the end of the file requires reading all of the file's contents first.

However, this package chooses not to be a compliant Zip archive reader. By ignoring the Central Directory, one can begin extracting data from a Zip archive immediately upon reading the first Local File Header record it sees in the stream, greatly reducing latency to first read on large files, and also reducing the amount of data necessary to cache on disk or in memory.

A ZipArchiveSource is a wapper around an IO object that allows the user to extract files as they are read from the stream instead of waiting to read the file information from the Central Directory at the end of the stream.

ZipArchiveSource objects can be iterated. Each iteration returns an IO object that will lazily extract (and decompress) file data from the archive.

Information about each file in the archive is stored in the directory property of the struct as the file is read from the archive.

Create ZipArchiveSource objects using the zipsource function.

source
ZipStreams.zipsourceFunction
zipsource(io)
zipsource(f, io)

Create a read-only lazy streamable representation of a Zip archive.

The first form returns a ZipArchiveSource wrapped around io that allows the user to extract files as they are read from the stream by iterating over the returned object. io can be an object that inherits from Base.IO (technically only requiring read, eof, isopen, close, and bytesavailable to be defined) or an AbstractString file name, which will open the file in read-only mode and wrap that IOStream.

The second form takes a unary function as the first argument. The constructed ZipArchiveSource object will be passed to the function and the results of the function will be returned to the user. This allows compatability with do blocks. If io is an AbstractString file name, the file will be automatically closed when the block exits. If io is a Base.IO object as described above, it will not be closed when the block exits, allowing the caller to have control over the lifetime of the argument.

Reading before knowing where files end can be dangerous!

The Central Directory in the Zip archive is the authoritative source for file locations, compressed and uncompressed sizes, and CRC-32 checksums. A Local File Header can lie about this information, leading to improper file extraction. We highly recommend that users validate the file contents against the Central Directory using the is_valid! method before beginning to trust the extracted files from uncontrolled sources.

source
ZipStreams.next_fileFunction
next_file(archive) => Union{IO, Nothing}

Read the next file in the archive and return a readable IO object or nothing.

This is the same as calling first(iterate(archive)).

source
ZipStreams.is_valid!Function
is_valid!([sink::IO,] zf::ZipFileSource) -> Bool

Validate that the contents read from an archived file match the information stored in the Local File Header, optionally writing remaining file information to a sink.

If the contents of the file do not match the information in the Local File Header, the method will describe the detected error using @error logging. The method checks that the compressed and uncompressed file sizes match what is in the header and that the CRC-32 of the uncompressed data matches what is reported in the header. Validation will work even on files that have been partially read.

The exclaimation mark in the function name is a warning to the user that the function destructively reads bytes from the ZipFileSource. If sink is provided, the remaining unread bytes from zf will be extracted into sink.

Because data cannot be written to a ZipFileSource, repeated calls to is_valid! will return the same result each time, but will only extract data to sink on the first call.

source
is_valid!([sink::IO,] source::ZipArchiveSource) -> Bool

Validate the files in the archive source against the Central Directory at the end of the archive.

The exclaimation mark in the function name is a warning to the user that this method consumes all the remaining data from source. and returns false if the file information from the file headers read does not match the information in the Central Directory. Files that have already been consumed prior to calling this method will still be validated, but the local headers of those files will not be validated against the local data that has already been consumed.

The exclaimation mark in the function name is a warning to the user that the function destructively reads bytes from the ZipArchiveSource. If sink is provided, the remaining unread bytes from source will be extracted and the data from the remaining files will be written as concatenated bytes into sink.

Because data cannot be written to a ZipArchiveSource, repeated calls to is_valid! will return the same result each time, but will only extract data to sink on the first call.

Files using descriptors

If a file stored within source uses a File Descriptor rather than storing the size of the file in the Local File Header, the file must be read to the end in order to properly record the lengths for checking against the Central Directory. Failure to read such a file to the end will result in is_valid! returning false when called on the archive.

See also is_valid!(::ZipFileSource).

source