Sources: Read-only Streams of Data
Reading archives with zipsource
You can wrap any Julia readable IO
object with the zipsource
function. The returned object can be iterated to read archived files in archive order. Information about each file can be accessed throug the info
method called on the object returned from the iterator. The object returned from the iterator is readable like any standard Julia IO
object, but it is not writable.
Here are some examples:
Iterating through files from an archive on disk
This is perhaps the most common way to work with ZIP archives: reading them from disk and doing things with the contained files. Because zipsource
reads from the beginning of the file to the end, you can only iterate through files in archive order and cannot randomly access files. Here is an example of how to work with this kind of file iteration:
using ZipStreams
# open an archive from an IO object
open("archive.zip") do io
zs = zipsource(io)
# iterate through files
for f in zs
# get information about each file from the info method
println(info(f).name)
# read from the file just like any other IO object
println(readline(f))
println(read(f, String))
end
end
You can use the next_file
method to access the next file in the archive without iterating in a loop. The method returns nothing
if it reaches the end of the archive.
using ZipStreams
open("archive.zip") do io
zs = zipsource(io)
f = next_file(zs) # the first file in the archive, or nothing if there are no files archived
# ...
f = next_file(zs) # the next file in the archive, or nothing if there was only one file
# ...
end
Because reading ZIP files from an archive on disk is a common use case, a convenience method taking a file name argument is provided:
using ZipStreams
zs = zipsource("archive.zip") # Note: the caller is responsible for closing this to free the file handle
# ...
close(zs)
In addition, a method that takes as its first argument a unary function is included so that users can manage the lifetime of any file handles opened by zipsource
in an open() do x ... end
block:
using ZipStreams
zipsource("archive.zip") do zs
# ...
end # file handle is automatically closed at the end of the block
The same method is defined for IO
arguments, but it works slightly differently: the object passed is not closed when the block ends. It assumes that the caller is responsible for the IO
object's lifetime. However, manually calling close
on the source will always close the wrapped IO
object. Here is an example:
using ZipStreams
io = open("archive.zip")
zipsource(io) do zs
# ...
end
@assert isopen(io) == true
seekstart(io)
zipsource(io) do zs
# ...
close(zs) # called manually
end
@assert isopen(io) == false
Verifying the content of ZIP archives
A ZIP archive stores file sizes and checksums in two of three locations: one of either immediately before the archived file data (in the "Local File Header") or immediately after the archived file data (in the "Data Descriptor"), and always at the end of the file (in the "Central Directory"). Because the Central Directory is considered the ground truth, the Local File Header and Data Descriptor may report inaccurate values. To verify that the content of the file matches the values in the Local File Header, use the is_valid!
method on the archived file. To verify that all file content in the archive matches the values in the Central Directory, use the is_valid!
method on the archive itself. These methods will return false
if they detect any inconsistencies.
For example, to validate the data in a single file stored in the archive:
using ZipStreams
zipsource("archive.zip") do zs
f = next_file(zs)
@assert is_valid!(f) # throws if there is an inconsistency
end
To validate the data in all of the remaining files in the archive:
using ZipStreams
io = open("archive.zip")
zipsource(io) do zs
@assert is_valid!(zs) # validate all files and the archive itself
end
seekstart(io)
zipsource(io) do zs
f = next_file(zs)
read(f, UInt8) # read something from the first file
@assert is_valid!(zs) # validate all files except the first!
end
close(io)
The is_valid!
methods consume the data in the source and return a Boolean
. When called on an archived file, you can pass as an optional first argument an IO
object into which the remaining file data will be dumped. When called on the archive itself with an optional first argument IO
object, it will dump the contents of the remaining files into the object, concatenated, and in archive order, excluding any files that have already been read by iterating or with `nextfile`_.
Do not attempt to read from two places in an open archive at once, or jump between one open file and another, as this will result in undefined behavior!
using ZipStreams
zs = zipsource("archive.zip")
f1 = next_file(zs)
data1 = IOBuffer()
@assert is_valid!(data1, f1) # data1 contains all the file data as raw bytes
@assert take!(data1) == FILE_CONTENTS
close(zs)
zs = zipsource("archive.zip")
f2 = next_file(zs)
println(readline(f2)) # read a line off the file first
data2 = IOBuffer()
@assert id_valid!(data2, f2) # data2 contains the remaining file data excluding the first line!
@assert sizeof(take!(data2)) < sizeof(FILE_CONTENTS)
close(zs)
zs = zipsource("archive.zip")
all_data = IOBuffer()
@assert is_valid!(all_data, zs) # all_data now contains all files concatenated together
@assert take!(all_data) == vcat(FILE1_CONTENTS, FILE2_CONTENTS, etc)
close(zs)
The exclamation point in the method name is a warning to the user that these methods consume the data in the file or archive, as demonstrated in this example:
using ZipStreams
zs = zipsource("archive.zip")
is_valid!(zs)
@assert eof(zs) == true
API
ZipStreams.ZipArchiveSource
— TypeZipArchiveSource
A read-only lazy streamable representation of a Zip archive.
The authoritative record of files present in a Zip archive is stored in the Central Directory at the end of the archive. This allows for easy appending of new files to the archive by overwriting the Central Directory and adding a new Central Directory with the updated contents afterward. It also allows for easy deletion of files from the old archive by overwriting the Central Directory with the updated contents and relying on compliant Zip archive extraction programs ignoring the actual bytes in the file and only trusting the new Central Directory.
Unfortunately, this choice makes reading the contents of a Zip archive sub-optimal, especially over streaming IO interfaces like networks, where seeking to the end of the file requires reading all of the file's contents first.
However, this package chooses not to be a compliant Zip archive reader. By ignoring the Central Directory, one can begin extracting data from a Zip archive immediately upon reading the first Local File Header record it sees in the stream, greatly reducing latency to first read on large files, and also reducing the amount of data necessary to cache on disk or in memory.
A ZipArchiveSource
is a wapper around an IO
object that allows the user to extract files as they are read from the stream instead of waiting to read the file information from the Central Directory at the end of the stream.
ZipArchiveSource
objects can be iterated. Each iteration returns an IO object that will lazily extract (and decompress) file data from the archive.
Information about each file in the archive is stored in the directory
property of the struct as the file is read from the archive.
Create ZipArchiveSource
objects using the zipsource
function.
ZipStreams.zipsource
— Functionzipsource(io)
zipsource(f, io)
Create a read-only lazy streamable representation of a Zip archive.
The first form returns a ZipArchiveSource
wrapped around io
that allows the user to extract files as they are read from the stream by iterating over the returned object. io
can be an object that inherits from Base.IO
(technically only requiring read
, eof
, isopen
, close
, and bytesavailable
to be defined) or an AbstractString
file name, which will open the file in read-only mode and wrap that IOStream
.
The second form takes a unary function as the first argument. The constructed ZipArchiveSource
object will be passed to the function and the results of the function will be returned to the user. This allows compatability with do
blocks. If io
is an AbstractString
file name, the file will be automatically closed when the block exits. If io
is a Base.IO
object as described above, it will not be closed when the block exits, allowing the caller to have control over the lifetime of the argument.
The Central Directory in the Zip archive is the authoritative source for file locations, compressed and uncompressed sizes, and CRC-32 checksums. A Local File Header can lie about this information, leading to improper file extraction. We highly recommend that users validate the file contents against the Central Directory using the is_valid!
method before beginning to trust the extracted files from uncontrolled sources.
ZipStreams.next_file
— Functionnext_file(archive) => Union{IO, Nothing}
Read the next file in the archive and return a readable IO
object or nothing
.
This is the same as calling first(iterate(archive))
.
ZipStreams.is_valid!
— Functionis_valid!([sink::IO,] zf::ZipFileSource) -> Bool
Validate that the contents read from an archived file match the information stored in the Local File Header, optionally writing remaining file information to a sink.
If the contents of the file do not match the information in the Local File Header, the method will describe the detected error using @error
logging. The method checks that the compressed and uncompressed file sizes match what is in the header and that the CRC-32 of the uncompressed data matches what is reported in the header. Validation will work even on files that have been partially read.
The exclaimation mark in the function name is a warning to the user that the function destructively reads bytes from the ZipFileSource
. If sink
is provided, the remaining unread bytes from zf
will be extracted into sink
.
Because data cannot be written to a ZipFileSource
, repeated calls to is_valid!
will return the same result each time, but will only extract data to sink
on the first call.
is_valid!([sink::IO,] source::ZipArchiveSource) -> Bool
Validate the files in the archive source
against the Central Directory at the end of the archive.
The exclaimation mark in the function name is a warning to the user that this method consumes all the remaining data from source
. and returns false
if the file information from the file headers read does not match the information in the Central Directory. Files that have already been consumed prior to calling this method will still be validated, but the local headers of those files will not be validated against the local data that has already been consumed.
The exclaimation mark in the function name is a warning to the user that the function destructively reads bytes from the ZipArchiveSource
. If sink
is provided, the remaining unread bytes from source
will be extracted and the data from the remaining files will be written as concatenated bytes into sink
.
Because data cannot be written to a ZipArchiveSource
, repeated calls to is_valid!
will return the same result each time, but will only extract data to sink
on the first call.
If a file stored within source
uses a File Descriptor rather than storing the size of the file in the Local File Header, the file must be read to the end in order to properly record the lengths for checking against the Central Directory. Failure to read such a file to the end will result in is_valid!
returning false
when called on the archive.
See also is_valid!(::ZipFileSource)
.
ZipStreams.info
— Methodinfo(zipfile)
Return a ZipFileInformation object describing the file.