While writing the nix-nar crate, I had to bend Rust’s Iterator API to do things it wasn’t designed for. The resulting code looks strange, so let’s walk through increasingly complex implementations to see why it has to be that way.

The full code for this post is here.

The goal

The nix-nar crate implements an encoder and decoder for the NAR format. NAR is like tar, but the encoding rules are stricter such that there’s only one valid NAR encoding for a set of files. For instance, this is useful if we’re doing repeatable builds, and storing artefacts in NAR files. Then, two builds are guaranteed to result in the exact same NAR, whereas with tar, the order of files might be different, resulting in different hashes for the archive.

The encoder API is easy, but the decoder has requirements that don’t mesh well together:

  • The decoder should list the files in the archive, one at a time. However, it should not load the whole list at once to avoid problems with archives of very many small files,

  • Users should be able to choose whether to load a file from the archive or not. If they choose to skip a file, it shouldn’t be loaded into memory,

  • More generally, files should never be fully stored in memory to avoid problems with big files, and

  • The file listing and file reading APIs should follow Rust conventions and use standard traits where possible.

Ultimately, we want usage to look like this:

let dec = Decoder::new(BufReader::new(File::open("my-archive.nar")?))?;
for entry in dec.entries()? {
    let entry = entry?;
    match entry.content {
        Content::Directory => create_dir(entry.path)?,
        Content::Symlink { target } => create_symlink(entry.path, target)?
        Content::File { mut data, .. } => {
            let mut out = File::create(entry.path)?;
            io::copy(&mut data, &mut out)?;
        }
    }  
}

The key bits of Rust machinery in use are the Read and Iterator traits. The data in Content::File above is a struct that behaves like a readable file, but it actually references a slice of bytes in the real underlying file. Because it implements the Read trait, this is transparent to users.

The Iterator implementation in dec.entries() is more convoluted. Before we delve into it, let’s look at a simpler example.

A simple Iterator over a Vec

The simplest iterator we could possible write is that over a Vec.

pub struct MyVec<T>(Vec<T>);

pub struct MyIterator<'a, T> {
    vec: &'a MyVec<T>,
    idx: usize,
}

impl<T> MyVec<T> {
    pub fn new(vec: Vec<T>) -> Self {
        Self(vec)
    }
    pub fn iter(&self) -> MyIterator<T> {
        MyIterator { vec: self, idx: 0 }
    }
}

impl<'a, T> Iterator for MyIterator<'a, T> {
    type Item = &'a T;

    fn next(&mut self) -> Option<Self::Item> {
        if self.idx < self.vec.0.len() {
            self.idx += 1;
            Some(&self.vec.0[self.idx - 1])
        } else {
            None
        }
    }
}

We define the MyVec wrapper around a real Vec, and the MyIterator struct which will be our iterator. The latter holds a reference to MyVec so that it can access elements, and it also stores the current index into the vector.

To implement Iterator, we need to declare the type of the Items returned by next(), and then define next() itself. Per the Iterator trait, next takes a mutable reference to the iterator, and returns read-only references to the underlying values of type T. This function signature will make our jobs harder in a bit.

Interlude: A Reader with predictable output

In the next section, we’ll write an iterator which reads blocks from files. As in, it will return the first 10 bytes of the file, then the next 10 bytes, and so on. This is simpler than parsing NAR files, and it illustrates all the interesting problems.

Before that, we need something to run the iterator on. We could use a concrete file, but the Read trait lets us generate configurable data at runtime:

use std::io::{self, Read};

// A struct implementing `Read` which generates 'a' for 10 bytes,
// then 'b' for 10 bytes, etc.
pub struct AlphabetReader {
    idx: u64,
    size: u64,
}

impl AlphabetReader {
    pub fn new(size: u64) -> Self {
        Self { idx: 0, size }
    }
}

impl Read for AlphabetReader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        if self.idx >= self.size {
            Ok(0)
        } else {
            buf[0] = 65 + ((self.idx / 10) % 26) as u8;
            self.idx += 1;
            Ok(1)
        }
    }
}

The main attraction in Read is the fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> function. We’re given a buffer, and we’re supposed to fill as much of it as possible, and then return the number of bytes written. In our implementation, we fill just one byte because it’s easy, and it’s more likely to reveal bugs later.

An Iterator with eager reads

Now that we have our data source, let’s write a strict block iterator which loads everything into memory. This way, we avoid having to worry about ownership and references for now.

use std::io::Read;

pub struct MyReader<R: Read>(R);

pub struct Blocks<R> {
    reader: R,
    block_size: u64,
}

impl<R: Read> MyReader<R> {
    pub fn new(reader: R) -> Self {
        Self(reader)
    }

    pub fn blocks(self, block_size: u64) -> Blocks<R> {
        Blocks {
            reader: self.0,
            block_size,
        }
    }
}

impl<R: Read> Iterator for Blocks<R> {
    type Item = Result<Vec<u8>, std::io::Error>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut buf = vec![0; self.block_size as usize];
        let mut bytes_read = 0;
        loop {
            match self.reader.read(&mut buf[bytes_read..]) {
                Ok(0) => break,
                Ok(n) => {
                    bytes_read += n;
                    if bytes_read as u64 == self.block_size {
                        break;
                    }
                }
                Err(err) => return Some(Err(err)),
            }
        }
        if bytes_read == 0 {
            None
        } else {
            buf.truncate(bytes_read);
            Some(Ok(buf))
        }
    }
}

We wrap the given Read into MyReader, so that we can define methods on it. Our iterator is Blocks, which takes the Read from MyReader, and also stores the requested block size.

The iterator returns Result<Vec<u8>, std::io::Error>. Everything is wrapped in a Result so that we can pass IO errors the user. The actual bytes from the file are in the Vec<u8>. Since Vec is an owned type, we don’t have to deal with borrowing here–nothing from outside the iterator ever references data inside of it.

We try this out with a simple loop over the iterator’s values:

let alpha_reader = alphabet_reader::AlphabetReader::new(100);
let reader = blocks1::MyReader::new(alpha_reader);
const BLOCK_SIZE: u64 = 10;
for (idx, block) in reader.blocks(BLOCK_SIZE).enumerate().take(10) {
    let block = block?;
    println!("block {idx}: {}", std::str::from_utf8(&block)?);
}
$ cargo run -q -- blocks1
block 0: AAAAAAAAAA
block 1: BBBBBBBBBB
block 2: CCCCCCCCCC
block 3: DDDDDDDDDD
block 4: EEEEEEEEEE
block 5: FFFFFFFFFF
block 6: GGGGGGGGGG
block 7: HHHHHHHHHH
block 8: IIIIIIIIII
block 9: JJJJJJJJJJ

This works, but reading each block into memory isn’t good. If this were a real archive file, and we encountered a multi-gigabyte Blu-ray image, we’d be in trouble.

Manual iteration with lazy reading

Keeping with the theme of solving simpler problems, let’s now write a trait that lets us walk through the file blocks without reading them eagerly. This isn’t what we ultimately want because it’s not an Iterator, but it shows that the problem is solvable.

use std::io::{Read, Take};

pub struct MyReader<R: Read> {
    reader: R,
}

pub struct Blocks<R> {
    reader: R,
    block_size: u64,
}

impl<R: Read> MyReader<R> {
    pub fn new(reader: R) -> Self {
        Self { reader }
    }

    pub fn blocks(self, block_size: u64) -> Blocks<R> {
        Blocks {
            reader: self.reader,
            block_size,
        }
    }
}

impl<R: Read> Blocks<R> {
    pub fn next(&mut self) -> Option<Take<R>> {
        //               👇 create a `Read` over a slice of the data
        Some(self.reader.take(self.block_size))
    }
}

This is the same code as before, except for the next() function. The magic happens in the returned Take. This is a type which wraps a Read, but only lets users access a slice of the underlying data.

We try compiling and get an error:

error[E0507]: cannot move out of `self.reader` which is behind a mutable reference
   --> src/main.rs:224:18
    |
224 |             Some(self.reader.take(self.block_size))
    |                  ^^^^^^^^^^^ move occurs because `self.reader` has type `R`, which does not implement the `Copy` trait

The Take isn’t just wrapping the R reader–it’s also taking ownership of it. This would be fine if we called next() only once, but we want to call it multiple times, so the iterator’s return value can’t steal R. We can fix this by calling take on a reference to R. This way, each return value gets a copy of the reference to the reader. This also means we can have multiple references to the reader at the same time, but more on this later.

impl<R: Read> Blocks<R> {
    //                                   👇 reference
    pub fn next(&mut self) -> Option<Take<&R>> {
        //    👇 take reference
        Some((&self.reader).take(self.block_size))
    }
}
error[E0599]: the method `read_exact` exists for struct `std::io::Take<&AlphabetReader>`, but its trait bounds were not satisfied
    --> src/main.rs:36:31
     |
36   |                           block.read_exact(&mut buf)?;
     |                                 ^^^^^^^^^^ method cannot be called on `std::io::Take<&AlphabetReader>` due to unsatisfied trait bounds
     |
     = note: the following trait bounds were not satisfied:
             `&AlphabetReader: std::io::Read`
             which is required by `std::io::Take<&AlphabetReader>: std::io::Read`

The problem now is that Take requires Read, and while Read is implemented for AlphabetReader (the concrete type of the Read is being revealed by the generics instantiation), it is not implemented for &AlphabetReader. We can implement it ourselves, but first we need to wrap R into a struct:

pub struct MyReader<R: Read> {
    inner: MyReaderInner<R>,
}

pub struct MyReaderInner<R> {
    reader: R,
}

pub struct Blocks<R> {
    reader: MyReaderInner<R>,
    block_size: u64,
}

Now, we implement Read for &mut MyReaderInner<R>. The reference has to be mutable in order to call read on it.

impl<R: Read> Read for &mut MyReaderInner<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let i = self.reader.read(buf)?;
        Ok(i)
    }
}

impl<R: Read> Blocks<R> {
    //                        👇 the tower of types grows ever taller
    pub fn next(&mut self) -> Option<Take<&mut MyReaderInner<R>>> {
        Some(self.reader.take(self.block_size))
    }
}

Let’s see if this works by iterating over the first few blocks:

const BLOCK_SIZE: u64 = 10;
const BUF_SIZE: usize = 10;
let mut idx = 0;
let mut blocks = reader.blocks(BLOCK_SIZE);
while idx < 10 {
    match blocks.next() {
        None => break,
        Some(mut block) => {
            let mut buf = [0; BUF_SIZE];
            block.read_exact(&mut buf)?;
            println!("block {idx}: {}", std::str::from_utf8(&buf)?);
        }
    }
    idx += 1;
}
$ cargo run -q -- blocks_manual
block 0: AAAAAAAAAA
block 1: BBBBBBBBBB
block 2: CCCCCCCCCC
block 3: DDDDDDDDDD
block 4: EEEEEEEEEE
block 5: FFFFFFFFFF
block 6: GGGGGGGGGG
block 7: HHHHHHHHHH
block 8: IIIIIIIIII
block 9: JJJJJJJJJJ

It works! But that’s a lot of ceremony for what should’ve been a simple for loop. It’s twice as long as what we had before. We can do better: we just have to implement Iterator for Blocks. How hard can it be?

Iterating towards a lazy Iterator

We already have our next function. Let’s try putting it into the Iterator template:

impl<R> Iterator for Blocks<R> {
    type Item<'a> = Take<&'a mut MyReaderInner<R>>;

    fn next(&mut self) -> Option<Self::Item> {
        Some(self.reader.take(self.block_size))
    }
}
error[E0658]: generic associated types are unstable
   --> src/main.rs:242:9
    |
242 |         type Item<'a> = Take<&'a mut MyReaderInner<R>>;
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: see issue #44265 <https://github.com/rust-lang/rust/issues/44265> for more information

We need a lifetime for the &mut MyReaderInner<R> reference, so we try to introduce it in the trait’s generic type, but the compiler doesn’t support that yet. The issue for Generic Associated Types is close to being merged, but it’s not there yet. We need a workaround.

Since we can’t introduce the lifetime in the type, let’s do it in the impl declaration:

//   👇
impl<'a, R> Iterator for Blocks<R> {
    type Item = Take<&'a mut MyReaderInner<R>>;

    fn next(&mut self) -> Option<Self::Item> {
        Some(self.reader.take(self.block_size))
    }
}
error[E0207]: the lifetime parameter `'a` is not constrained by the impl trait, self type, or predicates
   --> src/main.rs:255:10
    |
255 |     impl<'a, R> Iterator for Blocks<R> {
    |          ^^ unconstrained lifetime parameter

A lifetime introduced in the impl must be used it in the same declaration. Easily done: we can turn Blocks<R> into Blocks<'a, R> by adding a PhantomData<&'a R> to it:

//                 👇 a lifetime to use later in the Iterator
pub struct Blocks<'a, R> {
    reader: MyReaderInner<R>,
    block_size: u64,
    // 👇 added to use the 'a
    phantom_data: PhantomData<&'a mut R>,
}

impl<'a, R: Read> Iterator for Blocks<'a, R> {
    //                👇 the lifetime 👆
    type Item = Take<&'a mut MyReaderInner<R>>;

    fn next(&mut self) -> Option<Self::Item> {
        Some(<&mut MyReaderInner<R> as Read>::take(
            &mut self.reader,
            self.block_size,
        ))
    }
}

We also need to explicitly say that we’re using Read::take. Otherwise, Rust guesses that take is coming from the Iterator trait, which is not what we want. This is the only version of the code in this post where Rust gets confused like this, and I don’t have an explanation as to why.

Let’s see if it compiles:

error[E0495]: cannot infer an appropriate lifetime for borrow expression due to conflicting requirements
   --> src/main.rs:251:17
    |
251 |                 &mut self.reader,
    |                 ^^^^^^^^^^^^^^^^
    |
note: first, the lifetime cannot outlive the anonymous lifetime defined here...
   --> src/main.rs:249:17
    |
249 |         fn next(&mut self) -> Option<Self::Item> {
    |                 ^^^^^^^^^
note: ...so that reference does not outlive borrowed content
   --> src/main.rs:251:17
    |
251 |                 &mut self.reader,
    |                 ^^^^^^^^^^^^^^^^
note: but, the lifetime must be valid for the lifetime `'a` as defined here...
   --> src/main.rs:246:10
    |
246 |     impl<'a, R: Read> Iterator for Blocks<'a, R> {
    |          ^^
note: ...so that the types are compatible
   --> src/main.rs:249:50
    |
249 |           fn next(&mut self) -> Option<Self::Item> {
    |  __________________________________________________^
250 | |             Some(<&mut MyReaderInner<R> as Read>::take(
251 | |                 &mut self.reader,
252 | |                 self.block_size,
253 | |             ))
254 | |         }
    | |_________^
    = note: expected `<blocks_manual::Blocks<'a, R> as Iterator>`
               found `<blocks_manual::Blocks<'_, R> as Iterator>`

That’s a long error, but I think it boils down to the fact that nothing constrains the 'a lifetime of Take<&'a mut MyReaderInner<R>> to be smaller than the implicit lifetime in the &mut self function argument. In the non-Iterator version of the code, the two lifetimes were the same because they were mentioned in the function signature: pub fn next(&mut self) -> Option<Take<&mut MyReaderInner<R>>>.

Let’s try expanding the function signature and constraining the lifetimes:

impl<'a, R: Read> Iterator for Blocks<'a, R> {
    type Item = Take<&'a mut MyReaderInner<R>>;
    
    //       👇 explicit lifetime   👇 expand Item to make the 'a visible
    fn next(&'a mut self) -> Option<Take<&'a mut MyReaderInner<R>>> {
        Some(<&mut MyReaderInner<R> as Read>::take(
            &mut self.reader,
            self.block_size,
        ))
    }
}
error[E0308]: method not compatible with trait
   --> src/main.rs:249:9
    |
249 |         fn next(&'a mut self) -> Option<Take<&'a mut MyReaderInner<R>>> {
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ lifetime mismatch
    |
    = note: expected fn pointer `fn(&mut blocks_manual::Blocks<'a, R>) -> Option<std::io::Take<&mut blocks_manual::MyReaderInner<R>>>`
               found fn pointer `fn(&'a mut blocks_manual::Blocks<'a, R>) -> Option<std::io::Take<&'a mut blocks_manual::MyReaderInner<R>>>`

We can’t use next(&'a mut self) because the Iterator trait expects next(&mut self). Let’s unwind a few steps: we added a PhantomData<&'a R> to Blocks to introduce the 'a lifetime, but we didn’t constrain it, and this caused problems later. We want the lifetime in Blocks<'a, R> to be the same as the lifetime of the &'a mut MyReaderInner<R> we’ll use later. But that also has to be smaller than the implicit lifetime in next(&mut self), and we can’t add any constraints here because the Iterator trait doesn’t have any.

So, we need the &'a mut MyReaderInner<R> lifetime to be smaller than the &mut self one, forall possible instances of the latter. Practically, I think this means we can’t create the &mut MyReaderInner<R> reference ahead of time. Since we need to create it on the fly, we can stick the struct into a RefCell, and then borrow it:

pub struct Blocks<R> {
    reader: RefCell<MyReaderInner<R>>,
    block_size: u64,
}

This almost works, but we still need the 'a lifetime in Blocks, so that we can later use it in Item (which has to be a reference so that it doesn’t take ownership of the reader). So, we instead push the RefCell into MyReaderInner:

pub struct MyReader<R: Read> {
    inner: MyReaderInner<R>,
}

pub struct MyReaderInner<R> {
    //      👇 this lets us create fresh borrows on demand
    reader: RefCell<R>,
}

pub struct Blocks<'a, R> {
    //       👇 this will be the lifetime of the `Iterator::Item`
    reader: &'a MyReaderInner<R>,
    block_size: u64,
}

//                     👇 no need for a mut reference here...
impl<R: Read> Read for &MyReaderInner<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        //                  👇 ... because we can `borrow_mut` the `RefCell`
        let i = self.reader.borrow_mut().read(buf)?;
        Ok(i)
    }
}

impl<'a, R: Read> Iterator for Blocks<'a, R> {
    type Item = Take<&'a MyReaderInner<R>>;

    fn next(&mut self) -> Option<Self::Item> {
        //          👇 implicit borrow of the `RefCell` creating a new
        //          👇 lifetime smaller than that of `&mut self`
        Some(self.reader.take(self.block_size))
    }
}

This finally compiles, and the code to use it is much more ergonomic:

const BLOCK_SIZE: u64 = 10;
const BUF_SIZE: usize = 10;
for (idx, mut block) in reader.blocks(BLOCK_SIZE).enumerate().take(10) {
    let mut buf = [0; BUF_SIZE];
    block.read_exact(&mut buf)?;
    println!("block {idx}: {}", std::str::from_utf8(&buf)?);
}
$ cargo run -q -- blocks2
block 0: AAAAAAAAAA
block 1: BBBBBBBBBB
block 2: CCCCCCCCCC
block 3: DDDDDDDDDD
block 4: EEEEEEEEEE
block 5: FFFFFFFFFF
block 6: GGGGGGGGGG
block 7: HHHHHHHHHH
block 8: IIIIIIIIII
block 9: JJJJJJJJJJ

Printing the first 10 bytes of the first 10 blocks seems to work. What if we want to print just the first 5 bytes?

const BUF_SIZE: usize = 5;
$ cargo run -q -- blocks2
block 0: AAAAA
block 1: AAAAA
block 2: BBBBB
block 3: BBBBB
block 4: CCCCC
block 5: CCCCC
block 6: DDDDD
block 7: DDDDD
block 8: EEEEE
block 9: EEEEE

It is printing only 5 bytes for each block which is good, but the second block should’ve been BBBBB. We’re reading the bytes in each block on-demand, but we didn’t account for the case where the user doesn’t consume all the bytes in each block. So, the Blocks iterator can advance, while the position in the underlying file trails behind.

Syncing the Iterator and file position

The underlying problem is that calling Read::take creates an imperfect window into the underlying file data. If we consume all the data in the Take, the cursor in the file reaches the end of the window, and everything works as expected. But if we don’t consume all the data, then the cursor in the file is left at some partway point in the window. The next Take then starts from that position, rather than from the start of the next block.

To fix this, we keep track of the file position ourselves:

pub struct MyReader<R: Read> {
    inner: MyReaderInner<R>,
}

pub struct MyReaderInner<R> {
    reader: RefCell<R>,
    // 👇 the position in the file
    pos: Cell<u64>,
}

pub struct Blocks<'a, R: Read> {
    reader: &'a MyReaderInner<R>,
    block_size: u64,
    // 👇 the position where the next `Take` should start
    next: u64,
}

impl<'a, R: Read> Read for &'a MyReaderInner<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let i = self.reader.borrow_mut().read(buf)?;
        // 👇 set the current position as the user pulls from
        // 👇 the `MyReaderInner` wrapper
        self.pos.set(self.pos.get() + i as u64);
        Ok(i)
    }
}

impl<'a, R: Read> Iterator for Blocks<'a, R> {
    type Item = Result<Take<&'a MyReaderInner<R>>, std::io::Error>;

    fn next(&mut self) -> Option<Self::Item> {
        //         👇 skip to the next starting position...
        match self.skip_bytes(self.next - self.reader.pos.get()) {
            Ok(()) => {
                // ... 👇 and prepare the next next starting position
                self.next += self.block_size;
                Some(Ok(self.reader.take(self.block_size)))
            }
            Err(err) => Some(Err(err)),
        }
    }
}

impl<'a, R: Read> Blocks<'a, R> {
    /// Skip bytes by reading them from the `Read`.  This would be
    /// more efficient if we also required `Seek`.
    fn skip_bytes(&mut self, mut bytes_to_skip: u64) -> Result<(), std::io::Error> {
        while bytes_to_skip > 0 {
            let mut buf = [0u8; 4096 * 8];
            let n = std::cmp::min(bytes_to_skip, buf.len() as u64);
            match self.reader.read(&mut buf[..n as usize])? {
                0 => return Ok(()),
                n => {
                    bytes_to_skip -= n as u64;
                }
            }
        }
        Ok(())
    }
}

Tracking the position is easy because we implement our own Read instance for &MyReaderInner: we just update the position whenever the user calls read. We do the “skipping” by reading data from the file and immediately discarding it. If we wanted to be more efficient, we could do actual skipping, but that would require the Seek trait.

Stopping at EOF

The big remaining problem is that our blocks iterator never actually stops iterating. When the reader reaches EOF, the iterator will just continue to hand out Takes which contain no data. This is slightly tricky to solve because we only require the Read trait on the reader, and that can’t be queried for EOF. That said, once EOF is reached, skip_bytes will no longer be able to skip anything, so we could use that as the signal to stop iterating:

fn next(&mut self) -> Option<Self::Item> {
    match self.skip_bytes(self.next - self.reader.pos.get()) {
        Ok(0) => {
            self.next += self.block_size;
            Some(Ok(self.reader.take(self.block_size)))
        }
        // 👇 if we failed to skip, we must be at EOF
        Ok(_) => None,
        Err(err) => Some(Err(err)),
    }
}

//                                                         👇 return the bytes not read
fn skip_bytes(&mut self, mut bytes_to_skip: u64) -> Result<u64, std::io::Error> {
    while bytes_to_skip > 0 {
        let mut buf = [0u8; 4096 * 8];
        let n = std::cmp::min(bytes_to_skip, buf.len() as u64);
        // 👇 we don't just pass errors up any more; we treat
        // 👇 errors as reaching EOF
        match self.reader.read(&mut buf[..n as usize]) {
            Ok(0) | Err(_) => return Ok(bytes_to_skip),
            Ok(n) => {
                bytes_to_skip -= n as u64;
            }
        }
    }
    Ok(0)
}

This isn’t perfect because the iterator might still return one empty Take if the file ends on a block boundary, but it’s good enough in practice.

The thing that doesn’t work

Earlier we glossed over the fact that we’re returning multiple references to the same underlying file. Now, it’s time to face the music. Processing the iterator values out of order just doesn’t work:

const BLOCK_SIZE: u64 = 10;
const BUF_SIZE: usize = 5;
let mut blocks = reader.blocks(BLOCK_SIZE);
let mut block1 = blocks.next().unwrap()?;
let mut block2 = blocks.next().unwrap()?;

// Read first from block2, then block1
let mut buf = [0; BUF_SIZE];
block2.read_exact(&mut buf)?;
println!("block 2: {}", std::str::from_utf8(&buf)?);
block1.read_exact(&mut buf)?;
println!("block 1: {}", std::str::from_utf8(&buf)?);
$ cargo run -q -- blocks_out_of_order
block 2: BBBBB
block 1: BBBBB

It’s the same underlying problem from before. The Takes we return limit the amount of data read from the underlying file, but don’t do anything to ensure that the starting position is correct. For any one Take, if anything else changes the file position after it was created (e.g. the byte skipping code, or a different Take), then it gets out of sync. We could code around this by writing MyTake which keeps track of the file position, and seeks to the right place before reading, but that’s beyond the scope of this post. It would also result in unpredictable runtime performance as seeking then reading is much slower than just reading.

For now, we just put a warning in the docs. To quote the tar crate:

Note that care must be taken to consider each entry within an archive in sequence. If entries are processed out of sequence (from what the iterator returns), then the contents read for each entry may be corrupted.

Conclusion

The main takeaway is that, although the Iterator API is ergonomic to use, it’s also fairly constrained and doesn’t allow us to do everything. Specifically, the API wants next() to return values whose lifetime is smaller than that of the iterator reference. These could be owned values without any references, or references that are created during iteration, but they can’t be references that were stored ahead of time in the iterator itself.