Poor performance compared with qemu-img convert

Converting compressed qcow2 is about 10 times slower compared with qemu-img convert.

qemu-img convert use many technics to speed up the operation:
- Using thread pool for reading and decompressing clusters
- Use block status to avoid reading unallocated areas
- Use fast zlib library in C for decompression
- Caching L2 tables to avoid disk access for checking if a cluster is allocated, zero or getting the cluster host offset
- Zero detection using 4k granularity (defualt)
- Unordered writes - having multiple in-flight reads and writes.

We can learn from qemu-img and apply some of the technics in this library.

## How thread pool affects performance

We can use -m N to control the numbner of coroutines, practically controlling the number of threads doing I/O and decompression in parallel.

Testing with limactl create, since we don't have useful tool for benchmarking.

Testing show that single threaded qemu-img convert is 4 times faster, and with 8 threads it is 9 times faster.

### Limactl create

```
% time limactl create --plain --tty=false
...
INFO[0000] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/346ee1ff9e381b78ba08e2a29445960b5cd31c51f896fc346b82e26e345a5b9a/data" 
INFO[0000] Converting "/Users/nsoffer/.lima/default/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/default/diffdisk" 
3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 198.96 MiB/s
...
limactl create --plain --tty=false  19.88s user 1.47s system 116% cpu 18.294 total
```

### qemu-img convert with different number of coroutines

```
% SRC=/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/346ee1ff9e381b78ba08e2a29445960b5cd31c51f896fc346b82e26e345a5b9a/data
% for i in 1 2 4 8; do echo "=== $i coroutines"; time qemu-img convert -O raw -m $i $SRC tmp.img; done
=== 1 coroutines
qemu-img convert -O raw -m $i $SRC tmp.img  2.27s user 1.63s system 85% cpu 4.567 total
=== 2 coroutines
qemu-img convert -O raw -m $i $SRC tmp.img  2.33s user 1.81s system 154% cpu 2.687 total
=== 4 coroutines
qemu-img convert -O raw -m $i $SRC tmp.img  2.37s user 2.09s system 213% cpu 2.097 total
=== 8 coroutines
qemu-img convert -O raw -m $i $SRC tmp.img  2.38s user 2.10s system 223% cpu 2.004 total
```

## How unordered writes affect performance

With ordered writes (the default), when a read (and decompress) is finished, the coroutine waits until it can write the data to the target image in order. This decreases performance and cpu usage.

With unordered writes, when a read (and decomporess) is finished, the coroutine writes the data to the target without waiting for previous reads. This increase fragmentation of the file, but greatly improve performance in some cases (e.g. writing to shared block storage can be 6 times faster).

Using unordered writes is more effective with direct I/O (1.5 times faster). With buffered I/O it is only 1.14 times faster.

This is not great improvement for our use case, but we can consider this later.

### Buffered I/O

```
% for i in 1 2 4 8; do echo "=== $i coroutines"; time qemu-img convert -O raw -m $i -W $SRC tmp.img; done
=== 1 coroutines
qemu-img convert -O raw -m $i -W $SRC tmp.img  2.26s user 1.75s system 91% cpu 4.360 total
=== 2 coroutines
qemu-img convert -O raw -m $i -W $SRC tmp.img  2.32s user 1.89s system 166% cpu 2.533 total
=== 4 coroutines
qemu-img convert -O raw -m $i -W $SRC tmp.img  2.41s user 2.43s system 262% cpu 1.844 total
=== 8 coroutines
qemu-img convert -O raw -m $i -W $SRC tmp.img  2.45s user 3.34s system 330% cpu 1.752 total
```

### Direct I/O

```
% for i in 1 2 4 8; do echo "=== $i coroutines"; time qemu-img convert -O raw -m $i -W -t none $SRC tmp.img; done
=== 1 coroutines
qemu-img convert -O raw -m $i -W -t none $SRC tmp.img  2.25s user 1.75s system 88% cpu 4.538 total
=== 2 coroutines
qemu-img convert -O raw -m $i -W -t none $SRC tmp.img  2.32s user 1.86s system 166% cpu 2.517 total
=== 4 coroutines
qemu-img convert -O raw -m $i -W -t none $SRC tmp.img  2.42s user 2.69s system 311% cpu 1.639 total
=== 8 coroutines
qemu-img convert -O raw -m $i -W -t none $SRC tmp.img  2.44s user 3.56s system 445% cpu 1.347 total
```

## Improving qcow2reader performance

### Caching L2 tables

Using #31, limactl is 1.6 times faster with compressed images, but still 2.7 times slower compared with single threaded qmeu-img.

With uncompressed images we are fast as qemu-img, but it can be 2 times faster using direct I/O and unordered writes.

```
% time limactl create --plain --tty=false
INFO[0000] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/346ee1ff9e381b78ba08e2a29445960b5cd31c51f896fc346b82e26e345a5b9a/data" 
INFO[0000] Converting "/Users/nsoffer/.lima/default/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/default/diffdisk" 
3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 296.81 MiB/s
...
limactl create --plain --tty=false  12.11s user 0.70s system 103% cpu 12.355 total
``` 

Comparing uncompressed images - for reference:

```
% cat test.yaml 
images:
- location: /tmp/test.qcow2
  arch: "aarch64"
vmType: vz
plain: true

% time limactl create --tty=false test.yaml                                                                         
...
INFO[0000] Downloaded the image from "/tmp/test.qcow2"  
INFO[0000] Converting "/Users/nsoffer/.lima/test/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/test/diffdisk" 
3.50 GiB / 3.50 GiB [---------------------------------------] 100.00% 3.87 GiB/s
...
limactl create --tty=false test.yaml  0.70s user 0.53s system 82% cpu 1.494 total
```

Converting same image with qemu-img with different number of threads:

```
% for i in 1 2 4 8; do echo "=== $i coroutines"; time qemu-img convert -O raw -m $i /tmp/test.qcow2 tmp.img; done 
=== 1 coroutines
qemu-img convert -O raw -m $i /tmp/test.qcow2 tmp.img  0.03s user 0.74s system 83% cpu 0.927 total
=== 2 coroutines
qemu-img convert -O raw -m $i /tmp/test.qcow2 tmp.img  0.03s user 0.76s system 84% cpu 0.933 total
=== 4 coroutines
qemu-img convert -O raw -m $i /tmp/test.qcow2 tmp.img  0.04s user 0.77s system 88% cpu 0.916 total
=== 8 coroutines
qemu-img convert -O raw -m $i /tmp/test.qcow2 tmp.img  0.05s user 0.79s system 88% cpu 0.953 total
```

Converting same image with qemu-img convert using direct I/O and unordered writes:

```
% for i in 1 2 4 8; do echo "=== $i coroutines"; time qemu-img convert -O raw -m $i -W -t none /tmp/test.qcow2 tmp.img; done
=== 1 coroutines
qemu-img convert -O raw -m $i -W -t none /tmp/test.qcow2 tmp.img  0.05s user 0.57s system 58% cpu 1.056 total
=== 2 coroutines
qemu-img convert -O raw -m $i -W -t none /tmp/test.qcow2 tmp.img  0.05s user 0.58s system 107% cpu 0.591 total
=== 4 coroutines
qemu-img convert -O raw -m $i -W -t none /tmp/test.qcow2 tmp.img  0.07s user 0.71s system 175% cpu 0.440 total
=== 8 coroutines
qemu-img convert -O raw -m $i -W -t none /tmp/test.qcow2 tmp.img  0.08s user 0.91s system 226% cpu 0.435 totalful
```

### Faster decompression

Looks like the builtin flate library is slow. We can find a better library or use the C library with cgo.

#33 added a test for fully allocated compressed image. This test shows that decompression is extremely slow.

## Faster zero reads

With the inefficient io.Reader interface, we have to fill a buffer with zeros when reading zero cluster. This should be many times faster compared with reading data from storage,  close to memory bandwidth (e.g. 50 GB/s). However the new test for completely image (#33) shows that this is almost 3 times slower compared with reading fully allocated image.

Improving this will not save lot of time, but can very simple change, like copying a preallocated zero buffer, or using `memset()` via cgo.

### Using internal thread pool for I/O and decompression

We can split the read request (e.g. 1 MiB) to 16 sub requests (for default cluster size) process in a goroutine pool.

The request can complete when:
- all sub request completed
- at least one sub request failed

This keeps the io.ReaderAt interface, which makes the library easy to use, but it will never perform well like an async interface.

This also add a lot of complexity to the library.

### Providing block status interface

Using io.ReaderAt (and io.Reader) will never be efficient for sparse images. The efficient way to copy sparse images is:

1. Get the next run of clusters with same status (allocated, zero, unallocated, compressed)
2. Process clusters:
   - Zero clusters (clusters allocated but marked as zero): skip
   - Unallocated clusters: skip
   - Uncompressed clusters: read all clusters (in one read call if possible)
   - Compressed clusters: read and decompress each cluster
3. Write data to target

So we need an interface like:

```Go
type Qcow2Reader intreface {
    BlockStatus(offset int64) (*Status, error)
    ReadAt(b []byte, offset int64) (int64, error)
}
```

With this the user of the library can use find runs of clusters of the same type, and process them in a thread pool, reading and decompressing in parallel.

Implementing BlockStatus with backing files is not trivial, but much less work comparing to implementing internal thread pool.

## Related issues

- https://github.com/lima-vm/lima/issues/2579

## Tasks

- [x] Cache L2 tables (#31)
- [x] Faster zero reads (#34)
- [x] Faster decompression (#36)
- [ ] BlockStatus interface
  - [x] #39
  - [ ] #38 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance compared with qemu-img convert #32

How thread pool affects performance

Limactl create

qemu-img convert with different number of coroutines

How unordered writes affect performance

Buffered I/O

Direct I/O

Improving qcow2reader performance

Caching L2 tables

Faster decompression

Faster zero reads

Using internal thread pool for I/O and decompression

Providing block status interface

Related issues

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance compared with qemu-img convert #32

Description

How thread pool affects performance

Limactl create

qemu-img convert with different number of coroutines

How unordered writes affect performance

Buffered I/O

Direct I/O

Improving qcow2reader performance

Caching L2 tables

Faster decompression

Faster zero reads

Using internal thread pool for I/O and decompression

Providing block status interface

Related issues

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions