Skip to content

Conversation

synarete
Copy link
Collaborator

@synarete synarete commented Sep 3, 2025

Use a wrapper shell script over 'fio' + a container to perform a set of I/O testing. Execute both I/O throughput workload as well as random I/O with data verification.

@synarete synarete requested a review from spuiuk September 3, 2025 17:48
Copy link
Collaborator

@Shwetha-Acharya Shwetha-Acharya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

spuiuk
spuiuk previously approved these changes Sep 10, 2025
Copy link
Collaborator

@spuiuk spuiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Edit: Removing the ACK as we have seen issues when running.

@spuiuk spuiuk requested a review from anoopcs9 September 10, 2025 22:58
@spuiuk
Copy link
Collaborator

spuiuk commented Sep 10, 2025

built and pushed the container image to quay. Will retry tests to confirm that they are running as expected.

@spuiuk
Copy link
Collaborator

spuiuk commented Sep 10, 2025

/retest all

@spuiuk
Copy link
Collaborator

spuiuk commented Sep 10, 2025

@synarete, can you rebase and push. We should be able to see the output of the tests in the checks now that the fio containers have been setup.

@synarete
Copy link
Collaborator Author

@synarete, can you rebase and push. We should be able to see the output of the tests in the checks now that the fio containers have been setup.

Done

@anoopcs9
Copy link
Collaborator

We have a problem w.r.t disk space.

run_fio.sh: fio --name=fio_simple_nproc --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=8K --numjobs=35
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device

Our disks are configured with 10G. So either we increase it or limit the number of jobs. Or do both?

@synarete
Copy link
Collaborator Author

We have a problem w.r.t disk space.

run_fio.sh: fio --name=fio_simple_nproc --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=8K --numjobs=35
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device
fio: pid=0, err=28/file:filesetup.c:253, func=fsync, error=No space left on device

Our disks are configured with 10G. So either we increase it or limit the number of jobs. Or do both?

I see numjobs=35 (which we get from nproc). I will add an upper limit to numjobs (say, 8) by why do we use such non-conventional odd number of CPUs?

@anoopcs9
Copy link
Collaborator

Our disks are configured with 10G. So either we increase it or limit the number of jobs. Or do both?

I see numjobs=35 (which we get from nproc). I will add an upper limit to numjobs (say, 8)

Thanks, I hope we can use --max-jobs.

by why do we use such non-conventional odd number of CPUs?

This is not explicitly set rather comes from the host system within the fio container.

Copy link
Collaborator

@anoopcs9 anoopcs9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spuiuk Can you rebuild and push the fio container using the latest version?

@spuiuk
Copy link
Collaborator

spuiuk commented Sep 11, 2025

Pushed the new version of the container.

@spuiuk
Copy link
Collaborator

spuiuk commented Sep 11, 2025

/retest all

@anoopcs9
Copy link
Collaborator

@synarete Does verify workloads consume more space than normal? Now it fails as below:

run_fio.sh: fio --name=fio_verify_4jobs --size=1G --directory=/testdir --ioengine=pvsync2 --group_reporting --sync=1 --direct=0 --rw=randwrite --do_verify=1 --verify_state_save=0 --verify=xxhash --output-format=json --bs=64K --numjobs=4
fio: pid=30, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.3.0: No space left on device: write offset=198246400, buflen=65536
fio: pid=29, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.0.0: No space left on device: write offset=608567296, buflen=65536
fio: pid=32, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.1.0: No space left on device: write offset=236388352, buflen=65536
fio: io_u error on file /testdir/fio_verify_4jobs.2.0: No space left on device: write offset=239665152, buflen=65536
fio: pid=31, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device

@synarete
Copy link
Collaborator Author

@synarete Does verify workloads consume more space than normal? Now it fails as below:

run_fio.sh: fio --name=fio_verify_4jobs --size=1G --directory=/testdir --ioengine=pvsync2 --group_reporting --sync=1 --direct=0 --rw=randwrite --do_verify=1 --verify_state_save=0 --verify=xxhash --output-format=json --bs=64K --numjobs=4
fio: pid=30, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.3.0: No space left on device: write offset=198246400, buflen=65536
fio: pid=29, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.0.0: No space left on device: write offset=608567296, buflen=65536
fio: pid=32, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device
fio: io_u error on file /testdir/fio_verify_4jobs.1.0: No space left on device: write offset=236388352, buflen=65536
fio: io_u error on file /testdir/fio_verify_4jobs.2.0: No space left on device: write offset=239665152, buflen=65536
fio: pid=31, err=28/file:io_u.c:1896, func=io_u error, error=No space left on device

It looks like it. I will reduce the size on verify runs

@spuiuk spuiuk self-requested a review September 11, 2025 13:12
@spuiuk
Copy link
Collaborator

spuiuk commented Sep 11, 2025

new version of container pushed.

@spuiuk
Copy link
Collaborator

spuiuk commented Sep 11, 2025

/retest all

@spuiuk spuiuk dismissed their stale review September 11, 2025 14:43

Noticed issues with test run.

@anoopcs9
Copy link
Collaborator

XFS (and GPFS) tests are looking good but CephFS tests show different errors:

  • For default (without mgr) setup on a proxy enabled share (same should be the case without proxy):

    run_fio.sh: fio --name=fio_simple_64k --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=64K --numjobs=1
    fio: io_u error on file /testdir/fio_simple_64k.0.0: Permission denied: write offset=0, buflen=65536
    fio: pid=6, err=13/file:io_u.c:1896, func=io_u error, error=Permission denied

  • For mgr based setup on a normal vfs-ceph-new share without proxy (same should be the case with proxy):

    run_fio.sh: fio --name=fio_simple_64k --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=64K --numjobs=1
    fio: io_u error on file /testdir/fio_simple_64k.0.0: Resource temporarily unavailable: write offset=0, buflen=65536
    fio: pid=6, err=11/file:io_u.c:1896, func=io_u error, error=Resource temporarily unavailable

To me it looks like new ceph module doesn't complete the fio workload. Old vfs module based share could get past the fio test.

testcases/containers/test_containers.py::test_containers[192.168.123.10-share-cephfs-default-vfs-fio] PASSED [ 94%]

@synarete
Copy link
Collaborator Author

XFS (and GPFS) tests are looking good but CephFS tests show different errors:

  • For default (without mgr) setup on a proxy enabled share (same should be the case without proxy):

    run_fio.sh: fio --name=fio_simple_64k --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=64K --numjobs=1
    fio: io_u error on file /testdir/fio_simple_64k.0.0: Permission denied: write offset=0, buflen=65536
    fio: pid=6, err=13/file:io_u.c:1896, func=io_u error, error=Permission denied

  • For mgr based setup on a normal vfs-ceph-new share without proxy (same should be the case with proxy):

    run_fio.sh: fio --name=fio_simple_64k --size=1G --directory=/testdir --runtime=120 --ioengine=pvsync2 --group_reporting --time_based --sync=1 --direct=0 --rw=readwrite --output-format=json --bs=64K --numjobs=1
    fio: io_u error on file /testdir/fio_simple_64k.0.0: Resource temporarily unavailable: write offset=0, buflen=65536
    fio: pid=6, err=11/file:io_u.c:1896, func=io_u error, error=Resource temporarily unavailable

To me it looks like new ceph module doesn't complete the fio workload. Old vfs module based share could get past the fio test.

testcases/containers/test_containers.py::test_containers[192.168.123.10-share-cephfs-default-vfs-fio] PASSED [ 94%]

Interesting -- I did not encounter those failures on my setup (without proxy). Will dig into it.

@anoopcs9
Copy link
Collaborator

anoopcs9 commented Sep 15, 2025

/retest centos-ci/cephfs

Use a wrapper shell script over 'fio' + a container to perform a set of
I/O testing. Execute both I/O throughput workload as well as random I/O
with data verification.

Signed-off-by: Shachar Sharon <[email protected]>
@synarete
Copy link
Collaborator Author

I tried to reproduce on my local ceph cluster (without proxy) but failed to do so. Indeed, I had a crash because I used non-standard ceph images, but as soon as I switched back to normal ceph (image: quay.ceph.io/ceph-ci/ceph:main libcepfs: centos-release-ceph-reef-1.0-1.el9s.noarch libcephfs2-18.2.7-2.el9s.x86_64 everything works fine, repeated few time. Which version of libcephfs are we using for CI ?

@anoopcs9
Copy link
Collaborator

Which version of libcephfs are we using for CI ?

Whatever is the latest build available from ceph main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants