How come searching for a single character with `-q` is notably faster when splitting the file into smaller columns? #3039

5HT2 · 2025-05-01T03:54:52Z

5HT2
May 1, 2025

Sorry, this is moreso just a technical question as to "what is going on", I'm somewhat struggling with the title itself.

I did search for existing issues, and couldn't find any that talk about this sort of behavior, but I don't feel like it's worthy of an issue in it's own right.

Background information as to what I am doing

In short: - I have a script (`dump-pdfs`) which runs a bunch of processing on a PDF file and returns an as-clean-as-possible text form of said PDF, including drawings and margins, etc. - I'm lazy, in that I don't necessarily want to have to type out the awfully-formatted filenames that I start with (it just renames the files using a regex afterwards).

This leads to two problems, only one of which is relevant here: how do I skip unnecessarily processing the same steps more than once on the same file?

The processing step in question is just responsible for deleting unwanted Unicode & other characters, one of which is a form feed (0x0c / \\f), as an example.

In testing, on a directory of ~730 text files, this processing step takes about 421.313ms, fairly consistently when said files are cached into RAM.
However, if I pre-process it using xxd to essentially split the file into a 2-char column, which takes ~275.159ms on average, or 285.3ms at worst.

As an example, the change is as follows, I'll include the full processing step for context:

# Some PDFs contain these due to the way that they were rendered.
# Cleanup line endings, used to remove the following:
# - 0x0c (\f)
# - 0x0d 0x0a → 0x0a (\r\n → \n)
process_fmt() {
    local pdf=$''$1''
    local txt=$''${pdf:r}.txt''

    if [[ ! -f $''$txt'' ]]; then
        log 1 "Failed to cleanup PDF, no such file: $txt" || return $?
    fi

    # Check for existence of 0x0c, no need to run dos2unix otherwise
    # We use this instead of (rg $'\x0c' -q) because it's twice as fast somehow
    xxd -c 1 -p $''$txt'' | rg '^0c$' -q || {
        log $? "Skipped cleaning up $txt"
        return 0
    }

    log 0 "Cleaning up line endings in $txt..."
    dos2unix "$txt" &>/dev/null || {
        log $? "fatal: dos2unix returned $?" || return $?
    }

    log "Deleting 0x0c (form feed) characters from $txt..."
    tr -d '\f' <"$txt" >"${txt}.temp"
    mv "${txt}.temp" "$txt"

    log "Finished cleaning up: $txt"
}

The speed improvement comes from this change:

- rg $'\x0c' -q $''$txt'' || {
+ xxd -c 1 -p $''$txt'' | rg '^0c$' -q || {
=     log $? "Skipped cleaning up $txt"
=     return 0
= }

Again, all xxd is doing here is splitting the file into 2-char wide columns, it shouldn't really have some kind of advantage I/O-wise over rg as the files are cached into RAM regardless.

From what I understand, the --quiet flag should exit as soon as a match is found, meaning, an order-of-indexing advantage regarding where in the file 0x0c is first encountered shouldn't be making a difference.
I tried taking a look at the code (presumably in crates/core/main.rs#run()) but it's unfortunately going over my head, as I don't have a ton of time to decipher the levels of abstraction myself.

Is this just a case of exact char-by-char matches being faster than searching continuously through the file, as you can just skip to the next line if the first character != 0, or is something else going on?

Thank you for your time, I'm hoping someone can help me understand this behavior a bit better.

BurntSushi · 2025-05-01T11:20:31Z

BurntSushi
May 1, 2025
Maintainer

Could you provide an example input where you see the speed difference, and then show me the commands used to run with that input?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How come searching for a single character with `-q` is notably faster when splitting the file into smaller columns? #3039

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How come searching for a single character with -q is notably faster when splitting the file into smaller columns? #3039

Uh oh!

5HT2 May 1, 2025

Replies: 1 comment

Uh oh!

BurntSushi May 1, 2025 Maintainer

How come searching for a single character with `-q` is notably faster when splitting the file into smaller columns? #3039

5HT2
May 1, 2025

BurntSushi
May 1, 2025
Maintainer