Replies: 1 comment
-
Could you provide an example input where you see the speed difference, and then show me the commands used to run with that input? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Sorry, this is moreso just a technical question as to "what is going on", I'm somewhat struggling with the title itself.
I did search for existing issues, and couldn't find any that talk about this sort of behavior, but I don't feel like it's worthy of an issue in it's own right.
Background information as to what I am doing
In short: - I have a script (`dump-pdfs`) which runs a bunch of processing on a PDF file and returns an as-clean-as-possible text form of said PDF, including drawings and margins, etc. - I'm lazy, in that I don't necessarily want to have to type out the awfully-formatted filenames that I start with (it just renames the files using a regex afterwards).
This leads to two problems, only one of which is relevant here: how do I skip unnecessarily processing the same steps more than once on the same file?
The processing step in question is just responsible for deleting unwanted Unicode & other characters, one of which is a form feed (
0x0c
/\\f
), as an example.In testing, on a directory of ~730 text files, this processing step takes about
421.313ms
, fairly consistently when said files are cached into RAM.However, if I pre-process it using
xxd
to essentially split the file into a 2-char column, which takes ~275.159ms
on average, or285.3ms
at worst.As an example, the change is as follows, I'll include the full processing step for context:
The speed improvement comes from this change:
Again, all
xxd
is doing here is splitting the file into 2-char wide columns, it shouldn't really have some kind of advantageI/O
-wise overrg
as the files are cached into RAM regardless.From what I understand, the
--quiet
flag should exit as soon as a match is found, meaning, an order-of-indexing advantage regarding where in the file0x0c
is first encountered shouldn't be making a difference.I tried taking a look at the code (presumably in
crates/core/main.rs#run()
) but it's unfortunately going over my head, as I don't have a ton of time to decipher the levels of abstraction myself.Is this just a case of exact char-by-char matches being faster than searching continuously through the file, as you can just skip to the next line if the first character != 0, or is something else going on?
Thank you for your time, I'm hoping someone can help me understand this behavior a bit better.
Beta Was this translation helpful? Give feedback.
All reactions