-
-
Notifications
You must be signed in to change notification settings - Fork 17
Description
This is a response to your email, but I thought I'd post it publicly since it might interest other people.
Here's some fairly unstructured thoughts without a real "conclusion":
I'd say SQLite is definitely a possibility for that use case.
-
SQLite has an "integrated" zip-file replacement https://sqlite.org/sqlar/doc/trunk/README.md (not that great for this since it uses DEFLATE and doesn't directly have any other file metadata or indexes, but it's precedent for storing files in SQLite).
-
SQLite has a BLOB reading API that allows reading a byte sequence stored in there to be read as a stream (as opposed to getting it completely with a normal SQL query)
-
Audacity recently changed their project format which consists of many audio file tracks to use SQLite instead of a directory of files: https://www.audacityteam.org/audacity-3-0-0-released/ . These can be pretty large audio files so it seems to work fine for that.
-
SQLite has a limitation of 2GB for one blob which is probably not that relevant for webrecorder but can definitely be hit when using it to archive normal files
-
I've looked at the WACZ format before and it seems well thought out, but a bit unnecessarily complicated - kind of reinventing a database structure inside a ZIP file. I also understand in general why people like plain text etc formats - but is there a specific reason why you need to retain normal WARC files with side-car indexes in your WACZ format?
-
Putting everything in a cleanly normalized SQLite database has a few obvious advantages - for one it becomes trivial to continuously insert more requests into it without having to rewrite any indexes, all indexing is managed automatically and adding a new key by which to query data is trivial. For example in browsertrix-crawler, I couldn't really find out if I can stop the crawling and resume it later, or if it always has to run through completely, and I guess it only writes out the final archive at the end?
-
You can write to and read from SQLite database from multiple processes at the same time with ease (single-writer but multiple-reader with journal_mode=WAL)
-
The SQLite database storage format is pretty efficient - for example it transparently uses var-ints for all integers. Text/binary data is not compressed though.
-
I see that accessing CDXJ uses binary search for looking up stuff - SQLite uses a B-Tree (as do basically all databases) which in general should result in a constant factor of fewer requests, since the tree depth is lower. It can also be updated more efficiently (afaik). Since request latency is significant for partial fetching, this is probably a good improvement. You can check the depth of the b-tree with
sqlite_analyze
. I just looked at a DB with 2M rows, and the SQLite index has depth 4, while a normal binary tree lookup would have depth 20. -
SQLite always first looks up the rowid in a index b-tree, then the real row in the table b-tree. Since each HTTP request has fairly high overhead, it can make sense to pre-cache the inner nodes of the b-trees beforehand. Someone implemented this streamlining HTTP request pattern with a sidecar file phiresky/sql.js-httpvfs#15 . In your WACZ format, the equivalent would be caching the first few layers of the binary tree induced by the binary search (but it would be less useful, since the tree is much deeper).
-
CDXJ gzips groups of lines - this is something that would be harder with SQLite since you can't compress many rows together without denormalizing the DB in some way (e.g. storing many real rows per DB row).
-
To solve the above I've actually developed https://github.com/phiresky/sqlite-zstd , which allows transparent row-level compression in SQLite with dictionaries - so if you have 10000 similar entries all the common information need only be stored once. It's great, but a bit experimental and right now no wasm support.
-
SQlite would be pretty much a blackbox in this context - you probably don't want to modify SQLite core so if you find anything that's outside of the flexibility of SQLite will be hard to do.
-
One disadvantage about SQLite is that it would be harder to reason about how much data needs to be fetched exactly when querying by something specifically, and having a wasm "blob" between layers of your code is annoying in general.
-
SQLite is very robust and established, but using it from within the browser is not. So if your use case was only native programs I'd say definitely use it instead of a home-brew format, but for web-based it's less clear.
-
To use SQLite as your file format, you'd have to decide on the database structure. So that would be something like:
create table requests(id integer primary key autoincrement, timestamp date, headers text, content blob);
create table entry_points(request_id integer references requests(id), ts date, title text, text text); -- pages
create table archive_metadata(key text primary key, value text);
You could then also easily add a full-text search search index so you can search for websites with specific text in the titles or even content.
The "cleanest" way would be to put everything in 3NF form. That way you could add indexes on whatever your want, including looking for requests with a specific header value etc. But you probably want to denormalize the data somewhat since you always want all the headers together with the content (like in WARC), without any joining overhead. Then it's harder to index the headers individually, but the main use case is (probably) efficient. Putting it in a more normal form would mean something like
create table requests(id integer primary key autoincrement, timestamp date, ...);
create table headers(id integer primary key autoincrement, key text);
create table header_values(id integer primary key autoincrement, header integer references headers(id), value text);
create table headers(request_id integer references requests(id), header_value references header_values(id));
Note that this is much more complicated, but it also deduplicates the data since most headers and even header values are probably used multiple times.