Time-based since parameter for `_changes` #5603

nickva · 2025-07-21T23:07:49Z

Overview

Implement a time-to-sequence mapping data structure then use it to enable the _changes?since=$time feature.

This started as an experiment wondering if we could have a simple data structure to map rough time intervals to db sequences, nothing too exact just something on the order of hours, days, months, years. The original idea came from a discussion with Glynn Bird with him wondering if it would be possible to do such a thing (thanks, @glynnbird!), and then the idea of using exponentially decaying intervals is from our recent rewrite of couch_stats histograms

Time-Seq Data Structure

The data structure, called "time-seq" further below, is a list of 60 key-value pairs mapping time bins to db sequences. The structure can represent exponentially decaying time intervals. This decaying behavior is a trade-off of being small and having a fixed size -- the further back in time we go, the lower the accuracy. However, this is how we often regard time in general, when we talk about "yesterday", we refer to individual hours; when we talk about "last month", we may talk about individual days; when talking about two years ago, we may care about months only, etc.

Just the time-seq implementation and the associated tests are in the first commit. It has some additional info in the commit and module comments. Property tests were written by @iilyak (thank you!) and along with the eunit test we got to 100% test coverage.

Serialization: Upgrade / Downgrade Behavior

Another unexpected benefit of using a small data structure is fits inside the header. And, with an additional bit of luck the implementation turned out to also be downgrade-safe. This was accomplished by reusing a very old unused header field. This way, on downgrade the older versions of CouchDB will ignore the new time-seq field. With this "trick" we can avoid having to create an intermediate downgrade target release. The addition of time-seq data structure to the header is implemented in the second commit. That commit also implements how the structure is updated: that happens in the couch_db_updater right before the writes are committed.

Dealing With Time

Since we're dealing with time, we're bound to have some sharp edges. On some systems time could jump backwards briefly after boot until NTP sync kicks in, or it may misbehave in other ways. There are a few mitigations implemented to help with these sharp edges:

Round timestamps to three hour blocks. We only care about very rough synchronization -- on the order of hours. Even if it's off by days, users may still use the system if they rely on since intervals larger than whole days (weeks, months).
Ignore updates from the past. Once the time catches up, updates will continue. User doesn't have to do anything or configure anything, this feature is always enabled.
Ignore times below a configurable threshold. If some systems are known to jump back to some fixed time in the past after boot, the user may configure a minimum threshold to ignore any updates below that threshold.
Implement API endpoints to inspect and reset time-seq structures: GET $db/_time_seq and DELETE $db/_time_seq. The result of the GET $db/_time_seq contains all the time-seq bins with formatted timestamps mapped to the number of changes in that bin. The DELETE call resets the _time_seq structure. This allows users to inspect and reset any time-seq structure if they detect something unexpected happened with the time synchronization, for example if the date jumped forward to 2050 or something like that.

The third commit implements the new $db/_time_seq API endpoint and the general fabric level integration of the new feature.

`_changes?since=$time` Implementation

_change?since=$time feature is implemented in the fourth commit. Due to all the preparatory steps this commit is pretty simple. We handle the new parameter variant just like we handle the special now value for descending changes feeds. After the initial start argument process the rest of the changes feed logic proceeds as before.

A small example copied from the _changes commit comment with a db I updated every few hours during the day:

% http get $DB/db/_time_seq`

{
    "00000000-ffffffff": {
        "[email protected]": [
            ["2025-07-21T01:00:00Z", 15],
            ["2025-07-21T05:00:00Z", 2]
            ["2025-07-21T19:00:00Z", 9],
            ["2025-07-21T20:00:00Z", 5],
            ["2025-07-21T21:00:00Z", 70]
            ["2025-07-21T22:00:00Z", 10]
        ]
    }
}

_change?since=2025-07-21T22:00:00Z will return documents changed since that
last hour only:

% http get $DB/db/_changes'?since=2025-07-21T22:00:00Z' | jq -r '.results[].id'

101
102
103
104
105
106
107
108
109
110

Even the (somewhat) hidden since_seq replication parameter should work, so we can replicate from a particular point in time:

% http post 'http://adm:pass@localhost:15984/_replicate' \
  source:='"http://adm:pass@localhost:15984/db"' \
  target:='"http://adm:pass@localhost:15984/tgt"' \
  since_seq:='"2025-07-21T22:00:00Z"'

{
    "history": [
        {
            "bulk_get_attempts": 10,
            "bulk_get_docs": 10,
            "doc_write_failures": 0,
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "end_time": "Mon, 21 Jul 2025 22:11:59 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "session_id": "19252b97e34088aeaaa6cde6694a419f",
            "start_last_seq": "2025-07-21T22:00:00Z",
            "start_time": "Mon, 21 Jul 2025 22:11:55 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 4,
    "session_id": "19252b97e34088aeaaa6cde6694a419f",
    "source_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews"
}

The target db now has only the documents written in that last hour:

% http $DB/tgt/_all_docs | jq -r '.rows[].id'

101
102
103
104
105
106
107
108
109
110

Downgrade Testing

Ran a downgrade test. Updated a db with the PR branch. Switched to main, then verified it was possible to read and write the same dbs without any issue.

Performance impact

Ran the quick and dirty built-in fabric_bench test. Used q=8 and small docs. Didn't noticed any significant difference between main and PR branch:

  - main 
     _bulk_get rate (hz): 29000, 27000, 26000, 26000, 29000, 30000
    single doc update (hz): 320, 330, 350, 320, 330

  - PR
    _bulk_get rate (hz): 30000, 30000, 30000, 29000, 26000, 27000
   single doc update (hz): 340, 310, 310, 330, 340, 330

What Happens Over Time

To get a feel for how the rollup works ran a test which updated the data structure once per hour for 1 million hours.

   3000-01-01T00:00:00Z -> 82176
   3009-05-18T00:00:00Z -> 83712
   3018-12-05T00:00:00Z -> 85584
   3028-09-09T00:00:00Z -> 82416
   3038-02-03T00:00:00Z -> 85488
   3047-11-05T00:00:00Z -> 82704
   3057-04-12T00:00:00Z -> 82944
   3066-09-28T00:00:00Z -> 85872
   3076-07-15T00:00:00Z -> 83520
   3086-01-24T00:00:00Z -> 41472
   3090-10-18T00:00:00Z -> 41472
   3095-07-12T00:00:00Z -> 41760
   3100-04-17T00:00:00Z -> 41472
   3105-01-09T00:00:00Z -> 20736
   3107-05-23T00:00:00Z -> 20736
   3109-10-03T00:00:00Z -> 10368
   3110-12-09T00:00:00Z -> 10368
   3112-02-14T00:00:00Z -> 5184
   3112-09-17T00:00:00Z -> 3456
   3113-02-08T00:00:00Z -> 864
   3113-03-16T00:00:00Z -> 864
   3113-04-21T00:00:00Z -> 864
   3113-05-27T00:00:00Z -> 864
   3113-07-02T00:00:00Z -> 864
   3113-08-07T00:00:00Z -> 288
   3113-08-19T00:00:00Z -> 288
   3113-08-31T00:00:00Z -> 288
   3113-09-12T00:00:00Z -> 288
   3113-09-24T00:00:00Z -> 288
   3113-10-06T00:00:00Z -> 288
   3113-10-18T00:00:00Z -> 288
   3113-10-30T00:00:00Z -> 288
   3113-11-11T00:00:00Z -> 288
   3113-11-23T00:00:00Z -> 288
   3113-12-05T00:00:00Z -> 288
   3113-12-17T00:00:00Z -> 288
   3113-12-29T00:00:00Z -> 96
   3114-01-02T00:00:00Z -> 48
   3114-01-04T00:00:00Z -> 48
   3114-01-06T00:00:00Z -> 48
   3114-01-08T00:00:00Z -> 48
   3114-01-10T00:00:00Z -> 48
   3114-01-12T00:00:00Z -> 48
   3114-01-14T00:00:00Z -> 48
   3114-01-16T00:00:00Z -> 48
   3114-01-18T00:00:00Z -> 48
   3114-01-20T00:00:00Z -> 24
   3114-01-21T00:00:00Z -> 24
   3114-01-22T00:00:00Z -> 24
   3114-01-23T00:00:00Z -> 24
   3114-01-24T00:00:00Z -> 24
   3114-01-25T00:00:00Z -> 24
   3114-01-26T00:00:00Z -> 24
   3114-01-27T00:00:00Z -> 24
   3114-01-28T00:00:00Z -> 24
   3114-01-29T00:00:00Z -> 24
   3114-01-30T00:00:00Z -> 6
   3114-01-30T06:00:00Z -> 6
   3114-01-30T12:00:00Z -> 3
   3114-01-30T15:00:00Z -> 1

Noticed a few things:

During the last day there are 4 individual intervals. So we could determine which changes occurred about 3 to 6 hours apart.
There are 11 individual days, then days are combined into pairs, so if we ask for changes since=3114-01-09T00:00:00Z we may also get changes from 3114-01-08T00:00:00Z
Most of the bins are devoted to keeping track the sequences in the current year. That's exactly what we'd expect. We can efficiently get the changes since.
Even after 100 years we can still target intervals less than 10 years apart

src/mem3/src/mem3_util.erl

src/chttpd/test/eunit/chttpd_changes_test.erl

src/couch/src/couch_bt_engine_header.erl

rnewson · 2025-07-22T13:14:23Z

noting we use ISO 8601 for date/time elsewhere in the codebase.

src/couch/src/couch_bt_engine_header.erl

src/couch/src/couch_time_seq.erl

src/fabric/src/fabric.erl

iilyak · 2025-07-22T14:59:56Z

http get $DB/db/_changes'?since=2025-07-21T22:00:00Z'

The re-use of since with different type will create problems for openAPI spec.

nickva · 2025-07-22T15:15:13Z

noting we use ISO 8601 for date/time elsewhere in the codebase.

@rnewson good point I did start with saying they are ISO 8601 but then noticed the new-ish erlang calendar module used the rfc3339 so flipped to that. I think technically rfc3339 is a bit more restrictive https://ijmacd.github.io/rfc3339-iso8601 but also allows stuff we don't accept here like a space instead of 'T' or an underscore.

nickva · 2025-07-22T15:20:40Z

http get $DB/db/_changes'?since=2025-07-21T22:00:00Z'

The re-use of since with different type will create problems for openAPI spec.

I think it's in line with how now is used, and even the implementation part of it is similar (another clause on top of now handling).

Having some new parameter from or time_since would be an option, however then:

We'd also need so alter the changes_args record and that's sent between nodes and so we may need an intermediate release to avoid breaking online upgrades etc.
Any place that passes through a since parameter like the replicator will need to be updated to now know about two since parameters
The sequences emitted in the response are no different than with any other since types: either 0, now or 1-xyz. That is the user still gets back regular "seq" and "last_seq" values.

That's why I opted to stick with just "since"

nickva · 2025-07-28T07:19:22Z

I updated the algorithm used for merging/rollup. It's now a bit simpler the new algorithm merges bins together. First the shortest intervals (multiple hours), then longer ones (multiple days) etc. Previous one was a bit more complicated with trying to create a new set of bins then fitting old bins into the new ones.
Removed a few more mentions of ISO 8601 vs RFC 3339, focused more on "here is the accepted format" point.
Moved the "range-to-hex" DRY refactoring to a separate PR to keep this one smaller.
Added docs for the new config values and the http APIs.
Added some more specs
Added more test coverage.
Updated the main comment with some tests: downgrade test, quick perf test, and investigated what happens at longer time limit, in that after 100+ years.

src/couch/src/couch_time_seq.erl

src/couch/test/eunit/couch_time_seq_tests.erl

iilyak-ibm · 2025-08-12T16:22:30Z

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

nickva · 2025-08-12T16:55:08Z

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

iilyak-ibm · 2025-08-12T16:59:10Z

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

Can we recreate _time_seq from the _time_seq of another node? We don't need to be exact, approximation of an age would do (correct up to the bin placement).

nickva · 2025-08-12T19:08:14Z

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

Can we recreate _time_seq from the _time_seq of another node? We don't need to be exact, approximation of an age would do (correct up to the bin placement).

We probably could easier if we rebuilt nodes while there is no interactive traffic going to it (we'd "lock it" so to speak, which is exactly what we do for shard splits). Then, each document in the rebuild replication request could include a times bin in the #doc.meta fields, and then we could associate those timestamp with the sequences on the shard.

It would be neat to have a fast-forward rebuild like that, it would be like rsync-in the shard over and then renaming but all in Erlang. Or another way to think about it, it's like what we'd so for shard splitting but allow the target to be 1 instead of 2+ and make the calls go to another node not only the local one.

src/chttpd/test/eunit/chttpd_changes_test.erl

src/couch/src/couch_db_engine.erl

rnewson · 2025-09-01T15:35:36Z

src/couch/src/couch_db_int.hrl

-    % feature removed in 3.x, but field kept to avoid changing db record size
-    % and breaking rolling cluster upgrade
-    waiting_delayed_commit_deprecated,
+    time_seq,


retain a comment that we repurposed the item.

Good idea, I'l describe its history a bit:

% In 2.x versions this field was called waiting_delayed_commit. % In 3.0->3.5 versions it was deprecated and named waiting_delayed_commit_deprecated. % In 3.6+ it was repurposed to keep the time_seq structure. % This repurposing and deprecating is done in order to avoid changing db % record sizes and breaking cross-cluster online upgrades.

src/couch/src/couch_db_updater.erl

rnewson · 2025-09-01T15:46:30Z

src/couch/src/couch_time_seq.erl

+
+-spec update(time_seq(), update_seq()) -> time_seq().
+update(#{v := ?VER} = Ctx, Seq) when is_integer(Seq), Seq >= 0 ->
+    update(Ctx, now_unix_sec(), Seq).


that the function calls now_unix_sec() itself makes it harder to write tests (i.e, your tests use a far-future real date in order to accomodate the inability to inject time at this point).

I'd prefer to see the injection of real time in a way that allows us to test all the edge cases and boundary values if possible.

I had both an update/2 and update/3, tests use update/3 with their own time. But it is a bit confusing I agree, so I'll update it to have just update/3 and let the couch_db_updater pass in the time.

It may be neater to have couch_db_updater grab the time from couch_time_seq:timestamp() in case we wanted to change the resolution, plug in another time source or something like that in the future. So all APIs calls get an explicit timestamp (always) but the timestamp itself is generated in an opaque way by the couch_time_seq module.

I like that idea.

src/couch/src/couch_time_seq.erl

src/docs/src/config/couchdb.rst

rnewson · 2025-09-04T13:54:49Z

src/couch/src/couch_time_seq.erl

+-define(M, ?D * 30).
+-define(Y, ?D * 365).
+%% erlfmt-ignore
+-define(INTERVALS, [


thinking these should be configurable per database.

Hmm, they are sort of tested and adjusted to work out nicely when intervals are merged, almost all of them merge neatly in pairs (3 hours merge to 6 hours, 2 years merge to 4 years etc), and so on, and that's on purpose. I can see users getting tangled in the config settings with those. It would be a foot-gun and they might only find out months later that they misconfigured something.

Also, our per-db settings are kind of odd currently, with the revs_limit and such being per shard and stored in the header, then we have props that's supposed to be universal but we don't allow dynamically setting it (and it also lives in two places -- the shard doc and also in db shards), so it would be a bit of work on top of this existing PR to add per DB settings for intervals, and may also break "downgradability". Maybe it's something to add the future, if this is used quite a bit and there is a user demand for it?

I can see adding a lower base interval option for testing / evaluation (making it 3 minutes maybe?).

sure, I get that, but for example we can't do "3 months" and the ability to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is tricky (and the security object is a bad precedent since clustering).

Perhaps a broader discussion on what the intervals should be then.

(I'm +1 on this PR otherwise, it's just the hard-coded intervals that I struggle with)

A note about efficiency. We could pre-calculate the values and keep the formulas as comments for efficiency reasons. Because otherwise we are going to do pointless multiplications over and over.

decompile the .beam to be sure? I will go find my collection of pearls just in case it doesn't.

cat a.erl -module(a). -export([a/0]). -define(NUMBER, 1 * 3 * 6 * 1000). a() -> ?NUMBER. ➜ couchdb git:(lucene-10) ✗ erl -pa . Erlang/OTP 26 [erts-14.2.5.11] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] Eshell V14.2.5.11 (press Ctrl+G to abort, type help(). for help) 1> beam_disasm:file(a). {beam_file,a, [{a,0,2},{module_info,0,4},{module_info,1,6}], [{vsn,[177408199690294671755095659682854774622]}], [{version,"8.4.3.3"}, {options,[]}, {source,"/Users/rnewson/Source/couchdb/a.erl"}], [{function,a,0,2, [{label,1}, {line,1}, {func_info,{atom,a},{atom,a},0}, {label,2}, {move,{integer,18000},{x,0}}, return]}, {function,module_info,0,4, [{line,0}, {label,3}, {func_info,{atom,a},{atom,module_info},0}, {label,4}, {move,{atom,a},{x,0}}, {call_ext_only,1,{extfunc,erlang,get_module_info,1}}]}, {function,module_info,1,6, [{line,0}, {label,5}, {func_info,{atom,a},{atom,module_info},1}, {label,6}, {move,{x,0},{x,1}}, {move,{atom,a},{x,0}}, {call_ext_only,2,{extfunc,erlang,get_module_info,2}}]}]} 3> a:a(). 18000

anyway, back to the point. Ok, so 3 months etc does work, that is good enough for me. thanks for the explanation.

Perfect thanks for checking it out:

-define(NUMBER, 1 * 3 * 6 * 1000).

And the returned value is: {move,{integer,18000},{x,0}}

Thank you for checking. Erlang was not capable of doing it for a long time.

iilyak

Looks great +1 from me.

iilyak · 2025-09-04T19:44:15Z

Looks great +1 from me.

Please wait for rnewson's conclusion before merging.

glynnbird · 2025-09-04T21:19:12Z

If you still want to retain 60 buckets, how about: - 1 bucket a day for 30 days - 1 bucket a month for 24 months for changes older than 30 days - 1 bucket a year for 5 years for changes older than 30 days + 24 months - 1 bucket for everything older than that 30 days + 24 months + 5 years + 1 = 60

…

On Thu, 4 Sept 2025, 21:59 Nick Vatamaniuc, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/couch/src/couch_time_seq.erl <#5603 (comment)>: > +% - With the ?INTERVALS schedule defined below ran 1 update per hour for 1M +% updates starting at year 3000 and ending at year 3114 and obtained: +% * Less than 10 years of spacing between years at the start: 3000, 3009, 3018 ... +% * Ten individual latest days: 3114-01-20 -> 3114-01-30 +% * Seven individual latest months: 3113-07 -> 3114-01 +% - Uncompressed term_to_binary(TSeq) = 920B +% - RAM flat size erts_debug:flat_size(TSeq) * erlang:system_info(wordsize) = 2KB +% +-define(MAX_BIN_COUNT, 60). + +-define(H, 3600). +-define(D, ?H * 24). +-define(M, ?D * 30). +-define(Y, ?D * 365). +%% erlfmt-ignore +-define(INTERVALS, [ sure, I get that, but for example we can't do "3 months" and the ability to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is tricky (and the security object is a bad precedent since clustering). 3 months already works; I ran a simulation with 1 update per hour for 1 million hours: - Can do 11 days down to individual days: 30,29,28,27,26,25,24,23,22,21,20 then can do another 10 but skipping one or two days in between. - Can do 7 months down to individual months: 01 ,12,11,10,09,08,07, then it can do 3 or 4 more months skipping one or two in between. - Left less resolution for the years: about 3 individual years, then 4 more skipping 2 or 3 in between. Then switches to decades and such and probably doesn't matter as much. 3000-01-01T00:00:00Z -> 82176 3009-05-18T00:00:00Z -> 83712 3018-12-05T00:00:00Z -> 85584 3028-09-09T00:00:00Z -> 82416 3038-02-03T00:00:00Z -> 85488 3047-11-05T00:00:00Z -> 82704 3057-04-12T00:00:00Z -> 82944 3066-09-28T00:00:00Z -> 85872 3076-07-15T00:00:00Z -> 83520 3086-01-24T00:00:00Z -> 41472 3090-10-18T00:00:00Z -> 41472 3095-07-12T00:00:00Z -> 41760 3100-04-17T00:00:00Z -> 41472 3105-01-09T00:00:00Z -> 20736 3107-05-23T00:00:00Z -> 20736 3109-10-03T00:00:00Z -> 10368 3110-12-09T00:00:00Z -> 10368 3112-02-14T00:00:00Z -> 5184 3112-09-17T00:00:00Z -> 3456 3113-02-08T00:00:00Z -> 864 3113-03-16T00:00:00Z -> 864 3113-04-21T00:00:00Z -> 864 3113-05-27T00:00:00Z -> 864 3113-07-02T00:00:00Z -> 864 3113-08-07T00:00:00Z -> 288 3113-08-19T00:00:00Z -> 288 3113-08-31T00:00:00Z -> 288 3113-09-12T00:00:00Z -> 288 3113-09-24T00:00:00Z -> 288 3113-10-06T00:00:00Z -> 288 3113-10-18T00:00:00Z -> 288 3113-10-30T00:00:00Z -> 288 3113-11-11T00:00:00Z -> 288 3113-11-23T00:00:00Z -> 288 3113-12-05T00:00:00Z -> 288 3113-12-17T00:00:00Z -> 288 3113-12-29T00:00:00Z -> 96 3114-01-02T00:00:00Z -> 48 3114-01-04T00:00:00Z -> 48 3114-01-06T00:00:00Z -> 48 3114-01-08T00:00:00Z -> 48 3114-01-10T00:00:00Z -> 48 3114-01-12T00:00:00Z -> 48 3114-01-14T00:00:00Z -> 48 3114-01-16T00:00:00Z -> 48 3114-01-18T00:00:00Z -> 48 3114-01-20T00:00:00Z -> 24 3114-01-21T00:00:00Z -> 24 3114-01-22T00:00:00Z -> 24 3114-01-23T00:00:00Z -> 24 3114-01-24T00:00:00Z -> 24 3114-01-25T00:00:00Z -> 24 3114-01-26T00:00:00Z -> 24 3114-01-27T00:00:00Z -> 24 3114-01-28T00:00:00Z -> 24 3114-01-29T00:00:00Z -> 24 3114-01-30T00:00:00Z -> 6 3114-01-30T06:00:00Z -> 6 3114-01-30T12:00:00Z -> 3 3114-01-30T15:00:00Z -> 1 — Reply to this email directly, view it on GitHub <#5603 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFKMRIIXAN3FYP4IECJV4T3RCR2DAVCNFSM6AAAAACCBERWLWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTCOBXGIZTEMBSGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nickva · 2025-09-04T21:53:31Z

If you still want to retain 60 buckets, how about:

1 bucket a day for 30 days

1 bucket a month for 24 months for changes older than 30 days

1 bucket a year for 5 years for changes older than 30 days + 24 months

1 bucket for everything older than that

30 days + 24 months + 5 years + 1 = 60

I started with something like that in the beginning but wanted something less than a 24 h day, I could see someone wanting to know what part of the day changes happened, morning afternoon, etc so "hours" got about 4 buckets. Then I also had the sharper transitions like that from a day 30th day to month, but wanted something more gradual, so we may skip 1 or two days in between, as opposed going from 1 day straight to a month.

It's also worth pointing out that if there is not enough time to fill up decades and such, the rest of the 60 bins will still be utilized with more recent intervals. For instance the first 60 * 3h updates, all 60 bins would be filled with 3h intervals. Then to make room we'd merge some into 6h, so it will become a mix of 3h and 6h intervals after (with 6h ones towards the end).

We can try a few more example of schedules and see which ones we like better. I'll generate some in a bit

rnewson

excellent work.

nickva · 2025-09-05T05:10:14Z

I tried a schedule with more individual days:

-define(INTERVALS, [
    ?D,
    ?M,
    ?Y, ?Y * 2, ?Y * 4, ?Y * 8, ?Y * 16
]).

  3000-01-01T00:00:00Z  : 995808
  3113-08-09T00:00:00Z  : 720
  3113-09-08T00:00:00Z  : 384
  3113-09-24T00:00:00Z  : 336
  3113-10-08T00:00:00Z  : 192
  3113-10-16T00:00:00Z  : 192
  3113-10-24T00:00:00Z  : 192
  3113-11-01T00:00:00Z  : 144
  3113-11-07T00:00:00Z  : 96
  3113-11-11T00:00:00Z  : 96
  3113-11-15T00:00:00Z  : 96
  3113-11-19T00:00:00Z  : 96
  3113-11-23T00:00:00Z  : 96
  3113-11-27T00:00:00Z  : 96
  3113-12-01T00:00:00Z  : 96
  3113-12-05T00:00:00Z  : 48
  3113-12-07T00:00:00Z  : 48
  3113-12-09T00:00:00Z  : 48
  3113-12-11T00:00:00Z  : 48
  3113-12-13T00:00:00Z  : 48
  3113-12-15T00:00:00Z  : 48
  3113-12-17T00:00:00Z  : 48
  3113-12-19T00:00:00Z  : 48
  3113-12-21T00:00:00Z  : 48
  3113-12-23T00:00:00Z  : 48
  3113-12-25T00:00:00Z  : 48
  3113-12-27T00:00:00Z  : 48
  3113-12-29T00:00:00Z  : 24
  3113-12-30T00:00:00Z  : 24
  3113-12-31T00:00:00Z  : 24
  3114-01-01T00:00:00Z  : 24
  3114-01-02T00:00:00Z  : 24
  3114-01-03T00:00:00Z  : 24
  3114-01-04T00:00:00Z  : 24
  3114-01-05T00:00:00Z  : 24
  3114-01-06T00:00:00Z  : 24
  3114-01-07T00:00:00Z  : 24
  3114-01-08T00:00:00Z  : 24
  3114-01-09T00:00:00Z  : 24
  3114-01-10T00:00:00Z  : 24
  3114-01-11T00:00:00Z  : 24
  3114-01-12T00:00:00Z  : 24
  3114-01-13T00:00:00Z  : 24
  3114-01-14T00:00:00Z  : 24
  3114-01-15T00:00:00Z  : 24
  3114-01-16T00:00:00Z  : 24
  3114-01-17T00:00:00Z  : 24
  3114-01-18T00:00:00Z  : 24
  3114-01-19T00:00:00Z  : 24
  3114-01-20T00:00:00Z  : 24
  3114-01-21T00:00:00Z  : 24
  3114-01-22T00:00:00Z  : 24
  3114-01-23T00:00:00Z  : 24
  3114-01-24T00:00:00Z  : 24
  3114-01-25T00:00:00Z  : 24
  3114-01-26T00:00:00Z  : 24
  3114-01-27T00:00:00Z  : 24
  3114-01-28T00:00:00Z  : 24
  3114-01-29T00:00:00Z  : 24
  3114-01-30T00:00:00Z  : 16

We get more than 30 days then about 6 months. So a lot more days but then everything sort of gets squashed into the oldest bin after two years. To keep the algorithm simple we do simple pair-wise merging so sharp jumps from days to months don't work as well -- would need another merge strategy (a custom clause to merge days to months, months to years). I had actually started that way but then the algorithm had more special cases and was a bit more fiddly.

The way I generated these is with some eunit test functions added to the main couch_time_seq.erl module and calling couch_time_seq:test_hist().

-define(TEST_TIME, "3000-01-01T00:00:00Z").

test_time() ->
    calendar:rfc3339_to_system_time(?TEST_TIME).

test_hist() ->
    test_hist(1_000_000).

test_hist(N) ->
    TSeq = update_cnt(N, hours(1)),
    Hist = couch_time_seq:histogram(TSeq, N),
    lists:foreach(fun([T, V]) ->
      io:format("  ~s  : ~B~n", [T, V])
    end, Hist).

hours(Hours) ->
    Hours * 3600.

update_cnt(N, TimeInc) ->
    update_cnt(N, test_time(), 0, TimeInc, couch_time_seq:new()).

update_cnt(0, _Time, _Seq, _TimeInc, TSeq) ->
    TSeq;
update_cnt(Cnt, Time, Seq, TimeInc, TSeq) ->
    TSeq1 = couch_time_seq:update(TSeq, Time, Seq),
    Time1 = Time + TimeInc,
    Seq1 = Seq + 1,
    update_cnt(Cnt - 1, Time1, Seq1, TimeInc, TSeq1).

@iilyak

This data structure maps time intervals to database update sequences. The idea is to be able to quickly determine which changes occurred in a time interval. The main goal of the design is to have a small data structure to fit well under a few KBs and yet represent time intervals from few hours up to a decades. This goal was accomplished by using exponentially decaying time intervals. The further back in time we go, the longer the intervals get. This matches how humans usually keep track of time: if we're talking about yesterday, we may care about hours; if we talk about last month, we may care about single days; and if we talk about last year, we may only care about the months or quarters, and so on. If we accept this historical loss of accuracy, we can hit the design goals of having only 60 time bins and a small, under 500B on-disk representation. The data structure format is a KV list of integers which which looks like: `[{Time, Seq}, {OlderTime, OlderSeq}, ...]`. Times are rounded to whole three hour blocks. The head of the KV list is the youngest entry. The `Time` value is the time of the earliest sequence in that time interval. The `Seq` value indicates the first sequence observed in the time interval. During updates, if we're into the next three hour block and all the bins are filled already, then the bins are "rolled up". That means finding some older bins to merge together to make some room for the new one, such that that the bin count does not increase and stays at or below the maximum limit. The main API functions are: * `new()` : create a new time sequence (`TSeq`) context. * `update(TSeq, Seq)` insert a new sequence into the timeline. * `since(TSeq, Time) -> Seq` get sequence before the timestamp. * `histogram(TSeq, UpdateSeq)` return formatted time bins and the count of updates which occurred during each interval. Use this for debugging or to give users an idea how many changes occurred in each interval. If the database was upgraded with some existing updates already, those are represented as occurring in a time bin starting in 1970-01-01. Since we're using the operating system's knowledge of time, the solution is not perfect. However, there are few mitigations to help with some scenarios: * Time values are rounded to three hour blocks. Even if the synchronization is known to be off by a day, the user can always restring the usage of the `since` parameter to a larger interval, for example only ask about time intervals greater than a few days. * Ignore updates which appear to happen back in time. Those are likely from a not yet synchronized clock after boot. Compare update times to the last entry or to a config setting. Users can set the config setting to a particular time, for example to 1971 if they know their systems jumps to 1970 after boot due to some hardware default. Any updates during that time won't be registered but the sequence will catch up once the NTP synchronization kicks in. It's best to set it to a much more recent time. The default is recent date before this feature is implemented. Future releases may bump that up. * If, due to some misconfiguration time jumps far ahead, say, to year 3000, or any other time configuration mishap occurred it's always safe to reset the time-seq structure and simply start fresh at the new time. The plan is for the structure to not be interleaved into the doc bodies or the rev tree, but instead to have it in a separate off-to-the-side unused header field. As much it can always be safely inspected and reset if needed. There are EUnit and property tests for 100% test coverage. Thanks to Ilya (@iilyak) for writing the property tests!

Since time-seq is fixed size, well under 1KB when serialized, handle it like we handle epochs in the header. That is simpler than having a new btree, or having to juggle file term pointers. When we write the 4KB db header block most of it is empty anyway, so we'll use a few more hundreds bytes from there for time-seq data structure and as a result gain the ability to map update sequences to time intervals. This change is downgrade-safe because it's backwards compatible with previous supported disk format versions. It's possible to safely downgrade to a previous version before this feature was added. That is achieved by re-using a very old field from the header that was set to 0 for many years. Downgraded versions will simply ignore the new data structure. This means we don't to run compaction to upgrade anything, or create an extra intermediate release version in between to allow for safe downgrades. For simplicity, time-seq tracking is per-shard. During shard splitting or compaction the time-seq data structure is preserved. If the user moved the shard to another node, it will also be preserved. However, if shard files are manually truncated and rebuilt, then the updates in that shard file will appear at the later time. As such, the user then might get more (older) documents from that copy. In the context of time-based _changes feed implementation this would look like a rewind for that shard copy. However, we have those for regular changes feeds when shards are manipulated externally, and it's documented so it's in-line with the current such behavior.

This is an escape hatch in case something went wrong with time synchronization. Users should always be able to reset the time seq structure and start from scratch. In fabric, the get* and set* calls are somewhat similar to how db metadata calls like get_revs_limit limit / set_revs_limit work, however to keep all the time-seq logic together added them to the single `fabric_time_seq` module. To inspect the time-seq structure use `GET $db/_time_seq`. In the result each shard time-seq data structure is returned. It's a mapping of formatted time in YYYY-MM-DDTHH:MM:SSZ format to count of sequence updates which occurred in that time interval for that shard. It may look something like: ```json { "00000000-7fffffff": { "[email protected]": [["2025-07-21T16:00:00Z", 1]], "[email protected]": [["2025-07-21T16:00:00Z", 1]], "[email protected]": [["2025-07-21T16:00:00Z", 1]] }, "80000000-ffffffff": { "[email protected]": [["2025-07-21T16:00:00Z", 3]], "[email protected]": [["2025-07-21T16:00:00Z", 3]], "[email protected]": [["2025-07-21T16:00:00Z", 3]] } } ``` For consistency here the result shape is modeled after the $db/_shards endpoint. The `DELETE $db/_time_seq` API endpoint will reset the data structure. After calling it, the result from `GET $db/_time_seq` will look like: ```json { "00000000-7fffffff": { "[email protected]": [], "[email protected]": [], "[email protected]": [] }, "80000000-ffffffff": { "[email protected]": [], "[email protected]": [], "[email protected]": [] } } ```

Use the new time-seq feature to stream changes from before a point in time. This can be used for backups or any case when then it helps to associate a range of sequence updates to a time interval. The time-seq exponential decaying interval rules apply: the further back in time, the less accurate the time intervals will be. The API change consists in making `since` accept a standard time value and streaming the changes started right before that time value based on the known time-seq intervals. The time format of the since parameter is YYYY-MM-DDTHH:MM:SSZ. It's valid as either an ISO 8601 or an RFC 3339 format. From API design point of view this feature can be regarded as an extension to the other `since` values like `now` or `0`. Implementation-wise the change is treated similarly how we treat the `now` special value: before the changes request starts, we translate the time value to a proper `since` sequence. After that, we continue on with that regular sequence as if nothing special happened. Consequently, the shape of the emitted result is exactly the same as any previous change sequences. This is an extra "plus" for consistency and compatibility. To get a feel for the feature, I created a small db and updated it every few hours during the day: `http get $DB/db/_time_seq` ``` { "00000000-ffffffff": { "[email protected]": [ ["2025-07-21T01:00:00Z", 15], ["2025-07-21T05:00:00Z", 2] ["2025-07-21T19:00:00Z", 9], ["2025-07-21T20:00:00Z", 5], ["2025-07-21T21:00:00Z", 70] ["2025-07-21T22:00:00Z", 10] ] } } ``` Change feed with `since=2025-07-21T22:00:00Z` will return documents changed since that last hour only: ``` % http get $DB/db/_changes'?since=2025-07-21T22:00:00Z' | jq -r '.results[].id' 101 102 103 104 105 106 107 108 109 110 ``` Even the somewhat obscure `since_seq` replication parameter should work, so we can replicate from a particular point in time: ``` % http post 'http://adm:pass@localhost:15984/_replicate' \ source:='"http://adm:pass@localhost:15984/db"' \ target:='"http://adm:pass@localhost:15984/tgt"' \ since_seq:='"2025-07-21T22:00:00Z"' { "history": [ { "bulk_get_attempts": 10, "bulk_get_docs": 10, "doc_write_failures": 0, "docs_read": 10, "docs_written": 10, "end_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews", "end_time": "Mon, 21 Jul 2025 22:11:59 GMT", "missing_checked": 10, "missing_found": 10, "recorded_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews", "session_id": "19252b97e34088aeaaa6cde6694a419f", "start_last_seq": "2025-07-21T22:00:00Z", "start_time": "Mon, 21 Jul 2025 22:11:55 GMT" } ], "ok": true, "replication_id_version": 4, "session_id": "19252b97e34088aeaaa6cde6694a419f", "source_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews" } ``` The target db now has only documents written in that last hour: ``` % http $DB/tgt/_all_docs | jq -r '.rows[].id' 101 102 103 104 105 106 107 108 109 110 ```