Skip to content

Conversation

@nickva
Copy link
Contributor

@nickva nickva commented Jul 21, 2025

Overview

Implement a time-to-sequence mapping data structure then use it to enable the _changes?since=$time feature.

This started as an experiment wondering if we could have a simple data structure to map rough time intervals to db sequences, nothing too exact just something on the order of hours, days, months, years. The original idea came from a discussion with Glynn Bird with him wondering if it would be possible to do such a thing (thanks, @glynnbird!), and then the idea of using exponentially decaying intervals is from our recent rewrite of couch_stats histograms

Time-Seq Data Structure

The data structure, called "time-seq" further below, is a list of 60 key-value pairs mapping time bins to db sequences. The structure can represent exponentially decaying time intervals. This decaying behavior is a trade-off of being small and having a fixed size -- the further back in time we go, the lower the accuracy. However, this is how we often regard time in general, when we talk about "yesterday", we refer to individual hours; when we talk about "last month", we may talk about individual days; when talking about two years ago, we may care about months only, etc.

Just the time-seq implementation and the associated tests are in the first commit. It has some additional info in the commit and module comments. Property tests were written by @iilyak (thank you!) and along with the eunit test we got to 100% test coverage.

Serialization: Upgrade / Downgrade Behavior

Another unexpected benefit of using a small data structure is fits inside the header. And, with an additional bit of luck the implementation turned out to also be downgrade-safe. This was accomplished by reusing a very old unused header field. This way, on downgrade the older versions of CouchDB will ignore the new time-seq field. With this "trick" we can avoid having to create an intermediate downgrade target release. The addition of time-seq data structure to the header is implemented in the second commit. That commit also implements how the structure is updated: that happens in the couch_db_updater right before the writes are committed.

Dealing With Time

Since we're dealing with time, we're bound to have some sharp edges. On some systems time could jump backwards briefly after boot until NTP sync kicks in, or it may misbehave in other ways. There are a few mitigations implemented to help with these sharp edges:

  • Round timestamps to three hour blocks. We only care about very rough synchronization -- on the order of hours. Even if it's off by days, users may still use the system if they rely on since intervals larger than whole days (weeks, months).
  • Ignore updates from the past. Once the time catches up, updates will continue. User doesn't have to do anything or configure anything, this feature is always enabled.
  • Ignore times below a configurable threshold. If some systems are known to jump back to some fixed time in the past after boot, the user may configure a minimum threshold to ignore any updates below that threshold.
  • Implement API endpoints to inspect and reset time-seq structures: GET $db/_time_seq and DELETE $db/_time_seq. The result of the GET $db/_time_seq contains all the time-seq bins with formatted timestamps mapped to the number of changes in that bin. The DELETE call resets the _time_seq structure. This allows users to inspect and reset any time-seq structure if they detect something unexpected happened with the time synchronization, for example if the date jumped forward to 2050 or something like that.

The third commit implements the new $db/_time_seq API endpoint and the general fabric level integration of the new feature.

_changes?since=$time Implementation

_change?since=$time feature is implemented in the fourth commit. Due to all the preparatory steps this commit is pretty simple. We handle the new parameter variant just like we handle the special now value for descending changes feeds. After the initial start argument process the rest of the changes feed logic proceeds as before.

A small example copied from the _changes commit comment with a db I updated every few hours during the day:

% http get $DB/db/_time_seq`

{
    "00000000-ffffffff": {
        "[email protected]": [
            ["2025-07-21T01:00:00Z", 15],
            ["2025-07-21T05:00:00Z", 2]
            ["2025-07-21T19:00:00Z", 9],
            ["2025-07-21T20:00:00Z", 5],
            ["2025-07-21T21:00:00Z", 70]
            ["2025-07-21T22:00:00Z", 10]
        ]
    }
}

_change?since=2025-07-21T22:00:00Z will return documents changed since that
last hour only:

% http get $DB/db/_changes'?since=2025-07-21T22:00:00Z' | jq -r '.results[].id'

101
102
103
104
105
106
107
108
109
110

Even the (somewhat) hidden since_seq replication parameter should work, so we can replicate from a particular point in time:

% http post 'http://adm:pass@localhost:15984/_replicate' \
  source:='"http://adm:pass@localhost:15984/db"' \
  target:='"http://adm:pass@localhost:15984/tgt"' \
  since_seq:='"2025-07-21T22:00:00Z"'

{
    "history": [
        {
            "bulk_get_attempts": 10,
            "bulk_get_docs": 10,
            "doc_write_failures": 0,
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "end_time": "Mon, 21 Jul 2025 22:11:59 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "session_id": "19252b97e34088aeaaa6cde6694a419f",
            "start_last_seq": "2025-07-21T22:00:00Z",
            "start_time": "Mon, 21 Jul 2025 22:11:55 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 4,
    "session_id": "19252b97e34088aeaaa6cde6694a419f",
    "source_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews"
}

The target db now has only the documents written in that last hour:

% http $DB/tgt/_all_docs | jq -r '.rows[].id'

101
102
103
104
105
106
107
108
109
110
Downgrade Testing

Ran a downgrade test. Updated a db with the PR branch. Switched to main, then verified it was possible to read and write the same dbs without any issue.

Performance impact

Ran the quick and dirty built-in fabric_bench test. Used q=8 and small docs. Didn't noticed any significant difference between main and PR branch:

  - main 
     _bulk_get rate (hz): 29000, 27000, 26000, 26000, 29000, 30000
    single doc update (hz): 320, 330, 350, 320, 330

  - PR
    _bulk_get rate (hz): 30000, 30000, 30000, 29000, 26000, 27000
   single doc update (hz): 340, 310, 310, 330, 340, 330
What Happens Over Time

To get a feel for how the rollup works ran a test which updated the data structure once per hour for 1 million hours.

   3000-01-01T00:00:00Z -> 82176
   3009-05-18T00:00:00Z -> 83712
   3018-12-05T00:00:00Z -> 85584
   3028-09-09T00:00:00Z -> 82416
   3038-02-03T00:00:00Z -> 85488
   3047-11-05T00:00:00Z -> 82704
   3057-04-12T00:00:00Z -> 82944
   3066-09-28T00:00:00Z -> 85872
   3076-07-15T00:00:00Z -> 83520
   3086-01-24T00:00:00Z -> 41472
   3090-10-18T00:00:00Z -> 41472
   3095-07-12T00:00:00Z -> 41760
   3100-04-17T00:00:00Z -> 41472
   3105-01-09T00:00:00Z -> 20736
   3107-05-23T00:00:00Z -> 20736
   3109-10-03T00:00:00Z -> 10368
   3110-12-09T00:00:00Z -> 10368
   3112-02-14T00:00:00Z -> 5184
   3112-09-17T00:00:00Z -> 3456
   3113-02-08T00:00:00Z -> 864
   3113-03-16T00:00:00Z -> 864
   3113-04-21T00:00:00Z -> 864
   3113-05-27T00:00:00Z -> 864
   3113-07-02T00:00:00Z -> 864
   3113-08-07T00:00:00Z -> 288
   3113-08-19T00:00:00Z -> 288
   3113-08-31T00:00:00Z -> 288
   3113-09-12T00:00:00Z -> 288
   3113-09-24T00:00:00Z -> 288
   3113-10-06T00:00:00Z -> 288
   3113-10-18T00:00:00Z -> 288
   3113-10-30T00:00:00Z -> 288
   3113-11-11T00:00:00Z -> 288
   3113-11-23T00:00:00Z -> 288
   3113-12-05T00:00:00Z -> 288
   3113-12-17T00:00:00Z -> 288
   3113-12-29T00:00:00Z -> 96
   3114-01-02T00:00:00Z -> 48
   3114-01-04T00:00:00Z -> 48
   3114-01-06T00:00:00Z -> 48
   3114-01-08T00:00:00Z -> 48
   3114-01-10T00:00:00Z -> 48
   3114-01-12T00:00:00Z -> 48
   3114-01-14T00:00:00Z -> 48
   3114-01-16T00:00:00Z -> 48
   3114-01-18T00:00:00Z -> 48
   3114-01-20T00:00:00Z -> 24
   3114-01-21T00:00:00Z -> 24
   3114-01-22T00:00:00Z -> 24
   3114-01-23T00:00:00Z -> 24
   3114-01-24T00:00:00Z -> 24
   3114-01-25T00:00:00Z -> 24
   3114-01-26T00:00:00Z -> 24
   3114-01-27T00:00:00Z -> 24
   3114-01-28T00:00:00Z -> 24
   3114-01-29T00:00:00Z -> 24
   3114-01-30T00:00:00Z -> 6
   3114-01-30T06:00:00Z -> 6
   3114-01-30T12:00:00Z -> 3
   3114-01-30T15:00:00Z -> 1

Noticed a few things:

  • During the last day there are 4 individual intervals. So we could determine which changes occurred about 3 to 6 hours apart.
  • There are 11 individual days, then days are combined into pairs, so if we ask for changes since=3114-01-09T00:00:00Z we may also get changes from 3114-01-08T00:00:00Z
  • Most of the bins are devoted to keeping track the sequences in the current year. That's exactly what we'd expect. We can efficiently get the changes since.
  • Even after 100 years we can still target intervals less than 10 years apart

@rnewson
Copy link
Member

rnewson commented Jul 22, 2025

noting we use ISO 8601 for date/time elsewhere in the codebase.

@iilyak
Copy link
Contributor

iilyak commented Jul 22, 2025

http get $DB/db/_changes'?since=2025-07-21T22:00:00Z'

The re-use of since with different type will create problems for openAPI spec.

@nickva
Copy link
Contributor Author

nickva commented Jul 22, 2025

noting we use ISO 8601 for date/time elsewhere in the codebase.

@rnewson good point I did start with saying they are ISO 8601 but then noticed the new-ish erlang calendar module used the rfc3339 so flipped to that. I think technically rfc3339 is a bit more restrictive https://ijmacd.github.io/rfc3339-iso8601 but also allows stuff we don't accept here like a space instead of 'T' or an underscore.

@nickva
Copy link
Contributor Author

nickva commented Jul 22, 2025

http get $DB/db/_changes'?since=2025-07-21T22:00:00Z'

The re-use of since with different type will create problems for openAPI spec.

I think it's in line with how now is used, and even the implementation part of it is similar (another clause on top of now handling).

Having some new parameter from or time_since would be an option, however then:

  • We'd also need so alter the changes_args record and that's sent between nodes and so we may need an intermediate release to avoid breaking online upgrades etc.
  • Any place that passes through a since parameter like the replicator will need to be updated to now know about two since parameters
  • The sequences emitted in the response are no different than with any other since types: either 0, now or 1-xyz. That is the user still gets back regular "seq" and "last_seq" values.

That's why I opted to stick with just "since"

@nickva nickva force-pushed the tseq branch 3 times, most recently from 0263859 to 2e839b1 Compare July 28, 2025 07:00
@nickva
Copy link
Contributor Author

nickva commented Jul 28, 2025

  • I updated the algorithm used for merging/rollup. It's now a bit simpler the new algorithm merges bins together. First the shortest intervals (multiple hours), then longer ones (multiple days) etc. Previous one was a bit more complicated with trying to create a new set of bins then fitting old bins into the new ones.

  • Removed a few more mentions of ISO 8601 vs RFC 3339, focused more on "here is the accepted format" point.

  • Moved the "range-to-hex" DRY refactoring to a separate PR to keep this one smaller.

  • Added docs for the new config values and the http APIs.

  • Added some more specs

  • Added more test coverage.

  • Updated the main comment with some tests: downgrade test, quick perf test, and investigated what happens at longer time limit, in that after 100+ years.

@nickva nickva requested a review from iilyak July 29, 2025 04:35
@nickva nickva force-pushed the tseq branch 2 times, most recently from 0391132 to 1d79884 Compare August 4, 2025 20:15
@iilyak-ibm
Copy link

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

@nickva
Copy link
Contributor Author

nickva commented Aug 12, 2025

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

@iilyak-ibm
Copy link

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

Can we recreate _time_seq from the _time_seq of another node? We don't need to be exact, approximation of an age would do (correct up to the bin placement).

@nickva
Copy link
Contributor Author

nickva commented Aug 12, 2025

What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)?

There is some discussion about in first and second commit comments. There is the _time_seq endpoint to make the structure visible and it returns results from all shards, so users could detect if something is off. The structure can always be reset safely, without affecting the main data. The condition with it being out of sync is similar to a rewind which is expected sometimes and document it. In this case we also default to sending more data rather than less, that is if a shard copy is blown away and rebuilt, it's update will appear in the changes feed as it was created at the time when they replicated in. It's like a shard range get a rewind back to 0. So users should be prepared to reprocess the same rows like with regular sequences. If anyone needs to rely on strict timestamp it's up to them to insert explicit timestamp in the documents and index on them.

Can we recreate _time_seq from the _time_seq of another node? We don't need to be exact, approximation of an age would do (correct up to the bin placement).

We probably could easier if we rebuilt nodes while there is no interactive traffic going to it (we'd "lock it" so to speak, which is exactly what we do for shard splits). Then, each document in the rebuild replication request could include a times bin in the #doc.meta fields, and then we could associate those timestamp with the sequences on the shard.

It would be neat to have a fast-forward rebuild like that, it would be like rsync-in the shard over and then renaming but all in Erlang. Or another way to think about it, it's like what we'd so for shard splitting but allow the target to be 1 instead of 2+ and make the calls go to another node not only the local one.

% feature removed in 3.x, but field kept to avoid changing db record size
% and breaking rolling cluster upgrade
waiting_delayed_commit_deprecated,
time_seq,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retain a comment that we repurposed the item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'l describe its history a bit:

    % In 2.x versions this field was called waiting_delayed_commit.
    % In 3.0->3.5 versions it was deprecated and named waiting_delayed_commit_deprecated.
    % In 3.6+ it was repurposed to keep the time_seq structure.
    % This repurposing and deprecating is done in order to avoid changing db
    % record sizes and breaking cross-cluster online upgrades.


-spec update(time_seq(), update_seq()) -> time_seq().
update(#{v := ?VER} = Ctx, Seq) when is_integer(Seq), Seq >= 0 ->
update(Ctx, now_unix_sec(), Seq).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the function calls now_unix_sec() itself makes it harder to write tests (i.e, your tests use a far-future real date in order to accomodate the inability to inject time at this point).

I'd prefer to see the injection of real time in a way that allows us to test all the edge cases and boundary values if possible.

Copy link
Contributor Author

@nickva nickva Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had both an update/2 and update/3, tests use update/3 with their own time. But it is a bit confusing I agree, so I'll update it to have just update/3 and let the couch_db_updater pass in the time.

It may be neater to have couch_db_updater grab the time from couch_time_seq:timestamp() in case we wanted to change the resolution, plug in another time source or something like that in the future. So all APIs calls get an explicit timestamp (always) but the timestamp itself is generated in an opaque way by the couch_time_seq module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea.

-define(M, ?D * 30).
-define(Y, ?D * 365).
%% erlfmt-ignore
-define(INTERVALS, [
Copy link
Member

@rnewson rnewson Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking these should be configurable per database.

Copy link
Contributor Author

@nickva nickva Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, they are sort of tested and adjusted to work out nicely when intervals are merged, almost all of them merge neatly in pairs (3 hours merge to 6 hours, 2 years merge to 4 years etc), and so on, and that's on purpose. I can see users getting tangled in the config settings with those. It would be a foot-gun and they might only find out months later that they misconfigured something.

Also, our per-db settings are kind of odd currently, with the revs_limit and such being per shard and stored in the header, then we have props that's supposed to be universal but we don't allow dynamically setting it (and it also lives in two places -- the shard doc and also in db shards), so it would be a bit of work on top of this existing PR to add per DB settings for intervals, and may also break "downgradability". Maybe it's something to add the future, if this is used quite a bit and there is a user demand for it?

I can see adding a lower base interval option for testing / evaluation (making it 3 minutes maybe?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I get that, but for example we can't do "3 months" and the ability to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is tricky (and the security object is a bad precedent since clustering).

Perhaps a broader discussion on what the intervals should be then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm +1 on this PR otherwise, it's just the hard-coded intervals that I struggle with)

Copy link
Contributor

@iilyak iilyak Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about efficiency. We could pre-calculate the values and keep the formulas as comments for efficiency reasons. Because otherwise we are going to do pointless multiplications over and over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decompile the .beam to be sure? I will go find my collection of pearls just in case it doesn't.

Copy link
Member

@rnewson rnewson Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat a.erl
-module(a).

-export([a/0]).

-define(NUMBER, 1 * 3 * 6 * 1000).

a() ->
    ?NUMBER.

➜  couchdb git:(lucene-10) ✗ erl -pa .
Erlang/OTP 26 [erts-14.2.5.11] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V14.2.5.11 (press Ctrl+G to abort, type help(). for help)
1> beam_disasm:file(a).
{beam_file,a,
           [{a,0,2},{module_info,0,4},{module_info,1,6}],
           [{vsn,[177408199690294671755095659682854774622]}],
           [{version,"8.4.3.3"},
            {options,[]},
            {source,"/Users/rnewson/Source/couchdb/a.erl"}],
           [{function,a,0,2,
                      [{label,1},
                       {line,1},
                       {func_info,{atom,a},{atom,a},0},
                       {label,2},
                       {move,{integer,18000},{x,0}},
                       return]},
            {function,module_info,0,4,
                      [{line,0},
                       {label,3},
                       {func_info,{atom,a},{atom,module_info},0},
                       {label,4},
                       {move,{atom,a},{x,0}},
                       {call_ext_only,1,{extfunc,erlang,get_module_info,1}}]},
            {function,module_info,1,6,
                      [{line,0},
                       {label,5},
                       {func_info,{atom,a},{atom,module_info},1},
                       {label,6},
                       {move,{x,0},{x,1}},
                       {move,{atom,a},{x,0}},
                       {call_ext_only,2,{extfunc,erlang,get_module_info,2}}]}]}
3> a:a().
18000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, back to the point. Ok, so 3 months etc does work, that is good enough for me. thanks for the explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect thanks for checking it out:

-define(NUMBER, 1 * 3 * 6 * 1000).

And the returned value is: {move,{integer,18000},{x,0}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking. Erlang was not capable of doing it for a long time.

Copy link
Contributor

@iilyak iilyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great +1 from me.

@iilyak
Copy link
Contributor

iilyak commented Sep 4, 2025

Looks great +1 from me.

Please wait for rnewson's conclusion before merging.

@glynnbird
Copy link
Contributor

glynnbird commented Sep 4, 2025 via email

@nickva
Copy link
Contributor Author

nickva commented Sep 4, 2025

If you still want to retain 60 buckets, how about:

  • 1 bucket a day for 30 days
  • 1 bucket a month for 24 months for changes older than 30 days
  • 1 bucket a year for 5 years for changes older than 30 days + 24 months
  • 1 bucket for everything older than that

30 days + 24 months + 5 years + 1 = 60

I started with something like that in the beginning but wanted something less than a 24 h day, I could see someone wanting to know what part of the day changes happened, morning afternoon, etc so "hours" got about 4 buckets. Then I also had the sharper transitions like that from a day 30th day to month, but wanted something more gradual, so we may skip 1 or two days in between, as opposed going from 1 day straight to a month.

It's also worth pointing out that if there is not enough time to fill up decades and such, the rest of the 60 bins will still be utilized with more recent intervals. For instance the first 60 * 3h updates, all 60 bins would be filled with 3h intervals. Then to make room we'd merge some into 6h, so it will become a mix of 3h and 6h intervals after (with 6h ones towards the end).

We can try a few more example of schedules and see which ones we like better. I'll generate some in a bit

Copy link
Member

@rnewson rnewson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent work.

@nickva
Copy link
Contributor Author

nickva commented Sep 5, 2025

I tried a schedule with more individual days:

-define(INTERVALS, [
    ?D,
    ?M,
    ?Y, ?Y * 2, ?Y * 4, ?Y * 8, ?Y * 16
]).
  3000-01-01T00:00:00Z  : 995808
  3113-08-09T00:00:00Z  : 720
  3113-09-08T00:00:00Z  : 384
  3113-09-24T00:00:00Z  : 336
  3113-10-08T00:00:00Z  : 192
  3113-10-16T00:00:00Z  : 192
  3113-10-24T00:00:00Z  : 192
  3113-11-01T00:00:00Z  : 144
  3113-11-07T00:00:00Z  : 96
  3113-11-11T00:00:00Z  : 96
  3113-11-15T00:00:00Z  : 96
  3113-11-19T00:00:00Z  : 96
  3113-11-23T00:00:00Z  : 96
  3113-11-27T00:00:00Z  : 96
  3113-12-01T00:00:00Z  : 96
  3113-12-05T00:00:00Z  : 48
  3113-12-07T00:00:00Z  : 48
  3113-12-09T00:00:00Z  : 48
  3113-12-11T00:00:00Z  : 48
  3113-12-13T00:00:00Z  : 48
  3113-12-15T00:00:00Z  : 48
  3113-12-17T00:00:00Z  : 48
  3113-12-19T00:00:00Z  : 48
  3113-12-21T00:00:00Z  : 48
  3113-12-23T00:00:00Z  : 48
  3113-12-25T00:00:00Z  : 48
  3113-12-27T00:00:00Z  : 48
  3113-12-29T00:00:00Z  : 24
  3113-12-30T00:00:00Z  : 24
  3113-12-31T00:00:00Z  : 24
  3114-01-01T00:00:00Z  : 24
  3114-01-02T00:00:00Z  : 24
  3114-01-03T00:00:00Z  : 24
  3114-01-04T00:00:00Z  : 24
  3114-01-05T00:00:00Z  : 24
  3114-01-06T00:00:00Z  : 24
  3114-01-07T00:00:00Z  : 24
  3114-01-08T00:00:00Z  : 24
  3114-01-09T00:00:00Z  : 24
  3114-01-10T00:00:00Z  : 24
  3114-01-11T00:00:00Z  : 24
  3114-01-12T00:00:00Z  : 24
  3114-01-13T00:00:00Z  : 24
  3114-01-14T00:00:00Z  : 24
  3114-01-15T00:00:00Z  : 24
  3114-01-16T00:00:00Z  : 24
  3114-01-17T00:00:00Z  : 24
  3114-01-18T00:00:00Z  : 24
  3114-01-19T00:00:00Z  : 24
  3114-01-20T00:00:00Z  : 24
  3114-01-21T00:00:00Z  : 24
  3114-01-22T00:00:00Z  : 24
  3114-01-23T00:00:00Z  : 24
  3114-01-24T00:00:00Z  : 24
  3114-01-25T00:00:00Z  : 24
  3114-01-26T00:00:00Z  : 24
  3114-01-27T00:00:00Z  : 24
  3114-01-28T00:00:00Z  : 24
  3114-01-29T00:00:00Z  : 24
  3114-01-30T00:00:00Z  : 16

We get more than 30 days then about 6 months. So a lot more days but then everything sort of gets squashed into the oldest bin after two years. To keep the algorithm simple we do simple pair-wise merging so sharp jumps from days to months don't work as well -- would need another merge strategy (a custom clause to merge days to months, months to years). I had actually started that way but then the algorithm had more special cases and was a bit more fiddly.

The way I generated these is with some eunit test functions added to the main couch_time_seq.erl module and calling couch_time_seq:test_hist().

-define(TEST_TIME, "3000-01-01T00:00:00Z").

test_time() ->
    calendar:rfc3339_to_system_time(?TEST_TIME).

test_hist() ->
    test_hist(1_000_000).

test_hist(N) ->
    TSeq = update_cnt(N, hours(1)),
    Hist = couch_time_seq:histogram(TSeq, N),
    lists:foreach(fun([T, V]) ->
      io:format("  ~s  : ~B~n", [T, V])
    end, Hist).

hours(Hours) ->
    Hours * 3600.

update_cnt(N, TimeInc) ->
    update_cnt(N, test_time(), 0, TimeInc, couch_time_seq:new()).

update_cnt(0, _Time, _Seq, _TimeInc, TSeq) ->
    TSeq;
update_cnt(Cnt, Time, Seq, TimeInc, TSeq) ->
    TSeq1 = couch_time_seq:update(TSeq, Time, Seq),
    Time1 = Time + TimeInc,
    Seq1 = Seq + 1,
    update_cnt(Cnt - 1, Time1, Seq1, TimeInc, TSeq1).

This data structure maps time intervals to database update sequences. The idea
is to be able to quickly determine which changes occurred in a time interval.

The main goal of the design is to have a small data structure to fit well under
a few KBs and yet represent time intervals from few hours up to a decades. This
goal was accomplished by using exponentially decaying time intervals. The
further back in time we go, the longer the intervals get. This matches how
humans usually keep track of time: if we're talking about yesterday, we may
care about hours; if we talk about last month, we may care about single days;
and if we talk about last year, we may only care about the months or quarters,
and so on. If we accept this historical loss of accuracy, we can hit the design
goals of having only 60 time bins and a small, under 500B on-disk
representation.

The data structure format is a KV list of integers which which looks like:
`[{Time, Seq}, {OlderTime, OlderSeq}, ...]`. Times are rounded to whole three
hour blocks.

The head of the KV list is the youngest entry. The `Time` value is the time of
the earliest sequence in that time interval. The `Seq` value indicates the
first sequence observed in the time interval.

During updates, if we're into the next three hour block and all the bins are
filled already, then the bins are "rolled up". That means finding some older
bins to merge together to make some room for the new one, such that that the
bin count does not increase and stays at or below the maximum limit.

The main API functions are:
  * `new()` : create a new time sequence (`TSeq`) context.
  * `update(TSeq, Seq)` insert a new sequence into the timeline.
  * `since(TSeq, Time) -> Seq` get sequence before the timestamp.
  * `histogram(TSeq, UpdateSeq)` return formatted time bins and the count of updates which
     occurred during each interval. Use this for debugging or to give users an idea
     how many changes occurred in each interval. If the database was upgraded
     with some existing updates already, those are represented as occurring in
     a time bin starting in 1970-01-01.

Since we're using the operating system's knowledge of time, the solution is not
perfect. However, there are few mitigations to help with some scenarios:

  * Time values are rounded to three hour blocks. Even if the synchronization
  is known to be off by a day, the user can always restring the usage of the
  `since` parameter to a larger interval, for example only ask about time
  intervals greater than a few days.

  * Ignore updates which appear to happen back in time. Those are likely from a
  not yet synchronized clock after boot. Compare update times to the last entry
  or to a config setting. Users can set the config setting to a particular
  time, for example to 1971 if they know their systems jumps to 1970 after boot
  due to some hardware default. Any updates during that time won't be
  registered but the sequence will catch up once the NTP synchronization kicks
  in. It's best to set it to a much more recent time. The default is recent
  date before this feature is implemented. Future releases may bump that up.

  * If, due to some misconfiguration time jumps far ahead, say, to year 3000,
  or any other time configuration mishap occurred it's always safe to reset the
  time-seq structure and simply start fresh at the new time. The plan is for
  the structure to not be interleaved into the doc bodies or the rev tree, but
  instead to have it in a separate off-to-the-side unused header field. As much
  it can always be safely inspected and reset if needed.

There are EUnit and property tests for 100% test coverage.

Thanks to Ilya (@iilyak) for writing the property tests!
Since time-seq is fixed size, well under 1KB when serialized, handle it like we
handle epochs in the header. That is simpler than having a new btree, or having
to juggle file term pointers. When we write the 4KB db header block most of it
is empty anyway, so we'll use a few more hundreds bytes from there for time-seq
data structure and as a result gain the ability to map update sequences to time
intervals.

This change is downgrade-safe because it's backwards compatible with previous
supported disk format versions. It's possible to safely downgrade to a previous
version before this feature was added. That is achieved by re-using a very old
field from the header that was set to 0 for many years. Downgraded versions
will simply ignore the new data structure. This means we don't to run
compaction to upgrade anything, or create an extra intermediate release version
in between to allow for safe downgrades.

For simplicity, time-seq tracking is per-shard. During shard splitting or
compaction the time-seq data structure is preserved. If the user moved the
shard to another node, it will also be preserved. However, if shard files are
manually truncated and rebuilt, then the updates in that shard file will appear
at the later time. As such, the user then might get more (older) documents from
that copy. In the context of time-based _changes feed implementation this would
look like a rewind for that shard copy. However, we have those for regular
changes feeds when shards are manipulated externally, and it's documented so
it's in-line with the current such behavior.
This is an escape hatch in case something went wrong with time synchronization.
Users should always be able to reset the time seq structure and start from
scratch.

In fabric, the get* and set* calls are somewhat similar to how db metadata
calls like get_revs_limit limit / set_revs_limit work, however to keep all the
time-seq logic together added them to the single `fabric_time_seq` module.

To inspect the time-seq structure use `GET $db/_time_seq`. In the result each
shard time-seq data structure is returned. It's a mapping of formatted time in
YYYY-MM-DDTHH:MM:SSZ format to count of sequence updates which occurred in that
time interval for that shard. It may look something like:

```json
{
    "00000000-7fffffff": {
        "[email protected]": [["2025-07-21T16:00:00Z", 1]],
        "[email protected]": [["2025-07-21T16:00:00Z", 1]],
        "[email protected]": [["2025-07-21T16:00:00Z", 1]]
    },
    "80000000-ffffffff": {
        "[email protected]": [["2025-07-21T16:00:00Z", 3]],
        "[email protected]": [["2025-07-21T16:00:00Z", 3]],
        "[email protected]": [["2025-07-21T16:00:00Z", 3]]
    }
}
```

For consistency here the result shape is modeled after the $db/_shards
endpoint.

The `DELETE $db/_time_seq` API endpoint will reset the data structure. After
calling it, the result from `GET $db/_time_seq` will look like:

```json
{
    "00000000-7fffffff": {
        "[email protected]": [],
        "[email protected]": [],
        "[email protected]": []
    },
    "80000000-ffffffff": {
        "[email protected]": [],
        "[email protected]": [],
        "[email protected]": []
    }
}
```
Use the new time-seq feature to stream changes from before a point in time.

This can be used for backups or any case when then it helps to associate a
range of sequence updates to a time interval. The time-seq exponential decaying
interval rules apply: the further back in time, the less accurate the time
intervals will be.

The API change consists in making `since` accept a standard time value and
streaming the changes started right before that time value based on the known
time-seq intervals. The time format of the since parameter is
YYYY-MM-DDTHH:MM:SSZ. It's valid as either an ISO 8601 or an RFC 3339 format.

From API design point of view this feature can be regarded as an extension to
the other `since` values like `now` or `0`.

Implementation-wise the change is treated similarly how we treat the `now`
special value: before the changes request starts, we translate the time value
to a proper `since` sequence. After that, we continue on with that regular
sequence as if nothing special happened. Consequently, the shape of the emitted
result is exactly the same as any previous change sequences. This is an extra
"plus" for consistency and compatibility.

To get a feel for the feature, I created a small db and updated it every few
hours during the day:

`http get $DB/db/_time_seq`

```
{
    "00000000-ffffffff": {
        "[email protected]": [
            ["2025-07-21T01:00:00Z", 15],
            ["2025-07-21T05:00:00Z", 2]
            ["2025-07-21T19:00:00Z", 9],
            ["2025-07-21T20:00:00Z", 5],
            ["2025-07-21T21:00:00Z", 70]
            ["2025-07-21T22:00:00Z", 10]
        ]
    }
}
```

Change feed with `since=2025-07-21T22:00:00Z` will return documents changed
since that last hour only:

```
% http get $DB/db/_changes'?since=2025-07-21T22:00:00Z' | jq -r '.results[].id'

101
102
103
104
105
106
107
108
109
110
```

Even the somewhat obscure `since_seq` replication parameter should work, so we
can replicate from a particular point in time:

```
% http post 'http://adm:pass@localhost:15984/_replicate' \
  source:='"http://adm:pass@localhost:15984/db"' \
  target:='"http://adm:pass@localhost:15984/tgt"' \
  since_seq:='"2025-07-21T22:00:00Z"'

{
    "history": [
        {
            "bulk_get_attempts": 10,
            "bulk_get_docs": 10,
            "doc_write_failures": 0,
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "end_time": "Mon, 21 Jul 2025 22:11:59 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
            "session_id": "19252b97e34088aeaaa6cde6694a419f",
            "start_last_seq": "2025-07-21T22:00:00Z",
            "start_time": "Mon, 21 Jul 2025 22:11:55 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 4,
    "session_id": "19252b97e34088aeaaa6cde6694a419f",
    "source_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews"
}
```

The target db now has only documents written in that last hour:

```
% http $DB/tgt/_all_docs | jq -r '.rows[].id'

101
102
103
104
105
106
107
108
109
110
```
@nickva nickva merged commit aff29ae into main Sep 5, 2025
24 checks passed
@nickva nickva deleted the tseq branch September 5, 2025 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants