-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Time-based since parameter for _changes
#5603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
noting we use ISO 8601 for date/time elsewhere in the codebase. |
The re-use of |
@rnewson good point I did start with saying they are ISO 8601 but then noticed the new-ish erlang calendar module used the |
I think it's in line with how Having some new parameter
That's why I opted to stick with just "since" |
0263859 to
2e839b1
Compare
|
0391132 to
1d79884
Compare
|
What is your vision to handle a case when timestamps become inconsistent with each other on different nodes (for one reason or another)? |
There is some discussion about in first and second commit comments. There is the |
Can we recreate _time_seq from the _time_seq of another node? We don't need to be exact, approximation of an age would do (correct up to the bin placement). |
We probably could easier if we rebuilt nodes while there is no interactive traffic going to it (we'd "lock it" so to speak, which is exactly what we do for shard splits). Then, each document in the rebuild replication request could include a times bin in the It would be neat to have a fast-forward rebuild like that, it would be like rsync-in the shard over and then renaming but all in Erlang. Or another way to think about it, it's like what we'd so for shard splitting but allow the target to be 1 instead of 2+ and make the calls go to another node not only the local one. |
| % feature removed in 3.x, but field kept to avoid changing db record size | ||
| % and breaking rolling cluster upgrade | ||
| waiting_delayed_commit_deprecated, | ||
| time_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retain a comment that we repurposed the item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'l describe its history a bit:
% In 2.x versions this field was called waiting_delayed_commit.
% In 3.0->3.5 versions it was deprecated and named waiting_delayed_commit_deprecated.
% In 3.6+ it was repurposed to keep the time_seq structure.
% This repurposing and deprecating is done in order to avoid changing db
% record sizes and breaking cross-cluster online upgrades.
src/couch/src/couch_time_seq.erl
Outdated
|
|
||
| -spec update(time_seq(), update_seq()) -> time_seq(). | ||
| update(#{v := ?VER} = Ctx, Seq) when is_integer(Seq), Seq >= 0 -> | ||
| update(Ctx, now_unix_sec(), Seq). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that the function calls now_unix_sec() itself makes it harder to write tests (i.e, your tests use a far-future real date in order to accomodate the inability to inject time at this point).
I'd prefer to see the injection of real time in a way that allows us to test all the edge cases and boundary values if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had both an update/2 and update/3, tests use update/3 with their own time. But it is a bit confusing I agree, so I'll update it to have just update/3 and let the couch_db_updater pass in the time.
It may be neater to have couch_db_updater grab the time from couch_time_seq:timestamp() in case we wanted to change the resolution, plug in another time source or something like that in the future. So all APIs calls get an explicit timestamp (always) but the timestamp itself is generated in an opaque way by the couch_time_seq module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea.
| -define(M, ?D * 30). | ||
| -define(Y, ?D * 365). | ||
| %% erlfmt-ignore | ||
| -define(INTERVALS, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking these should be configurable per database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, they are sort of tested and adjusted to work out nicely when intervals are merged, almost all of them merge neatly in pairs (3 hours merge to 6 hours, 2 years merge to 4 years etc), and so on, and that's on purpose. I can see users getting tangled in the config settings with those. It would be a foot-gun and they might only find out months later that they misconfigured something.
Also, our per-db settings are kind of odd currently, with the revs_limit and such being per shard and stored in the header, then we have props that's supposed to be universal but we don't allow dynamically setting it (and it also lives in two places -- the shard doc and also in db shards), so it would be a bit of work on top of this existing PR to add per DB settings for intervals, and may also break "downgradability". Maybe it's something to add the future, if this is used quite a bit and there is a user demand for it?
I can see adding a lower base interval option for testing / evaluation (making it 3 minutes maybe?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I get that, but for example we can't do "3 months" and the ability to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is tricky (and the security object is a bad precedent since clustering).
Perhaps a broader discussion on what the intervals should be then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm +1 on this PR otherwise, it's just the hard-coded intervals that I struggle with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note about efficiency. We could pre-calculate the values and keep the formulas as comments for efficiency reasons. Because otherwise we are going to do pointless multiplications over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decompile the .beam to be sure? I will go find my collection of pearls just in case it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat a.erl
-module(a).
-export([a/0]).
-define(NUMBER, 1 * 3 * 6 * 1000).
a() ->
?NUMBER.
➜ couchdb git:(lucene-10) ✗ erl -pa .
Erlang/OTP 26 [erts-14.2.5.11] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V14.2.5.11 (press Ctrl+G to abort, type help(). for help)
1> beam_disasm:file(a).
{beam_file,a,
[{a,0,2},{module_info,0,4},{module_info,1,6}],
[{vsn,[177408199690294671755095659682854774622]}],
[{version,"8.4.3.3"},
{options,[]},
{source,"/Users/rnewson/Source/couchdb/a.erl"}],
[{function,a,0,2,
[{label,1},
{line,1},
{func_info,{atom,a},{atom,a},0},
{label,2},
{move,{integer,18000},{x,0}},
return]},
{function,module_info,0,4,
[{line,0},
{label,3},
{func_info,{atom,a},{atom,module_info},0},
{label,4},
{move,{atom,a},{x,0}},
{call_ext_only,1,{extfunc,erlang,get_module_info,1}}]},
{function,module_info,1,6,
[{line,0},
{label,5},
{func_info,{atom,a},{atom,module_info},1},
{label,6},
{move,{x,0},{x,1}},
{move,{atom,a},{x,0}},
{call_ext_only,2,{extfunc,erlang,get_module_info,2}}]}]}
3> a:a().
18000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway, back to the point. Ok, so 3 months etc does work, that is good enough for me. thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect thanks for checking it out:
-define(NUMBER, 1 * 3 * 6 * 1000).And the returned value is: {move,{integer,18000},{x,0}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking. Erlang was not capable of doing it for a long time.
iilyak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great +1 from me.
Please wait for rnewson's conclusion before merging. |
|
If you still want to retain 60 buckets, how about:
- 1 bucket a day for 30 days
- 1 bucket a month for 24 months for changes older than 30 days
- 1 bucket a year for 5 years for changes older than 30 days + 24 months
- 1 bucket for everything older than that
30 days + 24 months + 5 years + 1 = 60
…On Thu, 4 Sept 2025, 21:59 Nick Vatamaniuc, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/couch/src/couch_time_seq.erl
<#5603 (comment)>:
> +% - With the ?INTERVALS schedule defined below ran 1 update per hour for 1M
+% updates starting at year 3000 and ending at year 3114 and obtained:
+% * Less than 10 years of spacing between years at the start: 3000, 3009, 3018 ...
+% * Ten individual latest days: 3114-01-20 -> 3114-01-30
+% * Seven individual latest months: 3113-07 -> 3114-01
+% - Uncompressed term_to_binary(TSeq) = 920B
+% - RAM flat size erts_debug:flat_size(TSeq) * erlang:system_info(wordsize) = 2KB
+%
+-define(MAX_BIN_COUNT, 60).
+
+-define(H, 3600).
+-define(D, ?H * 24).
+-define(M, ?D * 30).
+-define(Y, ?D * 365).
+%% erlfmt-ignore
+-define(INTERVALS, [
sure, I get that, but for example we can't do "3 months" and the ability
to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is
tricky (and the security object is a bad precedent since clustering).
3 months already works; I ran a simulation with 1 update per hour for 1
million hours:
- Can do 11 days down to individual days:
30,29,28,27,26,25,24,23,22,21,20 then can do another 10 but skipping one or
two days in between.
- Can do 7 months down to individual months: 01 ,12,11,10,09,08,07,
then it can do 3 or 4 more months skipping one or two in between.
- Left less resolution for the years: about 3 individual years, then 4
more skipping 2 or 3 in between. Then switches to decades and such and
probably doesn't matter as much.
3000-01-01T00:00:00Z -> 82176
3009-05-18T00:00:00Z -> 83712
3018-12-05T00:00:00Z -> 85584
3028-09-09T00:00:00Z -> 82416
3038-02-03T00:00:00Z -> 85488
3047-11-05T00:00:00Z -> 82704
3057-04-12T00:00:00Z -> 82944
3066-09-28T00:00:00Z -> 85872
3076-07-15T00:00:00Z -> 83520
3086-01-24T00:00:00Z -> 41472
3090-10-18T00:00:00Z -> 41472
3095-07-12T00:00:00Z -> 41760
3100-04-17T00:00:00Z -> 41472
3105-01-09T00:00:00Z -> 20736
3107-05-23T00:00:00Z -> 20736
3109-10-03T00:00:00Z -> 10368
3110-12-09T00:00:00Z -> 10368
3112-02-14T00:00:00Z -> 5184
3112-09-17T00:00:00Z -> 3456
3113-02-08T00:00:00Z -> 864
3113-03-16T00:00:00Z -> 864
3113-04-21T00:00:00Z -> 864
3113-05-27T00:00:00Z -> 864
3113-07-02T00:00:00Z -> 864
3113-08-07T00:00:00Z -> 288
3113-08-19T00:00:00Z -> 288
3113-08-31T00:00:00Z -> 288
3113-09-12T00:00:00Z -> 288
3113-09-24T00:00:00Z -> 288
3113-10-06T00:00:00Z -> 288
3113-10-18T00:00:00Z -> 288
3113-10-30T00:00:00Z -> 288
3113-11-11T00:00:00Z -> 288
3113-11-23T00:00:00Z -> 288
3113-12-05T00:00:00Z -> 288
3113-12-17T00:00:00Z -> 288
3113-12-29T00:00:00Z -> 96
3114-01-02T00:00:00Z -> 48
3114-01-04T00:00:00Z -> 48
3114-01-06T00:00:00Z -> 48
3114-01-08T00:00:00Z -> 48
3114-01-10T00:00:00Z -> 48
3114-01-12T00:00:00Z -> 48
3114-01-14T00:00:00Z -> 48
3114-01-16T00:00:00Z -> 48
3114-01-18T00:00:00Z -> 48
3114-01-20T00:00:00Z -> 24
3114-01-21T00:00:00Z -> 24
3114-01-22T00:00:00Z -> 24
3114-01-23T00:00:00Z -> 24
3114-01-24T00:00:00Z -> 24
3114-01-25T00:00:00Z -> 24
3114-01-26T00:00:00Z -> 24
3114-01-27T00:00:00Z -> 24
3114-01-28T00:00:00Z -> 24
3114-01-29T00:00:00Z -> 24
3114-01-30T00:00:00Z -> 6
3114-01-30T06:00:00Z -> 6
3114-01-30T12:00:00Z -> 3
3114-01-30T15:00:00Z -> 1
—
Reply to this email directly, view it on GitHub
<#5603 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFKMRIIXAN3FYP4IECJV4T3RCR2DAVCNFSM6AAAAACCBERWLWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTCOBXGIZTEMBSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I started with something like that in the beginning but wanted something less than a 24 h day, I could see someone wanting to know what part of the day changes happened, morning afternoon, etc so "hours" got about 4 buckets. Then I also had the sharper transitions like that from a day 30th day to month, but wanted something more gradual, so we may skip 1 or two days in between, as opposed going from 1 day straight to a month. It's also worth pointing out that if there is not enough time to fill up decades and such, the rest of the 60 bins will still be utilized with more recent intervals. For instance the first 60 * 3h updates, all 60 bins would be filled with 3h intervals. Then to make room we'd merge some into 6h, so it will become a mix of 3h and 6h intervals after (with 6h ones towards the end). We can try a few more example of schedules and see which ones we like better. I'll generate some in a bit |
rnewson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work.
|
I tried a schedule with more individual days: -define(INTERVALS, [
?D,
?M,
?Y, ?Y * 2, ?Y * 4, ?Y * 8, ?Y * 16
]).We get more than 30 days then about 6 months. So a lot more days but then everything sort of gets squashed into the oldest bin after two years. To keep the algorithm simple we do simple pair-wise merging so sharp jumps from days to months don't work as well -- would need another merge strategy (a custom clause to merge days to months, months to years). I had actually started that way but then the algorithm had more special cases and was a bit more fiddly. The way I generated these is with some eunit test functions added to the main couch_time_seq.erl module and calling -define(TEST_TIME, "3000-01-01T00:00:00Z").
test_time() ->
calendar:rfc3339_to_system_time(?TEST_TIME).
test_hist() ->
test_hist(1_000_000).
test_hist(N) ->
TSeq = update_cnt(N, hours(1)),
Hist = couch_time_seq:histogram(TSeq, N),
lists:foreach(fun([T, V]) ->
io:format(" ~s : ~B~n", [T, V])
end, Hist).
hours(Hours) ->
Hours * 3600.
update_cnt(N, TimeInc) ->
update_cnt(N, test_time(), 0, TimeInc, couch_time_seq:new()).
update_cnt(0, _Time, _Seq, _TimeInc, TSeq) ->
TSeq;
update_cnt(Cnt, Time, Seq, TimeInc, TSeq) ->
TSeq1 = couch_time_seq:update(TSeq, Time, Seq),
Time1 = Time + TimeInc,
Seq1 = Seq + 1,
update_cnt(Cnt - 1, Time1, Seq1, TimeInc, TSeq1). |
This data structure maps time intervals to database update sequences. The idea
is to be able to quickly determine which changes occurred in a time interval.
The main goal of the design is to have a small data structure to fit well under
a few KBs and yet represent time intervals from few hours up to a decades. This
goal was accomplished by using exponentially decaying time intervals. The
further back in time we go, the longer the intervals get. This matches how
humans usually keep track of time: if we're talking about yesterday, we may
care about hours; if we talk about last month, we may care about single days;
and if we talk about last year, we may only care about the months or quarters,
and so on. If we accept this historical loss of accuracy, we can hit the design
goals of having only 60 time bins and a small, under 500B on-disk
representation.
The data structure format is a KV list of integers which which looks like:
`[{Time, Seq}, {OlderTime, OlderSeq}, ...]`. Times are rounded to whole three
hour blocks.
The head of the KV list is the youngest entry. The `Time` value is the time of
the earliest sequence in that time interval. The `Seq` value indicates the
first sequence observed in the time interval.
During updates, if we're into the next three hour block and all the bins are
filled already, then the bins are "rolled up". That means finding some older
bins to merge together to make some room for the new one, such that that the
bin count does not increase and stays at or below the maximum limit.
The main API functions are:
* `new()` : create a new time sequence (`TSeq`) context.
* `update(TSeq, Seq)` insert a new sequence into the timeline.
* `since(TSeq, Time) -> Seq` get sequence before the timestamp.
* `histogram(TSeq, UpdateSeq)` return formatted time bins and the count of updates which
occurred during each interval. Use this for debugging or to give users an idea
how many changes occurred in each interval. If the database was upgraded
with some existing updates already, those are represented as occurring in
a time bin starting in 1970-01-01.
Since we're using the operating system's knowledge of time, the solution is not
perfect. However, there are few mitigations to help with some scenarios:
* Time values are rounded to three hour blocks. Even if the synchronization
is known to be off by a day, the user can always restring the usage of the
`since` parameter to a larger interval, for example only ask about time
intervals greater than a few days.
* Ignore updates which appear to happen back in time. Those are likely from a
not yet synchronized clock after boot. Compare update times to the last entry
or to a config setting. Users can set the config setting to a particular
time, for example to 1971 if they know their systems jumps to 1970 after boot
due to some hardware default. Any updates during that time won't be
registered but the sequence will catch up once the NTP synchronization kicks
in. It's best to set it to a much more recent time. The default is recent
date before this feature is implemented. Future releases may bump that up.
* If, due to some misconfiguration time jumps far ahead, say, to year 3000,
or any other time configuration mishap occurred it's always safe to reset the
time-seq structure and simply start fresh at the new time. The plan is for
the structure to not be interleaved into the doc bodies or the rev tree, but
instead to have it in a separate off-to-the-side unused header field. As much
it can always be safely inspected and reset if needed.
There are EUnit and property tests for 100% test coverage.
Thanks to Ilya (@iilyak) for writing the property tests!
Since time-seq is fixed size, well under 1KB when serialized, handle it like we handle epochs in the header. That is simpler than having a new btree, or having to juggle file term pointers. When we write the 4KB db header block most of it is empty anyway, so we'll use a few more hundreds bytes from there for time-seq data structure and as a result gain the ability to map update sequences to time intervals. This change is downgrade-safe because it's backwards compatible with previous supported disk format versions. It's possible to safely downgrade to a previous version before this feature was added. That is achieved by re-using a very old field from the header that was set to 0 for many years. Downgraded versions will simply ignore the new data structure. This means we don't to run compaction to upgrade anything, or create an extra intermediate release version in between to allow for safe downgrades. For simplicity, time-seq tracking is per-shard. During shard splitting or compaction the time-seq data structure is preserved. If the user moved the shard to another node, it will also be preserved. However, if shard files are manually truncated and rebuilt, then the updates in that shard file will appear at the later time. As such, the user then might get more (older) documents from that copy. In the context of time-based _changes feed implementation this would look like a rewind for that shard copy. However, we have those for regular changes feeds when shards are manipulated externally, and it's documented so it's in-line with the current such behavior.
This is an escape hatch in case something went wrong with time synchronization.
Users should always be able to reset the time seq structure and start from
scratch.
In fabric, the get* and set* calls are somewhat similar to how db metadata
calls like get_revs_limit limit / set_revs_limit work, however to keep all the
time-seq logic together added them to the single `fabric_time_seq` module.
To inspect the time-seq structure use `GET $db/_time_seq`. In the result each
shard time-seq data structure is returned. It's a mapping of formatted time in
YYYY-MM-DDTHH:MM:SSZ format to count of sequence updates which occurred in that
time interval for that shard. It may look something like:
```json
{
"00000000-7fffffff": {
"[email protected]": [["2025-07-21T16:00:00Z", 1]],
"[email protected]": [["2025-07-21T16:00:00Z", 1]],
"[email protected]": [["2025-07-21T16:00:00Z", 1]]
},
"80000000-ffffffff": {
"[email protected]": [["2025-07-21T16:00:00Z", 3]],
"[email protected]": [["2025-07-21T16:00:00Z", 3]],
"[email protected]": [["2025-07-21T16:00:00Z", 3]]
}
}
```
For consistency here the result shape is modeled after the $db/_shards
endpoint.
The `DELETE $db/_time_seq` API endpoint will reset the data structure. After
calling it, the result from `GET $db/_time_seq` will look like:
```json
{
"00000000-7fffffff": {
"[email protected]": [],
"[email protected]": [],
"[email protected]": []
},
"80000000-ffffffff": {
"[email protected]": [],
"[email protected]": [],
"[email protected]": []
}
}
```
Use the new time-seq feature to stream changes from before a point in time.
This can be used for backups or any case when then it helps to associate a
range of sequence updates to a time interval. The time-seq exponential decaying
interval rules apply: the further back in time, the less accurate the time
intervals will be.
The API change consists in making `since` accept a standard time value and
streaming the changes started right before that time value based on the known
time-seq intervals. The time format of the since parameter is
YYYY-MM-DDTHH:MM:SSZ. It's valid as either an ISO 8601 or an RFC 3339 format.
From API design point of view this feature can be regarded as an extension to
the other `since` values like `now` or `0`.
Implementation-wise the change is treated similarly how we treat the `now`
special value: before the changes request starts, we translate the time value
to a proper `since` sequence. After that, we continue on with that regular
sequence as if nothing special happened. Consequently, the shape of the emitted
result is exactly the same as any previous change sequences. This is an extra
"plus" for consistency and compatibility.
To get a feel for the feature, I created a small db and updated it every few
hours during the day:
`http get $DB/db/_time_seq`
```
{
"00000000-ffffffff": {
"[email protected]": [
["2025-07-21T01:00:00Z", 15],
["2025-07-21T05:00:00Z", 2]
["2025-07-21T19:00:00Z", 9],
["2025-07-21T20:00:00Z", 5],
["2025-07-21T21:00:00Z", 70]
["2025-07-21T22:00:00Z", 10]
]
}
}
```
Change feed with `since=2025-07-21T22:00:00Z` will return documents changed
since that last hour only:
```
% http get $DB/db/_changes'?since=2025-07-21T22:00:00Z' | jq -r '.results[].id'
101
102
103
104
105
106
107
108
109
110
```
Even the somewhat obscure `since_seq` replication parameter should work, so we
can replicate from a particular point in time:
```
% http post 'http://adm:pass@localhost:15984/_replicate' \
source:='"http://adm:pass@localhost:15984/db"' \
target:='"http://adm:pass@localhost:15984/tgt"' \
since_seq:='"2025-07-21T22:00:00Z"'
{
"history": [
{
"bulk_get_attempts": 10,
"bulk_get_docs": 10,
"doc_write_failures": 0,
"docs_read": 10,
"docs_written": 10,
"end_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
"end_time": "Mon, 21 Jul 2025 22:11:59 GMT",
"missing_checked": 10,
"missing_found": 10,
"recorded_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews",
"session_id": "19252b97e34088aeaaa6cde6694a419f",
"start_last_seq": "2025-07-21T22:00:00Z",
"start_time": "Mon, 21 Jul 2025 22:11:55 GMT"
}
],
"ok": true,
"replication_id_version": 4,
"session_id": "19252b97e34088aeaaa6cde6694a419f",
"source_last_seq": "111-g1AAAABLeJzLYWBgYMxgTmHgz8tPSTV0MDQy1zMAQsMcoARTIkMeC8N_IMjKYE7MzwUKsacaG6UYGSVhasgCALN1Ews"
}
```
The target db now has only documents written in that last hour:
```
% http $DB/tgt/_all_docs | jq -r '.rows[].id'
101
102
103
104
105
106
107
108
109
110
```
Overview
Implement a time-to-sequence mapping data structure then use it to enable the
_changes?since=$timefeature.This started as an experiment wondering if we could have a simple data structure to map rough time intervals to db sequences, nothing too exact just something on the order of hours, days, months, years. The original idea came from a discussion with Glynn Bird with him wondering if it would be possible to do such a thing (thanks, @glynnbird!), and then the idea of using exponentially decaying intervals is from our recent rewrite of couch_stats histograms
Time-Seq Data Structure
The data structure, called "time-seq" further below, is a list of 60 key-value pairs mapping time bins to db sequences. The structure can represent exponentially decaying time intervals. This decaying behavior is a trade-off of being small and having a fixed size -- the further back in time we go, the lower the accuracy. However, this is how we often regard time in general, when we talk about "yesterday", we refer to individual hours; when we talk about "last month", we may talk about individual days; when talking about two years ago, we may care about months only, etc.
Just the
time-seqimplementation and the associated tests are in the first commit. It has some additional info in the commit and module comments. Property tests were written by @iilyak (thank you!) and along with the eunit test we got to 100% test coverage.Serialization: Upgrade / Downgrade Behavior
Another unexpected benefit of using a small data structure is fits inside the header. And, with an additional bit of luck the implementation turned out to also be downgrade-safe. This was accomplished by reusing a very old unused header field. This way, on downgrade the older versions of CouchDB will ignore the new time-seq field. With this "trick" we can avoid having to create an intermediate downgrade target release. The addition of time-seq data structure to the header is implemented in the second commit. That commit also implements how the structure is updated: that happens in the
couch_db_updaterright before the writes are committed.Dealing With Time
Since we're dealing with time, we're bound to have some sharp edges. On some systems time could jump backwards briefly after boot until NTP sync kicks in, or it may misbehave in other ways. There are a few mitigations implemented to help with these sharp edges:
sinceintervals larger than whole days (weeks, months).GET $db/_time_seqandDELETE $db/_time_seq. The result of theGET $db/_time_seqcontains all the time-seq bins with formatted timestamps mapped to the number of changes in that bin. TheDELETEcall resets the_time_seqstructure. This allows users to inspect and reset any time-seq structure if they detect something unexpected happened with the time synchronization, for example if the date jumped forward to 2050 or something like that.The third commit implements the new
$db/_time_seqAPI endpoint and the general fabric level integration of the new feature._changes?since=$timeImplementation_change?since=$timefeature is implemented in the fourth commit. Due to all the preparatory steps this commit is pretty simple. We handle the new parameter variant just like we handle the specialnowvalue for descending changes feeds. After the initial start argument process the rest of the changes feed logic proceeds as before.A small example copied from the
_changescommit comment with a db I updated every few hours during the day:_change?since=2025-07-21T22:00:00Zwill return documents changed since thatlast hour only:
Even the (somewhat) hidden
since_seqreplication parameter should work, so we can replicate from a particular point in time:The target db now has only the documents written in that last hour:
Downgrade Testing
Ran a downgrade test. Updated a db with the PR branch. Switched to main, then verified it was possible to read and write the same dbs without any issue.
Performance impact
Ran the quick and dirty built-in fabric_bench test. Used q=8 and small docs. Didn't noticed any significant difference between main and PR branch:
What Happens Over Time
To get a feel for how the rollup works ran a test which updated the data structure once per hour for 1 million hours.
Noticed a few things:
since=3114-01-09T00:00:00Zwe may also get changes from3114-01-08T00:00:00Z