rfc about drift detection, and SALL to mark end of message delivery

epoberezkin · epoberezkin · commit b57b709338b0 · 2025-08-21T16:58:48.000+01:00
diff --git a/rfcs/2025-08-20-service-subs-drift.md b/rfcs/2025-08-20-service-subs-drift.md
@@ -0,0 +1,101 @@
+# Detecting and fixing state with service subscriptions
+
+## Problem
+
+While service certificates and subscriptions hugely decrease startup time and delivery delays on server restarts, they introduce the risk of losing subscriptions in case of state drifts. They also do not provide efficient mechanism for validating that the list of subscribed queues is in sync.
+
+How can the state drift happen?
+
+There are several possibilities:
+- lost broker response would make the broker consider that the queue is associated, but the client won't know it, and will have to re-associate. While in itself it is not a problem, as it'll be resolved, it would make drift detected more frequently (regardless of the detection logic used). That service certificates are used on clients with good connection would make it less likely though.
+- server state restored from the backup, in case of some failure. Nothing can be done to recover lost queues, but we may restore lost service associations.
+- queue blocking or removal by server operator because of policy violation.
+- server downgrade (when it loses all service associations) with subsequent upgrade - the client would think queues are associated, while they are not, and won't receive any messages at all in this scenario.
+- any other server-side error or logic error.
+
+In addition to the possibility of the drift, we simply need to have confidence that service subscriptions work as intended, without skipping queues. We ignored this consideration for notifications, as the tolerance to lost notifications is higher, but we can't ignore it for messages.
+
+## Solution
+
+Previously considered approach of sending NIL to all queues without messages is very expensive for traffic (most queues don't have messages), and it is also very expensive to detect and validate drift in the client because of asynchronous / concurrent events.
+
+We cannot read all queues into memory, and we cannot aggregate all responses in memory, and we cannot create database writes on every single service subscription to say 1m queues (a realistic number), as it simply won't work well even at the current scale.
+
+An approach of having an efficient way to detect drift, but load the full list of IDs when drift is detected, also won't work well, as drifts may be common, so we need both efficient way to detect there is diff and also to reconcile it.
+
+### Drift detection
+
+Both client and server would maintain the number of associated queues and the "symmetric" hash over the set of queue IDs. The requirements for this hash algorithm are:
+- not cryptographically strong, to be fast.
+- 128 bits to minimize collisions over the large set of millions of queues.
+- symmetric - the result should not depend on ID order.
+- allows fast additions and removals.
+
+In this way, every time association is added or removed (including queue marked as deleted), both peers would recompute this hash in the same transaction.
+
+The client would suspend sending and processing any other commands on the server and the queues of this server until SOKS response is received from this server, to prevent drift. It can be achieved with per-server semaphores/locks in memory. UI clients need to become responsive sooner than these responses are received, but we do not service certificates on UI clients, and chat relays may prevent operations on server queues until SOKS response is received.
+
+SOKS response would include both the count of associated queues (as now) and the hash over all associated queue IDs (to be added). If both count and hash match, the client will not do anything. If either does not match the client would perform full sync (see below).
+
+There is a value from doing the same in notification server as well to detect and "fix" drifts.
+
+The algorithm to compute hashes can be the following.
+
+1. Compute hash of each queue ID using xxHash3_128 ([xxhash-ffi](https://hackage.haskell.org/package/xxhash-ffi) library). They don't need to be stored or loaded at once, initially, it can be done with streaming if it is detected on start that there is no pre-computed hash.
+2. Combine hashes using XOR. XOR is both commutative and associative, so it would produce the same aggregate hash irrespective of the ID order.
+3. Adding queue ID to pre-computed hash requires a single XOR with ID hash: `new_aggregate = aggregate XOR hash(queue_id)`.
+4. Removing queue ID from pre-computed hash also requires the same XOR (XOR is involutory, it undoes itself): `new_aggregate = aggregate XOR hash(queue_id)`.
+
+These hashes need to be computed per user/server in the client and per service certificate in the server - on startup both have to validate and compute them once if necessary.
+
+There can be also a start-up option to recompute hashe(s) to detect and fix any errors.
+
+This is all rather simple and would help detecting drifts.
+
+### Synchronization when drift is detected
+
+The assumption here is that in most cases drifts are rare, and isolated to few IDs (e.g., this is the case with notification server).
+
+But the algorithm should be resilient to losing all associations, and it should not be substantially worse than simply restoring all associations or loading all IDs.
+
+We have `c_n` and `c_hash` for client-side count and hash of queue IDs and `s_n` and `s_hash` for server-side, which are returned in SOKS response to SUBS command.
+
+1. If `c_n /= s_n || c_hash /= s_hash`, the client must perform sync.
+
+2. If `abs(c_n - s_n) / max(c_n, s_n) > 0.5`, the client will request the full list of queues (more than half of the queues are different), and will perform diff with the queues it has. While performing the diff the client will continue block operations with this user/server.
+
+3. Otherwise would perform some algorithm for determining the difference between queue IDs between client and server. This algorithm can be made efficient (`O(log N)`) by relying on efficient sorting of IDs and database loading of ranges, via computing and communicating hashes of ranges, and performing a binary search on ranges, with batching to optimize network traffic.
+
+This algorithm is similar to Merkle tree reconcilliation, but it is optimized for database reading of ordered ranges, and for our 16kb block size to minimize network requests.
+
+The algorithm:
+1. The client would request all ranges from the server.
+2. The server would compute hashes for N ranges of IDs and send them to the client. Each range would include start_id, optional end_id (for single ID ranges) and XOR-hash of the range. N is determined based on the block size and the range size.
+3. The client would perform the same computation for the same ranges, and compare them with the returned ranges from the server, while detecting any gaps between ranges and missing range boundaries.
+4. If more than half of the ranges don't match, the client would request the full list. Otherwise it would repeat the same algorithm for each mismatched range and for gaps.
+
+It can be further optimized by merging adjacent ranges and by batching all range requests, it is quite simple.
+
+Once the client determines the list of missing and extra queues it can:
+- create associations (via SUB) for missing queues,
+- request removal of association (a new command, e.g. BUS) for extra queues on the server.
+
+The pseudocode for the algorightm:
+
+For the server to return all ranges or subranges of requested range:
+
+```haskell
+getSubRanges :: Maybe (RecipientId, RecipientId) -> [(RecipientId, Maybe RecipientId, Hash)]
+getSubRanges range_ = do
+  ((min_id, max_id), s_n) <- case range_ of
+    Nothing -> getAssociatedQueueRange -- with the certificate in the client session.
+    Just range -> (range,) <$> getAssociatedQueueCount range
+    if
+      | s_n <= max_N -> reply_with_single_queue_ranges
+      | otherwise -> do
+          let range_size = s_n `div` max_N
+          read_all_ranges -- in a recursive loop, with max_id, range_hash and next_min_id in each step
+          reply_ranges
+```
+
+We don't need to implement this synchronization logic right now, so not including client logic here, it's sufficient to implement drift detection, and the action to fix the drift would be to disable and to re-enable certificates via some command-line parameter of CLI.
diff --git a/src/Simplex/Messaging/Agent.hs b/src/Simplex/Messaging/Agent.hs
@@ -200,6 +200,7 @@ import Simplex.Messaging.Protocol
     ErrorType (AUTH),
     MsgBody,
     MsgFlags (..),
+    IdsHash,
     NtfServer,
     ProtoServerWithAuth (..),
     ProtocolServer (..),
@@ -465,7 +466,7 @@ resubscribeConnections :: AgentClient -> [ConnId] -> AE (Map ConnId (Either Agen
 resubscribeConnections c = withAgentEnv c . resubscribeConnections' c
 {-# INLINE resubscribeConnections #-}
 
-subscribeClientServices :: AgentClient -> UserId -> AE (Map SMPServer (Either AgentErrorType Int64))
+subscribeClientServices :: AgentClient -> UserId -> AE (Map SMPServer (Either AgentErrorType (Int64, IdsHash)))
 subscribeClientServices c = withAgentEnv c . subscribeClientServices' c
 {-# INLINE subscribeClientServices #-}
 
@@ -1361,7 +1362,8 @@ resubscribeConnections' c connIds = do
   -- union is left-biased, so results returned by subscribeConnections' take precedence
   (`M.union` r) <$> subscribeConnections' c connIds'
 
-subscribeClientServices' :: AgentClient -> UserId -> AM (Map SMPServer (Either AgentErrorType Int64))
+-- TODO [serts rcv] compare hash with lock
+subscribeClientServices' :: AgentClient -> UserId -> AM (Map SMPServer (Either AgentErrorType (Int64, IdsHash)))
 subscribeClientServices' c userId =
   ifM useService subscribe $ throwError $ CMD PROHIBITED "no user service allowed"
   where
diff --git a/src/Simplex/Messaging/Agent/Client.hs b/src/Simplex/Messaging/Agent/Client.hs
@@ -254,6 +254,7 @@ import Simplex.Messaging.Protocol
     ErrorType,
     MsgFlags (..),
     MsgId,
+    IdsHash,
     NtfServer,
     NtfServerWithAuth,
     ProtoServer,
@@ -1424,6 +1425,7 @@ newRcvQueue_ c nm userId connId (ProtoServerWithAuth srv auth) vRange cqrd enabl
       (ntfKeys, ntfCreds) <- liftIO $ mkNtfCreds a g smp
       (thParams smp,ntfKeys,) <$> createSMPQueue smp nm nonce_ rKeys dhKey auth subMode (queueReqData cqrd) ntfCreds
   -- TODO [certs rcv] validate that serviceId is the same as in the client session, fail otherwise
+  -- possibly, it should allow returning Nothing - it would indicate incorrect old version
   liftIO . logServer "<--" c srv NoEntity $ B.unwords ["IDS", logSecret rcvId, logSecret sndId]
   shortLink <- mkShortLinkCreds thParams' qik
   let rq =
@@ -1575,7 +1577,7 @@ subscribeQueues c qs = do
         processSubResults = mapM_ $ uncurry $ processSubResult c sessId
         resubscribe = resubscribeSMPSession c tSess `runReaderT` env
 
-subscribeClientService :: AgentClient -> UserId -> SMPServer -> AM Int64
+subscribeClientService :: AgentClient -> UserId -> SMPServer -> AM (Int64, IdsHash)
 subscribeClientService c userId srv =
   withLogClient c NRMBackground (userId, srv, Nothing) B.empty "SUBS" $
     (`subscribeService` SMP.SRecipientService) . connectedClient
diff --git a/src/Simplex/Messaging/Client.hs b/src/Simplex/Messaging/Client.hs
@@ -884,12 +884,12 @@ nsubResponse_ = \case
 {-# INLINE nsubResponse_ #-}
 
 -- This command is always sent in background request mode
-subscribeService :: forall p. (PartyI p, ServiceParty p) => SMPClient -> SParty p -> ExceptT SMPClientError IO Int64
+subscribeService :: forall p. (PartyI p, ServiceParty p) => SMPClient -> SParty p -> ExceptT SMPClientError IO (Int64, IdsHash)
 subscribeService c party = case smpClientService c of
   Just THClientService {serviceId, serviceKey} -> do
     liftIO $ enablePings c
     sendSMPCommand c NRMBackground (Just (C.APrivateAuthKey C.SEd25519 serviceKey)) serviceId subCmd >>= \case
-      SOKS n -> pure n
+      SOKS n idsHash -> pure (n, idsHash)
       r -> throwE $ unexpectedResponse r
     where
       subCmd :: Command p
diff --git a/src/Simplex/Messaging/Client/Agent.hs b/src/Simplex/Messaging/Client/Agent.hs
@@ -479,14 +479,14 @@ smpSubscribeService ca smp srv serviceSub@(serviceId, _) = case smpClientService
             (True <$ processSubscription r)
             (pure False)
       if ok
-        then case r of
-          Right n -> notify ca $ CAServiceSubscribed srv serviceSub n
+        then case r of -- TODO [certs rcv] compare hash
+          Right (n, _idsHash) -> notify ca $ CAServiceSubscribed srv serviceSub n
           Left e
             | smpClientServiceError e -> notifyUnavailable
             | temporaryClientError e -> reconnectClient ca srv
             | otherwise -> notify ca $ CAServiceSubError srv serviceSub e
         else reconnectClient ca srv
-    processSubscription = mapM_ $ \n -> do
+    processSubscription = mapM_ $ \(n, _idsHash) -> do -- TODO [certs rcv] validate hash here?
       setActiveServiceSub ca srv $ Just ((serviceId, n), sessId)
       setPendingServiceSub ca srv Nothing
     serviceAvailable THClientService {serviceRole, serviceId = serviceId'} =
diff --git a/src/Simplex/Messaging/Protocol.hs b/src/Simplex/Messaging/Protocol.hs
@@ -139,6 +139,7 @@ module Simplex.Messaging.Protocol
     RcvMessage (..),
     MsgId,
     MsgBody,
+    IdsHash,
     MaxMessageLen,
     MaxRcvMessageLen,
     EncRcvMsgBody (..),
@@ -692,11 +693,13 @@ data BrokerMsg where
   -- | Service subscription success - confirms when queue was associated with the service
   SOK :: Maybe ServiceId -> BrokerMsg
   -- | The number of queues subscribed with SUBS command
-  SOKS :: Int64 -> BrokerMsg
+  SOKS :: Int64 -> IdsHash -> BrokerMsg
   -- MSG v1/2 has to be supported for encoding/decoding
   -- v1: MSG :: MsgId -> SystemTime -> MsgBody -> BrokerMsg
   -- v2: MsgId -> SystemTime -> MsgFlags -> MsgBody -> BrokerMsg
   MSG :: RcvMessage -> BrokerMsg
+  -- sent once delivering messages to SUBS command is complete
+  SALL :: BrokerMsg
   NID :: NotifierId -> RcvNtfPublicDhKey -> BrokerMsg
   NMSG :: C.CbNonce -> EncNMsgMeta -> BrokerMsg
   -- Should include certificate chain
@@ -933,6 +936,7 @@ data BrokerMsgTag
   | SOK_
   | SOKS_
   | MSG_
+  | SALL_
   | NID_
   | NMSG_
   | PKEY_
@@ -1025,6 +1029,7 @@ instance Encoding BrokerMsgTag where
     SOK_ -> "SOK"
     SOKS_ -> "SOKS"
     MSG_ -> "MSG"
+    SALL_ -> "SALL"
     NID_ -> "NID"
     NMSG_ -> "NMSG"
     PKEY_ -> "PKEY"
@@ -1046,6 +1051,7 @@ instance ProtocolMsgTag BrokerMsgTag where
     "SOK" -> Just SOK_
     "SOKS" -> Just SOKS_
     "MSG" -> Just MSG_
+    "SALL" -> Just SALL_
     "NID" -> Just NID_
     "NMSG" -> Just NMSG_
     "PKEY" -> Just PKEY_
@@ -1448,6 +1454,8 @@ type MsgId = ByteString
 -- | SMP message body.
 type MsgBody = ByteString
 
+type IdsHash = ByteString
+
 data ProtocolErrorType = PECmdSyntax | PECmdUnknown | PESession | PEBlock
 
 -- | Type for protocol errors.
@@ -1807,9 +1815,12 @@ instance ProtocolEncoding SMPVersion ErrorType BrokerMsg where
     SOK serviceId_
       | v >= serviceCertsSMPVersion -> e (SOK_, ' ', serviceId_)
       | otherwise -> e OK_ -- won't happen, the association with the service requires v >= serviceCertsSMPVersion
-    SOKS n -> e (SOKS_, ' ', n)
+    SOKS n idsHash
+      | v >= rcvServiceSMPVersion -> e (SOKS_, ' ', n, idsHash)
+      | otherwise -> e (SOKS_, ' ', n)
     MSG RcvMessage {msgId, msgBody = EncRcvMsgBody body} ->
       e (MSG_, ' ', msgId, Tail body)
+    SALL -> e SALL_
     NID nId srvNtfDh -> e (NID_, ' ', nId, srvNtfDh)
     NMSG nmsgNonce encNMsgMeta -> e (NMSG_, ' ', nmsgNonce, encNMsgMeta)
     PKEY sid vr certKey -> e (PKEY_, ' ', sid, vr, certKey)
@@ -1836,6 +1847,7 @@ instance ProtocolEncoding SMPVersion ErrorType BrokerMsg where
       MSG . RcvMessage msgId <$> bodyP
       where
         bodyP = EncRcvMsgBody . unTail <$> smpP
+    SALL_ -> pure SALL
     IDS_
       | v >= newNtfCredsSMPVersion -> ids smpP smpP smpP smpP
       | v >= serviceCertsSMPVersion -> ids smpP smpP smpP nothing
@@ -1856,7 +1868,9 @@ instance ProtocolEncoding SMPVersion ErrorType BrokerMsg where
           pure $ IDS QIK {rcvId, sndId, rcvPublicDhKey, queueMode, linkId, serviceId, serverNtfCreds}
     LNK_ -> LNK <$> _smpP <*> smpP
     SOK_ -> SOK <$> _smpP
-    SOKS_ -> SOKS <$> _smpP
+    SOKS_
+      | v >= rcvServiceSMPVersion -> SOKS <$> _smpP <*> smpP
+      | otherwise -> SOKS <$> _smpP <*> pure B.empty
     NID_ -> NID <$> _smpP <*> smpP
     NMSG_ -> NMSG <$> _smpP <*> smpP
     PKEY_ -> PKEY <$> _smpP <*> smpP <*> smpP
@@ -1886,6 +1900,7 @@ instance ProtocolEncoding SMPVersion ErrorType BrokerMsg where
     PONG -> noEntityMsg
     PKEY {} -> noEntityMsg
     RRES _ -> noEntityMsg
+    SALL -> noEntityMsg
     -- other broker responses must have queue ID
     _
       | B.null entId -> Left $ CMD NO_ENTITY
diff --git a/src/Simplex/Messaging/Server.hs b/src/Simplex/Messaging/Server.hs
@@ -1790,17 +1790,18 @@ client
         subscribeServiceMessages serviceId =
           sharedSubscribeService SRecipientService serviceId subscribers serviceSubscribed serviceSubsCount >>= \case
             Left e -> pure $ ERR e
-            Right (hasSub, count) -> do
+            Right (hasSub, (count, idsHash)) -> do
               unless hasSub $ forkClient clnt "deliverServiceMessages" $ liftIO $ deliverServiceMessages count
-              pure $ SOKS count
+              pure $ SOKS count idsHash
           where
             deliverServiceMessages expectedCnt = do
               (qCnt, _msgCnt, _dupCnt, _errCnt) <- foldRcvServiceMessages ms serviceId deliverQueueMsg (0, 0, 0, 0)
+              atomically $ writeTBQueue msgQ [(NoCorrId, NoEntity, SALL)]
               -- TODO [cert rcv] compare with expected
               logNote $ "Service subscriptions for " <> tshow serviceId <> " (" <> tshow qCnt <> " queues)"
             deliverQueueMsg :: (Int, Int, Int, Int) -> RecipientId -> Either ErrorType (Maybe (QueueRec, Message)) -> IO (Int, Int, Int, Int)
             deliverQueueMsg (!qCnt, !msgCnt, !dupCnt, !errCnt) rId = \case
-              Left e -> pure (qCnt + 1, msgCnt, dupCnt, errCnt + 1) -- TODO deliver subscription error
+              Left e -> pure (qCnt + 1, msgCnt, dupCnt, errCnt + 1) -- TODO [certs rcv] deliver subscription error
               Right qMsg_ -> case qMsg_ of
                 Nothing -> pure (qCnt + 1, msgCnt, dupCnt, errCnt)
                 Just (qr, msg) ->
@@ -1823,23 +1824,23 @@ client
 
         subscribeServiceNotifications :: ServiceId -> M s BrokerMsg
         subscribeServiceNotifications serviceId =
-          either ERR (SOKS . snd) <$> sharedSubscribeService SNotifierService serviceId ntfSubscribers ntfServiceSubscribed ntfServiceSubsCount
+          either ERR (uncurry SOKS . snd) <$> sharedSubscribeService SNotifierService serviceId ntfSubscribers ntfServiceSubscribed ntfServiceSubsCount
 
-        sharedSubscribeService :: (PartyI p, ServiceParty p) => SParty p -> ServiceId -> ServerSubscribers s -> (Client s -> TVar Bool) -> (Client s -> TVar Int64) -> M s (Either ErrorType (Bool, Int64))
+        sharedSubscribeService :: (PartyI p, ServiceParty p) => SParty p -> ServiceId -> ServerSubscribers s -> (Client s -> TVar Bool) -> (Client s -> TVar Int64) -> M s (Either ErrorType (Bool, (Int64, IdsHash)))
         sharedSubscribeService party serviceId srvSubscribers clientServiceSubscribed clientServiceSubs = do
           subscribed <- readTVarIO $ clientServiceSubscribed clnt
           liftIO $ runExceptT $
             (subscribed,)
               <$> if subscribed
-                then readTVarIO (clientServiceSubs clnt)
+                then (,B.empty) <$> readTVarIO (clientServiceSubs clnt) -- TODO [certs rcv] get IDs hash
                 else do
                   count' <- ExceptT $ getServiceQueueCount @(StoreQueue s) (queueStore ms) party serviceId
                   incCount <- atomically $ do
                     writeTVar (clientServiceSubscribed clnt) True
                     count <- swapTVar (clientServiceSubs clnt) count'
                     pure $ count' - count
                   atomically $ writeTQueue (subQ srvSubscribers) (CSService serviceId incCount, clientId)
-                  pure count'
+                  pure (count', B.empty) -- TODO [certs rcv] get IDs hash
 
         acknowledgeMsg :: MsgId -> StoreQueue s -> QueueRec -> M s (Transmission BrokerMsg)
         acknowledgeMsg msgId q qr =
diff --git a/src/Simplex/Messaging/Transport.hs b/src/Simplex/Messaging/Transport.hs
diff --git a/tests/ServerTests.hs b/tests/ServerTests.hs