Federated Dataspaces: many nodes cooperating to make an "overlay" dataspace
Find a file
2024-06-12 15:22:36 +02:00
protocols/schemas Cosmetic 2024-05-21 13:42:13 +02:00
racket Allow #:when in relay-assertions-and-messages 2024-06-08 11:46:58 +02:00
udp-ds udp-ds 2024-06-12 15:22:36 +02:00
node-aggregated-subs.pr Tweak 2024-06-08 16:01:37 +02:00
node-aggregated-subs2.pr node-aggregated-subs2.pr 2024-06-08 16:13:54 +02:00
README.md On aggregation 2024-06-08 13:01:13 +02:00

FEDS: FEderated DataSpaces

An adaptation to the modern Syndicated Actor Model of the prototype dataspace federation developed as part of the FRµIT project in 2019.

Protocol variations

So far, I have explored three alternative protocols for federating assertions among nodes connected in a spanning tree. Each implementation is roughly 40 lines of Racket code.

All the protocol variations considered here have the same core structure. Each node has exactly one internal dataspace, called a fragment, speaking a custom protocol. Each node has some entities that link to adjacent nodes' fragments, and some entities that proxy "local" access to the node's own fragment. Local access makes a fragment appear like a normal dataspace, with all the federation machinery hidden away in its internal operation. Links operate in quite different ways for each protocol variation.

In every case, the protocols establish a rooted, directed tree of nodes, not a DAG or a general graph. Each node has exactly one "upstream" node. The connection between a node and its upstream is called an uplink.

For simplicity of presentation, I'll discuss ideas as if only assertions exist, without messages. Implementing support for messages is reasonably straightforward given solutions for assertions.

Common concepts: relay object, assertion relay

Two frequently-occurring concepts are the notions of relay object and assertion relay, related but distinct.

A relay object is an entity which takes each received assertion matching some pattern, performs some simple, optional transformation on it such as adding or removing an envelope, and asserts the result in turn at some designated target entity. Relaying may be, but is usually not, conditional (on e.g. some property of the assertion).

An assertion relay is like a relay object, but for use with an existing dataspace. It is an assertion of interest - an Observe record - that matches some (dataspace) pattern, transforms the bindings into a new value, and asserts the new value at some target entity.

The "copy-up" protocol

This protocol works by copying all non-subscription assertions upward in the tree. The root of the federation tree then has a copy of every non-subscription assertion made anywhere in any fragment connected to the tree. Subscription assertions are aggregated before being relayed up the tree. Information flows back down the tree "out of band" by the normal Syndicate assertion flow mechanism: aggregated distributor entity references are embedded in Observe records and targetted by <at...> records.

Create = <create @kind LinkKind @reply #:#:any> .
LinkKind = =uplink / =local .

LinkEnvelope = <link @id any @body any> .
Content = <content @body any> .
FromUplink = <from-uplink @distributor #:any @body any> .
At = <at @observer #:any @body any> .

Each local access entity (which, remember, is acting like a plain dataspace for "local" consumers unaware of federation) takes assertions v and places them in the local fragment wrapped in a <link #f v> envelope.

An uplink is an assertion relay from the local fragment's <content v> assertions to the remote fragment as <link f v>, where f is the entity reference for the local fragment.

Internal to each fragment, every <link _ v> where v is not an observation record <Observe pat e> results in <content v> being asserted directly back into the fragment.

Observation records are handled differently: if <link _ <Observe pat _>> is asserted into a fragment, a "distributor" relay object d is created which takes each assertion x it receives and asserts a <from-uplink d x> into the fragment. A <content <Observe pat d>> assertion makes the distributor visible to the uplink. Then, for each specific observer e of the pattern pat, an assertion relay takes <from-uplink d x> and asserts x indirectly at e by way of placing an <at e x> record in the fragment.

The purpose of the <at...> records is deduplication. Every fragment has an assertion relay responding to every <at e x> with an assertion of x at e.

Finally, for each <link f <Observe pat e>> in a fragment, an Observe assertion is placed into the fragment with pattern <group <rec link> {0: <bind <_>> 1: pat}> and observer entity e'. The entity e' is a relay object that expects assertions containing the bindings [v ...] from pat plus an extra binding i for the ID of the asserting fragment. If i is #f, or is not equal to f, <at e [v ...]> is then asserted.

The "observer-flood" protocol

This protocol floods the tree with every individual subscription. Replica subscriptions then react locally to matching assertions. Any assertion matching a given subscription causes assertion of an <at...> record, one for each subscriber. These are relayed hop-by-hop toward the target entity. No subscription aggregation happens at all: if multiple entities subscribe to a given pattern, multiple respondent assertions will traverse the network for each match for each entity.

This protocol variation is roughly equivalent to, and is directly inspired by, the FRμIT federation experiments I did in 2019. It's much simpler, though!

CreateUplink = <uplink @downlinkFragment #:any @reply #:#:any> .
CreateDownlink = <downlink @reply #:#:any> .

From = <from @fragment #:any @body any> .
At = <at @observer #:any @item any> .

Each local access entity takes assertions v and places them in the local fragment f wrapped in a <from f v> envelope.

Uplinks are symmetrical (once established): consider the half-uplink from fragment a to fragment b, where a symmetrical half-uplink from b to a also exists. Assertions <from f <Observe pat e>> result, when fb, in <from a <Observe pat e>> being asserted at b, and also in mirroring of assertions <at e x> from b as assertions <at e x> at a.

Within each fragment f, <from f <Observe _ e>> causes establishment of a relay of <at e x> to x at e.

Finally, at fragment f, <from _ <Observe pat e>> causes assertion of <Observe <group <rec from> {0: f 1: pat}> e'. The entity e' takes all assertions x and asserts <at e x> in f.

The "aggregate-subs" protocol

As in the "observer-flood" protocol, subscriptions are copied across the whole tree. The important difference is that just the subscription patterns are replicated. Subscription entities are treated separately, as a fragment-local concern.

Each subscription is labelled with a "next-hop" fragment. Each local assertion (from a local access entity pointing to this particular fragment) that matches the subscription's pattern is labelled with the pattern, plus the fragment it originated with (a "previous-hop" label), and copied to all "next-hop" fragments for the matching pattern. For a given "next-hop", an assertion is only copied if it has a different "previous-hop" label. This, plus the strict tree structure of the federated network, avoids routing loops.

CreateUplink = <uplink @downlinkFragment #:any @reply #:#:any> .
CreateDownlink = <downlink @reply #:#:any> .

Local = <local @item any> .
At = <at @fragment #:any @item any> .
Sub = <sub @fragment #:any @pat any> .
Pub = <pub @fragment #:any @pat any @vs [any ...]> .

Each local access entity takes assertions v and places them in the local fragment wrapped in a <local *v> envelope.

Uplinks are, as in the "observer-flood" protocol, symmetrical. Consider again a half-uplink from fragment a to fragment b. Every <sub f pat> assertion from a results in assertion of <at b <sub a pat>> if fb. Notice there is no mention of entities in these sub records.

Within a fragment f, each <sub _ pat> assertion causes assertion of <Observe <group <rec local> {0: pat}> e>, where e is a relay object taking the bindings [v ...] and asserting <pub f pat [v ...]> at f.

Separately, each <sub t pat> at f, when tf, establishes an assertion relay from <pub a pat x> at f to <at t <pub f pat x>> whenever at.

Local observers at f, <local <Observe pat e>>, trigger assertion of <sub f pat> at f, plus establishment of an assertion relay sending each <pub _ pat x> at f directly as x to e.

Finally, an assertion relay at every fragment f sends each <at t x> directly as x to t. The purpose of <at...> in this protocol is local deduplication of assertions before sending them on to a potentially-distant peer, even though that peer is a deduplicating dataspace itself. This saves on network traffic.

Discussion

An interesting consideration is the sync operation from the Syndicated Actor Model. ISTM that synchronizing to a federated dataspace should reach all the leaves of the spanning tree and come back before completing, making it in principle quite expensive! Idea: share the costs by having an epoch counter constantly ticking away, so local sync operations can wait for the epoch counter to increment, say, twice, or something.

The federation implementation from the FRμIT project of 2019 is hundreds of lines long. What makes for such a difference? The answer is that the 2019 code was implementing an early form of the Syndicate wire protocol in addition to providing federation. The current implementation has disentangled these two considerations into a design where federation can be elegantly added to a network of Syndicated actors without much difficulty and without revision of the core Syndicate protocol or the Syndicated actor model itself.

Subscription aggregation is done naively in the current protocols. The structure of dataspace implementations could usefully be reflected in the federation protocol, perhaps treating each of the "skeleton", "constant positions" and "binding positions" of subscriptions separately when aggregating. This starts to move toward a routing-protocol-like system. Imagine some structure to the subscriptions that happens to reflect locality of connection among nodes in the spanning tree: nodes "downstream" of a given node tend to have subscriptions with a "common prefix". The "common prefix" gets shorter as you travel up the tree. If subscriptions were to aggregate based on common prefix, you end up with something analogous to IP subnet aggregation and comparable to IP routing protocols.