357 lines
14 KiB
Markdown
357 lines
14 KiB
Markdown
# Preserves
|
|
|
|
Synit makes **extensive** use of *Preserves*, a programming-language-independent language for
|
|
data.
|
|
|
|
- [Preserves homepage](https://preserves.dev/)
|
|
- [Preserves specification](https://preserves.dev/preserves.html)
|
|
- [Preserves Schema specification](https://preserves.dev/preserves-schema.html)
|
|
- [Source code](https://gitlab.com/preserves/preserves) for many (not all) of the implementations
|
|
- Implementations for
|
|
[Nim](https://git.sr.ht/~ehmry/preserves-nim),
|
|
[Python](https://pypi.org/project/preserves/),
|
|
[Racket](https://pkgs.racket-lang.org/package/preserves),
|
|
[Rust](https://docs.rs/preserves/latest/preserves/),
|
|
[Squeak Smalltalk](https://squeaksource.com/Preserves.html),
|
|
[TypeScript/Javascript](https://www.npmjs.com/org/preserves)
|
|
|
|
The Preserves data language is in many ways comparable to JSON, XML, S-expressions, CBOR, ASN.1
|
|
BER, and so on. From the [specification
|
|
document](https://preserves.dev/preserves.html):
|
|
|
|
> Preserves supports *records* with user-defined *labels*, embedded *references*, and the usual
|
|
> suite of atomic and compound data types, including *binary* data as a distinct type from text
|
|
> strings.
|
|
|
|
## Why does Synit rely on Preserves?
|
|
|
|
There are four aspects of Preserves that make it particularly relevant to Synit:
|
|
|
|
- the core Preserves [data language](#grammar-of-values) has a robust semantics;
|
|
- a [canonical form](#canonical-form) exists for every Preserves value;
|
|
- Preserves values may have [capability references](#capabilities) embedded within them; and
|
|
- Preserves has a [schema language](#schemas) useful for specifying protocols among actors.
|
|
|
|
## Grammar of values
|
|
|
|
Preserves has programming-language-independent *semantics*: the specification defines an
|
|
*equivalence relation* over Preserves values.[^preserves-ordering-exists-too] This makes it a
|
|
solid foundation for a multi-language, multi-process, potentially distributed system like
|
|
Synit. [^dataspaces-need-data-with-semantics]
|
|
|
|
### Values and Types
|
|
|
|
Preserves values come in various *types*: a few basic atomic types, plus sequence, set,
|
|
dictionary, and record compound types. From the specification:
|
|
|
|
Value = Atom Atom = Boolean
|
|
| Compound | Float
|
|
| Embedded | Double
|
|
| SignedInteger
|
|
Compound = Record | String
|
|
| Sequence | ByteString
|
|
| Set | Symbol
|
|
| Dictionary
|
|
|
|
### Concrete syntax
|
|
|
|
Preserves offers *multiple* syntaxes, each useful in different settings. Values are
|
|
automatically, losslessly translatable from one syntax to another because Preserves' semantics
|
|
are syntax-independent.
|
|
|
|
The core Preserves specification defines a text-based, human-readable, JSON-like syntax, that
|
|
is a syntactic superset of JSON, and a completely equivalent compact binary syntax, crucial to
|
|
the definition of [canonical form](#canonical-form) for Preserves values.[^syrup]
|
|
|
|
Here are a few example values, written using the text syntax (see [the
|
|
specification](https://preserves.dev/preserves.html#textual-syntax) for the
|
|
grammar):
|
|
|
|
Boolean : #t #f
|
|
Float : 1.0f 10.4e3f -100.6f
|
|
Double : 1.0 10.4e3 -100.6
|
|
Integer : 1 0 -100
|
|
String : "Hello, world!\n"
|
|
ByteString : #"bin\x00str\x00" #[YmluAHN0cgA] #x"62696e0073747200"
|
|
Symbol : hello-world |hello world| = ! hello? || ...
|
|
Record : <label field1 field2 ...>
|
|
Sequence : [value1 value2 ...]
|
|
Set : #{value1 value2 ...}
|
|
Dictionary : {key1: value1 key2: value2 ...: ...}
|
|
Embedded : #!value
|
|
|
|
Commas are optional in sequences, sets, and dictionaries.
|
|
|
|
### Canonical form
|
|
|
|
Every Preserves value can be serialized into a *canonical form* using the [binary
|
|
syntax](https://preserves.dev/preserves.html#compact-binary-syntax) along with
|
|
[a few simple rules](https://preserves.dev/canonical-binary.html) about
|
|
serialization ordering of elements in sets and keys in dictionaries.
|
|
|
|
Having a canonical form means that, for example, a cryptographic hash of a value's canonical
|
|
serialization can be used as a unique fingerprint for the value.
|
|
|
|
For example, the SHA-512 digest of the canonical serialization of the value
|
|
|
|
```preserves
|
|
<sms-delivery <address international "31653131313">
|
|
<address international "31655512345">
|
|
<rfc3339 "2022-02-09T08:18:29.88847+01:00">
|
|
"This is a test SMS message">
|
|
```
|
|
|
|
is
|
|
|
|
bfea9bd5ddf7781e34b6ca7e146ba2e442ef8ce04fd5ff912f889359945d0e2967a77a13
|
|
c86b13959dcce7e8ba3950d303832b825648609447b3d147677163ce
|
|
|
|
### Capabilities
|
|
|
|
Preserves values can include *embedded references*, written as values with a `#!` prefix. For
|
|
example, a command adding `<some-setting>` to the user settings database might look like this
|
|
as it travels over a Unix pipe connecting a program to the root dataspace:
|
|
|
|
```preserves
|
|
<user-settings-command <assert <some-setting>> #![0 123]>
|
|
```
|
|
|
|
The `user-settings-command` structure includes the `assert` command itself, plus an embedded
|
|
capability reference, `#![0 123]`, which encodes a transport-specific reference to an object.
|
|
(See the [Syndicate Protocol](../protocol.md#capabilities-on-the-wire) for an concrete example
|
|
of this.)
|
|
|
|
The syntax of values under `#!` differs depending on the medium carrying the message.
|
|
For example, point-to-point transports need to be able to refer to "my references" (`#![0 `*n*`]`) and "your
|
|
references" (`#![1 `*n*`]`), while multicast/broadcast media (like Ethernet) need to be able to name
|
|
references within specific, named conversational participants (`#![<udp [192 168 1 10] 5999>
|
|
`*n*`]`), and in-memory representations need to use direct pointers (`#!140425190562944`).
|
|
|
|
In every case, the references themselves work like Unix file descriptors: an integer or similar
|
|
that unforgeably denotes, in a local context, some complex data structure on the other side of
|
|
a trust boundary.
|
|
|
|
When capability-bearing Preserves values are read off a transport, the capabilities are
|
|
[automatically rewritten](../protocol.md#inbound-rewriting) into references to in-memory proxy
|
|
objects. The [reverse process](../protocol.md#outbound-rewriting) of rewriting capability
|
|
references happens when an in-memory value is serialized for transmission.
|
|
|
|
## Schemas
|
|
|
|
Preserves comes with a schema language suitable for defining protocols among actors/programs in
|
|
Synit. Because Preserves is a superset of JSON, its schemas can be used for parsing JSON just
|
|
as well as for native Preserves values.[^you-have-to-use-a-preserves-reader] From the [schema
|
|
specification](https://preserves.dev/preserves-schema.html):
|
|
|
|
> A Preserves schema connects Preserves Values to host-language data
|
|
> structures. Each definition within a schema can be processed by a
|
|
> compiler to produce
|
|
>
|
|
> - a host-language *type definition*;
|
|
> - a partial *parsing* function from Values to instances of the
|
|
> produced type; and
|
|
> - a total *serialization* function from instances of the type to
|
|
> Values.
|
|
>
|
|
> Every parsed Value retains enough information to always be able to
|
|
> be serialized again, and every instance of a host-language data
|
|
> structure contains, by construction, enough information to be
|
|
> successfully serialized.
|
|
|
|
Instead of taking host-language data structure definitions as primary, in the way that systems
|
|
like [Serde](https://serde.rs/) do, Preserves schemas take *the shape of the serialized data*
|
|
as primary.
|
|
|
|
To see the difference, let's look at an example.
|
|
|
|
### Example: Book Outline
|
|
|
|
Systems like [Serde](https://serde.rs/) concentrate on defining (de)serializers for
|
|
host-language type definitions.
|
|
|
|
Serde starts from definitions like the following.[^this-example-from-mdbook] It generates
|
|
(de)serialization code for various different *data* languages (such as JSON, XML, CBOR, etc.)
|
|
in a single *programming* language: Rust.
|
|
|
|
```rust
|
|
pub struct BookOutline {
|
|
pub sections: Vec<BookItem>,
|
|
}
|
|
pub enum BookItem {
|
|
Chapter(Chapter),
|
|
Separator,
|
|
PartTitle(String),
|
|
}
|
|
pub struct Chapter {
|
|
pub name: String,
|
|
pub sub_items: Vec<BookItem>,
|
|
}
|
|
```
|
|
|
|
The (de)serializers are able to convert between in-memory and serialized representations such
|
|
as the following JSON document. The focus is on Rust: interpreting the produced documents from
|
|
other languages is out-of-scope for Serde.
|
|
|
|
```json
|
|
{
|
|
"sections": [
|
|
{ "PartTitle": "Part I" },
|
|
"Separator",
|
|
{
|
|
"Chapter": {
|
|
"name": "Chapter One",
|
|
"sub_items": []
|
|
}
|
|
},
|
|
{
|
|
"Chapter": {
|
|
"name": "Chapter Two",
|
|
"sub_items": []
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
By contrast, Preserves schemas map a single *data* language to and from multiple *programming*
|
|
languages. Each specific programming language has its own schema compiler, which generates type
|
|
definitions and (de)serialization code for that language from a language-independent grammar.
|
|
|
|
For example, a schema able to parse values compatible with those produced by Serde for the type
|
|
definitions above is the following:
|
|
|
|
```preserves
|
|
version 1 .
|
|
|
|
BookOutline = {
|
|
"sections": @sections [BookItem ...],
|
|
} .
|
|
|
|
BookItem = @chapter { "Chapter": @value Chapter }
|
|
/ @separator "Separator"
|
|
/ @partTitle { "PartTitle": @value string } .
|
|
|
|
Chapter = {
|
|
"name": @name string,
|
|
"sub_items": @sub_items [BookItem ...],
|
|
} .
|
|
```
|
|
|
|
Using the Rust schema compiler, we see types such as the following, which are similar to but
|
|
not the same as the original Rust types above:
|
|
|
|
```rust
|
|
pub struct BookOutline {
|
|
pub sections: std::vec::Vec<BookItem>
|
|
}
|
|
pub enum BookItem {
|
|
Chapter { value: std::boxed::Box<Chapter> },
|
|
Separator,
|
|
PartTitle { value: std::string::String }
|
|
}
|
|
pub struct Chapter {
|
|
pub name: std::string::String,
|
|
pub sub_items: std::vec::Vec<BookItem>
|
|
}
|
|
```
|
|
|
|
Using the TypeScript schema compiler, we see
|
|
|
|
```typescript
|
|
export type BookOutline = {"sections": Array<BookItem>};
|
|
|
|
export type BookItem = (
|
|
{"_variant": "chapter", "value": Chapter} |
|
|
{"_variant": "separator"} |
|
|
{"_variant": "partTitle", "value": string}
|
|
);
|
|
|
|
export type Chapter = {"name": string, "sub_items": Array<BookItem>};
|
|
```
|
|
|
|
Using the Racket schema compiler, we see
|
|
|
|
```racket
|
|
(struct BookOutline (sections))
|
|
(define (BookItem? p)
|
|
(or (BookItem-chapter? p)
|
|
(BookItem-separator? p)
|
|
(BookItem-partTitle? p)))
|
|
(struct BookItem-chapter (value))
|
|
(struct BookItem-separator ())
|
|
(struct BookItem-partTitle (value))
|
|
(struct Chapter (name sub_items))
|
|
```
|
|
|
|
and so on.
|
|
|
|
### Example: Book Outline redux, using Records
|
|
|
|
The schema for book outlines above accepts Preserves (JSON) documents compatible with the
|
|
(de)serializers produced by Serde for a Rust-native type.
|
|
|
|
Instead, we might choose to define a Preserves-native data definition, and to work from
|
|
that:[^lose-compatibility]
|
|
|
|
```preserves
|
|
version 1 .
|
|
BookOutline = <book-outline @sections [BookItem ...]> .
|
|
BookItem = Chapter / =separator / @partTitle string .
|
|
Chapter = <chapter @name string @sub_items [BookItem ...]> .
|
|
```
|
|
|
|
The schema compilers produce **exactly the same type definitions**[^well-almost-exactly] for
|
|
this variation. The differences are in the (de)serialization code only.
|
|
|
|
Here's the Preserves value equivalent to the example above, expressed using the Preserves-native schema:
|
|
|
|
```preserves
|
|
<book-outline [
|
|
"Part I"
|
|
separator
|
|
<chapter "Chapter One" []>
|
|
<chapter "Chapter Two" []>
|
|
]>
|
|
```
|
|
|
|
---
|
|
|
|
#### Notes
|
|
|
|
[^preserves-ordering-exists-too]: The specification defines a total order relation over
|
|
Preserves values as well.
|
|
|
|
[^dataspaces-need-data-with-semantics]: In particular, *dataspaces* need the assertion data
|
|
they contain to have a sensible equivalence predicate in order to be useful at all. If you
|
|
can't reliably tell whether two values are the same or different, how are you supposed to
|
|
use them to look things up in anything database-like?
|
|
Languages like JSON, which [don't have a well-defined equivalence
|
|
relation](https://preserves.dev/why-not-json.html#json-syntax-doesnt-mean-anything),
|
|
aren't good enough. When programs communicate with each other, they need to be sure that
|
|
their peers will understand the information they receive exactly as it was sent.
|
|
|
|
[^syrup]: Besides the two core syntaxes, other serialization syntaxes are in use in other
|
|
systems. For example, the [Spritely](https://gitlab.com/spritely)
|
|
[Goblins](https://gitlab.com/spritely/goblins) actor library uses a serialization syntax
|
|
called [Syrup](https://github.com/ocapn/syrup#pseudo-specification), reminiscent of
|
|
[`bencode`](https://en.wikipedia.org/wiki/Bencode).
|
|
|
|
[^you-have-to-use-a-preserves-reader]: You have to use a Preserves text-syntax reader on JSON
|
|
terms to do this, though: JSON values like `null`, `true`, and `false` naively read as
|
|
Preserves *symbols*. Preserves doesn't have the concept of `null`.
|
|
|
|
[^this-example-from-mdbook]: This example is a simplified form of the preprocessor type
|
|
definitions for
|
|
[mdBook](https://rust-lang.github.io/mdBook/for_developers/preprocessors.html), the system
|
|
used to render these pages. I use a real [Preserves schema
|
|
definition](https://git.syndicate-lang.org/synit/synit/src/branch/main/manual/book.prs) for
|
|
parsing and producing Serde's JSON representation of mdBook `Book` structures in order to
|
|
[preprocess the text](https://git.syndicate-lang.org/synit/synit/src/branch/main/manual/mdbook-ditaa).
|
|
|
|
[^lose-compatibility]: By doing so, we lose compatibility with the Serde structures, but the
|
|
point is to show the kinds of schemas available to us once we move away from strict
|
|
compatibility with existing data formats.
|
|
|
|
[^well-almost-exactly]: Well, almost exactly the same. The only difference is in the Rust
|
|
types, which use tuple-style instead of record-style structs for chapters and part titles.
|