439 lines
20 KiB
Markdown
439 lines
20 KiB
Markdown
---
|
||
no_site_title: true
|
||
title: "Preserves: an Expressive Data Language"
|
||
---
|
||
|
||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||
{{ site.version_date }}. Version {{ site.version }}.
|
||
|
||
{% include what-is-preserves.md %}
|
||
|
||
This document defines the core semantics and data model of Preserves and
|
||
presents a handful of examples. Two other core documents define
|
||
|
||
- a [human-readable text syntax](preserves-text.html), and
|
||
- a [machine-oriented binary syntax](preserves-binary.html)
|
||
|
||
for the Preserves data model.
|
||
|
||
## <a id="semantics"></a><a id="starting-with-semantics"></a>Values
|
||
|
||
Preserves *values* are given meaning independent of their syntax. We
|
||
will write "`Value`" when we mean the set of all Preserves values or an
|
||
element of that set.
|
||
|
||
`Value`s fall into two broad categories: *atomic* and *compound*
|
||
data. Every `Value` is finite and non-cyclic. Embedded values, called
|
||
`Embedded`s, are a third, special-case category.
|
||
|
||
{% include value-grammar.md %}
|
||
|
||
**Total order.**<a name="total-order"></a> As we go, we will
|
||
incrementally specify a total order over `Value`s. Two values of the
|
||
same kind are compared using kind-specific rules. The ordering among
|
||
values of different kinds is essentially arbitrary, but having a total
|
||
order is convenient for many tasks, so we define it as
|
||
follows:
|
||
|
||
(Values) Atom < Compound < Embedded
|
||
|
||
(Compounds) Record < Sequence < Set < Dictionary
|
||
|
||
(Atoms) Boolean < Float < Double < SignedInteger
|
||
< String < ByteString < Symbol
|
||
|
||
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
||
neither is less than the other according to the total order.
|
||
|
||
### Signed integers.
|
||
|
||
A `SignedInteger` is an arbitrarily-large signed integer.
|
||
`SignedInteger`s are compared as mathematical integers.
|
||
|
||
### Unicode strings.
|
||
|
||
A `String` is a sequence of [Unicode
|
||
scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted]
|
||
`String`s are compared lexicographically, scalar value by
|
||
scalar value.[^utf8-is-awesome]
|
||
|
||
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
||
gives the same result as a lexicographic byte-by-byte comparison
|
||
of the UTF-8 encoding of a string!
|
||
|
||
[^nul-permitted]: All Unicode scalar values are permitted, including NUL
|
||
(scalar value zero). Because scalar values are defined as code points
|
||
*excluding* surrogate code points
|
||
(D800<sub>16</sub>–DFFF<sub>16</sub>), surrogates are *not* permitted
|
||
in Preserves Unicode data.
|
||
|
||
### Binary data.
|
||
|
||
A `ByteString` is a sequence of octets. `ByteString`s are compared
|
||
lexicographically.
|
||
|
||
### Symbols.
|
||
|
||
Programming languages like Lisp and Prolog frequently use string-like
|
||
values called *symbols*.[^even-java-has-quasi-symbols] Here, a `Symbol`
|
||
is, like a `String`, a sequence of Unicode scalar values representing an
|
||
identifier of some kind. `Symbol`s are also compared lexicographically
|
||
by scalar value.
|
||
|
||
[^even-java-has-quasi-symbols]: Even Java has quasi-symbols in the form
|
||
of its "interned strings". A Java Preserves implementation might
|
||
intern Preserves `Symbol`s while leaving Preserves `String`s
|
||
uninterned.
|
||
|
||
### Booleans.
|
||
|
||
There are two `Boolean`s, “false” and “true”. The “false” value is
|
||
less-than the “true” value.
|
||
|
||
### IEEE floating-point values.
|
||
|
||
`Float`s and `Double`s are single- and double-precision IEEE 754
|
||
floating-point values, respectively. `Float`s, `Double`s and
|
||
`SignedInteger`s are disjoint; by the rules [above](#total-order), every
|
||
`Float` is less than every `Double`, and every `SignedInteger` is
|
||
greater than both. Two `Float`s or two `Double`s are to be ordered by
|
||
the `totalOrder` predicate defined in section 5.10 of [IEEE Std
|
||
754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
||
|
||
### Records.
|
||
|
||
A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
|
||
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
|
||
[^iri-labels] `Record`s are ordered first by label, then
|
||
lexicographically[^lexicographical-sequences] by field sequence.
|
||
|
||
[^extensibility]: The [Racket](https://racket-lang.org/) programming
|
||
language defines
|
||
“[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
|
||
structure types, which map well to our `Record`s. Racket supports
|
||
record extensibility by encoding record supertypes into record
|
||
labels as specially-formatted lists.
|
||
|
||
[^iri-labels]: It is occasionally (but seldom) necessary to
|
||
interpret such `Symbol` labels as IRIs. Where a
|
||
label can be read as a relative IRI, it is notionally interpreted
|
||
with respect to the IRI
|
||
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
|
||
be read as an absolute IRI, it stands for that IRI; and otherwise,
|
||
it cannot be read as an IRI at all, and so the label simply stands
|
||
for itself—for its own `Value`.
|
||
|
||
[^lexicographical-sequences]: When comparing sequences of values for
|
||
the total order, [lexicographical
|
||
ordering](https://en.wikipedia.org/wiki/Lexicographic_order) is
|
||
used. Elements are drawn pairwise from the two sequences to be
|
||
compared. If one is smaller than the other according to the total
|
||
order, the sequence it was drawn from is the smaller of the
|
||
sequences. If the end of one sequence is reached, while the other
|
||
sequence has elements remaining, the shorter sequence is considered
|
||
smaller. Otherwise, all the elements compared equal and neither was
|
||
longer than the other, so they compare equal. For example,
|
||
- `[#f]` is ordered before `[foo]` because `Boolean` appears before `Symbol` in the kind ordering;
|
||
- `[x]` before `[x y]` because there is no element remaining to compare against `y`;
|
||
- `[a b]` before `[x]` because `a` is smaller than `x`; and
|
||
- `[x y]` before `[x z]` because `y` is ordered before `z` according to the ordering rules for `Symbol`.
|
||
|
||
### Sequences.
|
||
|
||
A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
|
||
lexicographically.[^lexicographical-sequences]
|
||
|
||
### Sets.
|
||
|
||
A `Set` is an unordered finite set of `Value`s. It contains no
|
||
duplicate values, following the [equivalence relation](#equivalence)
|
||
induced by the total order on `Value`s. Two `Set`s are compared by
|
||
sorting their elements ascending using the [total order](#total-order)
|
||
and comparing the resulting `Sequence`s.[^lexicographical-sequences]
|
||
|
||
### Dictionaries.
|
||
|
||
A `Dictionary` is an unordered finite collection of pairs of `Value`s.
|
||
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
|
||
pairwise distinct. Instances of `Dictionary` are compared by
|
||
lexicographic[^lexicographical-sequences] comparison of the sequences
|
||
resulting from ordering each `Dictionary`'s pairs in ascending order by
|
||
key.
|
||
|
||
### Embeddeds.
|
||
|
||
An `Embedded` allows inclusion of *domain-specific*, potentially
|
||
*stateful* or *located* data into a `Value`.[^embedded-rationale]
|
||
`Embedded`s may be used to denote stateful objects, network services,
|
||
object capabilities, file descriptors, Unix processes, or other
|
||
possibly-stateful things. Because each `Embedded` is a domain-specific
|
||
datum, comparison of two `Embedded`s is done according to
|
||
domain-specific rules.
|
||
|
||
[^embedded-rationale]: **Rationale.** Why include `Embedded`s as a
|
||
special class, distinct from, say, a specially-labeled `Record`?
|
||
First, a `Record` can only hold other `Value`s: in order to embed
|
||
values such as live pointers to Java objects, some means of
|
||
"escaping" from the `Value` data type must be provided. Second,
|
||
`Embedded`s are meant to be able to denote stateful entities, for
|
||
which comparison by address is appropriate; however, we do not
|
||
wish to place restrictions on the *nature* of these entities: if
|
||
we had used `Record`s instead of distinct `Embedded`s, users would
|
||
have to invent an encoding of domain data into `Record`s that
|
||
reflected domain ordering into `Value` ordering. This is often
|
||
difficult and may not always be possible. Finally, because
|
||
`Embedded`s are intended to be able to represent network and
|
||
memory *locations*, they must be able to be rewritten at network
|
||
and process boundaries. Having a distinct class allows generic
|
||
`Embedded` rewriting without the quotation-related complications
|
||
of encoding references as, say, `Record`s.
|
||
|
||
*Motivating Examples.* In a Java or Python implementation, an `Embedded` may
|
||
denote a reference to a Java or Python object; comparison would be
|
||
done via the language's own rules for equivalence and ordering. In a
|
||
Unix application, an `Embedded` may denote an open file descriptor or
|
||
a process ID. In an HTTP-based application, each `Embedded` might be a
|
||
URL, compared according to
|
||
[RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a
|
||
`Value` is serialized for storage or transfer, `Embedded`s will
|
||
usually be represented as ordinary `Value`s, in which case the
|
||
ordinary rules for comparing `Value`s will apply.
|
||
|
||
## <a id="examples"></a>Appendix. Examples
|
||
|
||
The definitions above are independent of any particular concrete syntax.
|
||
The examples of `Value`s that follow are written using [the Preserves
|
||
text syntax](preserves-text.html), and the example encoded byte
|
||
sequences use [the Preserves binary encoding](preserves-binary.html).
|
||
|
||
### Ordering.
|
||
|
||
The total ordering specified [above](#total-order) means that the following statements are true:
|
||
|
||
- `"bzz"` < `"c"` < `"caa"` < `#!"a"`
|
||
- `#t` < `3.0f` < `3.0` < `3` < `"3"` < `|3|` < `[]` < `#!#t`
|
||
- `[#f]` < `[foo]`, because `Boolean` appears before `Symbol` in the kind ordering
|
||
- `[x]` < `[x y]`, because there is no element remaining to compare against `y`
|
||
- `[a b]` < `[x]`, because `a` is smaller than `x`
|
||
- `[x y]` < `[x z]`, because `y` is ordered before `z`
|
||
|
||
### Simple examples.
|
||
|
||
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
||
<!-- translated from various JSON blobs floating around the internet. -->
|
||
|
||
| Value | Encoded byte sequence |
|
||
|-----------------------------------------------------|---------------------------------------------------------------------------------|
|
||
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
|
||
| `[1 2 3 4]` | B5 B0 01 01 B0 01 02 B0 01 03 B0 01 04 84 |
|
||
| `[-2 -1 0 1]` | B5 B0 01 FE B0 01 FF B0 00 B0 01 01 84 |
|
||
| `"hello"` | B1 05 'h' 'e' 'l' 'l' 'o' |
|
||
| `"z水𝄞"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
|
||
| `"z水\uD834\uDD1E"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E |
|
||
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
|
||
| `-257` | B0 02 FE FF |
|
||
| `-1` | B0 01 FF |
|
||
| `0` | B0 00 |
|
||
| `1` | B0 01 01 |
|
||
| `255` | B0 02 00 FF |
|
||
| `1.0f` | 87 04 3F 80 00 00 |
|
||
| `1.0` | 87 08 3F F0 00 00 00 00 00 00 |
|
||
| `-1.202e300` | 87 08 FE 3C B7 B7 59 BF 04 26 |
|
||
| `#xf"7f800000"`, positive `Float` infinity | 87 04 7F 80 00 00 |
|
||
| `#xd"fff0000000000000"`, negative `Double` infinity | 87 08 FF F0 00 00 00 00 00 00 |
|
||
|
||
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
||
|
||
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
|
||
|
||
encodes to
|
||
|
||
B4 # Record
|
||
B5 # Sequence
|
||
B3 06 74 69 74 6C 65 64 # Symbol, "titled"
|
||
B3 06 70 65 72 73 6F 6E # Symbol, "person"
|
||
B0 01 02 # SignedInteger, "2"
|
||
B3 05 74 68 69 6E 67 # Symbol, "thing"
|
||
B0 01 01 # SignedInteger, "1"
|
||
84 # End (sequence)
|
||
B0 01 65 # SignedInteger, "101"
|
||
B1 09 42 6C 61 63 6B 77 65 6C 6C # String, "Blackwell"
|
||
B4 # Record
|
||
B3 04 64 61 74 65 # Symbol, "date"
|
||
B0 02 07 1D # SignedInteger, "1821"
|
||
B0 01 02 # SignedInteger, "2"
|
||
B0 01 03 # SignedInteger, "3"
|
||
84 # End (record)
|
||
B1 02 44 72 # String, "Dr"
|
||
84 # End (record)
|
||
|
||
[^extensibility2]: It happens to line up with Racket's
|
||
representation of a record label for an inheritance hierarchy
|
||
where `titled` extends `person` extends `thing`:
|
||
|
||
(struct date (year month day) #:prefab)
|
||
(struct thing (id) #:prefab)
|
||
(struct person thing (name date-of-birth) #:prefab)
|
||
(struct titled person (title) #:prefab)
|
||
|
||
For more detail on Racket's representations of record labels, see
|
||
[the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
|
||
|
||
### JSON examples.
|
||
|
||
Preserves text syntax is a superset of JSON,[^json-string-caveat] so the
|
||
examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13)
|
||
read as valid Preserves.
|
||
|
||
[^json-string-caveat]: There is one caveat to be aware of. [Section 8.2
|
||
of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2)
|
||
explicitly permits unpaired [surrogate code
|
||
point](https://unicode.org/glossary/#surrogate_code_point)s in JSON
|
||
texts without specifying an interpretation for them. Preserves mandates
|
||
UTF-8 in its binary syntax, forbids unpaired surrogates in its text
|
||
syntax, and disallows surrogate code points in `String`s and `Symbol`s,
|
||
meaning that any valid JSON text including an unpaired surrogate will
|
||
not be parseable using the Preserves text syntax rules.
|
||
|
||
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
|
||
JSON numbers read (unambiguously) either as `SignedInteger`s or as
|
||
`Double`s.[^json-superset]
|
||
|
||
[^json-superset]: The following [schema](./preserves-schema.html)
|
||
definitions match exactly the JSON subset of a Preserves input:
|
||
|
||
version 1 .
|
||
JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
|
||
/ @array [JSON ...] / @object { string: JSON ...:... } .
|
||
JSONBoolean = =true / =false .
|
||
|
||
The first RFC 8259 example:
|
||
|
||
{
|
||
"Image": {
|
||
"Width": 800,
|
||
"Height": 600,
|
||
"Title": "View from 15th Floor",
|
||
"Thumbnail": {
|
||
"Url": "http://www.example.com/image/481989943",
|
||
"Height": 125,
|
||
"Width": 100
|
||
},
|
||
"Animated" : false,
|
||
"IDs": [116, 943, 234, 38793]
|
||
}
|
||
}
|
||
|
||
when read using the Preserves text syntax encodes via the binary syntax
|
||
as follows:
|
||
|
||
B7
|
||
B1 05 "Image"
|
||
B7
|
||
B1 03 "IDs" B5
|
||
B0 01 74
|
||
B0 02 03 AF
|
||
B0 02 00 EA
|
||
B0 03 00 97 89
|
||
84
|
||
B1 05 "Title" B1 14 "View from 15th Floor"
|
||
B1 05 "Width" B0 02 03 20
|
||
B1 06 "Height" B0 02 02 58
|
||
B1 08 "Animated" B3 05 "false"
|
||
B1 09 "Thumbnail"
|
||
B7
|
||
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
|
||
B1 05 "Width" B0 01 64
|
||
B1 06 "Height" B0 01 7D
|
||
84
|
||
84
|
||
84
|
||
|
||
The second RFC 8259 example:
|
||
|
||
[
|
||
{
|
||
"precision": "zip",
|
||
"Latitude": 37.7668,
|
||
"Longitude": -122.3959,
|
||
"Address": "",
|
||
"City": "SAN FRANCISCO",
|
||
"State": "CA",
|
||
"Zip": "94107",
|
||
"Country": "US"
|
||
},
|
||
{
|
||
"precision": "zip",
|
||
"Latitude": 37.371991,
|
||
"Longitude": -122.026020,
|
||
"Address": "",
|
||
"City": "SUNNYVALE",
|
||
"State": "CA",
|
||
"Zip": "94085",
|
||
"Country": "US"
|
||
}
|
||
]
|
||
|
||
encodes to binary as follows:
|
||
|
||
B5
|
||
B7
|
||
B1 03 "Zip" B1 05 "94107"
|
||
B1 04 "City" B1 0D "SAN FRANCISCO"
|
||
B1 05 "State" B1 02 "CA"
|
||
B1 07 "Address" B1 00
|
||
B1 07 "Country" B1 02 "US"
|
||
B1 08 "Latitude" 87 08 40 42 E2 26 80 9D 49 52
|
||
B1 09 "Longitude" 87 08 C0 5E 99 56 6C F4 1F 21
|
||
B1 09 "precision" B1 03 "zip"
|
||
84
|
||
B7
|
||
B1 03 "Zip" B1 05 "94085"
|
||
B1 04 "City" B1 09 "SUNNYVALE"
|
||
B1 05 "State" B1 02 "CA"
|
||
B1 07 "Address" B1 00
|
||
B1 07 "Country" B1 02 "US"
|
||
B1 08 "Latitude" 87 08 40 42 AF 9D 66 AD B4 03
|
||
B1 09 "Longitude" 87 08 C0 5E 81 AA 4F CA 42 AF
|
||
B1 09 "precision" B1 03 "zip"
|
||
84
|
||
84
|
||
|
||
## Appendix. Merging Values
|
||
|
||
The *merge* of two `Value`s is a combination of the two values that includes all information
|
||
from each that is missing from the other. If the values are incompatible, they have no merge.
|
||
|
||
- the merge of two `Atom`s has no value if they are not [equal](#equivalence); otherwise, it
|
||
has value equal to (an arbitrary) one of the atoms.
|
||
|
||
- the merge of two `Embedded`s depends on the interpretation of the embedded values, and so is
|
||
implementation-defined.
|
||
|
||
- the merge of two `Compound`s is:
|
||
|
||
- if both are `Sequence`s, let `n` be the minimum of the lengths of the two sequences. If
|
||
every merge of corresponding positions up to `n` in the sequences is defined, the result
|
||
is defined, with elements merged up to position `n` and simply copied from the longer
|
||
sequence from position `n` onward; otherwise, it is undefined.
|
||
|
||
- if both are `Record`s, the `Record` with the merge of the two input records' labels as its
|
||
label and the merge of the inputs' field sequences as its fields;
|
||
|
||
- if both are `Dictionary`s, if every merge of values associated with keys common to both
|
||
inputs is defined, the result is defined, with merged values at common keys and simply
|
||
copied from either side for keys unique to that side; otherwise, it is undefined.
|
||
|
||
- Otherwise, the merge is undefined.
|
||
|
||
**Examples.**
|
||
|
||
- `merge [1, [2], 3] [1, [2, 99], 3, 4, 5] = merge [1, [2, 99], 3, 4, 5]`
|
||
- `merge [1, 2, 3] [1, 5, 3] = ⊥`
|
||
- `merge #{a, b, c} #{a, b, c} = ⊥`
|
||
- `merge {a: 1, b: [2]} {b: [2, 99] c: 3} = {a: 1, b: [2, 99], c: 3}`
|
||
- `merge {a: 1, b: [2]} {a: 5, b: [2]} = ⊥`
|
||
|
||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||
## Notes
|