390 lines
16 KiB
Markdown
390 lines
16 KiB
Markdown
---
|
|
no_site_title: true
|
|
title: "Preserves: an Expressive Data Language"
|
|
---
|
|
|
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
{{ site.version_date }}. Version {{ site.version }}.
|
|
|
|
*Preserves* is a data model, with associated serialization formats.
|
|
|
|
It supports *records* with user-defined *labels*, embedded *references*,
|
|
and the usual suite of atomic and compound data types, including
|
|
*binary* data as a distinct type from text strings. Its *annotations*
|
|
allow separation of data from metadata such as
|
|
[comments](conventions.html#comments), trace information, and provenance
|
|
information.
|
|
|
|
Preserves departs from many other data languages in defining how to
|
|
*compare* two values. Comparison is based on the data model, not on
|
|
syntax or on data structures of any particular implementation
|
|
language.
|
|
|
|
This document defines the core semantics and data model of Preserves and
|
|
presents a handful of examples. Two other core documents define
|
|
|
|
- a [human-readable text syntax](preserves-text.html), and
|
|
- a [machine-oriented binary syntax](preserves-binary.html)
|
|
|
|
for the Preserves data model.
|
|
|
|
## <a id="semantics"></a><a id="starting-with-semantics"></a>Values
|
|
|
|
Preserves *values* are given meaning independent of their syntax. We
|
|
will write "`Value`" when we mean the set of all Preserves values or an
|
|
element of that set.
|
|
|
|
`Value`s fall into two broad categories: *atomic* and *compound*
|
|
data. Every `Value` is finite and non-cyclic. Embedded values, called
|
|
`Embedded`s, are a third, special-case category.
|
|
|
|
Value = Atom
|
|
| Compound
|
|
| Embedded
|
|
|
|
Atom = Boolean
|
|
| Float
|
|
| Double
|
|
| SignedInteger
|
|
| String
|
|
| ByteString
|
|
| Symbol
|
|
|
|
Compound = Record
|
|
| Sequence
|
|
| Set
|
|
| Dictionary
|
|
|
|
**Total order.**<a name="total-order"></a> As we go, we will
|
|
incrementally specify a total order over `Value`s. Two values of the
|
|
same kind are compared using kind-specific rules. The ordering among
|
|
values of different kinds is essentially arbitrary, but having a total
|
|
order is convenient for many tasks, so we define it as
|
|
follows:
|
|
|
|
(Values) Atom < Compound < Embedded
|
|
|
|
(Compounds) Record < Sequence < Set < Dictionary
|
|
|
|
(Atoms) Boolean < Float < Double < SignedInteger
|
|
< String < ByteString < Symbol
|
|
|
|
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
|
neither is less than the other according to the total order.
|
|
|
|
### Signed integers.
|
|
|
|
A `SignedInteger` is an arbitrarily-large signed integer.
|
|
`SignedInteger`s are compared as mathematical integers.
|
|
|
|
### Unicode strings.
|
|
|
|
A `String` is a sequence of Unicode
|
|
[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
|
|
`String`s are compared lexicographically, code-point by
|
|
code-point.[^utf8-is-awesome]
|
|
|
|
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
|
gives the same result as a lexicographic byte-by-byte comparison
|
|
of the UTF-8 encoding of a string!
|
|
|
|
[^nul-permitted]: All Unicode code-points are permitted, including NUL
|
|
(code point zero).
|
|
|
|
### Binary data.
|
|
|
|
A `ByteString` is a sequence of octets. `ByteString`s are compared
|
|
lexicographically.
|
|
|
|
### Symbols.
|
|
|
|
Programming languages like Lisp and Prolog frequently use string-like
|
|
values called *symbols*. Here, a `Symbol` is, like a `String`, a
|
|
sequence of Unicode code-points representing an identifier of some
|
|
kind. `Symbol`s are also compared lexicographically by code-point.
|
|
|
|
### Booleans.
|
|
|
|
There are two `Boolean`s, “false” and “true”. The “false” value is
|
|
less-than the “true” value.
|
|
|
|
### IEEE floating-point values.
|
|
|
|
`Float`s and `Double`s are single- and double-precision IEEE 754
|
|
floating-point values, respectively. `Float`s, `Double`s and
|
|
`SignedInteger`s are disjoint; by the rules [above](#total-order), every
|
|
`Float` is less than every `Double`, and every `SignedInteger` is
|
|
greater than both. Two `Float`s or two `Double`s are to be ordered by
|
|
the `totalOrder` predicate defined in section 5.10 of [IEEE Std
|
|
754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
|
|
|
### Records.
|
|
|
|
A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
|
|
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
|
|
[^iri-labels] `Record`s are compared lexicographically: first by
|
|
label, then by field sequence.
|
|
|
|
[^extensibility]: The [Racket](https://racket-lang.org/) programming
|
|
language defines
|
|
“[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
|
|
structure types, which map well to our `Record`s. Racket supports
|
|
record extensibility by encoding record supertypes into record
|
|
labels as specially-formatted lists.
|
|
|
|
[^iri-labels]: It is occasionally (but seldom) necessary to
|
|
interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
|
|
label can be read as a relative IRI, it is notionally interpreted
|
|
with respect to the IRI
|
|
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
|
|
be read as an absolute IRI, it stands for that IRI; and otherwise,
|
|
it cannot be read as an IRI at all, and so the label simply stands
|
|
for itself—for its own `Value`.
|
|
|
|
### Sequences.
|
|
|
|
A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
|
|
lexicographically.
|
|
|
|
### Sets.
|
|
|
|
A `Set` is an unordered finite set of `Value`s. It contains no
|
|
duplicate values, following the [equivalence relation](#equivalence)
|
|
induced by the total order on `Value`s. Two `Set`s are compared by
|
|
sorting their elements ascending using the [total order](#total-order)
|
|
and comparing the resulting `Sequence`s.
|
|
|
|
### Dictionaries.
|
|
|
|
A `Dictionary` is an unordered finite collection of pairs of `Value`s.
|
|
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
|
|
pairwise distinct. Instances of `Dictionary` are compared by
|
|
lexicographic comparison of the sequences resulting from ordering each
|
|
`Dictionary`'s pairs in ascending order by key.
|
|
|
|
### Embeddeds.
|
|
|
|
An `Embedded` allows inclusion of *domain-specific*, potentially
|
|
*stateful* or *located* data into a `Value`.[^embedded-rationale]
|
|
`Embedded`s may be used to denote stateful objects, network services,
|
|
object capabilities, file descriptors, Unix processes, or other
|
|
possibly-stateful things. Because each `Embedded` is a domain-specific
|
|
datum, comparison of two `Embedded`s is done according to
|
|
domain-specific rules.
|
|
|
|
[^embedded-rationale]: **Rationale.** Why include `Embedded`s as a
|
|
special class, distinct from, say, a specially-labeled `Record`?
|
|
First, a `Record` can only hold other `Value`s: in order to embed
|
|
values such as live pointers to Java objects, some means of
|
|
"escaping" from the `Value` data type must be provided. Second,
|
|
`Embedded`s are meant to be able to denote stateful entities, for
|
|
which comparison by address is appropriate; however, we do not
|
|
wish to place restrictions on the *nature* of these entities: if
|
|
we had used `Record`s instead of distinct `Embedded`s, users would
|
|
have to invent an encoding of domain data into `Record`s that
|
|
reflected domain ordering into `Value` ordering. This is often
|
|
difficult and may not always be possible. Finally, because
|
|
`Embedded`s are intended to be able to represent network and
|
|
memory *locations*, they must be able to be rewritten at network
|
|
and process boundaries. Having a distinct class allows generic
|
|
`Embedded` rewriting without the quotation-related complications
|
|
of encoding references as, say, `Record`s.
|
|
|
|
*Examples.* In a Java or Python implementation, an `Embedded` may
|
|
denote a reference to a Java or Python object; comparison would be
|
|
done via the language's own rules for equivalence and ordering. In a
|
|
Unix application, an `Embedded` may denote an open file descriptor or
|
|
a process ID. In an HTTP-based application, each `Embedded` might be a
|
|
URL, compared according to
|
|
[RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a
|
|
`Value` is serialized for storage or transfer, `Embedded`s will
|
|
usually be represented as ordinary `Value`s, in which case the
|
|
ordinary rules for comparing `Value`s will apply.
|
|
|
|
## Examples
|
|
|
|
The definitions above are independent of any particular concrete syntax.
|
|
The examples of `Value`s that follow are written using [the Preserves
|
|
text syntax](preserves-text.html), and the example encoded byte
|
|
sequences use [the Preserves binary encoding](preserves-binary.html).
|
|
|
|
### Ordering.
|
|
|
|
The total ordering specified [above](#total-order) means that the following statements are true:
|
|
|
|
"bzz" < "c" < "caa" < #!"a"
|
|
#t < 3.0f < 3.0 < 3 < "3" < |3| < [] < #!#t
|
|
|
|
### Simple examples.
|
|
|
|
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
|
<!-- translated from various JSON blobs floating around the internet. -->
|
|
|
|
| Value | Encoded byte sequence |
|
|
|-----------------------------------------------------|---------------------------------------------------------------------------------|
|
|
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
|
|
| `[1 2 3 4]` | B5 91 92 93 94 84 |
|
|
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
|
|
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
|
|
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
|
|
| `-257` | A1 FE FF |
|
|
| `-1` | 9F |
|
|
| `0` | 90 |
|
|
| `1` | 91 |
|
|
| `255` | A1 00 FF |
|
|
| `1.0f` | 82 3F 80 00 00 |
|
|
| `1.0` | 83 3F F0 00 00 00 00 00 00 |
|
|
| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
|
|
| `#xf"7f800000"`, positive `Float` infinity | 82 7F 80 00 00 |
|
|
| `#xd"fff0000000000000"`, negative `Double` infinity | 83 FF F0 00 00 00 00 00 00 |
|
|
|
|
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
|
|
|
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
|
|
|
|
encodes to
|
|
|
|
B4 ;; Record
|
|
B5 ;; Sequence
|
|
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
|
|
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
|
|
92 ;; SignedInteger, "2"
|
|
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
|
|
91 ;; SignedInteger, "1"
|
|
84 ;; End (sequence)
|
|
A0 65 ;; SignedInteger, "101"
|
|
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
|
B4 ;; Record
|
|
B3 04 64 61 74 65 ;; Symbol, "date"
|
|
A1 07 1D ;; SignedInteger, "1821"
|
|
92 ;; SignedInteger, "2"
|
|
93 ;; SignedInteger, "3"
|
|
84 ;; End (record)
|
|
B1 02 44 72 ;; String, "Dr"
|
|
84 ;; End (record)
|
|
|
|
[^extensibility2]: It happens to line up with Racket's
|
|
representation of a record label for an inheritance hierarchy
|
|
where `titled` extends `person` extends `thing`:
|
|
|
|
(struct date (year month day) #:prefab)
|
|
(struct thing (id) #:prefab)
|
|
(struct person thing (name date-of-birth) #:prefab)
|
|
(struct titled person (title) #:prefab)
|
|
|
|
For more detail on Racket's representations of record labels, see
|
|
[the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
|
|
|
|
### JSON examples.
|
|
|
|
Preserves text syntax is a superset of JSON, so the examples from [RFC
|
|
8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
|
|
Preserves.
|
|
|
|
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
|
|
JSON numbers read (unambiguously) either as `SignedInteger`s or as
|
|
`Double`s.[^json-superset]
|
|
|
|
[^json-superset]: The following [schema](./preserves-schema.html)
|
|
definitions match exactly the JSON subset of a Preserves input:
|
|
|
|
version 1 .
|
|
JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
|
|
/ @array [JSON ...] / @object { string: JSON ...:... } .
|
|
JSONBoolean = =true / =false .
|
|
|
|
The first RFC 8259 example:
|
|
|
|
{
|
|
"Image": {
|
|
"Width": 800,
|
|
"Height": 600,
|
|
"Title": "View from 15th Floor",
|
|
"Thumbnail": {
|
|
"Url": "http://www.example.com/image/481989943",
|
|
"Height": 125,
|
|
"Width": 100
|
|
},
|
|
"Animated" : false,
|
|
"IDs": [116, 943, 234, 38793]
|
|
}
|
|
}
|
|
|
|
when read using the Preserves text syntax encodes via the binary syntax
|
|
as follows:
|
|
|
|
B7
|
|
B1 05 "Image"
|
|
B7
|
|
B1 03 "IDs" B5
|
|
A0 74
|
|
A1 03 AF
|
|
A1 00 EA
|
|
A2 00 97 89
|
|
84
|
|
B1 05 "Title" B1 14 "View from 15th Floor"
|
|
B1 05 "Width" A1 03 20
|
|
B1 06 "Height" A1 02 58
|
|
B1 08 "Animated" B3 05 "false"
|
|
B1 09 "Thumbnail"
|
|
B7
|
|
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
|
|
B1 05 "Width" A0 64
|
|
B1 06 "Height" A0 7D
|
|
84
|
|
84
|
|
84
|
|
|
|
The second RFC 8259 example:
|
|
|
|
[
|
|
{
|
|
"precision": "zip",
|
|
"Latitude": 37.7668,
|
|
"Longitude": -122.3959,
|
|
"Address": "",
|
|
"City": "SAN FRANCISCO",
|
|
"State": "CA",
|
|
"Zip": "94107",
|
|
"Country": "US"
|
|
},
|
|
{
|
|
"precision": "zip",
|
|
"Latitude": 37.371991,
|
|
"Longitude": -122.026020,
|
|
"Address": "",
|
|
"City": "SUNNYVALE",
|
|
"State": "CA",
|
|
"Zip": "94085",
|
|
"Country": "US"
|
|
}
|
|
]
|
|
|
|
encodes to binary as follows:
|
|
|
|
B5
|
|
B7
|
|
B1 03 "Zip" B1 05 "94107"
|
|
B1 04 "City" B1 0D "SAN FRANCISCO"
|
|
B1 05 "State" B1 02 "CA"
|
|
B1 07 "Address" B1 00
|
|
B1 07 "Country" B1 02 "US"
|
|
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
|
|
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
|
|
B1 09 "precision" B1 03 "zip"
|
|
84
|
|
B7
|
|
B1 03 "Zip" B1 05 "94085"
|
|
B1 04 "City" B1 09 "SUNNYVALE"
|
|
B1 05 "State" B1 02 "CA"
|
|
B1 07 "Address" B1 00
|
|
B1 07 "Country" B1 02 "US"
|
|
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
|
|
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
|
|
B1 09 "precision" B1 03 "zip"
|
|
84
|
|
84
|
|
|
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
|
## Notes
|