preserves/preserves.md

390 lines
16 KiB
Markdown
Raw Normal View History

---
2019-08-18 21:08:55 +00:00
no_site_title: true
title: "Preserves: an Expressive Data Language"
---
2018-09-24 11:59:22 +00:00
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
2022-06-18 17:11:08 +00:00
{{ site.version_date }}. Version {{ site.version }}.
2022-06-18 17:11:08 +00:00
*Preserves* is a data model, with associated serialization formats.
2022-06-18 17:11:08 +00:00
It supports *records* with user-defined *labels*, embedded *references*,
and the usual suite of atomic and compound data types, including
*binary* data as a distinct type from text strings. Its *annotations*
allow separation of data from metadata such as
[comments](conventions.html#comments), trace information, and provenance
information.
2018-09-24 11:59:22 +00:00
2021-01-29 11:03:28 +00:00
Preserves departs from many other data languages in defining how to
*compare* two values. Comparison is based on the data model, not on
syntax or on data structures of any particular implementation
language.
2018-09-24 11:59:22 +00:00
2022-06-18 17:11:08 +00:00
This document defines the core semantics and data model of Preserves and
presents a handful of examples. Two other core documents define
2022-06-18 17:11:08 +00:00
- a [human-readable text syntax](preserves-text.html), and
- a [machine-oriented binary syntax](preserves-binary.html)
2022-06-18 17:11:08 +00:00
for the Preserves data model.
## <a id="semantics"></a><a id="starting-with-semantics"></a>Values
Preserves *values* are given meaning independent of their syntax. We
will write "`Value`" when we mean the set of all Preserves values or an
element of that set.
`Value`s fall into two broad categories: *atomic* and *compound*
data. Every `Value` is finite and non-cyclic. Embedded values, called
`Embedded`s, are a third, special-case category.
Value = Atom
2021-01-29 11:03:28 +00:00
| Compound
| Embedded
2018-09-23 21:44:43 +00:00
Atom = Boolean
2021-01-29 11:03:28 +00:00
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
2021-01-29 11:03:28 +00:00
| Sequence
| Set
| Dictionary
**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
2019-10-23 21:58:47 +00:00
follows:
(Values) Atom < Compound < Embedded
(Compounds) Record < Sequence < Set < Dictionary
2018-09-23 21:44:43 +00:00
(Atoms) Boolean < Float < Double < SignedInteger
< String < ByteString < Symbol
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.
### Signed integers.
2022-06-18 17:11:08 +00:00
A `SignedInteger` is an arbitrarily-large signed integer.
`SignedInteger`s are compared as mathematical integers.
### Unicode strings.
A `String` is a sequence of Unicode
2022-06-18 17:11:08 +00:00
[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
`String`s are compared lexicographically, code-point by
code-point.[^utf8-is-awesome]
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
gives the same result as a lexicographic byte-by-byte comparison
of the UTF-8 encoding of a string!
2022-06-18 17:11:08 +00:00
[^nul-permitted]: All Unicode code-points are permitted, including NUL
(code point zero).
### Binary data.
A `ByteString` is a sequence of octets. `ByteString`s are compared
lexicographically.
### Symbols.
Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point.
### Booleans.
There are two `Boolean`s, “false” and “true”. The “false” value is
less-than the “true” value.
### IEEE floating-point values.
`Float`s and `Double`s are single- and double-precision IEEE 754
floating-point values, respectively. `Float`s, `Double`s and
2022-06-18 17:11:08 +00:00
`SignedInteger`s are disjoint; by the rules [above](#total-order), every
`Float` is less than every `Double`, and every `SignedInteger` is
greater than both. Two `Float`s or two `Double`s are to be ordered by
the `totalOrder` predicate defined in section 5.10 of [IEEE Std
754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
### Records.
A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
[^iri-labels] `Record`s are compared lexicographically: first by
label, then by field sequence.
[^extensibility]: The [Racket](https://racket-lang.org/) programming
language defines
2018-09-24 17:34:07 +00:00
“[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
structure types, which map well to our `Record`s. Racket supports
record extensibility by encoding record supertypes into record
labels as specially-formatted lists.
[^iri-labels]: It is occasionally (but seldom) necessary to
interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
label can be read as a relative IRI, it is notionally interpreted
with respect to the IRI
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
be read as an absolute IRI, it stands for that IRI; and otherwise,
it cannot be read as an IRI at all, and so the label simply stands
2018-09-24 11:59:22 +00:00
for itself—for its own `Value`.
### Sequences.
A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
lexicographically.
### Sets.
A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
2018-09-24 11:59:22 +00:00
sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s.
2018-09-24 11:59:22 +00:00
### Dictionaries.
2018-09-24 11:59:22 +00:00
A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
pairwise distinct. Instances of `Dictionary` are compared by
2018-09-24 11:59:22 +00:00
lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key.
### Embeddeds.
2021-01-29 11:03:28 +00:00
An `Embedded` allows inclusion of *domain-specific*, potentially
*stateful* or *located* data into a `Value`.[^embedded-rationale]
`Embedded`s may be used to denote stateful objects, network services,
object capabilities, file descriptors, Unix processes, or other
possibly-stateful things. Because each `Embedded` is a domain-specific
datum, comparison of two `Embedded`s is done according to
2021-01-29 11:03:28 +00:00
domain-specific rules.
[^embedded-rationale]: **Rationale.** Why include `Embedded`s as a
2021-01-29 11:03:28 +00:00
special class, distinct from, say, a specially-labeled `Record`?
First, a `Record` can only hold other `Value`s: in order to embed
values such as live pointers to Java objects, some means of
"escaping" from the `Value` data type must be provided. Second,
`Embedded`s are meant to be able to denote stateful entities, for
2021-01-29 11:03:28 +00:00
which comparison by address is appropriate; however, we do not
wish to place restrictions on the *nature* of these entities: if
we had used `Record`s instead of distinct `Embedded`s, users would
2021-01-29 11:03:28 +00:00
have to invent an encoding of domain data into `Record`s that
reflected domain ordering into `Value` ordering. This is often
difficult and may not always be possible. Finally, because
`Embedded`s are intended to be able to represent network and
memory *locations*, they must be able to be rewritten at network
and process boundaries. Having a distinct class allows generic
`Embedded` rewriting without the quotation-related complications
of encoding references as, say, `Record`s.
*Examples.* In a Java or Python implementation, an `Embedded` may
denote a reference to a Java or Python object; comparison would be
done via the language's own rules for equivalence and ordering. In a
Unix application, an `Embedded` may denote an open file descriptor or
a process ID. In an HTTP-based application, each `Embedded` might be a
2021-01-29 11:03:28 +00:00
URL, compared according to
[RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a
`Value` is serialized for storage or transfer, `Embedded`s will
usually be represented as ordinary `Value`s, in which case the
2021-01-29 11:03:28 +00:00
ordinary rules for comparing `Value`s will apply.
## Examples
2022-06-18 17:11:08 +00:00
The definitions above are independent of any particular concrete syntax.
The examples of `Value`s that follow are written using [the Preserves
text syntax](preserves-text.html), and the example encoded byte
sequences use [the Preserves binary encoding](preserves-binary.html).
### Ordering.
The total ordering specified [above](#total-order) means that the following statements are true:
2021-01-29 11:03:28 +00:00
"bzz" < "c" < "caa" < #!"a"
#t < 3.0f < 3.0 < 3 < "3" < |3| < [] < #!#t
2019-07-03 23:35:56 +00:00
### Simple examples.
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->
| Value | Encoded byte sequence |
|-----------------------------------------------------|---------------------------------------------------------------------------------|
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
| `[1 2 3 4]` | B5 91 92 93 94 84 |
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
| `-257` | A1 FE FF |
| `-1` | 9F |
| `0` | 90 |
| `1` | 91 |
| `255` | A1 00 FF |
| `1.0f` | 82 3F 80 00 00 |
| `1.0` | 83 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
| `#xf"7f800000"`, positive `Float` infinity | 82 7F 80 00 00 |
| `#xd"fff0000000000000"`, negative `Double` infinity | 83 FF F0 00 00 00 00 00 00 |
2018-09-23 21:35:00 +00:00
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
2019-08-11 22:54:57 +00:00
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
encodes to
B4 ;; Record
B5 ;; Sequence
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
92 ;; SignedInteger, "2"
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
91 ;; SignedInteger, "1"
84 ;; End (sequence)
A0 65 ;; SignedInteger, "101"
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
B4 ;; Record
B3 04 64 61 74 65 ;; Symbol, "date"
A1 07 1D ;; SignedInteger, "1821"
92 ;; SignedInteger, "2"
93 ;; SignedInteger, "3"
84 ;; End (record)
B1 02 44 72 ;; String, "Dr"
84 ;; End (record)
[^extensibility2]: It happens to line up with Racket's
representation of a record label for an inheritance hierarchy
where `titled` extends `person` extends `thing`:
(struct date (year month day) #:prefab)
(struct thing (id) #:prefab)
(struct person thing (name date-of-birth) #:prefab)
(struct titled person (title) #:prefab)
For more detail on Racket's representations of record labels, see
[the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
2019-07-03 23:35:56 +00:00
### JSON examples.
2022-06-18 17:11:08 +00:00
Preserves text syntax is a superset of JSON, so the examples from [RFC
8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
Preserves.
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
JSON numbers read (unambiguously) either as `SignedInteger`s or as
`Double`s.[^json-superset]
[^json-superset]: The following [schema](./preserves-schema.html)
definitions match exactly the JSON subset of a Preserves input:
version 1 .
JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
/ @array [JSON ...] / @object { string: JSON ...:... } .
JSONBoolean = =true / =false .
The first RFC 8259 example:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
2022-06-18 17:11:08 +00:00
when read using the Preserves text syntax encodes via the binary syntax
as follows:
B7
B1 05 "Image"
B7
2020-12-30 15:43:18 +00:00
B1 03 "IDs" B5
A0 74
A1 03 AF
A1 00 EA
A2 00 97 89
84
B1 05 "Title" B1 14 "View from 15th Floor"
B1 05 "Width" A1 03 20
B1 06 "Height" A1 02 58
B1 08 "Animated" B3 05 "false"
B1 09 "Thumbnail"
B7
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
B1 05 "Width" A0 64
B1 06 "Height" A0 7D
84
84
84
2022-06-18 17:11:08 +00:00
The second RFC 8259 example:
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
encodes to binary as follows:
B5
B7
B1 03 "Zip" B1 05 "94107"
B1 04 "City" B1 0D "SAN FRANCISCO"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
B1 09 "precision" B1 03 "zip"
84
B7
B1 03 "Zip" B1 05 "94085"
B1 04 "City" B1 09 "SUNNYVALE"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
B1 09 "precision" B1 03 "zip"
84
84
2019-07-14 18:09:19 +00:00
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes