--- no_site_title: true title: "Preserves: an Expressive Data Language" --- Tony Garnock-Jones {{ site.version_date }}. Version {{ site.version }}. {% include what-is-preserves.md %} This document defines the core semantics and data model of Preserves and presents a handful of examples. Two other core documents define - a [human-readable text syntax](preserves-text.html), and - a [machine-oriented binary syntax](preserves-binary.html) for the Preserves data model. ## Values Preserves *values* are given meaning independent of their syntax. We will write "`Value`" when we mean the set of all Preserves values or an element of that set. `Value`s fall into two broad categories: *atomic* and *compound* data. Every `Value` is finite and non-cyclic. Embedded values, called `Embedded`s, are a third, special-case category. {% include value-grammar.md %} **Total order.** As we go, we will incrementally specify a total order over `Value`s. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows: (Values) Atom < Compound < Embedded (Compounds) Record < Sequence < Set < Dictionary (Atoms) Boolean < Double < SignedInteger < String < ByteString < Symbol **Equivalence.** Two `Value`s are equal if neither is less than the other according to the total order. ### Signed integers. A `SignedInteger` is an arbitrarily-large signed integer. `SignedInteger`s are compared as mathematical integers. ### Unicode strings. A `String` is a sequence of [Unicode scalar value](http://www.unicode.org/glossary/#unicode_scalar_value)s.[^nul-permitted] `String`s are compared lexicographically, scalar value by scalar value.[^utf8-is-awesome] [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! [^nul-permitted]: All Unicode scalar values are permitted, including NUL (scalar value zero). Because scalar values are defined as code points *excluding* surrogate code points (D80016–DFFF16), surrogates are *not* permitted in Preserves Unicode data. ### Binary data. A `ByteString` is a sequence of octets. `ByteString`s are compared lexicographically. ### Symbols. Programming languages like Lisp and Prolog frequently use string-like values called *symbols*.[^even-java-has-quasi-symbols] Here, a `Symbol` is, like a `String`, a sequence of Unicode scalar values representing an identifier of some kind. `Symbol`s are also compared lexicographically by scalar value. [^even-java-has-quasi-symbols]: Even Java has quasi-symbols in the form of its "interned strings". A Java Preserves implementation might intern Preserves `Symbol`s while leaving Preserves `String`s uninterned. ### Booleans. There are two `Boolean`s, “false” and “true”. The “false” value is less-than the “true” value. ### IEEE floating-point values. `Double`s are double-precision IEEE 754 floating-point values.[^other-ieee754-precisions] `Double`s and `SignedInteger`s are disjoint; by the rules [above](#total-order), every `Double` is less than every `SignedInteger`. Two `Double`s are to be ordered by the `totalOrder` predicate defined in section 5.10 of [IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). [^other-ieee754-precisions]: Every value inhabiting a smaller IEEE 754 type (e.g. single- or half-precision) can be injected into and projected from double-precision losslessly and in an order-preserving way. ### Records. A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A label can be any `Value`, but is usually a `Symbol`.[^extensibility] [^iri-labels] `Record`s are ordered first by label, then lexicographically[^lexicographical-sequences] by field sequence. [^extensibility]: The [Racket](https://racket-lang.org/) programming language defines “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))” structure types, which map well to our `Record`s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. [^iri-labels]: It is occasionally (but seldom) necessary to interpret such `Symbol` labels as IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its own `Value`. [^lexicographical-sequences]: When comparing sequences of values for the total order, [lexicographical ordering](https://en.wikipedia.org/wiki/Lexicographic_order) is used. Elements are drawn pairwise from the two sequences to be compared. If one is smaller than the other according to the total order, the sequence it was drawn from is the smaller of the sequences. If the end of one sequence is reached, while the other sequence has elements remaining, the shorter sequence is considered smaller. Otherwise, all the elements compared equal and neither was longer than the other, so they compare equal. For example, - `[#f]` is ordered before `[foo]` because `Boolean` appears before `Symbol` in the kind ordering; - `[x]` before `[x y]` because there is no element remaining to compare against `y`; - `[a b]` before `[x]` because `a` is smaller than `x`; and - `[x y]` before `[x z]` because `y` is ordered before `z` according to the ordering rules for `Symbol`. ### Sequences. A `Sequence` is a sequence of `Value`s. `Sequence`s are compared lexicographically.[^lexicographical-sequences] ### Sets. A `Set` is an unordered finite set of `Value`s. It contains no duplicate values, following the [equivalence relation](#equivalence) induced by the total order on `Value`s. Two `Set`s are compared by sorting their elements ascending using the [total order](#total-order) and comparing the resulting `Sequence`s.[^lexicographical-sequences] ### Dictionaries. A `Dictionary` is an unordered finite collection of pairs of `Value`s. Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are pairwise distinct. Instances of `Dictionary` are compared by lexicographic[^lexicographical-sequences] comparison of the sequences resulting from ordering each `Dictionary`'s pairs in ascending order by key. ### Embeddeds. An `Embedded` allows inclusion of *domain-specific*, potentially *stateful* or *located* data into a `Value`.[^embedded-rationale] `Embedded`s may be used to denote stateful objects, network services, object capabilities, file descriptors, Unix processes, or other possibly-stateful things. Because each `Embedded` is a domain-specific datum, comparison of two `Embedded`s is done according to domain-specific rules. [^embedded-rationale]: **Rationale.** Why include `Embedded`s as a special class, distinct from, say, a specially-labeled `Record`? First, a `Record` can only hold other `Value`s: in order to embed values such as live pointers to Java objects, some means of "escaping" from the `Value` data type must be provided. Second, `Embedded`s are meant to be able to denote stateful entities, for which comparison by address is appropriate; however, we do not wish to place restrictions on the *nature* of these entities: if we had used `Record`s instead of distinct `Embedded`s, users would have to invent an encoding of domain data into `Record`s that reflected domain ordering into `Value` ordering. This is often difficult and may not always be possible. Finally, because `Embedded`s are intended to be able to represent network and memory *locations*, they must be able to be rewritten at network and process boundaries. Having a distinct class allows generic `Embedded` rewriting without the quotation-related complications of encoding references as, say, `Record`s. *Motivating Examples.* In a Java or Python implementation, an `Embedded` may denote a reference to a Java or Python object; comparison would be done via the language's own rules for equivalence and ordering. In a Unix application, an `Embedded` may denote an open file descriptor or a process ID. In an HTTP-based application, each `Embedded` might be a URL, compared according to [RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a `Value` is serialized for storage or transfer, `Embedded`s will usually be represented as ordinary `Value`s, in which case the ordinary rules for comparing `Value`s will apply. ## Appendix. Examples The definitions above are independent of any particular concrete syntax. The examples of `Value`s that follow are written using [the Preserves text syntax](preserves-text.html), and the example encoded byte sequences use [the Preserves binary encoding](preserves-binary.html). ### Ordering. The total ordering specified [above](#total-order) means that the following statements are true: - `"bzz"` < `"c"` < `"caa"` < `#!"a"` - `#t` < `3.0f` < `3.0` < `3` < `"3"` < `|3|` < `[]` < `#!#t` - `[#f]` < `[foo]`, because `Boolean` appears before `Symbol` in the kind ordering - `[x]` < `[x y]`, because there is no element remaining to compare against `y` - `[a b]` < `[x]`, because `a` is smaller than `x` - `[x y]` < `[x z]`, because `y` is ordered before `z` ### Simple examples. | Value | Encoded byte sequence | |-----------------------------------------------------|---------------------------------------------------------------------------------| | `>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 | | `[1 2 3 4]` | B5 B0 01 01 B0 01 02 B0 01 03 B0 01 04 84 | | `[-2 -1 0 1]` | B5 B0 01 FE B0 01 FF B0 00 B0 01 01 84 | | `"hello"` | B1 05 'h' 'e' 'l' 'l' 'o' | | `"z水𝄞"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E | | `"z水\uD834\uDD1E"` | B1 08 'z' E6 B0 B4 F0 9D 84 9E | | `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 | | `-257` | B0 02 FE FF | | `-1` | B0 01 FF | | `0` | B0 00 | | `1` | B0 01 01 | | `255` | B0 02 00 FF | | `1.0f` | 87 04 3F 80 00 00 | | `1.0` | 87 08 3F F0 00 00 00 00 00 00 | | `-1.202e300` | 87 08 FE 3C B7 B7 59 BF 04 26 | | `#xd"fff0000000000000"`, negative `Double` infinity | 87 08 FF F0 00 00 00 00 00 00 | The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` <[titled person 2 thing 1] 101 "Blackwell" "Dr"> encodes to B4 # Record B5 # Sequence B3 06 74 69 74 6C 65 64 # Symbol, "titled" B3 06 70 65 72 73 6F 6E # Symbol, "person" B0 01 02 # SignedInteger, "2" B3 05 74 68 69 6E 67 # Symbol, "thing" B0 01 01 # SignedInteger, "1" 84 # End (sequence) B0 01 65 # SignedInteger, "101" B1 09 42 6C 61 63 6B 77 65 6C 6C # String, "Blackwell" B4 # Record B3 04 64 61 74 65 # Symbol, "date" B0 02 07 1D # SignedInteger, "1821" B0 01 02 # SignedInteger, "2" B0 01 03 # SignedInteger, "3" 84 # End (record) B1 02 44 72 # String, "Dr" 84 # End (record) [^extensibility2]: It happens to line up with Racket's representation of a record label for an inheritance hierarchy where `titled` extends `person` extends `thing`: (struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab) For more detail on Racket's representations of record labels, see [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29). ### JSON examples. Preserves text syntax is a superset of JSON,[^json-string-caveat] so the examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid Preserves. [^json-string-caveat]: There is one caveat to be aware of. [Section 8.2 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.2) explicitly permits unpaired [surrogate code point](https://unicode.org/glossary/#surrogate_code_point)s in JSON texts without specifying an interpretation for them. Preserves mandates UTF-8 in its binary syntax, forbids unpaired surrogates in its text syntax, and disallows surrogate code points in `String`s and `Symbol`s, meaning that any valid JSON text including an unpaired surrogate will not be parseable using the Preserves text syntax rules. The JSON literals `true`, `false` and `null` all read as `Symbol`s, and JSON numbers read (unambiguously) either as `SignedInteger`s or as `Double`s.[^json-superset] [^json-superset]: The following [schema](./preserves-schema.html) definitions match exactly the JSON subset of a Preserves input: version 1 . JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null / @array [JSON ...] / @object { string: JSON ...:... } . JSONBoolean = =true / =false . The first RFC 8259 example: { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": 100 }, "Animated" : false, "IDs": [116, 943, 234, 38793] } } when read using the Preserves text syntax encodes via the binary syntax as follows: B7 B1 05 "Image" B7 B1 03 "IDs" B5 B0 01 74 B0 02 03 AF B0 02 00 EA B0 03 00 97 89 84 B1 05 "Title" B1 14 "View from 15th Floor" B1 05 "Width" B0 02 03 20 B1 06 "Height" B0 02 02 58 B1 08 "Animated" B3 05 "false" B1 09 "Thumbnail" B7 B1 03 "Url" B1 26 "http://www.example.com/image/481989943" B1 05 "Width" B0 01 64 B1 06 "Height" B0 01 7D 84 84 84 The second RFC 8259 example: [ { "precision": "zip", "Latitude": 37.7668, "Longitude": -122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "Latitude": 37.371991, "Longitude": -122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ] encodes to binary as follows: B5 B7 B1 03 "Zip" B1 05 "94107" B1 04 "City" B1 0D "SAN FRANCISCO" B1 05 "State" B1 02 "CA" B1 07 "Address" B1 00 B1 07 "Country" B1 02 "US" B1 08 "Latitude" 87 08 40 42 E2 26 80 9D 49 52 B1 09 "Longitude" 87 08 C0 5E 99 56 6C F4 1F 21 B1 09 "precision" B1 03 "zip" 84 B7 B1 03 "Zip" B1 05 "94085" B1 04 "City" B1 09 "SUNNYVALE" B1 05 "State" B1 02 "CA" B1 07 "Address" B1 00 B1 07 "Country" B1 02 "US" B1 08 "Latitude" 87 08 40 42 AF 9D 66 AD B4 03 B1 09 "Longitude" 87 08 C0 5E 81 AA 4F CA 42 AF B1 09 "precision" B1 03 "zip" 84 84 ## Appendix. Merging Values The *merge* of two `Value`s is a combination of the two values that includes all information from each that is missing from the other. If the values are incompatible, they have no merge. - the merge of two `Atom`s has no value if they are not [equal](#equivalence); otherwise, it has value equal to (an arbitrary) one of the atoms. - the merge of two `Embedded`s depends on the interpretation of the embedded values, and so is implementation-defined. - the merge of two `Compound`s is: - if both are `Sequence`s, let `n` be the minimum of the lengths of the two sequences. If every merge of corresponding positions up to `n` in the sequences is defined, the result is defined, with elements merged up to position `n` and simply copied from the longer sequence from position `n` onward; otherwise, it is undefined. - if both are `Record`s, the `Record` with the merge of the two input records' labels as its label and the merge of the inputs' field sequences as its fields; - if both are `Dictionary`s, if every merge of values associated with keys common to both inputs is defined, the result is defined, with merged values at common keys and simply copied from either side for keys unique to that side; otherwise, it is undefined. - Otherwise, the merge is undefined. **Examples.** - `merge [1, [2], 3] [1, [2, 99], 3, 4, 5] = merge [1, [2, 99], 3, 4, 5]` - `merge [1, 2, 3] [1, 5, 3] = ⊥` - `merge #{a, b, c} #{a, b, c} = ⊥` - `merge {a: 1, b: [2]} {b: [2, 99] c: 3} = {a: 1, b: [2, 99], c: 3}` - `merge {a: 1, b: [2]} {a: 5, b: [2]} = ⊥` ## Notes