diff --git a/syndicate/mc/preserve.md b/syndicate/mc/preserve.md new file mode 100644 index 0000000..cc3849c --- /dev/null +++ b/syndicate/mc/preserve.md @@ -0,0 +1,982 @@ +--- +--- + + +# Preserves: Semantic Serialization of Node-labelled Data + + _________ + <_________> Tony Garnock-Jones + | FRμIT | September 2018 + |Preserves| Version 0.0.2 + \_________/ +   + + [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt + [spki]: http://world.std.com/~cme/html/spki.html + [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints + [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map + +## Introduction + +Most data serialization formats used on the web represent +*edge-labelled* semi-structured data. + +This document proposes a data model and serialization format that +takes a *node-labelled* approach. + +This makes it both extensible and much more like S-expressions, making +it easily able to represent the *labelled sums of products* as seen in +Rust, Haskell, OCaml, and other functional programming languages. + +## Starting with Semantics + +Taking inspiration from functional programming, we start with a +definition of the *values* that we want to work with and give them +meaning independent of their syntax. We will treat syntax separately, +later in this document. + + Value = Atom + | Compound + + Atom = SignedInteger + | String + | ByteString + | Symbol + | Boolean + | Float + | Double + | MIMEData + + Compound = Record + | Sequence + | Set + | Dictionary + +Our `Value`s fall into two broad categories: *atomic* and *compound* +data.[^zephyr-asdl] + + [^zephyr-asdl]: This design was loosely inspired by S-expressions, + as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others, + and by the ML type system, as seen in languages such as SML, + OCaml, Haskell, Rust, and many others. It is also related to + Zephyr ASDL (h/t + [Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)), + which doesn't offer much in the way of atoms, but offers + general-purpose labelled sums and products. See D. C. Wang, A. W. + Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax + Description Language,” in USENIX Conference on Domain-Specific + Languages, 1997, pp. 213–228. + [PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf) + +**Total order.** As we go, we will +incrementally specify a total order over `Value`s. Two values of the +same kind are compared using kind-specific rules. The ordering among +values of different kinds is essentially arbitrary, but having a total +order is convenient for many tasks, so we define it as +follows:[^ordering-by-syntax] + + (Values) Compound < Atom + + (Compounds) Record < Sequence < Set < Dictionary + + (Atoms) SignedInteger < String < ByteString < Symbol + < Boolean < Float < Double < MIMEData + + [^ordering-by-syntax]: The observant reader may note that the + ordering here is the same as that implied by the tagging scheme + used in the concrete binary syntax for `Value`s. + +**Equivalence.** Two `Value`s are equal if +neither is less than the other according to the total order. + + + + +### Signed integers. + +A `SignedInteger` is a signed integer of arbitrary width. +`SignedInteger`s are compared as mathematical integers. We will write +examples of `SignedInteger`s using standard mathematical notation. + +**Examples.** 10; -6; 0. + +**Non-examples.** NaN (the clue is in the name!); ∞ (not finite); 0.2 +(not an integer); 1/7 (likewise); 2+*i*3 (likewise); √2 (likewise). + +### Unicode strings. + +A `String` is a sequence of Unicode +[code-point](http://www.unicode.org/glossary/#code_point)s. Two +`String`s are compared lexicographically, code-point by +code-point.[^utf8-is-awesome] We will write examples of `String`s text +surrounded by double-quotes “`"`” using a monospace font. + + [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this + gives the same result as a lexicographic byte-by-byte comparison + of the UTF-8 encoding of a string! + +**Examples.** `"Hello world"`, an eleven-code-point string; `"z水𝄞"`, +the string containing the three Unicode code-points `z` (0x7A), `水` +(0x6C34) and `𝄞` (0x1D11E); `""`, the empty string. + +**Normalization forms.** Unicode defines multiple +[normalization forms](http://unicode.org/reports/tr15/) for text. No +particular normalization form is required for `String`s; +[see below](#normalization-forms). + +### Binary data. + +A `ByteString` is an ordered sequence of zero or more integers in the +inclusive range [0..255]. `ByteString`s are compared +lexicographically, byte by byte. We will only write examples of +`ByteString`s that contain bytes mapping to printable ASCII +characters, using “`#"`” as an opening quote mark and “`"`” as a +closing quote mark. + +**Examples.** The `ByteString` containing the integers 65, 66 and 67 +(corresponding to ASCII characters `A`, `B` and `C`) is written as +`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite +appearances, these are *binary* data. + +### Symbols or identifiers. + +Programming languages like Lisp and Prolog frequently use string-like +values called *symbols*. Here, a `Symbol` is, like a `String`, a +sequence of Unicode code-points, intended to represent an identifier +of some kind. `Symbol`s are also compared lexicographically by +code-point. We will write examples including only non-empty sequences +of non-whitespace characters, using a monospace font without quotation +marks. + +**Examples.** `hello-world`; `utf8-string`; `exact-integer?`. + +### Booleans. + +There are exactly two `Boolean` values, “false” and “true”. The +“false” value compares less-than the “true” value. We write `#f` for +“false”, and `#t` for “true”. + +**Examples.** `#f`; `#t`. + +### IEEE floating-point values. + +A `Float` is a single-precision IEEE 754 floating-point value; a +`Double` is a double-precision IEEE 754 floating-point value. +`Float`s, `Double`s and `SignedInteger`s are considered disjoint, and +so by the rules [above](#total-order), every `Float` is less than +every `Double`, and every `SignedInteger` is less than both. Two +`Float`s or two `Double`s are to be ordered by the `totalOrder` +predicate defined in section 5.10 of +[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). +We write examples using standard mathematical notation, avoiding NaN +and infinities, using a suffix `f` or `d` to indicate `Float` or +`Double`, respectively. + +**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d. + +**Non-examples.** 10, -6, and 0, because writing them this way +indicates `SignedInteger`s, not `Float`s or `Double`s. + +### MIME-type tagged binary data. + +A `MIMEData` is a pair of a `Symbol` denoting a +[media type](https://tools.ietf.org/html/rfc6838) and a `ByteString` +body, intended to be interpreted as an encoding of a document having +that media type. While each media type may define its own rules for +comparing documents, we define ordering among `MIMEData` +*representations* of such media types lexicographically over the +(`Symbol`, `ByteString`) pair. We write examples using the same syntax +as for byte strings, but with the media type `Symbol` sandwiched +between the “`#`” and the first “`"`”. + +**Examples.** `#application/octet-stream""`; `#text/plain"ABC"`; +`#application/xml""`; `#text/csv"123,234,345"`. + +### Records. + +A `Record` is a *labelled* tuple of zero or more `Value`s, called the +record's *fields*. A record's label is, itself, a `Value`, though it +will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s +are compared lexicographically as if they were just tuples; that is, +first by their labels, and then by the remainder of their fields. We +will only write examples of `Record`s having labels that are `Symbol`s +entirely composed of ASCII characters. Such `Record`s will be written +as a parenthesised, space-separated sequence of their label followed +by their fields. + + [^extensibility]: The [Racket](https://racket-lang.org/) programming + language defines + [“prefab”](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct)) + structure types, which map well to our `Record`s. Racket supports + record extensibility by encoding record supertypes into record + labels as specially-formatted lists. + + [^iri-labels]: It is occasionally (but seldom) necessary to + interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a + label can be read as a relative IRI, it is notionally interpreted + with respect to the IRI + `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can + be read as an absolute IRI, it stands for that IRI; and otherwise, + it cannot be read as an IRI at all, and so the label simply stands + for itself - for its own `Value`. + +**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is +written `(foo 1 2 3)`; the `Record` with label `void` and no fields is +written `(void)`. + +### Sequences. + +A `Sequence` is a general-purpose, variable-length ordered sequence of +zero or more `Value`s. `Sequence`s are compared lexicographically, +appealing to the ordering on `Value`s for comparisons at each position +in the `Sequence`s. We write examples space-separated, surrounded with +square brackets. + +**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of +`SignedInteger`s 1, 2 and 3. + +### Sets. + +A `Set` is an unordered finite set of `Value`s. It contains no +duplicate values, following the [equivalence relation](#equivalence) +induced by the total order on `Value`s. Two `Set`s are compared by +sorting their elements using the [total order](#total-order) and +comparing the resulting sequences as `Sequence`s. We write examples +space-separated, surrounded with curly braces, prefixed by `#set`. + +**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set +containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set +containing 4, the string `"hello"`, the record with label `void` and +no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`, +the set containing a `SignedInteger` and a `Float`, both denoting the +number 1; `#set{#application/xml"" #application/xml""}`, a +set containing two different `MIMEData` +values.[^mimedata-xml-difference] + + [^mimedata-xml-difference]: The two XML documents `` and `` + differ by bytewise comparison, and thus yield different `MIMEData` + values, even though under the semantics of XML they denote + identical XML infoset. + +**Non-examples.** `#set{1 1 1}`, because it contains multiple +equivalent `Value`s. + +### Dictionaries, hash-tables or maps. + +A `Dictionary` is an unordered finite collection of zero or more pairs +of `Value`s. Each pair comprises a *key* and a *value*. Keys in a +`Dictionary` must be pairwise distinct. Instances of `Dictionary` are +compared by lexicographic comparison of the sequences resulting from +ordering each `Dictionary`'s pairs in ascending order by key. Examples +are written as a `#dict`-prefixed, curly-brace-surrounded sequence of +space-separated key-value pairs, each written with a colon between the +key and value. + +**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the +dictionary mapping the `Symbol` `a` to the `SignedInteger` 1; +`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having +a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence` +values. + +**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate +keys; `#dict{[]:[] []:99}`, for the same reason. + +## Syntax + +Now we have discussed `Value`s and their meanings, we may turn to +techniques for *representing* `Value`s for communication or storage. + +The syntax we have used for the examples so far is inadequate in many +ways, not least of which is that it cannot represent every `Value`. + +Separation of the meaning of a piece of syntax from the syntax itself +opens the door to domain-specific syntaxes, all equivalent and +interconvertible.[^asn1] With a robust semantic foundation, +connections to other data languages can also be made. + + [^asn1]: Those who remember + [ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx) + will recall BER, DER, PER, CER, XER and so on, each appropriate to + a different setting. Similarly, + [Rivest's S-Expression design][sexp.txt] offers a human-friendly + syntax, a syntax robust to network-induced message corruption, and + an unambiguous, simple and easily-parsed machine-friendly syntax + for the same underlying values. + +### Binary syntax + +For now, we limit our attention to an easily-parsed, easily-produced +machine-readable syntax. + +Every `Value` is represented as one or more bytes describing first its +kind and its length, and then its specific contents. + +For a value `v`, we write `[[v]]` for the encoding of v. + +The following figure summarises the definitions below: + + tt nn mmmm varint(m) contents + ------------------------------- + + 00 00 mmmm ... application-specific Record + 00 01 mmmm ... application-specific Record + 00 10 mmmm ... application-specific Record + 00 11 mmmm ... Record + + 01 00 mmmm ... Sequence + 01 01 mmmm ... Set + 01 10 mmmm ... Dictionary + + 10 00 mmmm ... SignedInteger, big-endian binary + 10 01 mmmm ... String, UTF-8 binary + 10 10 mmmm ... Bytes + 10 11 mmmm ... Symbol, UTF-8 binary + + 11 00 0000 False + 11 00 0001 True + 11 00 0010 Float, 32 bits big-endian binary + 11 00 0011 Double, 64 bits big-endian binary + + 11 01 mmmm ... MIME-type-labelled binary data + + If mmmm = 1111, varint(m) is present; otherwise, m is the length + +#### Type and Length representation + +A `Value`'s type and length is represented by use of a function +`header(t,n,m)` that yields a sequence of bytes when `t`, `n` and `m` +are appropriate non-negative integers. + + header(t,n,m) = leadbyte(t,n,m) when m < 15 + or leadbyte(t,n,15) ++ varint(m) otherwise + +The lead byte in a `Value`'s representation is constructed by a function + + leadbyte(t,n,m) = [t*64 + n*16 + m] + +The lead byte describes the rest of the representation as +follows:[^some-encodings-unused] + + leadbyte(0,-,-) represents a Record + leadbyte(1,-,-) represents a Sequence, Set or Dictionary + leadbyte(2,-,-) represents an Atom with variable-length binary representation + leadbyte(3,0,-) represents an Atom with fixed-length binary representation + leadbyte(3,1,-) represents certain special variable-length values + + [^some-encodings-unused]: Some encodings are unused. All such + encodings are reserved for future versions of this specification. + +Variable-length representations use the value of `m` to encode their +lengths: + + - Lengths between 0 and 14 are represented using `leadbyte` with `m` + values 0 through 14. + - Lengths of 15 or greater are represented by `m` value 15, and + additional "length bytes" describing the length then follow the + lead byte. + +These additional length bytes are formatted as +[base 128 varints][varint]. Quoting the +[Google Protocol Buffers][varint] definition, + +> Each byte in a varint, except the last byte, has the most +> significant bit (msb) set – this indicates that there are further +> bytes to come. The lower 7 bits of each byte are used to store the +> two's complement representation of the number in groups of 7 bits, +> least significant group first. + +**Examples.** + + - The varint representation of 15 is just the byte 15. + - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2. + - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3. + +We write `varint(m)` for the varint-encoding of `m`. + +#### Records + + [[ (L F_1 ... F_m) ]] = header(0,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] + +For `m` fields, `m+1` is supplied to `header`, to account for the +encoding of the record label. + +##### Application-specific short form for labels + +Any given protocol using Preserves may additionally define an +interpretation for `n ∈ {0,1,2}`, mapping each *short form label +number* `n` to a specific record label. When encoding `m` fields with +short form label number `n`, the header is `header(0,n,m)` (rather +than `m+1`) since the label is implicit. + +**Examples.** For example, a protocol may choose to map records +labelled `void` to `n=0`, making + + [[(void)]] = header(0,0,0) = [0x00] + +or it may map records labelled `person` to short form label number 1, +making + + [[(person "Dr" "Elizabeth" "Blackwell")]] + = header(0,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]` + = [0x13] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]` + +#### Sequences, Sets and Dictionaries + + [[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]] + + [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[Y_1]] ++ ... ++ [[Y_m]] + where [Y_1 ... Y_m] = sort([X_1 ... X_m]) + + [[ #dict{K_1:V_1 ... K_m:V_m} ]] + = header(1,2,m) ++ [[K'_1]] ++ [[V'_1]] ++ ... ++ [[K'_m]] ++ [[V'_m]] + where [[K'_1 V'_1] ... [K'_m V'_m]] + = sort([[K_1 V_1] ... [K_m V_m]]) + +Note that `n=3` is unused and reserved. + +#### Variable-length Atoms + +##### SignedInteger + + [[ x ]] when x ∈ SignedInteger = header(2,0,m) ++ intbytes(x) + where m = |intbytes(x)| + and intbytes(x) = a big-endian two's-complement representation + of the signed integer x, taking exactly as + many whole bytes as needed to unambiguously + identify the value + +For example, + + [[ -257 ]] = [0x82, 0xFE, 0xFF] + [[ -256 ]] = [0x82, 0xFF, 0x00] + [[ -255 ]] = [0x82, 0xFF, 0x01] + [[ -254 ]] = [0x82, 0xFF, 0x02] + [[ -129 ]] = [0x82, 0xFF, 0x7F] + [[ -128 ]] = [0x81, 0x80] + [[ -127 ]] = [0x81, 0x81] + [[ -2 ]] = [0x81, 0xFE] + [[ -1 ]] = [0x81, 0xFF] + [[ 0 ]] = [0x80] + [[ 1 ]] = [0x81, 0x01] + [[ 127 ]] = [0x81, 0x7F] + [[ 128 ]] = [0x82, 0x00, 0x80] + [[ 255 ]] = [0x82, 0x00, 0xFF] + [[ 256 ]] = [0x82, 0x01, 0x00] + [[ 32767 ]] = [0x82, 0x7F, 0xFF] + [[ 32768 ]] = [0x83, 0x00, 0x80, 0x00] + [[ 65535 ]] = [0x83, 0x00, 0xFF, 0xFF] + [[ 65536 ]] = [0x83, 0x01, 0x00, 0x00] + [[ 131072 ]] = [0x83, 0x02, 0x00, 0x00] + +##### String + + [[ S ]] when S ∈ String = header(2,1,m) ++ utf8(S) + where m = |utf8(x)| + and utf8(x) = the UTF-8 encoding of S + +##### ByteString + + [[ B ]] when B ∈ ByteString = header(2,2,m) ++ B + where m = |B| + +##### Symbol + + [[ S ]] when S ∈ Symbol = header(2,2,m) ++ utf8(S) + where m = |utf8(x)| + and utf8(x) = the UTF-8 encoding of S + +#### Fixed-length Atoms + +##### Booleans + + [[ #f ]] = header(3,0,0) = [0xC0] + [[ #t ]] = header(3,0,1) = [0xC1] + +##### Floats and Doubles + + [[ F ]] when F ∈ Float = header(3,0,2) ++ binary32(F) + [[ D ]] when D ∈ Double = header(3,0,3) ++ binary64(D) + where binary32(F) and binary64(D) are big-endian 4- and 8-byte + IEEE 754 binary representations + +#### Special variable-length values + +##### MIMEData + +Each `MIMEData` value is comprised of a media type `Symbol` and a raw +binary body. + + [[ M ]] when M ∈ MIMEData = header(3,1,m) ++ [[T]] ++ B + where m = |B| + and T is the Symbol media type of M + and B is the ByteString body of M + +## Examples + + + + +For the following examples, imagine an application that maps `Record` +short form label number 0 to label `discard`, 1 to `capture`, and 2 to +`observe`. + +| Value | Encoded hexadecimal byte sequence | +|--------------------------------------------------------------------|----------------------------------------------------| +| `(capture (discard))` | 11 00 | +| `(observe (speak (discard) (capture (discard))))` | 21 33 B5 73 70 65 61 6B 00 11 00 | +| `[1 2 3 4]` | 44 81 01 81 02 81 03 81 04 | +| `[-2 -1 0 1]` | 54 81 FE 81 FF 80 81 01 | +| `["hello" there #"world" [] #set{} #t #f]` | 47 95 68 65 6C 6C 6F A5 74 68 65 72 65 40 50 C1 C0 | +| `-257` | 82 FE FF | +| `-1` | 81 FF | +| `0` | 80 | +| `1` | 81 01 | +| `255` | 82 00 FF | +| `1f` | C2 3F 80 00 00 | +| `1d` | C3 3F F0 00 00 00 00 00 00 | +| `-1.202e300d` | C3 FE 3C B7 B7 59 BF 04 26 | + +Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Value` + + ([titled person 2 thing 1] + 101 + "Blackwell" + (date 1821 2 3) + "Dr") + +encodes to + + 35 ;; Record, generic, 4+1 + 45 ;; Sequence, 5 + b6 74 69 74 6c 65 64 ;; Symbol, "titled" + b6 70 65 72 73 6f 6e ;; Symbol, "person" + 81 02 ;; SignedInteger, "2" + b5 74 68 69 6e 67 ;; Symbol, "thing" + 81 01 ;; SignedInteger, "1" + 81 65 ;; SignedInteger, "101" + 99 42 6c 61 63 6b 77 65 6c 6c ;; String, "Blackwell" + 34 ;; Record, generic, 3+1 + b4 64 61 74 65 ;; Symbol, "date" + 82 07 1d ;; SignedInteger, "1821" + 81 02 ;; SignedInteger, "2" + 81 03 ;; SignedInteger, "3" + 92 44 72 ;; String, "Dr" + + [^extensibility2]: It happens to line up with Racket's + representation of a record label for an inheritance hierarchy + where `titled` extends `person` extends `thing`: + + (struct date (year month day) #:prefab) + (struct thing (id) #:prefab) + (struct person thing (name date-of-birth) #:prefab) + (struct titled person (title) #:prefab) + +## Conventions for Common Data Types + +The `Value` data type is essentially an S-Expression, able to +represent semi-structured data over `ByteString`, `String`, +`SignedInteger` atoms and so on. + +However, users need a wide variety of data types for representing +domain-specific values such as various kinds of encoded and normalized +text, calendrical values, machine words, and so on. + +We use appropriately-labelled `Record`s to denote these +domain-specific data types. + +All of these conventions are optional. They form a layer atop the core +`Value` structure. Non-domain-specific tools do not in general need to +treat them specially. + +**Validity.** Many of the labels we will describe in this section come + with side-conditions on the contents of labelled `Record`s. It is + possible to construct an instance of `Value` that violates these + side-conditions without ceasing to be a `Value` or becoming + unrepresentable. However, we say that such a `Value` is *invalid* + because it fails to honour the necessary side-conditions. + Implementations *SHOULD* allow two modes of working: one which + treats all `Value`s identically, without regard for side-conditions, + and one which enforces validity (i.e. side-conditions) when reading, + writing, or constructing `Value`s. + +### Text + +#### Normalization forms + +In order for users to unambiguously signal or require a particular +[normalization form](http://unicode.org/reports/tr15/), we define a +`NormalizedString`, which is a `Record` labelled with +`unicode-normalization` and having two fields, the first of which is a +`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`, +`nfkc`, `nfkd`), and the second of which is a `String` whose +underlying code point representation *MUST* be normalized according to +the named normalization form. + +#### IRIs (URIs, URLs, URNs, etc.) + +An `IRI` is a `Record` labelled with `iri` and having one field, a +`String` which is the IRI itself and which *MUST* be a valid absolute +or relative IRI. + +### Machine words + +The definition of `SignedInteger` captures all integers. However, in +certain circumstances it can be valuable to assert that a number +inhabits a particular range, such as a fixed-width machine word. + +A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote +*n*-bit-wide signed and unsigned range restrictions, respectively. +Records with these labels *MUST* have one field, a `SignedInteger`, +which *MUST* fall within the appropriate range. That is, to be valid, + - in `(i16 `*x*`)`, -32768 <= *x* <= 32767. + - in `(u16 `*x*`)`, 0 <= *x* <= 65535. + - in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647. + - etc. + +### Anonymous Tuples and Unit + +A `Tuple` is a `Record` with label `tuple` and zero or more fields, +denoting an anonymous tuple of values. + +The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called +"unit" or "void" (but *not* e.g. JavaScript's "undefined" value). + +### Null and Undefined + +Tony Hoare's +"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)" +can be represented with the 0-ary `Record` `(null)`. An "undefined" +value can be represented as `(undefined)`. + +### Dates and Times + +Dates, times, moments, and timestamps can be represented with a +`Record` with label `rfc3339` having a single field, a `String`, which +*MUST* conform to one of the `full-date`, `partial-time`, `full-time`, +or `date-time` productions of +[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6). + +## Representing Values in Programming Languages + +We have given a definition of `Value` and its semantics, and proposed +a concrete syntax for communicating and storing `Value`s. We now turn +to **suggested** representations of `Value`s as *programming-language +values* for various programming languages. + +When designing a language mapping, an important consideration is +roundtripping: serialization after deserialization, and vice versa, +should both be identities. + +### JavaScript + + - `SignedInteger` ↔ numbers or `BigInt` [[1](https://developers.google.com/web/updates/2018/05/bigint), [2](https://github.com/tc39/proposal-bigint)] + - `String` ↔ strings + - `ByteString` ↔ `Uint8Array` + - `Symbol` ↔ `Symbol.for(...)` + - `Boolean` ↔ `Boolean` + - `Float` and `Double` ↔ numbers, + - `MIMEData` ↔ `{ "type": aString, "data": aUint8Array }` + - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors + - `(undefined)` ↔ the undefined value + - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production + - `Sequence` ↔ `Array` + - `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true` + - `Dictionary` ↔ a `Map` + +### Scheme/Racket + + - `SignedInteger` ↔ exact numbers + - `String` ↔ strings + - `ByteString` ↔ byte vector (Racket: "Bytes") + - `Symbol` ↔ symbols + - `Boolean` ↔ booleans + - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats) + - `MIMEData` ↔ a structure with a `type` and a `data` field (Racket: `(struct mime (type data))`) + - `Record` ↔ structures (Racket: prefab struct) + - `Sequence` ↔ lists + - `Set` ↔ Racket: sets + - `Dictionary` ↔ Racket: hash-table + +### Java + + - `SignedInteger` ↔ `Integer`, `Long`, `BigInteger` + - `String` ↔ `String` + - `ByteString` ↔ `byte[]` + - `Symbol` ↔ a simple data class wrapping a `String` + - `Boolean` ↔ `Boolean` + - `Float` and `Double` ↔ `Float` and `Double` + - `MIMEData` ↔ an implementation of `javax.activation.DataSource`, maybe? + - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping? + - `Sequence` ↔ an implementation of `java.util.List` + - `Set` ↔ an implementation of `java.util.Set` + - `Dictionary` ↔ an implementation of `java.util.Map` + +### Erlang + + - `SignedInteger` ↔ integers + - `String` ↔ tuple of `utf8` and a binary + - `ByteString` ↔ a binary + - `Symbol` ↔ the underlying string converted to an Erlang atom, if + some kind of an "unsafe" mode is set on the decoder (because Erlang + atoms are not GC'd); otherwise perhaps a tuple of `symbol` and a + binary of the utf-8 + - `Boolean` ↔ `true` and `false` + - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision) + - `MIMEData` ↔ tuple of the type as a utf8 binary, and the data as a binary + - `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions + - `Sequence` ↔ a list + - `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?) + - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17) + +## Appendix. Table of lead byte values + + 0x - short form Record label index 0 + 1x - short form Record label index 1 + 2x - short form Record label index 2 + 3x - Record + 4x - Sequence + 5x - Set + 6x - Dictionary + (7x) RESERVED + 8x - SignedInteger + 9x - String + Ax - Bytes + Bx - Symbol + C0 - False + C1 - True + C2 - Float + C3 - Double + (Cx) RESERVED C4-CF + Dx - MIMEData + (Ex) RESERVED + (Fx) RESERVED + +## Appendix. Why not Just Use JSON? + + + +JSON offers *syntax* for numbers, strings, booleans, null, arrays and +string-keyed maps. However, it suffers from two major problems. First, +it offers no *semantics* for the syntax: it is left to each +implementation to determine how to treat each JSON term. This causes +[interoperability](http://seriot.ch/parsing_json.php) and even +[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html) +issues. Second, JSON's lack of support for type tags leads to awkward +and incompatible *encodings* of type information in terms of the fixed +suite of constructors on offer. + +There are other minor problems with JSON having to do with its syntax. +Examples include its relative verbosity and its lack of support for +binary data. + +### JSON syntax doesn't *mean* anything + +When are two JSON values the same? When are they different? + + +The specifications are largely silent on these questions. Different +JSON implementations give different answers. + +Specifically, JSON does not: + + - assign any meaning to numbers,[^meaning-ieee-double] + - determine how strings are to be compared,[^string-key-comparison] + - determine whether object key ordering is significant,[^json-member-ordering] or + - determine whether duplicate object keys are permitted, what it + would mean if they were, or how to determine a duplicate in the + first place.[^json-key-uniqueness] + +In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats] + + [^meaning-ieee-double]: + [Section 6 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-6) + does go so far as to indicate “good interoperability can be + achieved” by imagining that parsers are able reliably to + understand the syntax of numbers as denoting an IEEE 754 + double-precision floating-point value. + + [^string-key-comparison]: + [Section 8.3 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-8.3) + suggests that *if* an implementation compares strings used as + object keys “code unit by code unit”, then it will interoperate + with *other such implementations*, but neither requires this + behaviour nor discusses comparisons of strings used in other + contexts. + + [^json-member-ordering]: + [Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4) + remarks that “[implementations] differ as to whether or not they + make the ordering of object members visible to calling software.” + + [^json-key-uniqueness]: + [Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4) + is the only place in the specification that mentions the issue. It + explicitly sanctions implementations supporting duplicate keys, + noting only that “when the names within an object are not unique, + the behavior of software that receives such an object is + unpredictable.” Implementations are free to choose any behaviour + at all in this situation, including signalling an error, or + discarding all but one of a set of duplicates. + + [^xml-infoset]: The XML world has the concept of + [XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely + speaking, XML infoset is the *denotation* of an XML document; the + *meaning* of the document. + + [^other-formats]: Most other recent data languages are like JSON in + specifying only a syntax with no associated semantics. While some + do make a sketch of a semantics, the result is often + underspecified (e.g. in terms of how strings are to be compared), + overly machine-oriented (e.g. treating 32-bit integers as + fundamentally distinct from 64-bit integers and from + floating-point numbers), overly fine (e.g. giving visibility to + the order in which map entries are written), or all three. + +Some examples: + + - are the JSON values `1`, `1.0`, and `1e0` the same or different? + - are the JSON values `1.0` and `1.0000000000000001` the same or different? + - are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"` + (UTF-8 `7061cc88726f6e`) the same or different? + - are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same + or different? + - which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the + same? Are all three legal? + - are `{"päron":1}` and `{"päron":1}` the same or different? + +### JSON can multiply nicely, but it can't add very well + +JSON includes a fixed set of types: numbers, strings, booleans, null, +arrays and string-keyed maps. Domain-specific data must be *encoded* +into these types. For example, dates and email addresses are often +represented as strings with an implicit internal structure. + +There is no convention for *labelling* a value as belonging to a +particular category. This makes it difficult to extract, say, all +email addresses, or all URLs, from an arbitrary JSON document. + +Instead, JSON-encoded data are often labelled in an ad-hoc way. +Multiple incompatible approaches exist. For example, a "money" +structure containing a `currency` field and an `amount` may be +represented in any number of ways: + + { "_type": "money", "currency": "EUR", "amount": 10 } + { "type": "money", "value": { "currency": "EUR", "amount": 10 } } + [ "money", { "currency": "EUR", "amount": 10 } ] + { "@money": { "currency": "EUR", "amount": 10 } } + +This causes particular problems when JSON is used to represent *sum* +or *union* types, such as "either a value or an error, but not both". +Again, multiple incompatible approaches exist. + +For example, imagine an API for depositing money in an account. The +response might be either a "success" response indicating the new +balance, or one of a set of possible errors. + +Sometimes, a *pair* of values is used, with `null` marking the option +not taken.[^interesting-failure-mode] + + { "ok": { "balance": 210 }, "error": null } + { "ok": null, "error": "Unauthorized" } + + [^interesting-failure-mode]: What is the meaning of a document where + both `ok` and `error` are non-null? What might happen when a + program is presented with such a document? + +The branch not chosen is sometimes present, sometimes omitted as if it +were an optional field: + + { "ok": { "balance": 210 } } + { "error": "Unauthorized" } + +Sometimes, an array of a label and a value is used: + + [ "ok", { "balance": 210 } ] + [ "error", "Unauthorized" ] + +Sometimes, the shape of the data is sufficient to distinguish among +the alternatives, and the label is left implicit: + + { "balance": 210 } + "Unauthorized" + +JSON itself does not offer any guidance for which of these options to +choose. In many real cases on the web, poor choices have led to +encodings that are irrecoverably ambiguous. + +--- +--- + +# Open questions + +Q. Should "symbols" instead be URIs? Relative, usually; relative to +what? Some domain-specific base URI? + +Q. What about general rationals, subsuming integers and IEEE floats +(except NaN and the Infinities)? + +Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps] + + [^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as + they are? While they include binary data and sequences, and an + obvious equivalence for them exists, they lack numbers *per se* as + well as any kind of unordered structure such as sets or maps. In + addition, while "display hints" allow labelling of binary data + with an intended interpretation, they cannot be attached to any + other kind of structure, and the "hint" itself can only be a + binary blob. + +Q. Should `MIMEData` be a special syntax for `Record`s with a single +`ByteString` field? + +A. Not even. It should probably just be moved to the "conventions" +section. Compare: + + D5 BA text/plain hello -- using special MIMEData encoding + 32 BA text/plain A5 hello -- using bog standard type-labelled Record + +Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol` +label (recursive!?) and a single `String` field? + +Q. Should `String` be a special syntax for `(utf8 Bytes)`? Again, +recursiveness problems...? + +Q. Should `Dictionary` be a special syntax for etc etc.? `Set`? +`Float`? `Double`? + + --> Rule of thumb: if there's a special equivalence predicate for it, + it needs to be built-in syntax. Otherwise it can be a regular + record. (So: `Boolean` might not make the cut for special + treatment?? Likewise `String`...? Ugh those are psychologically + important perhaps) + +Q. Are the language mappings reasonable? How about one for Python? + +--- + +OK so. No built-in `MIMEData`, but maybe a conventional `(mime-data +Symbol Bytes)`? Applications can put it in a short slot if they like. + +Streaming: needed for variable-sized structures. Tricky to design +syntax for this that isn't gratuitously warty. End byte value. + +Literal small integers: could be nice? Not absolutely necessary. + +Give algorithm for computing size of integers. + +Give up on sorting requirement for representation of sets and +dictionaries?? Probably a good idea if there are streaming forms of +them because that sounds impossible to do?? + +Maybe reorder: fixed-length atoms first, then variable-length atoms, +then fixed-length compounds, then variable-length compounds? Reason +being that then maybe can put the streaming forms of the +variable-length ones very last. + +---