preserve.md based on codec.md which I'm about to check in

2018-09-23 14:37:20 +01:00 · 2018-09-23 14:37:20 +01:00 · 9255ce1a72
parent 9b4a4a2cc4
commit 9255ce1a72
1 changed files with 982 additions and 0 deletions
--- a/preserve.md
+++ b/preserve.md
@ -0,0 +1,982 @@
+---
+---
+<style>
+body { padding-top: 2rem; font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", serif; max-width: 40em; margin: auto; font-size: 120%; }
+h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; }
+h2 { border-bottom: solid #4f81bd 1px; }
+pre, code { background-color: #eee; }
+pre { padding: 0.33rem; }
+</style>
+
+# Preserves: Semantic Serialization of Node-labelled Data
+
+       _________
+      <_________>   Tony Garnock-Jones <tonyg@leastfixedpoint.com>
+      |  FRμIT  |   September 2018
+      |Preserves|   Version 0.0.2
+      \_________/
+     
+
+  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
+  [spki]: http://world.std.com/~cme/html/spki.html
+  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
+  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
+
+## Introduction
+
+Most data serialization formats used on the web represent
+*edge-labelled* semi-structured data.
+
+This document proposes a data model and serialization format that
+takes a *node-labelled* approach.
+
+This makes it both extensible and much more like S-expressions, making
+it easily able to represent the *labelled sums of products* as seen in
+Rust, Haskell, OCaml, and other functional programming languages.
+
+## Starting with Semantics
+
+Taking inspiration from functional programming, we start with a
+definition of the *values* that we want to work with and give them
+meaning independent of their syntax. We will treat syntax separately,
+later in this document.
+
+                          Value = Atom
+                                | Compound
+
+                           Atom = SignedInteger
+                                | String
+                                | ByteString
+                                | Symbol
+                                | Boolean
+                                | Float
+                                | Double
+                                | MIMEData
+
+                       Compound = Record
+                                | Sequence
+                                | Set
+                                | Dictionary
+
+Our `Value`s fall into two broad categories: *atomic* and *compound*
+data.[^zephyr-asdl]
+
+  [^zephyr-asdl]: This design was loosely inspired by S-expressions,
+    as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others,
+    and by the ML type system, as seen in languages such as SML,
+    OCaml, Haskell, Rust, and many others. It is also related to
+    Zephyr ASDL (h/t
+    [Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)),
+    which doesn't offer much in the way of atoms, but offers
+    general-purpose labelled sums and products. See D. C. Wang, A. W.
+    Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax
+    Description Language,” in USENIX Conference on Domain-Specific
+    Languages, 1997, pp. 213–228.
+    [PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf)
+
+**Total order.**<a name="total-order"></a> As we go, we will
+incrementally specify a total order over `Value`s. Two values of the
+same kind are compared using kind-specific rules. The ordering among
+values of different kinds is essentially arbitrary, but having a total
+order is convenient for many tasks, so we define it as
+follows:[^ordering-by-syntax]
+
+            (Values)        Compound < Atom
+
+            (Compounds)     Record < Sequence < Set < Dictionary
+
+            (Atoms)         SignedInteger < String < ByteString < Symbol
+                              < Boolean < Float < Double < MIMEData
+
+  [^ordering-by-syntax]: The observant reader may note that the
+    ordering here is the same as that implied by the tagging scheme
+    used in the concrete binary syntax for `Value`s.
+
+**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
+neither is less than the other according to the total order.
+
+<!-- We should avoid unnecessary restrictions such as machine-oriented -->
+<!-- fixed-width integer or floating-point values where possible. -->
+
+### Signed integers.
+
+A `SignedInteger` is a signed integer of arbitrary width.
+`SignedInteger`s are compared as mathematical integers. We will write
+examples of `SignedInteger`s using standard mathematical notation.
+
+**Examples.** 10; -6; 0.
+
+**Non-examples.** NaN (the clue is in the name!); ∞ (not finite); 0.2
+(not an integer); 1/7 (likewise); 2+*i*3 (likewise); √2 (likewise).
+
+### Unicode strings.
+
+A `String` is a sequence of Unicode
+[code-point](http://www.unicode.org/glossary/#code_point)s. Two
+`String`s are compared lexicographically, code-point by
+code-point.[^utf8-is-awesome] We will write examples of `String`s text
+surrounded by double-quotes “`"`” using a monospace font.
+
+  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
+    gives the same result as a lexicographic byte-by-byte comparison
+    of the UTF-8 encoding of a string!
+
+**Examples.** `"Hello world"`, an eleven-code-point string; `"z水𝄞"`,
+the string containing the three Unicode code-points `z` (0x7A), `水`
+(0x6C34) and `𝄞` (0x1D11E); `""`, the empty string.
+
+**Normalization forms.** Unicode defines multiple
+[normalization forms](http://unicode.org/reports/tr15/) for text. No
+particular normalization form is required for `String`s;
+[see below](#normalization-forms).
+
+### Binary data.
+
+A `ByteString` is an ordered sequence of zero or more integers in the
+inclusive range [0..255]. `ByteString`s are compared
+lexicographically, byte by byte. We will only write examples of
+`ByteString`s that contain bytes mapping to printable ASCII
+characters, using “`#"`” as an opening quote mark and “`"`” as a
+closing quote mark.
+
+**Examples.** The `ByteString` containing the integers 65, 66 and 67
+(corresponding to ASCII characters `A`, `B` and `C`) is written as
+`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite
+appearances, these are *binary* data.
+
+### Symbols or identifiers.
+
+Programming languages like Lisp and Prolog frequently use string-like
+values called *symbols*. Here, a `Symbol` is, like a `String`, a
+sequence of Unicode code-points, intended to represent an identifier
+of some kind. `Symbol`s are also compared lexicographically by
+code-point. We will write examples including only non-empty sequences
+of non-whitespace characters, using a monospace font without quotation
+marks.
+
+**Examples.** `hello-world`; `utf8-string`; `exact-integer?`.
+
+### Booleans.
+
+There are exactly two `Boolean` values, “false” and “true”. The
+“false” value compares less-than the “true” value. We write `#f` for
+“false”, and `#t` for “true”.
+
+**Examples.** `#f`; `#t`.
+
+### IEEE floating-point values.
+
+A `Float` is a single-precision IEEE 754 floating-point value; a
+`Double` is a double-precision IEEE 754 floating-point value.
+`Float`s, `Double`s and `SignedInteger`s are considered disjoint, and
+so by the rules [above](#total-order), every `Float` is less than
+every `Double`, and every `SignedInteger` is less than both. Two
+`Float`s or two `Double`s are to be ordered by the `totalOrder`
+predicate defined in section 5.10 of
+[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
+We write examples using standard mathematical notation, avoiding NaN
+and infinities, using a suffix `f` or `d` to indicate `Float` or
+`Double`, respectively.
+
+**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d.
+
+**Non-examples.** 10, -6, and 0, because writing them this way
+indicates `SignedInteger`s, not `Float`s or `Double`s.
+
+### MIME-type tagged binary data.
+
+A `MIMEData` is a pair of a `Symbol` denoting a
+[media type](https://tools.ietf.org/html/rfc6838) and a `ByteString`
+body, intended to be interpreted as an encoding of a document having
+that media type. While each media type may define its own rules for
+comparing documents, we define ordering among `MIMEData`
+*representations* of such media types lexicographically over the
+(`Symbol`, `ByteString`) pair. We write examples using the same syntax
+as for byte strings, but with the media type `Symbol` sandwiched
+between the “`#`” and the first “`"`”.
+
+**Examples.** `#application/octet-stream""`; `#text/plain"ABC"`;
+`#application/xml"<xhtml/>"`; `#text/csv"123,234,345"`.
+
+### Records.
+
+A `Record` is a *labelled* tuple of zero or more `Value`s, called the
+record's *fields*. A record's label is, itself, a `Value`, though it
+will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
+are compared lexicographically as if they were just tuples; that is,
+first by their labels, and then by the remainder of their fields. We
+will only write examples of `Record`s having labels that are `Symbol`s
+entirely composed of ASCII characters. Such `Record`s will be written
+as a parenthesised, space-separated sequence of their label followed
+by their fields.
+
+  [^extensibility]: The [Racket](https://racket-lang.org/) programming
+    language defines
+    [“prefab”](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))
+    structure types, which map well to our `Record`s. Racket supports
+    record extensibility by encoding record supertypes into record
+    labels as specially-formatted lists.
+
+  [^iri-labels]: It is occasionally (but seldom) necessary to
+    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
+    label can be read as a relative IRI, it is notionally interpreted
+    with respect to the IRI
+    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
+    be read as an absolute IRI, it stands for that IRI; and otherwise,
+    it cannot be read as an IRI at all, and so the label simply stands
+    for itself - for its own `Value`.
+
+**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
+written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
+written `(void)`.
+
+### Sequences.
+
+A `Sequence` is a general-purpose, variable-length ordered sequence of
+zero or more `Value`s. `Sequence`s are compared lexicographically,
+appealing to the ordering on `Value`s for comparisons at each position
+in the `Sequence`s. We write examples space-separated, surrounded with
+square brackets.
+
+**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
+`SignedInteger`s 1, 2 and 3.
+
+### Sets.
+
+A `Set` is an unordered finite set of `Value`s. It contains no
+duplicate values, following the [equivalence relation](#equivalence)
+induced by the total order on `Value`s. Two `Set`s are compared by
+sorting their elements using the [total order](#total-order) and
+comparing the resulting sequences as `Sequence`s. We write examples
+space-separated, surrounded with curly braces, prefixed by `#set`.
+
+**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
+containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
+containing 4, the string `"hello"`, the record with label `void` and
+no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
+the set containing a `SignedInteger` and a `Float`, both denoting the
+number 1; `#set{#application/xml"<x/>" #application/xml"<x />"}`, a
+set containing two different `MIMEData`
+values.[^mimedata-xml-difference]
+
+  [^mimedata-xml-difference]: The two XML documents `<x/>` and `<x />`
+    differ by bytewise comparison, and thus yield different `MIMEData`
+    values, even though under the semantics of XML they denote
+    identical XML infoset.
+
+**Non-examples.** `#set{1 1 1}`, because it contains multiple
+equivalent `Value`s.
+
+### Dictionaries, hash-tables or maps.
+
+A `Dictionary` is an unordered finite collection of zero or more pairs
+of `Value`s. Each pair comprises a *key* and a *value*. Keys in a
+`Dictionary` must be pairwise distinct. Instances of `Dictionary` are
+compared by lexicographic comparison of the sequences resulting from
+ordering each `Dictionary`'s pairs in ascending order by key. Examples
+are written as a `#dict`-prefixed, curly-brace-surrounded sequence of
+space-separated key-value pairs, each written with a colon between the
+key and value.
+
+**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
+dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
+`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having
+a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
+values.
+
+**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
+keys; `#dict{[]:[] []:99}`, for the same reason.
+
+## Syntax
+
+Now we have discussed `Value`s and their meanings, we may turn to
+techniques for *representing* `Value`s for communication or storage.
+
+The syntax we have used for the examples so far is inadequate in many
+ways, not least of which is that it cannot represent every `Value`.
+
+Separation of the meaning of a piece of syntax from the syntax itself
+opens the door to domain-specific syntaxes, all equivalent and
+interconvertible.[^asn1] With a robust semantic foundation,
+connections to other data languages can also be made.
+
+  [^asn1]: Those who remember
+    [ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)
+    will recall BER, DER, PER, CER, XER and so on, each appropriate to
+    a different setting. Similarly,
+    [Rivest's S-Expression design][sexp.txt] offers a human-friendly
+    syntax, a syntax robust to network-induced message corruption, and
+    an unambiguous, simple and easily-parsed machine-friendly syntax
+    for the same underlying values.
+
+### Binary syntax
+
+For now, we limit our attention to an easily-parsed, easily-produced
+machine-readable syntax.
+
+Every `Value` is represented as one or more bytes describing first its
+kind and its length, and then its specific contents.
+
+For a value `v`, we write `[[v]]` for the encoding of v.
+
+The following figure summarises the definitions below:
+
+    tt nn mmmm  varint(m)  contents
+    -------------------------------
+
+    00 00 mmmm  ...        application-specific Record
+    00 01 mmmm  ...        application-specific Record
+    00 10 mmmm  ...        application-specific Record
+    00 11 mmmm  ...        Record
+
+    01 00 mmmm  ...        Sequence
+    01 01 mmmm  ...        Set
+    01 10 mmmm  ...        Dictionary
+
+    10 00 mmmm  ...        SignedInteger, big-endian binary
+    10 01 mmmm  ...        String, UTF-8 binary
+    10 10 mmmm  ...        Bytes
+    10 11 mmmm  ...        Symbol, UTF-8 binary
+
+    11 00 0000             False
+    11 00 0001             True
+    11 00 0010             Float, 32 bits big-endian binary
+    11 00 0011             Double, 64 bits big-endian binary
+
+    11 01 mmmm ...         MIME-type-labelled binary data
+
+    If mmmm = 1111, varint(m) is present; otherwise, m is the length
+
+#### Type and Length representation
+
+A `Value`'s type and length is represented by use of a function
+`header(t,n,m)` that yields a sequence of bytes when `t`, `n` and `m`
+are appropriate non-negative integers.
+
+    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
+                    or leadbyte(t,n,15) ++ varint(m)   otherwise
+
+The lead byte in a `Value`'s representation is constructed by a function
+
+    leadbyte(t,n,m) = [t*64 + n*16 + m]
+
+The lead byte describes the rest of the representation as
+follows:[^some-encodings-unused]
+
+    leadbyte(0,-,-) represents a Record
+    leadbyte(1,-,-) represents a Sequence, Set or Dictionary
+    leadbyte(2,-,-) represents an Atom with variable-length binary representation
+    leadbyte(3,0,-) represents an Atom with fixed-length binary representation
+    leadbyte(3,1,-) represents certain special variable-length values
+
+  [^some-encodings-unused]: Some encodings are unused. All such
+    encodings are reserved for future versions of this specification.
+
+Variable-length representations use the value of `m` to encode their
+lengths:
+
+ - Lengths between 0 and 14 are represented using `leadbyte` with `m`
+   values 0 through 14.
+ - Lengths of 15 or greater are represented by `m` value 15, and
+   additional "length bytes" describing the length then follow the
+   lead byte.
+
+These additional length bytes are formatted as
+[base 128 varints][varint]. Quoting the
+[Google Protocol Buffers][varint] definition,
+
+> Each byte in a varint, except the last byte, has the most
+> significant bit (msb) set – this indicates that there are further
+> bytes to come. The lower 7 bits of each byte are used to store the
+> two's complement representation of the number in groups of 7 bits,
+> least significant group first.
+
+**Examples.**
+
+ - The varint representation of 15 is just the byte 15.
+ - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
+ - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
+
+We write `varint(m)` for the varint-encoding of `m`.
+
+#### Records
+
+    [[ (L F_1 ... F_m) ]] = header(0,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
+
+For `m` fields, `m+1` is supplied to `header`, to account for the
+encoding of the record label.
+
+##### Application-specific short form for labels
+
+Any given protocol using Preserves may additionally define an
+interpretation for `n ∈ {0,1,2}`, mapping each *short form label
+number* `n` to a specific record label. When encoding `m` fields with
+short form label number `n`, the header is `header(0,n,m)` (rather
+than `m+1`) since the label is implicit.
+
+**Examples.** For example, a protocol may choose to map records
+labelled `void` to `n=0`, making
+
+    [[(void)]] = header(0,0,0) = [0x00]
+
+or it may map records labelled `person` to short form label number 1,
+making
+
+    [[(person "Dr" "Elizabeth" "Blackwell")]]
+        = header(0,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
+        =        [0x13] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
+
+#### Sequences, Sets and Dictionaries
+
+    [[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
+
+    [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[Y_1]] ++ ... ++ [[Y_m]]
+        where [Y_1 ... Y_m] = sort([X_1 ... X_m])
+
+    [[ #dict{K_1:V_1 ... K_m:V_m} ]]
+      = header(1,2,m) ++ [[K'_1]] ++ [[V'_1]] ++ ... ++ [[K'_m]] ++ [[V'_m]]
+          where [[K'_1 V'_1] ... [K'_m V'_m]]
+                  = sort([[K_1 V_1] ... [K_m V_m]])
+
+Note that `n=3` is unused and reserved.
+
+#### Variable-length Atoms
+
+##### SignedInteger
+
+    [[ x ]] when x ∈ SignedInteger = header(2,0,m) ++ intbytes(x)
+      where           m = |intbytes(x)|
+        and intbytes(x) = a big-endian two's-complement representation
+                          of the signed integer x, taking exactly as
+                          many whole bytes as needed to unambiguously
+                          identify the value
+
+For example,
+
+    [[   -257 ]] = [0x82, 0xFE, 0xFF]
+    [[   -256 ]] = [0x82, 0xFF, 0x00]
+    [[   -255 ]] = [0x82, 0xFF, 0x01]
+    [[   -254 ]] = [0x82, 0xFF, 0x02]
+    [[   -129 ]] = [0x82, 0xFF, 0x7F]
+    [[   -128 ]] = [0x81, 0x80]
+    [[   -127 ]] = [0x81, 0x81]
+    [[     -2 ]] = [0x81, 0xFE]
+    [[     -1 ]] = [0x81, 0xFF]
+    [[      0 ]] = [0x80]
+    [[      1 ]] = [0x81, 0x01]
+    [[    127 ]] = [0x81, 0x7F]
+    [[    128 ]] = [0x82, 0x00, 0x80]
+    [[    255 ]] = [0x82, 0x00, 0xFF]
+    [[    256 ]] = [0x82, 0x01, 0x00]
+    [[  32767 ]] = [0x82, 0x7F, 0xFF]
+    [[  32768 ]] = [0x83, 0x00, 0x80, 0x00]
+    [[  65535 ]] = [0x83, 0x00, 0xFF, 0xFF]
+    [[  65536 ]] = [0x83, 0x01, 0x00, 0x00]
+    [[ 131072 ]] = [0x83, 0x02, 0x00, 0x00]
+
+##### String
+
+    [[ S ]] when S ∈ String = header(2,1,m) ++ utf8(S)
+      where       m = |utf8(x)|
+        and utf8(x) = the UTF-8 encoding of S
+
+##### ByteString
+
+    [[ B ]] when B ∈ ByteString = header(2,2,m) ++ B
+                        where m = |B|
+
+##### Symbol
+
+    [[ S ]] when S ∈ Symbol = header(2,2,m) ++ utf8(S)
+      where       m = |utf8(x)|
+        and utf8(x) = the UTF-8 encoding of S
+
+#### Fixed-length Atoms
+
+##### Booleans
+
+    [[ #f ]] = header(3,0,0) = [0xC0]
+    [[ #t ]] = header(3,0,1) = [0xC1]
+
+##### Floats and Doubles
+
+    [[ F ]] when F ∈ Float  = header(3,0,2) ++ binary32(F)
+    [[ D ]] when D ∈ Double = header(3,0,3) ++ binary64(D)
+      where binary32(F) and binary64(D) are big-endian 4- and 8-byte
+            IEEE 754 binary representations
+
+#### Special variable-length values
+
+##### MIMEData
+
+Each `MIMEData` value is comprised of a media type `Symbol` and a raw
+binary body.
+
+    [[ M ]] when M ∈ MIMEData = header(3,1,m) ++ [[T]] ++ B
+      where m = |B|
+        and T is the Symbol media type of M
+        and B is the ByteString body of M
+
+## Examples
+
+<!-- TODO: Give some examples of large and small Preserves, perhaps -->
+<!-- translated from various JSON blobs floating around the internet. -->
+
+For the following examples, imagine an application that maps `Record`
+short form label number 0 to label `discard`, 1 to `capture`, and 2 to
+`observe`.
+
+| Value                                                              | Encoded hexadecimal byte sequence                  |
+|--------------------------------------------------------------------|----------------------------------------------------|
+| `(capture (discard))`                                              | 11 00                                              |
+| `(observe (speak (discard) (capture (discard))))`                  | 21 33 B5 73 70 65 61 6B 00 11 00                   |
+| `[1 2 3 4]`                                                        | 44 81 01 81 02 81 03 81 04                         |
+| `[-2 -1 0 1]`                                                      | 54 81 FE 81 FF 80 81 01                            |
+| `["hello" there #"world" [] #set{} #t #f]`                         | 47 95 68 65 6C 6C 6F A5 74 68 65 72 65 40 50 C1 C0 |
+| `-257`                                                             | 82 FE FF                                           |
+| `-1`                                                               | 81 FF                                              |
+| `0`                                                                | 80                                                 |
+| `1`                                                                | 81 01                                              |
+| `255`                                                              | 82 00 FF                                           |
+| `1f`                                                               | C2 3F 80 00 00                                     |
+| `1d`                                                               | C3 3F F0 00 00 00 00 00 00                         |
+| `-1.202e300d`                                                      | C3 FE 3C B7 B7 59 BF 04 26                         |
+
+Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Value`
+
+    ([titled person 2 thing 1]
+       101
+       "Blackwell"
+       (date 1821 2 3)
+       "Dr")
+
+encodes to
+
+    35                              ;; Record, generic, 4+1
+      45                              ;; Sequence, 5
+        b6 74 69 74 6c 65 64            ;; Symbol, "titled"
+        b6 70 65 72 73 6f 6e            ;; Symbol, "person"
+        81 02                           ;; SignedInteger, "2"
+        b5 74 68 69 6e 67               ;; Symbol, "thing"
+        81 01                           ;; SignedInteger, "1"
+      81 65                           ;; SignedInteger, "101"
+      99 42 6c 61 63 6b 77 65 6c 6c   ;; String, "Blackwell"
+      34                              ;; Record, generic, 3+1
+        b4 64 61 74 65                  ;; Symbol, "date"
+        82 07 1d                        ;; SignedInteger, "1821"
+        81 02                           ;; SignedInteger, "2"
+        81 03                           ;; SignedInteger, "3"
+      92 44 72                        ;; String, "Dr"
+
+  [^extensibility2]: It happens to line up with Racket's
+    representation of a record label for an inheritance hierarchy
+    where `titled` extends `person` extends `thing`:
+
+        (struct date (year month day) #:prefab)
+        (struct thing (id) #:prefab)
+        (struct person thing (name date-of-birth) #:prefab)
+        (struct titled person (title) #:prefab)
+
+## Conventions for Common Data Types
+
+The `Value` data type is essentially an S-Expression, able to
+represent semi-structured data over `ByteString`, `String`,
+`SignedInteger` atoms and so on.
+
+However, users need a wide variety of data types for representing
+domain-specific values such as various kinds of encoded and normalized
+text, calendrical values, machine words, and so on.
+
+We use appropriately-labelled `Record`s to denote these
+domain-specific data types.
+
+All of these conventions are optional. They form a layer atop the core
+`Value` structure. Non-domain-specific tools do not in general need to
+treat them specially.
+
+**Validity.** Many of the labels we will describe in this section come
+  with side-conditions on the contents of labelled `Record`s. It is
+  possible to construct an instance of `Value` that violates these
+  side-conditions without ceasing to be a `Value` or becoming
+  unrepresentable. However, we say that such a `Value` is *invalid*
+  because it fails to honour the necessary side-conditions.
+  Implementations *SHOULD* allow two modes of working: one which
+  treats all `Value`s identically, without regard for side-conditions,
+  and one which enforces validity (i.e. side-conditions) when reading,
+  writing, or constructing `Value`s.
+
+### Text
+
+#### Normalization forms
+
+In order for users to unambiguously signal or require a particular
+[normalization form](http://unicode.org/reports/tr15/), we define a
+`NormalizedString`, which is a `Record` labelled with
+`unicode-normalization` and having two fields, the first of which is a
+`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
+`nfkc`, `nfkd`), and the second of which is a `String` whose
+underlying code point representation *MUST* be normalized according to
+the named normalization form.
+
+#### IRIs (URIs, URLs, URNs, etc.)
+
+An `IRI` is a `Record` labelled with `iri` and having one field, a
+`String` which is the IRI itself and which *MUST* be a valid absolute
+or relative IRI.
+
+### Machine words
+
+The definition of `SignedInteger` captures all integers. However, in
+certain circumstances it can be valuable to assert that a number
+inhabits a particular range, such as a fixed-width machine word.
+
+A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote
+*n*-bit-wide signed and unsigned range restrictions, respectively.
+Records with these labels *MUST* have one field, a `SignedInteger`,
+which *MUST* fall within the appropriate range. That is, to be valid,
+ - in `(i16 `*x*`)`, -32768 <= *x* <= 32767.
+ - in `(u16 `*x*`)`, 0 <= *x* <= 65535.
+ - in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647.
+ - etc.
+
+### Anonymous Tuples and Unit
+
+A `Tuple` is a `Record` with label `tuple` and zero or more fields,
+denoting an anonymous tuple of values.
+
+The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called
+"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
+
+### Null and Undefined
+
+Tony Hoare's
+"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
+can be represented with the 0-ary `Record` `(null)`. An "undefined"
+value can be represented as `(undefined)`.
+
+### Dates and Times
+
+Dates, times, moments, and timestamps can be represented with a
+`Record` with label `rfc3339` having a single field, a `String`, which
+*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
+or `date-time` productions of
+[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
+
+## Representing Values in Programming Languages
+
+We have given a definition of `Value` and its semantics, and proposed
+a concrete syntax for communicating and storing `Value`s. We now turn
+to **suggested** representations of `Value`s as *programming-language
+values* for various programming languages.
+
+When designing a language mapping, an important consideration is
+roundtripping: serialization after deserialization, and vice versa,
+should both be identities.
+
+### JavaScript
+
+ - `SignedInteger` ↔ numbers or `BigInt` [[1](https://developers.google.com/web/updates/2018/05/bigint), [2](https://github.com/tc39/proposal-bigint)]
+ - `String` ↔ strings
+ - `ByteString` ↔ `Uint8Array`
+ - `Symbol` ↔ `Symbol.for(...)`
+ - `Boolean` ↔ `Boolean`
+ - `Float` and `Double` ↔ numbers,
+ - `MIMEData` ↔ `{ "type": aString, "data": aUint8Array }`
+ - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
+    - `(undefined)` ↔ the undefined value
+    - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
+ - `Sequence` ↔ `Array`
+ - `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
+ - `Dictionary` ↔ a `Map`
+
+### Scheme/Racket
+
+ - `SignedInteger` ↔ exact numbers
+ - `String` ↔ strings
+ - `ByteString` ↔ byte vector (Racket: "Bytes")
+ - `Symbol` ↔ symbols
+ - `Boolean` ↔ booleans
+ - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
+ - `MIMEData` ↔ a structure with a `type` and a `data` field (Racket: `(struct mime (type data))`)
+ - `Record` ↔ structures (Racket: prefab struct)
+ - `Sequence` ↔ lists
+ - `Set` ↔ Racket: sets
+ - `Dictionary` ↔ Racket: hash-table
+
+### Java
+
+ - `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
+ - `String` ↔ `String`
+ - `ByteString` ↔ `byte[]`
+ - `Symbol` ↔ a simple data class wrapping a `String`
+ - `Boolean` ↔ `Boolean`
+ - `Float` and `Double` ↔ `Float` and `Double`
+ - `MIMEData` ↔ an implementation of `javax.activation.DataSource`, maybe?
+ - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
+ - `Sequence` ↔ an implementation of `java.util.List`
+ - `Set` ↔ an implementation of `java.util.Set`
+ - `Dictionary` ↔ an implementation of `java.util.Map`
+
+### Erlang
+
+ - `SignedInteger` ↔ integers
+ - `String` ↔ tuple of `utf8` and a binary
+ - `ByteString` ↔ a binary
+ - `Symbol` ↔ the underlying string converted to an Erlang atom, if
+   some kind of an "unsafe" mode is set on the decoder (because Erlang
+   atoms are not GC'd); otherwise perhaps a tuple of `symbol` and a
+   binary of the utf-8
+ - `Boolean` ↔ `true` and `false`
+ - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
+ - `MIMEData` ↔ tuple of the type as a utf8 binary, and the data as a binary
+ - `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions
+ - `Sequence` ↔ a list
+ - `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
+ - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
+
+## Appendix. Table of lead byte values
+
+     0x - short form Record label index 0
+     1x - short form Record label index 1
+     2x - short form Record label index 2
+     3x - Record
+     4x - Sequence
+     5x - Set
+     6x - Dictionary
+    (7x)  RESERVED
+     8x - SignedInteger
+     9x - String
+     Ax - Bytes
+     Bx - Symbol
+     C0 - False
+     C1 - True
+     C2 - Float
+     C3 - Double
+    (Cx)  RESERVED C4-CF
+     Dx - MIMEData
+    (Ex)  RESERVED
+    (Fx)  RESERVED
+
+## Appendix. Why not Just Use JSON?
+
+<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
+
+JSON offers *syntax* for numbers, strings, booleans, null, arrays and
+string-keyed maps. However, it suffers from two major problems. First,
+it offers no *semantics* for the syntax: it is left to each
+implementation to determine how to treat each JSON term. This causes
+[interoperability](http://seriot.ch/parsing_json.php) and even
+[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
+issues. Second, JSON's lack of support for type tags leads to awkward
+and incompatible *encodings* of type information in terms of the fixed
+suite of constructors on offer.
+
+There are other minor problems with JSON having to do with its syntax.
+Examples include its relative verbosity and its lack of support for
+binary data.
+
+### JSON syntax doesn't *mean* anything
+
+When are two JSON values the same? When are they different?
+<!-- When is one JSON value "less than" another? -->
+
+The specifications are largely silent on these questions. Different
+JSON implementations give different answers.
+
+Specifically, JSON does not:
+
+ - assign any meaning to numbers,[^meaning-ieee-double]
+ - determine how strings are to be compared,[^string-key-comparison]
+ - determine whether object key ordering is significant,[^json-member-ordering] or
+ - determine whether duplicate object keys are permitted, what it
+   would mean if they were, or how to determine a duplicate in the
+   first place.[^json-key-uniqueness]
+
+In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
+
+  [^meaning-ieee-double]:
+    [Section 6 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-6)
+    does go so far as to indicate “good interoperability can be
+    achieved” by imagining that parsers are able reliably to
+    understand the syntax of numbers as denoting an IEEE 754
+    double-precision floating-point value.
+
+  [^string-key-comparison]:
+    [Section 8.3 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-8.3)
+    suggests that *if* an implementation compares strings used as
+    object keys “code unit by code unit”, then it will interoperate
+    with *other such implementations*, but neither requires this
+    behaviour nor discusses comparisons of strings used in other
+    contexts.
+
+  [^json-member-ordering]:
+    [Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
+    remarks that “[implementations] differ as to whether or not they
+    make the ordering of object members visible to calling software.”
+
+  [^json-key-uniqueness]:
+    [Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
+    is the only place in the specification that mentions the issue. It
+    explicitly sanctions implementations supporting duplicate keys,
+    noting only that “when the names within an object are not unique,
+    the behavior of software that receives such an object is
+    unpredictable.” Implementations are free to choose any behaviour
+    at all in this situation, including signalling an error, or
+    discarding all but one of a set of duplicates.
+
+  [^xml-infoset]: The XML world has the concept of
+    [XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
+    speaking, XML infoset is the *denotation* of an XML document; the
+    *meaning* of the document.
+
+  [^other-formats]: Most other recent data languages are like JSON in
+    specifying only a syntax with no associated semantics. While some
+    do make a sketch of a semantics, the result is often
+    underspecified (e.g. in terms of how strings are to be compared),
+    overly machine-oriented (e.g. treating 32-bit integers as
+    fundamentally distinct from 64-bit integers and from
+    floating-point numbers), overly fine (e.g. giving visibility to
+    the order in which map entries are written), or all three.
+
+Some examples:
+
+ - are the JSON values `1`, `1.0`, and `1e0` the same or different?
+ - are the JSON values `1.0` and `1.0000000000000001` the same or different?
+ - are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
+   (UTF-8 `7061cc88726f6e`) the same or different?
+ - are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
+   or different?
+ - which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
+   same? Are all three legal?
+ - are `{"päron":1}` and `{"päron":1}` the same or different?
+
+### JSON can multiply nicely, but it can't add very well
+
+JSON includes a fixed set of types: numbers, strings, booleans, null,
+arrays and string-keyed maps. Domain-specific data must be *encoded*
+into these types. For example, dates and email addresses are often
+represented as strings with an implicit internal structure.
+
+There is no convention for *labelling* a value as belonging to a
+particular category. This makes it difficult to extract, say, all
+email addresses, or all URLs, from an arbitrary JSON document.
+
+Instead, JSON-encoded data are often labelled in an ad-hoc way.
+Multiple incompatible approaches exist. For example, a "money"
+structure containing a `currency` field and an `amount` may be
+represented in any number of ways:
+
+    { "_type": "money", "currency": "EUR", "amount": 10 }
+    { "type": "money", "value": { "currency": "EUR", "amount": 10 } }
+    [ "money", { "currency": "EUR", "amount": 10 } ]
+    { "@money": { "currency": "EUR", "amount": 10 } }
+
+This causes particular problems when JSON is used to represent *sum*
+or *union* types, such as "either a value or an error, but not both".
+Again, multiple incompatible approaches exist.
+
+For example, imagine an API for depositing money in an account. The
+response might be either a "success" response indicating the new
+balance, or one of a set of possible errors.
+
+Sometimes, a *pair* of values is used, with `null` marking the option
+not taken.[^interesting-failure-mode]
+
+    { "ok": { "balance": 210 }, "error": null }
+    { "ok": null, "error": "Unauthorized" }
+
+  [^interesting-failure-mode]: What is the meaning of a document where
+    both `ok` and `error` are non-null? What might happen when a
+    program is presented with such a document?
+
+The branch not chosen is sometimes present, sometimes omitted as if it
+were an optional field:
+
+    { "ok": { "balance": 210 } }
+    { "error": "Unauthorized" }
+
+Sometimes, an array of a label and a value is used:
+
+    [ "ok", { "balance": 210 } ]
+    [ "error", "Unauthorized" ]
+
+Sometimes, the shape of the data is sufficient to distinguish among
+the alternatives, and the label is left implicit:
+
+    { "balance": 210 }
+    "Unauthorized"
+
+JSON itself does not offer any guidance for which of these options to
+choose. In many real cases on the web, poor choices have led to
+encodings that are irrecoverably ambiguous.
+
+---
+---
+
+# Open questions
+
+Q. Should "symbols" instead be URIs? Relative, usually; relative to
+what? Some domain-specific base URI?
+
+Q. What about general rationals, subsuming integers and IEEE floats
+(except NaN and the Infinities)?
+
+Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps]
+
+  [^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as
+    they are? While they include binary data and sequences, and an
+    obvious equivalence for them exists, they lack numbers *per se* as
+    well as any kind of unordered structure such as sets or maps. In
+    addition, while "display hints" allow labelling of binary data
+    with an intended interpretation, they cannot be attached to any
+    other kind of structure, and the "hint" itself can only be a
+    binary blob.
+
+Q. Should `MIMEData` be a special syntax for `Record`s with a single
+`ByteString` field?
+
+A. Not even. It should probably just be moved to the "conventions"
+section. Compare:
+
+    D5 BA text/plain    hello   -- using special MIMEData encoding
+    32 BA text/plain A5 hello   -- using bog standard type-labelled Record
+
+Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
+label (recursive!?) and a single `String` field?
+
+Q. Should `String` be a special syntax for `(utf8 Bytes)`? Again,
+recursiveness problems...?
+
+Q. Should `Dictionary` be a special syntax for etc etc.? `Set`?
+`Float`? `Double`?
+
+ --> Rule of thumb: if there's a special equivalence predicate for it,
+     it needs to be built-in syntax. Otherwise it can be a regular
+     record. (So: `Boolean` might not make the cut for special
+     treatment?? Likewise `String`...? Ugh those are psychologically
+     important perhaps)
+
+Q. Are the language mappings reasonable? How about one for Python?
+
+---
+
+OK so. No built-in `MIMEData`, but maybe a conventional `(mime-data
+Symbol Bytes)`? Applications can put it in a short slot if they like.
+
+Streaming: needed for variable-sized structures. Tricky to design
+syntax for this that isn't gratuitously warty. End byte value.
+
+Literal small integers: could be nice? Not absolutely necessary.
+
+Give algorithm for computing size of integers.
+
+Give up on sorting requirement for representation of sets and
+dictionaries?? Probably a good idea if there are streaming forms of
+them because that sounds impossible to do??
+
+Maybe reorder: fixed-length atoms first, then variable-length atoms,
+then fixed-length compounds, then variable-length compounds? Reason
+being that then maybe can put the streaming forms of the
+variable-length ones very last.
+
+---