Trim and improve

2018-09-24 12:59:22 +01:00 · 2018-09-24 12:59:22 +01:00 · b4d4092b90
parent b2eb53e664
commit b4d4092b90
1 changed files with 242 additions and 261 deletions
--- a/syndicate/mc/preserve.md
+++ b/syndicate/mc/preserve.md
@ -1,14 +1,16 @@
 ---
 ---
+<title>Preserves: an Expressive Data Language</title>
 <style>
 body { font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif; }
@media screen {
  body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; }
 }
@media print {
-  body { margin-left: 2rem; margin-right: 2rem; }
-  h2 { page-break-before: always }
-  h2:first-of-type { page-break-before: auto; }
+  @page { margin: 1.5cm; }
+  body { margin-left: 2rem; margin-right 2rem; }
+  h1, h2 { page-break-before: always }
+  h1:first-of-type, h2:first-of-type { page-break-before: auto; }
 }
 h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; }
 h2 { border-bottom: solid #4f81bd 1px; }
@ -17,29 +19,45 @@ code { font-size: 75%; }
 pre { padding: 0.33rem; }
 </style>

-# Preserves: Semantic Serialization of Node-labelled Data
+# Preserves: an Expressive Data Language

-       _________
-      <_________>   Tony Garnock-Jones <tonyg@leastfixedpoint.com>
-      |  FRμIT  |   September 2018
-      |Preserves|   Version 0.0.2
-      \_________/
-     
+Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
+September 2018. Version 0.0.2.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map

-Most data serialization formats used on the web represent
-*edge-labelled* semi-structured data.
+This document proposes a data model and serialization format called
+*Preserves*.

-This document proposes a data model and serialization format that
-takes a *node-labelled* approach.
+Preserves supports *records* with user-defined *labels*. This makes it
+more expressive[^macro-expressiveness] than most data languages in use
+on the web and allows it to easily represent the *labelled sums of
+products* as seen in many functional programming languages.

-This makes it both extensible and much more like S-expressions, making
-it easily able to represent the *labelled sums of products* as seen in
-Rust, Haskell, OCaml, and other functional programming languages.
+Preserves also supports the usual suite of atomic and compound data
+types, in particular including *binary* data as a distinct type from
+text strings.
+
+Finally, Preserves defines precisely how to compare two values with
+each other in terms of the data model, not in terms of syntax or of
+the data structures of any particular implementation language.
+
+  [^macro-expressiveness]: By "expressive" I mean *macro-expressive*
+    in the sense of Felleisen's 1991 paper, "On the Expressive Power
+    of Programming Languages".
+
+    Roughly speaking, there's no way in a JSON document to introduce a
+    new kind of information (such as binary data, or a date-stamp, or
+    a "person" object) in an *unambiguous way* without *global
+    agreement* from every potential consumer of the document. With an
+    extensible labelled record type, there is.
+
+    Felleisen, Matthias. “On the Expressive Power of Programming
+    Languages.” Science of Computer Programming 17, no. 1--3 (1991):
+    35–75.

 ## Starting with Semantics

@ -65,20 +83,12 @@ later in this document.
                                | Dictionary

 Our `Value`s fall into two broad categories: *atomic* and *compound*
-data.[^zephyr-asdl]
+data.[^inspiration]

-  [^zephyr-asdl]: This design was loosely inspired by S-expressions,
+  [^inspiration]: This design was loosely inspired by S-expressions,
    as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others,
-    and by the ML type system, as seen in languages such as SML,
-    OCaml, Haskell, Rust, and many others. It is also related to
-    Zephyr ASDL (h/t
-    [Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)),
-    which doesn't offer much in the way of atoms, but offers
-    general-purpose labelled sums and products. See D. C. Wang, A. W.
-    Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax
-    Description Language,” in USENIX Conference on Domain-Specific
-    Languages, 1997, pp. 213–228.
-    [PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf)
+    as well as by the ML type system, as seen in languages such as
+    SML, OCaml, Haskell, Rust, and many others.

 **Total order.**<a name="total-order"></a> As we go, we will
 incrementally specify a total order over `Value`s. Two values of the
@ -101,9 +111,6 @@ follows:[^ordering-by-syntax]
 **Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
 neither is less than the other according to the total order.

-<!-- We should avoid unnecessary restrictions such as machine-oriented -->
-<!-- fixed-width integer or floating-point values where possible. -->
-
 ### Signed integers.

 A `SignedInteger` is a signed integer of arbitrary width.
@ -120,8 +127,8 @@ examples of `SignedInteger`s using standard mathematical notation.
 A `String` is a sequence of Unicode
 [code-point](http://www.unicode.org/glossary/#code_point)s. Two
 `String`s are compared lexicographically, code-point by
-code-point.[^utf8-is-awesome] We will write examples of `String`s text
-surrounded by double-quotes “`"`” using a monospace font.
+code-point.[^utf8-is-awesome] We will write examples of `String`s as
+text surrounded by double-quotes “`"`” using a monospace font.

  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
@ -176,7 +183,7 @@ A `Float` is a single-precision IEEE 754 floating-point value; a
 `Double` is a double-precision IEEE 754 floating-point value.
 `Float`s, `Double`s and `SignedInteger`s are considered disjoint, and
 so by the rules [above](#total-order), every `Float` is less than
-every `Double`, and every `SignedInteger` is less than both. Two
+every `Double`, and every `SignedInteger` is greater than both. Two
 `Float`s or two `Double`s are to be ordered by the `totalOrder`
 predicate defined in section 5.10 of
 [IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
@ -196,10 +203,8 @@ record's *fields*. A record's label is, itself, a `Value`, though it
 will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
 are compared lexicographically as if they were just tuples; that is,
 first by their labels, and then by the remainder of their fields. We
-will only write examples of `Record`s having labels that are `Symbol`s
-entirely composed of ASCII characters. Such `Record`s will be written
-as a parenthesised, space-separated sequence of their label followed
-by their fields.
+will write examples of `Record`s as a parenthesised, space-separated
+sequence of their label `Value` followed by their field `Value`s.

  [^extensibility]: The [Racket](https://racket-lang.org/) programming
    language defines
@ -215,19 +220,19 @@ by their fields.
    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
    be read as an absolute IRI, it stands for that IRI; and otherwise,
    it cannot be read as an IRI at all, and so the label simply stands
-    for itself - for its own `Value`.
+    for itself—for its own `Value`.

 **Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
 written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
 written `(void)`.

+**Non-examples.** `()`, because it lacks a label.
+
 ### Sequences.

 A `Sequence` is a general-purpose, variable-length ordered sequence of
-zero or more `Value`s. `Sequence`s are compared lexicographically,
-appealing to the ordering on `Value`s for comparisons at each position
-in the `Sequence`s. We write examples space-separated, surrounded with
-square brackets.
+zero or more `Value`s. `Sequence`s are compared lexicographically. We
+write examples space-separated, surrounded with square brackets.

 **Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
 `SignedInteger`s 1, 2 and 3.
@ -237,18 +242,18 @@ square brackets.
 A `Set` is an unordered finite set of `Value`s. It contains no
 duplicate values, following the [equivalence relation](#equivalence)
 induced by the total order on `Value`s. Two `Set`s are compared by
-sorting their elements using the [total order](#total-order) and
-comparing the resulting sequences as `Sequence`s. We write examples
+sorting their elements ascending using the [total order](#total-order)
+and comparing the resulting `Sequence`s. We write examples
 space-separated, surrounded with curly braces, prefixed by `#set`.

 **Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
 containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
 containing 4, the string `"hello"`, the record with label `void` and
 no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
-the set containing a `SignedInteger` and a `Float`, both denoting the
-number 1; `#set{(mime application/xml #"<x/>") (mime
-application/xml #"<x />")}`, a set containing two different
-type-labelled byte arrays.[^mime-xml-difference]
+the set containing a `SignedInteger` and a `Float`; `#set{(mime
+application/xml #"<x/>") (mime application/xml #"<x />")}`, a set
+containing two different type-labelled byte
+arrays.[^mime-xml-difference]

  [^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
    differ by bytewise comparison, and thus yield different record
@ -258,50 +263,31 @@ type-labelled byte arrays.[^mime-xml-difference]
 **Non-examples.** `#set{1 1 1}`, because it contains multiple
 equivalent `Value`s.

-### Dictionaries, hash-tables or maps.
+### Dictionaries.

-A `Dictionary` is an unordered finite collection of zero or more pairs
-of `Value`s. Each pair comprises a *key* and a *value*. Keys in a
-`Dictionary` must be pairwise distinct. Instances of `Dictionary` are
-compared by lexicographic comparison of the sequences resulting from
-ordering each `Dictionary`'s pairs in ascending order by key. Examples
-are written as a `#dict`-prefixed, curly-brace-surrounded sequence of
+A `Dictionary` is an unordered finite collection of pairs of `Value`s.
+Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must
+be pairwise distinct. Instances of `Dictionary` are compared by
+lexicographic comparison of the sequences resulting from ordering each
+`Dictionary`'s pairs in ascending order by key. Examples are written
+as a `#dict`-prefixed, curly-brace-surrounded sequence of
 space-separated key-value pairs, each written with a colon between the
 key and value.

 **Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
 dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
-`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having
-a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
-values.
+`#dict{[1 2 3]:a}`, mapping `[1 2 3]` to `a`; `#dict{"hi":0 hi:0
+there:[]}`, having a `String` and two `Symbol` keys, and
+`SignedInteger` and `Sequence` values.

 **Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
-keys; `#dict{[]:[] []:99}`, for the same reason.
+keys; `#dict{[7 8]:[] [7 8]:99}`, for the same reason.

 ## Syntax

 Now we have discussed `Value`s and their meanings, we may turn to
 techniques for *representing* `Value`s for communication or storage.

-The syntax we have used for the examples so far is inadequate in many
-ways, not least of which is that it cannot represent every `Value`.
-
-Separation of the meaning of a piece of syntax from the syntax itself
-opens the door to domain-specific syntaxes, all equivalent and
-interconvertible.[^asn1] With a robust semantic foundation,
-connections to other data languages can also be made.
-
-  [^asn1]: Those who remember
-    [ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)
-    will recall BER, DER, PER, CER, XER and so on, each appropriate to
-    a different setting. Similarly,
-    [Rivest's S-Expression design][sexp.txt] offers a human-friendly
-    syntax, a syntax robust to network-induced message corruption, and
-    an unambiguous, simple and easily-parsed machine-friendly syntax
-    for the same underlying values.
-
-### Binary syntax
-
 For now, we limit our attention to an easily-parsed, easily-produced
 machine-readable syntax.

@ -312,42 +298,7 @@ encoded details of the `Value` itself.

 For a value `v`, we write `[[v]]` for the `Repr` of v.

-The following figure summarises the definitions below:
-
-    tt nn mmmm  varint(m)  contents
-    -------------------------------
-
-    00 00 0000             False
-    00 00 0001             True
-    00 00 0010             Float, 32 bits big-endian binary
-    00 00 0011             Double, 64 bits big-endian binary
-    00 00 x1xx             RESERVED
-    00 00 1xxx             RESERVED
-    00 01 xxxx             RESERVED
-    00 10 ttnn             Start Stream <tt,nn>
-                             When tt = 00 --> error
-                                       01 --> each chunk is a <tt,nn> piece
-                                       1x --> each chunk is a single encoded Value
-    00 11 ttnn             End Stream <tt,nn> (must match preceding Start Stream)
-
-    01 00 mmmm  ...        SignedInteger, big-endian binary
-    01 01 mmmm  ...        String, UTF-8 binary
-    01 10 mmmm  ...        ByteString
-    01 11 mmmm  ...        Symbol, UTF-8 binary
-
-    10 00 mmmm  ...        application-specific Record
-    10 01 mmmm  ...        application-specific Record
-    10 10 mmmm  ...        application-specific Record
-    10 11 mmmm  ...        Record
-
-    11 00 mmmm  ...        Sequence
-    11 01 mmmm  ...        Set
-    11 10 mmmm  ...        Dictionary
-    11 11 xxxx             RESERVED
-
-    If mmmm = 1111, varint(m) is present; otherwise, m is the length
-
-#### Type and Length representation
+### Type and Length representation

 Each `Repr` takes one of three possible forms:

@ -365,13 +316,13 @@ Each `Repr` takes one of three possible forms:
   begins before the number of elements or bytes in the corresponding
   `Value` is known.

-Applications may choose between formats (B) and (C) depending on their
+Applications may choose between formats B and C depending on their
 needs at serialization time.

-Every `Repr`, however, starts with a *lead byte* describing the
-remainder of the representation.
+Every `Repr` starts with a *lead byte* describing the remainder of the
+representation.

-##### The lead byte
+#### The lead byte

 The lead byte is constructed by a function `leadbyte`:

@ -387,18 +338,18 @@ follows:[^some-encodings-unused]
    encodings are reserved for future versions of this specification.

 - `leadbyte(0,0,-)` (format A) represents an Atom with fixed-length binary representation.
- - `leadbyte(0,1,-)` (format A) is RESERVED.
+ - `leadbyte(0,1,-)` (format A) is reserved.
 - `leadbyte(0,2,-)` (format C) is a Stream Start byte.
 - `leadbyte(0,3,-)` (format C) is a Stream End byte.
 - `leadbyte(1,-,-)` (format B) represents an Atom with variable-length binary representation.
 - `leadbyte(2,-,-)` (format B) represents a Record.
 - `leadbyte(3,-,-)` (format B) represents a Sequence, Set or Dictionary.

-##### Encoding data of fixed length (format A)
+#### Encoding data of fixed length (format A)

 Each specific type of data defines its own rules for this format.

-##### Encoding data of known length (format B)
+#### Encoding data of known length (format B)

 A `Repr` where the length of the `Value` to be encoded is variable but
 known uses the value of `m` in `leadbyte` to encode its length. The
@ -434,15 +385,15 @@ definition,
 - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
 - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.

-##### Streaming data of unknown length (format C)
+#### Streaming data of unknown length (format C)

 A `Repr` where the length of the `Value` to be encoded is variable and
 not known at the time serialization of the `Value` starts is encoded
-by a single Stream Start byte, followed by zero or more *chunks*,
-followed by a matching Stream End byte:
+by a single Stream Start (“open”) byte, followed by zero or more
+*chunks*, followed by a matching Stream End (“close”) byte:

-    startbyte(t,n) = leadbyte(0,2, t*4 + n)
-      endbyte(t,n) = leadbyte(0,3, t*4 + n)
+     open(t,n) = leadbyte(0,2, t*4 + n)
+    close(t,n) = leadbyte(0,3, t*4 + n)

 For a `Repr` of a `Value` containing binary data, each chunk is to be
 a format B `Repr` of the same type as the overall `Repr`.
@ -450,35 +401,34 @@ a format B `Repr` of the same type as the overall `Repr`.
 For a `Repr` of a `Value` containing other `Value`s, each chunk is to
 be a single `Repr`.

-#### Records
+### Records

 Format B (known length):

-    [[ (L F_1 ... F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
+    [[ (L F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]

 For `m` fields, `m+1` is supplied to `header`, to account for the
 encoding of the record label.

 Format C (streaming):

-    [[ (L F_1 ... F_m) ]]
-           = startbyte(2,3) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,3)
+    [[ (L F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)

 Applications *SHOULD* prefer the known-length format for encoding
 `Record`s.

-##### Application-specific short form for labels
+#### Application-specific short form for labels

 Any given protocol using Preserves may additionally define an
 interpretation for `n ∈ {0,1,2}`, mapping each *short form label
 number* `n` to a specific record label. When encoding `m` fields with
 short form label number `n`, format B becomes

-    header(2,n,m) ++ [[F_1]] ++ ... ++ [[F_m]]
+    header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]

 and format C becomes

-    startbyte(2,n) ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,n)
+    open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)

 **Examples.** For example, a protocol may choose to map records
 labelled `void` to `n=0`, making
@ -494,30 +444,29 @@ making

 for format B, or

-        = startbyte(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ endbyte(2,1)
-        =         [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
+        = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
+        =    [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]

 for format C.

-#### Sequences, Sets and Dictionaries
+### Sequences, Sets and Dictionaries

 Format B (known length):

-    [[ [X_1 ... X_m] ]] = header(3,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
+                 [[ [X_1...X_m] ]] = header(3,0,m)   ++ [[X_1]] ++...++ [[X_m]]
+             [[ #set{X_1...X_m} ]] = header(3,1,m)   ++ [[X_1]] ++...++ [[X_m]]
+    [[ #dict{K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
+                                                     ++ [[K_m]] ++ [[V_m]]

-    [[ #set{X_1 ... X_m} ]] = header(3,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]
-
-    [[ #dict{K_1:V_1 ... K_m:V_m} ]]
-      = header(3,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
+Note that `m*2` is given to `header` for a `Dictionary`, since there
+are two `Value`s in each key-value pair.

 Format C (streaming):

-    [[ [X_1 ... X_m] ]] = startbyte(3,0) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,0)
-
-    [[ #set{X_1 ... X_m} ]] = startbyte(3,1) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,1)
-
-    [[ #dict{K_1:V_1 ... K_m:V_m} ]]
-      = startbyte(3,2) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] ++ endbyte(3,2)
+                 [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
+             [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
+    [[ #dict{K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
+                                               ++ [[K_m]] ++ [[V_m]] ++ close(3,2)

 Applications may use whichever format suits their needs on a
 case-by-case basis.
@ -538,26 +487,30 @@ order.
    (b) sorting keys or elements makes no sense in streaming
    serialization formats.

-Note that `header(3,3,m)` and `startbyte(3,3)`/`endbyte(3,3)` is unused and reserved.
+    However, a quality implementation may wish to offer the programmer
+    the option of serializing with set elements and dictionary keys in
+    sorted order.

-#### Variable-length Atoms
+Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` is unused and reserved.

-##### SignedInteger
+### Variable-length Atoms
+
+#### SignedInteger

 Format B (known length):

    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)
-      where           m = |intbytes(x)|
-        and intbytes(x) = a big-endian two's-complement representation
-                          of the signed integer x, taking exactly as
-                          many whole bytes as needed to unambiguously
-                          identify the value

 Format C *MUST NOT* be used for `SignedInteger`s.

+The function `intbytes(x)` gives the big-endian two's-complement
+binary representation of `x`, taking exactly as many whole bytes as
+needed to unambiguously identify the value and its sign, and `m =
+|intbytes(x)|`.
+
 The value 0 needs zero bytes to identify the value, so `intbytes(0)`
 is the empty byte string. Non-zero values need at least one byte; the
-most-significant bit in the first byte in `intbytes(x)` for `x≠0` is
+most-significant bit in the first byte in `intbytes(x)` for `x`≠0 is
 the sign bit.

 For example,
@ -583,59 +536,49 @@ For example,
    [[  65536 ]] = [0x43, 0x01, 0x00, 0x00]
    [[ 131072 ]] = [0x43, 0x02, 0x00, 0x00]

-##### String
+#### String, ByteString and Symbol
+
+Syntax for these three types varies only in the value of `n` supplied
+to `header`, `open`, and `close`. In each case, the payload following
+the header is a binary sequence; for `String` and `Symbol`, it is a
+UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
+is the raw data contained within the `Value` unmodified.

 Format B (known length):

-    [[ S ]] when S ∈ String = header(1,1,m) ++ utf8(S)
-      where       m = |utf8(x)|
-        and utf8(x) = the UTF-8 encoding of S
+              [[ S ]] = header(1,n,m) ++ encode(S)
+              where m = |encode(S)|
+    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
+                        (2,S)        if S ∈ ByteString
+                        (3,utf8(S))  if S ∈ Symbol

-To stream a `String`, emit `startbyte(1,1)` and then a sequence of
-zero or more format B `String` chunks, followed by `endbyte(1,1)`.
+To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
+then a sequence of zero or more format B chunks, followed by
+`close(1,n)`. For a `String`, every chunk must be a `String`;
+likewise, for `ByteString` and `Symbol`.

-While the overall content of a streamed `String` must be valid UTF-8,
-individual chunks do not have to conform to UTF-8.
+While the overall content of a streamed `String` or `Symbol` must be
+valid UTF-8, individual chunks do not have to conform to UTF-8.

-##### ByteString
-
-Format B (known length):
-
-    [[ B ]] when B ∈ ByteString = header(1,2,m) ++ B
-                        where m = |B|
-
-To stream a `ByteString`, emit `startbyte(1,2)` and then a sequence of
-zero or more format B `ByteString` chunks, followed by `endbyte(1,2)`.
-
-##### Symbol
-
-Format B (known length):
-
-    [[ S ]] when S ∈ Symbol = header(1,3,m) ++ utf8(S)
-      where       m = |utf8(x)|
-        and utf8(x) = the UTF-8 encoding of S
-
-To stream a `Symbol`, emit `startbyte(1,3)` and then a sequence of
-zero or more format B `Symbol` chunks, followed by `endbyte(1,3)`.
-
-#### Fixed-length Atoms
+### Fixed-length Atoms

 Fixed-length atoms all use format A, and do not have a length
 representation. They repurpose the bits that format B `Repr`s use to
 specify lengths. Applications *MUST NOT* use format C with
-`startbyte(0,n)` or `endbyte(0,n)` for any `n`.
+`open(0,n)` or `close(0,n)` for any `n`.

-##### Booleans
+#### Booleans

    [[ #f ]] = header(0,0,0) = [0x00]
    [[ #t ]] = header(0,0,1) = [0x01]

-##### Floats and Doubles
+#### Floats and Doubles

    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
-      where binary32(F) and binary64(D) are big-endian 4- and 8-byte
-            IEEE 754 binary representations
+
+The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
+8-byte IEEE 754 binary representations of `F` and `D`, respectively.

 ## Examples

@ -705,14 +648,33 @@ encodes to

 The `Value` data type is essentially an S-Expression, able to
 represent semi-structured data over `ByteString`, `String`,
-`SignedInteger` atoms and so on.
+`SignedInteger` atoms and so on.[^why-not-spki-sexps]
+
+  [^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
+    similar to Preserves. However, while they include binary data and
+    sequences, and an obvious equivalence for them exists, they lack
+    numbers *per se* as well as any kind of unordered structure such
+    as sets or maps. In addition, while "display hints" allow
+    labelling of binary data with an intended interpretation, they
+    cannot be attached to any other kind of structure, and the "hint"
+    itself can only be a binary blob.

 However, users need a wide variety of data types for representing
 domain-specific values such as various kinds of encoded and normalized
 text, calendrical values, machine words, and so on.

-We use appropriately-labelled `Record`s to denote these
-domain-specific data types.
+Appropriately-labelled `Record`s denote these domain-specific data
+types.[^why-dictionaries]
+
+  [^why-dictionaries]: Given `Record`'s existence, it may seem odd
+    that `Dictionary`, `Set`, `Float`, etc. are given special
+    treatment. Preserves aims to offer a useful basic equivalence
+    predicate to programmers, and so if a data type demands a special
+    equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
+    then the type should be included in the base language. Otherwise,
+    it can be represented as a `Record` and treated separately. Both
+    `Boolean` and `String` are seeming exceptions: they merit
+    inclusion because of their cultural importance.

 All of these conventions are optional. They form a layer atop the core
 `Value` structure. Non-domain-specific tools do not in general need to
@ -740,11 +702,13 @@ being a `ByteString`, the binary data.

 While each media type may define its own rules for comparing
 documents, we define ordering among `MIMEData` *representations* of
-such media types lexicographically over the (`Symbol`, `ByteString`)
-pair.
+such media types following the general rules for ordering of
+`Record`s.

 **Examples.**

+| Value                                      | Encoded hexadecimal byte sequence                                                                                 |
+|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
 | `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
 | `(mime text/plain #"ABC")`                 | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
 | `(mime application/xml #"<xhtml/>")`       | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
@ -813,7 +777,84 @@ Dates, times, moments, and timestamps can be represented with a
 or `date-time` productions of
 [section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).

-## Representing Values in Programming Languages
+## Security Considerations
+
+**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
+`Symbol`s may include chunks of zero length. This opens up a
+possibility for denial-of-service: an attacker may begin streaming a
+string, sending an endless sequence of zero length chunks, appearing
+to make progress but not actually doing so. Implementations may place
+optional reasonable restrictions on the number of consecutive empty
+chunks that may appear in a stream, and may even supply an optional
+mode that rejects empty chunks entirely.
+
+**Canonical form for cryptographic hashing and signing.** As
+specified, the encoding rules for `Value`s do not force canonical
+serializations for `Set` or `Dictionary` values. Two serializations of
+the same `Value` may yield different binary `Repr`s.
+
+## Appendix. Table of lead byte values
+
+     00 - False
+     01 - True
+     02 - Float
+     03 - Double
+    (0x)  RESERVED 04-0F
+    (1x)  RESERVED 10-1F
+     2x - Start Stream
+     3x - End Stream
+
+     4x - SignedInteger
+     5x - String
+     6x - ByteString
+     7x - Symbol
+
+     8x - short form Record label index 0
+     9x - short form Record label index 1
+     Ax - short form Record label index 2
+     Bx - Record
+
+     Cx - Sequence
+     Dx - Set
+     Ex - Dictionary
+    (Fx)  RESERVED F0-FF
+
+## Appendix. Bit fields within lead byte values
+
+     tt nn mmmm  contents
+     ---------- ---------
+
+     00 00 0000  False
+     00 00 0001  True
+     00 00 0010  Float, 32 bits big-endian binary
+     00 00 0011  Double, 64 bits big-endian binary
+
+     00 10 ttnn  Start Stream <tt,nn>
+                   When tt = 00 --> error
+                             01 --> each chunk is a <tt,nn> piece
+                             1x --> each chunk is a single encoded Value
+     00 11 ttnn  End Stream <tt,nn> (must match preceding Start Stream)
+
+     01 00 mmmm  SignedInteger, big-endian binary
+     01 01 mmmm  String, UTF-8 binary
+     01 10 mmmm  ByteString
+     01 11 mmmm  Symbol, UTF-8 binary
+
+     10 00 mmmm  application-specific Record
+     10 01 mmmm  application-specific Record
+     10 10 mmmm  application-specific Record
+     10 11 mmmm  Record
+
+     11 00 mmmm  Sequence
+     11 01 mmmm  Set
+     11 10 mmmm  Dictionary
+
+     If mmmm = 1111, a varint(m) follows, giving the length, before
+     the body; otherwise, m is the length of the body to follow.
+
+
+
+## Appendix. Representing Values in Programming Languages

 We have given a definition of `Value` and its semantics, and proposed
 a concrete syntax for communicating and storing `Value`s. We now turn
@ -881,32 +922,6 @@ should both be identities.
 - `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
 - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)

-## Appendix. Table of lead byte values
-
-     00 - False
-     01 - True
-     02 - Float
-     03 - Double
-    (0x)  RESERVED 04-0F
-    (1x)  RESERVED 10-1F
-     2x - Start Stream
-     3x - End Stream
-
-     4x - SignedInteger
-     5x - String
-     6x - ByteString
-     7x - Symbol
-
-     8x - short form Record label index 0
-     9x - short form Record label index 1
-     Ax - short form Record label index 2
-     Bx - Record
-
-     Cx - Sequence
-     Dx - Set
-     Ex - Dictionary
-    (Fx)  RESERVED F0-FF
-
 ## Appendix. Why not Just Use JSON?

 <!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
@ -1060,47 +1075,13 @@ JSON itself does not offer any guidance for which of these options to
 choose. In many real cases on the web, poor choices have led to
 encodings that are irrecoverably ambiguous.

---
---
-
 # Open questions

 Q. Should "symbols" instead be URIs? Relative, usually; relative to
 what? Some domain-specific base URI?

-Q. What about general rationals, subsuming integers and IEEE floats
-(except NaN and the Infinities)?
-
-Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps]
-
-  [^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as
-    they are? While they include binary data and sequences, and an
-    obvious equivalence for them exists, they lack numbers *per se* as
-    well as any kind of unordered structure such as sets or maps. In
-    addition, while "display hints" allow labelling of binary data
-    with an intended interpretation, they cannot be attached to any
-    other kind of structure, and the "hint" itself can only be a
-    binary blob.
-
-Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
-label (recursive!?) and a single `String` field?
-
-Q. Should `String` be a special syntax for `(utf8 ByteString)`? Again,
-recursiveness problems...?
-
-Q. Should `Dictionary` be a special syntax for etc etc.? `Set`?
-`Float`? `Double`?
-
- --> Rule of thumb: if there's a special equivalence predicate for it,
-     it needs to be built-in syntax. Otherwise it can be a regular
-     record. (So: `Boolean` might not make the cut for special
-     treatment?? Likewise `String`...? Ugh those are psychologically
-     important perhaps)
-
 Q. Are the language mappings reasonable? How about one for Python?

---
+Q. Literal small integers: could be nice? Not absolutely necessary.

-Literal small integers: could be nice? Not absolutely necessary.
-
---
+## Notes