MUCH simpler binary format, inspired by Syrup; alterations to text format

2020-12-28 23:25:02 +01:00 · 2020-12-28 23:25:02 +01:00 · 5d719c2c6f
parent ccf4f97ed8
commit 5d719c2c6f
6 changed files with 399 additions and 596 deletions
--- a/2
+++ b/2
@ -1,2 +1,2 @@
 Preserves: an Expressive Data Language
-Copyright 2018-2019 Tony Garnock-Jones
+Copyright 2018-2020 Tony Garnock-Jones
--- a/TUTORIAL.md
+++ b/TUTORIAL.md
@ -38,7 +38,7 @@ For that, see the [Preserves specification](preserves.html).
 If you're familiar with JSON, Preserves looks fairly similar:
-``` javascript
+```
    {"name": "Missy Rose",
     "species": "Felis Catus",
     "age": 13,
@ -49,35 +49,35 @@ Preserves also has something we can use for debugging/development
 information called "annotations"; they aren't actually read in as data
 but we can use them for comments.
 (They can also be used for other development tools and are not
-restricted to strings; more on this later, but for now interpret them
+restricted to strings; more on this later, but for now, we will stick
-as comments.)
+to the special comment annotation syntax.)
-``` javascript
+```
-    @"I'm an annotation... basically a comment.  Ignore me!"
+    ;I'm an annotation... basically a comment. Ignore me!
-    "I'm data!  Don't ignore me!"
+    "I'm data! Don't ignore me!"
 ```
 Preserves supports some data types you're probably already familiar
 with from JSON, and which look fairly similar in the textual format:
-``` javascript
+```
-    @"booleans"
+    ;booleans
-    #true
+    #t
-    #false
+    #f
-    
+
-    @"various kinds of numbers:"
+    ;various kinds of numbers:
    42
    123556789012345678901234567890
    -10
    13.5
-    
+
-    @"strings"
+    ;strings
    "I'm feeling stringy!"
-    
+
-    @"sequences (lists)"
+    ;sequences (lists)
    ["cat", "dog", "mouse", "goldfish"]
-    
+
-    @"dictionaries (hashmaps)"
+    ;dictionaries (hashmaps)
    {"cat": "meow",
     "dog": "woof",
     "goldfish": "glub glub",
@ -90,16 +90,16 @@ with from JSON, and which look fairly similar in the textual format:
 ## Going beyond JSON
 We can observe a few differences from JSON already; it's possible to
-express numbers of arbitrary length in Preserves, and booleans look a little
+*reliably* express integers of arbitrary length in Preserves, and booleans look a little
 bit different.
 A few more interesting differences:
-``` javascript
+```
-    @"Preserves treats commas as whitespace, so these are the same"
+    ;Preserves treats commas as whitespace, so these are the same
    ["cat", "dog", "mouse", "goldfish"]
    ["cat" "dog" "mouse" "goldfish"]
-    
+
-    @"We can use anything as keys in dictionaries, not just strings"
+    ;We can use anything as keys in dictionaries, not just strings
    {1: "the loneliest number",
     ["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
     {"dictionaries": "as keys???"}: "well, why not?"}
@ -107,17 +107,17 @@ A few more interesting differences:
 Preserves technically provides a few types of numbers:
-``` javascript
+```
-    @"Signed Integers"
+    ;Signed Integers
    42
    -42
    5907212309572059846509324862304968273468909473609826340
    -5907212309572059846509324862304968273468909473609826340
-    
+
-    @"Floats (Single-precision IEEE floats) (notice the trailing f)"
+    ;Floats (Single-precision IEEE floats) (notice the trailing f)
    3.1415927f
-    
+
-    @"Doubles (Double-precision IEEE floats)"
+    ;Doubles (Double-precision IEEE floats)
    3.141592653589793
 ```
@ -129,33 +129,33 @@ Often they're meant to be used for something that has symbolic importance
 to the program, but not textual importance (other than to guide the
 programmer&#x2026; not unlike variable names).
-``` javascript
+```
-    @"A symbol (NOT a string!)"
+    ;A symbol (NOT a string!)
    JustASymbol
-    
+
-    @"You can do mixedCase or CamelCase too of course, pick your poison"
+    ;You can do mixedCase or CamelCase too of course, pick your poison
-    @"(but be consistent, for the sake of your collaborators!"
+    ;(but be consistent, for the sake of your collaborators!)
    iAmASymbol
    i-am-a-symbol
-    
+
-    @"A list of symbols"
+    ;A list of symbols
    [GET, PUT, POST, DELETE]
-    
+
-    @"A symbol with spaces in it"
+    ;A symbol with spaces in it
    |this is just one symbol believe it or not|
 ```
 We can also add binary data, aka ByteStrings:
-``` javascript
+```
-    @"Some binary data, base64 encoded"
+    ;Some binary data, base64 encoded
-    #base64{cGljdHVyZSBvZiBhIGNhdA==}
+    #[cGljdHVyZSBvZiBhIGNhdA==]
-    
+
-    @"Some other binary data, hexadecimal encoded"
+    ;Some other binary data, hexadecimal encoded
-    #hex{616263}
+    #x"616263"
-    
+
-    @"Same binary data as above, base64 encoded"
+    ;Same binary data as above, base64 encoded
-    #base64{YWJj}
+    #[YWJj]
 ```
 What's neat about this is that we don't have to "pay the cost" of
@ -165,48 +165,41 @@ the length of the binary data is the length of the binary data.
 Conveniently, Preserves also includes Sets, which are collections of
 unique elements where ordering of items is unimportant.
-``` javascript
+```
-    #set{flour, salt, water}
+    #{flour, salt, water}
 ```
 <a id="orgefafe56"></a>
-## Total ordering and canonicalization
+## Canonicalization
 This is a good time to mention that even though from a semantic
 perspective sets and dictionaries do not carry information about the
 ordering of their elements (and Preserves doesn't care what order we
 enter them in for our hand-written-as-text Preserves documents),
-Preserves has a well-defined "total ordering".
+[Preserves provides support for canonical ordering](canonical-binary.html)
 when serializing.
-Based on this total ordering, Preserves provides support for canonical
+In canonicalizing output mode, Preserves will always write out a given
-ordering when serializing; in this mode, Preserves will always write
+value using exactly the same bytes, every time. This is important and
-out the elements in the same order, every time.
+useful for many contexts, but especially for cryptographic signatures
-When combined with binary serialization, this is Preserves' "canonical
+and hashing.
 form".
 This is important and useful for many contexts, but especially for
 cryptographic signatures and hashing.
-``` javascript
+```
-    @"This hand-typed Preserves document..."
+    ;This hand-typed Preserves document...
    {monkey: {"noise": "ooh-ooh",
-              "eats": #set{"bananas", "berries"}}
+              "eats": #{"bananas", "berries"}}
     cat: {"noise": "meow",
-           "eats": #set{"kibble", "cat treats", "tinned meat"}}}
+           "eats": #{"kibble", "cat treats", "tinned meat"}}}
-    
+
-    @"Will always, always be written out in this order when canonicalized:"
+    ;Will always, always be written out in this order (except in
-    {cat: {"eats": #set{"cat treats", "kibble", "tinned meat"},
+    ;binary, of course) when canonicalized:
    {cat: {"eats": #{"cat treats", "kibble", "tinned meat"},
           "noise": "meow"}
-     monkey: {"eats": #set{"bananas", "berries"},
+     monkey: {"eats": #{"bananas", "berries"},
              "noise": "ooh-ooh"}}
 ```
 Clever implementations can get canonicalized output for free by
 carefully ordering set elements and dictionary entries at construction
 time, but even in simple implementations, canonical serialization is
 almost as cheap as normal serialization.
 <a id="org0366627"></a>
 ## Defining our own types using Records
@ -216,7 +209,7 @@ sense, it's a meta-type.
 `Record` objects have a label and a series of arguments (or "fields").
 For example, we can make a `Date` record:
-``` javascript
+```
    <Date 2019 8 15>
 ```
@ -228,7 +221,7 @@ We could instead just decide to encode our date data in a string,
 like "2019-08-15".
 A document using such a date structure might look like so:
-``` javascript
+```
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "1915-10-04"}
@ -243,13 +236,13 @@ know the date exactly.
 This causes a problem.
 Now we might have two kinds of entries:
-``` javascript
+```
-    @"Exact date known"
+    ;Exact date known
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "1915-10-04"}
-    
+
-    @"Not sure about exact date..."
+    ;Not sure about exact date...
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "Sometime in October 1915?  Or was that when he became an insect?"}
@ -261,13 +254,13 @@ like a date", but doing this kind of thing is prone to errors and weird
 edge cases.
 No, it's better to be able to have a separate type:
-``` javascript
+```
-    @"Exact date known"
+    ;Exact date known
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": <Date 1915 10 04>}
-    
+
-    @"Not sure about exact date..."
+    ;Not sure about exact date...
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": <Unknown "Sometime in October 1915?  Or was that when he became an insect?">}
@ -285,7 +278,7 @@ the meaning the label signifies for it to be of use.
 Still, there are plenty of interesting labels we can define.
 Here is one for an "iri", a hyperlink:
-``` javascript
+```
    <iri "https://dustycloud.org/blog/">
 ```
@ -294,11 +287,11 @@ Records are usually symbols but aren't necessarily so.
 They can also be strings or numbers or even dictionaries.
 And very interestingly, they can also be other records:
-``` javascript
+```
-    <<iri "https://www.w3.org/ns/activitystreams#Note">
+    < <iri "https://www.w3.org/ns/activitystreams#Note">
-     {"to": [<iri "https://chatty.example/ben/">],
+      {"to": [<iri "https://chatty.example/ben/">],
-      "attributedTo": <iri "https://social.example/alyssa/">,
+       "attributedTo": <iri "https://social.example/alyssa/">,
-      "content": "Say, did you finish reading that book I lent you?"}>
+       "content": "Say, did you finish reading that book I lent you?"} >
 ```
 Do you see it?  This Record's label is&#x2026; an `iri` Record!
@ -327,16 +320,18 @@ Annotations are not strictly a necessary feature, but they are useful
 in some circumstances.
 We have previously shown them used as comments:
-``` javascript
+```
-    @"I'm a comment!"
+    ;I'm a comment!
    "I am not a comment, I am data!"
 ```
 Annotations annotate the values they precede.
 It is possible to have multiple annotations on a value.
 The `;`-based comment syntax is syntactic sugar for the general
 `@`-prefixed string annotation syntax.
-``` javascript
+```
-    @"I am annotating this number"
+    ;I am annotating this number
    @"And so am I!"
    42
 ```
@ -349,7 +344,7 @@ Many implementations will, in the same mode, also supply line number
 and column information attached to each read value.
 So what's the point of them then?
-If annotations were just for comments, there would be indeed hardly
+If annotations were just for comments, there would be indeed hardly any
 point at all&#x2026; it would be simpler to just provide a comment syntax.
 However, annotations can be used for more than just comments.
@ -360,13 +355,17 @@ For instance, here's a reply from an HTTP API service running in
 "debug" mode annotated with the time it took to produce the reply and
 the internal name of the server that produced the response:
-``` javascript
+```
    @<ResponseTime <Milliseconds 64.4>>
    @<BackendServer "humpty-dumpty.example.com">
    <Success
      <Employees [
-        <Employee "Alyssa P. Hacker" #set{<Role Programmer>, <Role Manager>}, <Date 2018, 1, 24>>
+        <Employee "Alyssa P. Hacker"
-        <Employee "Ben Bitdiddle" #set{<Role Programmer>}, <Date 2019, 2, 13>> ]>>
+                  #{<Role Programmer>, <Role Manager>}
                  <Date 2018, 1, 24>>
        <Employee "Ben Bitdiddle"
                  #{<Role Programmer>}
                  <Date 2019, 2, 13>> ]>>
 ```
 The annotations aren't related to the data requested, which is all
--- a/canonical-binary.md
+++ b/canonical-binary.md
@ -20,22 +20,17 @@ are equal.
 This document specifies canonical form for the Preserves compact
 binary syntax.
-**General rules.**
+**Annotations.**
 Streaming formats ("format C") *MUST NOT* be used.
 Annotations *MUST NOT* be present.
 Whenever there is a choice between fixed-length ("format A") or
 variable-length ("format B") formats, the fixed-length format *MUST* be
 used.
 **Sets.**
 The elements of a `Set` *MUST* be serialized sorted in ascending order
-following the total order relation defined in the
+by comparing their canonical encoded binary representations.
 [Preserves specification][spec].
 **Dictionaries.**
 The key-value pairs in a `Dictionary` *MUST* be serialized sorted in
-ascending order by key, following the total order relation defined in
+ascending order by comparing the canonical encoded binary
-the [Preserves specification][spec].[^no-need-for-by-value]
+representations of their keys.[^no-need-for-by-value]
  [^no-need-for-by-value]: There is no need to order by (key, value)
    pair, since a `Dictionary` has no duplicate keys.
@ -43,7 +38,9 @@ the [Preserves specification][spec].[^no-need-for-by-value]
 **Other kinds of `Value`.**
 There are no special canonicalization restrictions on
 `SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s,
-`Float`s, `Double`s, `Record`s, or `Sequence`s.
+`Float`s, `Double`s, `Record`s, or `Sequence`s. The constraints given
 for these `Value`s in the [specification][spec] suffice to ensure
 canonicity.
 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes
--- a/conventions.md
+++ b/conventions.md
@ -65,28 +65,29 @@ interior portions of a tree.
 ## Comments.
 `String` values used as annotations are conventionally interpreted as
-comments.
+comments. Special syntax exists for such string annotations, though
 the usual `@`-prefixed annotation notation can also be used.
-    @"I am a comment for the Dictionary"
+    ;I am a comment for the Dictionary
    {
-      @"I am a comment for the key"
+      ;I am a comment for the key
-      key: @"I am a comment for the value"
+      key: ;I am a comment for the value
           value
    }
-    @"I am a comment for this entire IOList"
+    ;I am a comment for this entire IOList
    [
-      #hex{00010203}
+      #x"00010203"
-      @"I am a comment for the middle half of the IOList"
+      ;I am a comment for the middle half of the IOList
-      @"A second comment for the same portion of the IOList"
+      ;A second comment for the same portion of the IOList
-      @ @"I am the first and only comment for the following comment"
+      @ ;I am the first and only comment for the following comment
        "A third (itself commented!) comment for the same part of the IOList"
      [
-        @"I am a comment for the following ByteString"
+        ;"I am a comment for the following ByteString"
-        #hex{04050607}
+        #x"04050607"
-        #hex{08090A0B}
+        #x"08090A0B"
      ]
-      #hex{0C0D0E0F}
+      #x"0C0D0E0F"
    ]
 ## MIME-type tagged binary data.
@ -105,12 +106,17 @@ such media types following the general rules for ordering of
 **Examples.**
-| Value                                      | Encoded hexadecimal byte sequence                                                                                 |
+    «<mime application/octet-stream #"abcde">»
-|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
+      = B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"
-| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
+
-| `<mime text/plain #"ABC">`                 | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
+    «<mime text/plain #"ABC">»
-| `<mime application/xml #"<xhtml/>">`       | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
+      = B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84
-| `<mime text/csv #"123,234,345">`           | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |
+
    «<mime application/xml #"<xhtml/>">»
      = B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84
    «<mime text/csv #"123,234,345">»
      = B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84
 ## Unicode normalization forms.
--- a/preserves.md
+++ b/preserves.md
@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
 ---
 Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
-May 2020. Version 0.0.8.
+Jan 2021. Version 0.4.0.
  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
@ -12,6 +12,7 @@ May 2020. Version 0.0.8.
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405
  [canonical]: canonical-binary.html
 This document proposes a data model and serialization format called
 *Preserves*.
@ -42,20 +43,20 @@ Our `Value`s fall into two broad categories: *atomic* and *compound*
 data. Every `Value` is finite and non-cyclic.
                          Value = Atom
-                                | Compound
+                                                | Compound
                           Atom = Boolean
-                                | Float
+                                                | Float
-                                | Double
+                                                | Double
-                                | SignedInteger
+                                                | SignedInteger
-                                | String
+                                                | String
-                                | ByteString
+                                                | ByteString
-                                | Symbol
+                                                | Symbol
                       Compound = Record
-                                | Sequence
+                                                | Sequence
-                                | Set
+                                                | Set
-                                | Dictionary
+                                                | Dictionary
 **Total order.**<a name="total-order"></a> As we go, we will
 incrementally specify a total order over `Value`s. Two values of the
@ -215,14 +216,13 @@ label-`Value` followed by its field-`Value`s.
 `Sequence`s are enclosed in square brackets. `Dictionary` values are
 curly-brace-enclosed colon-separated pairs of values. `Set`s are
-written either as one or more values enclosed in curly braces, or zero
+written as values enclosed by the tokens `#{` and
 or more values enclosed by the tokens `#set{` and
 `}`.[^printing-collections] It is an error for a set to contain
 duplicate elements or for a dictionary to contain duplicate keys.
          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
-               Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
+               Set = "#{" *Value ws "}"
  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
@ -232,9 +232,10 @@ duplicate elements or for a dictionary to contain duplicate keys.
    commas separating, and commas terminating elements or key/value
    pairs within a collection.
-`Boolean`s are the simple literal strings `#true` and `#false`.
+`Boolean`s are the simple literal strings `#t` and `#f` for true and
 false, respectively.
-           Boolean = %s"#true" / %s"#false"
+           Boolean = %s"#t" / %s"#f"
 Numeric data follow the
 [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
@ -310,9 +311,10 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
-    Multilingual Plane. We encourage implementations to avoid escaping
+    Multilingual Plane. We encourage implementations to avoid using
-    such characters when producing output, and instead to rely on the
+    `\u` escapes when producing output, and instead to rely on the
-    UTF-8 encoding of the entire document to handle them correctly.
+    UTF-8 encoding of the entire document to handle non-ASCII
    codepoints correctly.
 A `ByteString` may be written in any of three different forms.
@ -327,16 +329,16 @@ value with `\x`.
      binunescaped = %x20-21 / %x23-5B / %x5D-7E
 The second is as a sequence of pairs of hexadecimal digits interleaved
-with whitespace and surrounded by `#hex{` and `}`.
+with whitespace and surrounded by `#x"` and `"`.
-       ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
+       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
 The third is as a sequence of
 [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
-with whitespace and surrounded by `#base64{` and `}`. Plain and
+with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
-URL-safe Base64 characters are allowed.
+Base64 characters are allowed.
-       ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
+       ByteString =/ "#[" *(ws / base64char) ws "]" /
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
 A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
@ -365,10 +367,10 @@ double quote mark.
 Finally, any `Value` may be represented by escaping from the textual
 syntax to the [compact binary syntax](#compact-binary-syntax) by
 prefixing a `ByteString` containing the binary representation of the
-`Value` with `#value`.[^rationale-switch-to-binary]
+`Value` with `#`.[^rationale-switch-to-binary]
 [^no-literal-binary-in-text] [^compact-value-annotations]
-           Compact = %s"#value" ws ByteString
+           Compact = "#" ws ByteString
  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
@ -387,8 +389,8 @@ prefixing a `ByteString` containing the binary representation of the
    access the representation of the text from within the text itself.
  [^compact-value-annotations]: Any text-syntax annotations preceding
-    the `#value` are prepended to any binary-syntax annotations
+    the `#` are prepended to any binary-syntax annotations yielded by
-    yielded by decoding the `ByteString`.
+    decoding the `ByteString`.
 ### Annotations.
@ -403,6 +405,17 @@ Each annotation is preceded by `@`; the underlying annotated value
 follows its annotations. Here we extend only the syntactic nonterminal
 named “`Value`” without altering the semantic class of `Value`s.
 **Comments.** Strings annotating a `Value` are conventionally
 interpreted as comments associated with that value. Comments are
 sufficiently common that special syntax exists for them.
            Value =/ ws
                     ";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
                     Value
 When written this way, everything between the `;` and the newline is
 included in the string annotating the `Value`.
 **Equivalence.** Annotations appear within syntax denoting a `Value`;
 however, the annotations are not part of the denoted value. They are
 only part of the syntax. Annotations do not play a part in
@ -421,86 +434,25 @@ different.
 ## Compact Binary Syntax
-A `Repr` is a binary-syntax encoding, or representation, of either a
+A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
-`Value` or an annotation on a `Repr`.
+For a value `v`, we write `«v»` for the `Repr` of v.
 Each `Repr` comprises one or more bytes describing the kind of
 represented information and the length of the representation, followed
 by the encoded details.
 For a value `v`, we write `[[v]]` for the `Repr` of v.
 ### Type and Length representation.
-Each `Repr` takes one of three possible forms:
+Each `Repr` starts with a tag byte, describing the kind of information
 represented. Depending on the tag, a length indicator, further encoded
 information, and/or an ending tag may follow.
- - (A) type-specific form, used for simple values such as `Boolean`s
+    tag                          (simple atomic data and small integers)
-   or `Float`s as well as for introducing annotations.
+    tag ++ binarydata            (most integers)
    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)
- - (B) a variable-length form with length specified up-front, used for
+The unique end tag is byte value `0x84`.
   compound and variable-length atomic data structures when their
   sizes are known at the time serialization begins.
- - (C) a variable-length streaming form with unknown or unpredictable
+If present after a tag, the length of a following piece of binary data
-   length, used in cases when serialization begins before the number
+is formatted as a [base 128 varint][varint].[^see-also-leb128] We
-   of elements or bytes in the corresponding `Value` is known.
+write `varint(m)` for the varint-encoding of `m`. Quoting the
 Applications may choose between formats B and C depending on their
 needs at serialization time.
 #### The lead byte.
 Every `Repr` starts with a *lead byte*, constructed by
 `leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
    leadbyte(t,n,m) = [t*64 + n*16 + m]
 The arguments `t`, `n` and `m` describe the rest of the
 representation.[^some-encodings-unused]
  [^some-encodings-unused]: Some encodings are unused. All such
    encodings are reserved for future versions of this specification.
 | `t` | `n` | `m` | Meaning                                                         |
 | --- | --- | --- | -------                                                         |
 |  0  |  0  | 0–3 | (format A) An `Atom` with fixed-length binary representation    |
 |  0  |  0  | 4   | (format C) Stream end                                           |
 |  0  |  0  | 5   | (format A) Annotation                                           |
 |  0  |  2  |     | (format C) Stream start                                         |
 |  0  |  3  |     | (format A) Certain small `SignedInteger`s                       |
 |  1  |     |     | (format B) An `Atom` with variable-length binary representation |
 |  2  |     |     | (format B) A `Compound` with variable-length representation     |
 |  3  |  3  | 15  | (format A) 0xFF byte; no-op                                     |
 #### Encoding data of type-specific length (format A).
 Each type of data defines its own rules for this format.
 Of particular note is lead byte `0xFF`, which is a no-op byte acting
 as a kind of pseudo-whitespace in a binary-syntax encoding.
 #### Encoding data of known length (format B).
 Format B is used where the length `l` of the `Value` to be encoded is
 known when serialization begins. Format B `Repr`s use `m` in
 `leadbyte` to encode `l`. The length counts *bytes* for atomic
 `Value`s, but counts *contained values* for compound `Value`s.
 - A length `l` between 0 and 14 is represented using `leadbyte` with
   `m=l`.
 - A length of 15 or greater is represented by `m=15` and additional
   bytes describing the length following the lead byte.
 The function `header(t,n,m)` yields an appropriate sequence of bytes
 describing a `Repr`'s type and length when `t`, `n` and `m` are
 appropriate non-negative integers:
    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
                    or leadbyte(t,n,15) ++ varint(m)   otherwise
 The additional length bytes are formatted as
 [base 128 varints][varint].[^see-also-leb128] We write `varint(m)` for
 the varint-encoding of `m`. Quoting the
 [Google Protocol Buffers][varint] definition,
  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
@ -515,174 +467,114 @@ the varint-encoding of `m`. Quoting the
 The following table illustrates varint-encoding.
-| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
+| Number, `m`                                   | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
-| ------      | -------------------                       | ------------      |
+| ------                                        | -------------------                       | ------------      |
-| 15          | `0001111`                                 | 15                |
+| 15                                            | `0001111`                                 | 15                |
-| 300         | `0000010 0101100`                         | 172 2             |
+| 300                                           | `0000010 0101100`                         | 172 2             |
-| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
+| 1000000000                                    | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
 It is an error for a varint-encoded `m` in a `Repr` to be anything
 other than the unique shortest encoding for that `m`. That is, a
-varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. However,
+varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
 the `varint(m)` encoding of a length *MUST NOT* be used when `m`<15,
 meaning that a `Repr` *MUST NOT* contain any varint-encoding with
 final byte `0`.
-#### Streaming data of unknown length (format C).
+### Records, Sequences, Sets and Dictionaries.
-A `Repr` where the length of the `Value` to be encoded is variable and
+          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
-not known at the time serialization of the `Value` starts is encoded
+            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
-by a single Stream Start (“open”) byte, followed by zero or more
+           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
-*chunks*, followed by a matching Stream End (“close”) byte:
+    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
     open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
       close() = leadbyte(0,0, 4)       = [0x04]
 For a format C `Repr` of an atomic `Value`, each chunk is to be a
 format B `Repr` of a `ByteString`, no matter the type of the overall
 `Value`. Annotations are not allowed on these individual chunks.
 For a format C `Repr` of a compound `Value`, each chunk is to be a
 single `Repr`, which may itself be annotated.
 Each chunk within a format C `Repr` *MUST* have non-zero length.
 Software that decodes `Repr`s *MUST* reject `Repr`s that include
 zero-length chunks.
 ### Records.
 Format B (known length):
    [[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
 For `m` fields, `m+1` is supplied to `header`, to account for the
 encoding of the record label.
 Format C (streaming):
    [[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
 Applications *SHOULD* prefer the known-length format for encoding
 `Record`s.
 ### Sequences, Sets and Dictionaries.
 Format B (known length):
            [[ [X_1...X_m] ]] = header(2,1,m)   ++ [[X_1]] ++...++ [[X_m]]
        [[ #set{X_1...X_m} ]] = header(2,2,m)   ++ [[X_1]] ++...++ [[X_m]]
    [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
                                                ++ [[K_m]] ++ [[V_m]]
 Note that `m*2` is given to `header` for a `Dictionary`, since there
 are two `Value`s in each key-value pair.
 Format C (streaming):
            [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
        [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
    [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
                                          ++ [[K_m]] ++ [[V_m]] ++ close()
 Applications may use whichever format suits their needs on a
 case-by-case basis.
 There is *no* ordering requirement on the `E_i` elements or
 `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
-order. However, the `E_i` and `K_i` *MUST* be pairwise distinct.
+order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
 addition, implementations *SHOULD* default to writing set elements and
 dictionary key/value pairs in order sorted lexicographically by their
 `Repr`s[^not-sorted-semantically], and *MAY* offer the option of
 serializing in some other implementation-defined order.
  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
-    elements) be in sorted order for serialized `Value`s, because (a)
+    elements) be in sorted order for serialized `Value`s; however, a
-    where canonicalization is used for cryptographic signatures, it is
+    [canonical form][canonical] for `Repr`s does exist where a sorted
-    more reliable to simply retain the exact binary form of the signed
+    ordering is required.
    document than to depend on canonical de- and re-serialization, and
    (b) sorting keys or elements makes no sense in streaming
    serialization formats.
-    However, a quality implementation may wish to offer the programmer
+  [^not-sorted-semantically]: It's important to note that the sort
-    the option of serializing with set elements and dictionary keys in
+    ordering for writing out set elements and dictionary key/value
-    sorted order.
+    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.
    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.
 ### SignedIntegers.
-Format B/A (known length/fixed-size):
+    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
                                 ([0xA0] + x)                        if  (-3≤x≤-1)
                                 ([0x90] + x)                        if  ( 0≤x≤12)
                               where m =        |intbytes(x)|
-    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
+Integers in the range [-3,12] are compactly represented with tags
-                                     header(0,3,x+16)              if -3≤x<0
+between `0x90` and `0x9F` because they are so frequently used.
-                                     header(0,3,x)                 if 0≤x<13
+Integers up to 16 bytes long are represented with a single-byte tag
-
+encoding the length of the integer. Larger integers are represented
-Integers in the range [-3,12] are compactly represented using format A
+with an explicit varint length. Every `SignedInteger` *MUST* be
-because they are so frequently used. Other integers are represented
+represented with its shortest possible encoding.
 using format B.
 Format C *MUST NOT* be used for `SignedInteger`s. Format A *MUST* be
 used for integers in the range -3 to 12, inclusive.
 The function `intbytes(x)` gives the big-endian two's-complement
 binary representation of `x`, taking exactly as many whole bytes as
 needed to unambiguously identify the value and its sign, and `m =
 |intbytes(x)|`. The most-significant bit in the first byte in
-`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
+`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
 example,
      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00
      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00
  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.
 For example,
    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 3D       [[    128 ]] = 42 00 80
    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 3E       [[    255 ]] = 42 00 FF
    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 3F       [[    256 ]] = 42 01 00
    [[   -254 ]] = 42 FF 02    [[      0 ]] = 30       [[  32767 ]] = 42 7F FF
    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 31       [[  32768 ]] = 43 00 80 00
    [[   -128 ]] = 41 80       [[     12 ]] = 3C       [[  65535 ]] = 43 00 FF FF
    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00
 ### Strings, ByteStrings and Symbols.
-Syntax for these three types varies only in the value of `n` supplied
+Syntax for these three types varies only in the tag used. For `String`
-to `header` and `open`. In each case, the payload following the header
+and `Symbol`, the data following the tag is a UTF-8 encoding of the
-is a binary sequence; for `String` and `Symbol`, it is a UTF-8
+`Value`'s code points, while for `ByteString` it is the raw data
-encoding of the `Value`'s code points, while for `ByteString` it is
+contained within the `Value` unmodified.
 the raw data contained within the `Value` unmodified.
-Format B (known length):
+    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
-              [[ S ]] = header(1,n,m) ++ encode(S)
+### Booleans.
              where m = |encode(S)|
    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
                        (2,S)        if S ∈ ByteString
                        (3,utf8(S))  if S ∈ Symbol
-To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
+    «#f» = [0x80]
-then a sequence of zero or more format B chunks, followed by
+    «#t» = [0x81]
 `close()`. Every chunk must be a `ByteString`, and no chunk may be
 annotated.
-While the overall content of a streamed `String` or `Symbol` must be
+### Floats and Doubles.
 valid UTF-8, individual chunks do not have to conform to UTF-8.
-### Fixed-length Atoms.
+    «F» when F ∈ Float  = [0x82] ++ binary32(F)
-
+    «D» when D ∈ Double = [0x83] ++ binary64(D)
 Fixed-length atoms all use format A, and do not have a length
 representation. They repurpose the bits that format B `Repr`s use to
 specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
 for any `n`.
 #### Booleans.
    [[ #false ]] = header(0,0,0) = [0x00]
    [[  #true ]] = header(0,0,1) = [0x01]
 #### Floats and Doubles.
    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.
@ -690,40 +582,43 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 ### Annotations.
 To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
-`[0x05] ++ [[v]]`.
+`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
 syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
 `a` and `b`, is
-For example, the `Repr` corresponding to textual syntax `@a@b[]`,
+    «@a @b []»
-i.e. an empty sequence annotated with two symbols, `a` and `b`, is
+      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
-
+      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
    [[ @a @b [] ]]
      = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
      = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
 ## Examples
 ### Ordering.
 The total ordering specified [above](#total-order) means that the following statements are true:
    "bzz" < "c" < "caa"
    #t < 3.0f < 3.0 < 3 < "3" < |3| < []
 ### Simple examples.
 <!-- TODO: Give some examples of large and small Preserves, perhaps -->
 <!-- translated from various JSON blobs floating around the internet. -->
-| Value                                             | Encoded byte sequence                                                               |
+| Value                       | Encoded byte sequence                                                           |
-|---------------------------------------------------|-------------------------------------------------------------------------------------|
+|-----------------------------|---------------------------------------------------------------------------------|
-| `<capture <discard>>`                             | 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd'                 |
+| `<capture <discard>>`       | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
-| `[1 2 3 4]` (format B)                            | 94 31 32 33 34                                                                      |
+| `[1 2 3 4]`                 | B5 91 92 93 94 84                                                               |
-| `[1 2 3 4]` (format C)                            | 29 31 32 33 34 04                                                                   |
+| `[-2 -1 0 1]`               | B5 9E 9F 90 91 84                                                               |
-| `[-2 -1 0 1]`                                     | 94 3E 3F 30 31                                                                      |
+| `"hello"` (format B)        | B1 05 'h' 'e' 'l' 'l' 'o'                                                       |
-| `"hello"` (format B)                              | 55 'h' 'e' 'l' 'l' 'o'                                                              |
+| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84                           |
-| `"hello"` (format C, 2 chunks)                    | 25 62 'h' 'e' 63 'l' 'l' 'o' 35                                                     |
+| `-257`                      | A1 FE FF                                                                        |
-| `"hello"` (format C, 5 chunks)                    | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35                                            |
+| `-1`                        | 9F                                                                              |
-| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
+| `0`                         | 90                                                                              |
-| `-257`                                            | 42 FE FF                                                                            |
+| `1`                         | 91                                                                              |
-| `-1`                                              | 3F                                                                                  |
+| `255`                       | A1 00 FF                                                                        |
-| `0`                                               | 30                                                                                  |
+| `1.0f`                      | 82 3F 80 00 00                                                                  |
-| `1`                                               | 31                                                                                  |
+| `1.0`                       | 83 3F F0 00 00 00 00 00 00                                                      |
-| `255`                                             | 42 00 FF                                                                            |
+| `-1.202e300`                | 83 FE 3C B7 B7 59 BF 04 26                                                      |
 | `1.0f`                                            | 02 3F 80 00 00                                                                      |
 | `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                                          |
 | `-1.202e300`                                      | 03 FE 3C B7 B7 59 BF 04 26                                                          |
 The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
@ -731,21 +626,24 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
 encodes to
-    85                              ;; Record, generic, 4+1
+    B4                                ;; Record
-      95                              ;; Sequence, 5
+      B5                                ;; Sequence
-        76 74 69 74 6C 65 64            ;; Symbol, "titled"
+        B3 06 74 69 74 6C 65 64           ;; Symbol, "titled"
-        76 70 65 72 73 6F 6E            ;; Symbol, "person"
+        B3 06 70 65 72 73 6F 6E           ;; Symbol, "person"
-        32                              ;; SignedInteger, "2"
+        92                                ;; SignedInteger, "2"
-        75 74 68 69 6E 67               ;; Symbol, "thing"
+        B3 05 74 68 69 6E 67              ;; Symbol, "thing"
-        31                              ;; SignedInteger, "1"
+        91                                ;; SignedInteger, "1"
-      41 65                           ;; SignedInteger, "101"
+      84                                ;; End (sequence)
-      59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
+      A0 65                             ;; SignedInteger, "101"
-      84                              ;; Record, generic, 3+1
+      B1 09 42 6C 61 63 6B 77 65 6C 6C  ;; String, "Blackwell"
-        74 64 61 74 65                  ;; Symbol, "date"
+      B4                                ;; Record
-        42 07 1D                        ;; SignedInteger, "1821"
+        B3 04 64 61 74 65                 ;; Symbol, "date"
-        32                              ;; SignedInteger, "2"
+        A1 07 1D                          ;; SignedInteger, "1821"
-        33                              ;; SignedInteger, "3"
+        92                                ;; SignedInteger, "2"
-      52 44 72                        ;; String, "Dr"
+        93                                ;; SignedInteger, "3"
      84                                ;; End (record)
      B1 02 44 72                       ;; String, "Dr"
    84                                ;; End (record)
  [^extensibility2]: It happens to line up with Racket's
    representation of a record label for an inheritance hierarchy
@ -785,23 +683,27 @@ read as `Symbol`s. The first example:
 encodes to binary as follows:
-    B2
+    B7
-      55 "Image"
+      B1 05 "Image"
-      BC
+      B7
-        55 "Width"    42 03 20
+        B1 05 "Title"    B1 14 "View from 15th Floor"
-        55 "Title"    5F 14 "View from 15th Floor"
+        B1 05 "Width"    A1 03 20
-        58 "Animated" 75 "false"
+        B1 06 "Height"   A1 02 58
-        56 "Height"   42 02 58
+        B1 08 "Animated" B3 05 "false"
-        59 "Thumbnail"
+        B1 09 "Thumbnail"
-          B6
+          B7
-            55 "Width"  41 64
+            B1 03 "Url"    B1 26 "http://www.example.com/image/481989943"
-            53 "Url"    5F 26 "http://www.example.com/image/481989943"
+            B1 03 "IDs"    B5
-            56 "Height" 41 7D
+                             A0 74
-            53 "IDs"    94
+                             A1 03 AF
-                          41 74
+                             A1 00 EA
-                          42 03 AF
+                             A2 00 97 89
-                          42 00 EA
+                           84
-                          43 00 97 89
+            B1 05 "Width"  A0 64
            B1 06 "Height" A0 7D
          84
      84
    84
 and the second example:
@ -830,55 +732,51 @@ and the second example:
 encodes to binary as follows:
-    92
+    B5
-      BF 10
+      B7
-        59 "precision"  53 "zip"
+        B1 03 "Zip"        B1 05 "94107"
-        58 "Latitude"   03 40 42 E2 26 80 9D 49 52
+        B1 04 "City"       B1 0D "SAN FRANCISCO"
-        59 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
+        B1 05 "State"      B1 02 "CA"
-        57 "Address"    50
+        B1 07 "Address"    B1 00
-        54 "City"       5D "SAN FRANCISCO"
+        B1 07 "Country"    B1 02 "US"
-        55 "State"      52 "CA"
+        B1 08 "Latitude"   83 40 42 E2 26 80 9D 49 52
-        53 "Zip"        55 "94107"
+        B1 09 "Longitude"  83 C0 5E 99 56 6C F4 1F 21
-        57 "Country"    52 "US"
+        B1 09 "precision"  B1 03 "zip"
-      BF 10
+      84
-        59 "precision"  53 "zip"
+      B7
-        58 "Latitude"   03 40 42 AF 9D 66 AD B4 03
+        B1 03 "Zip"        B1 05 "94085"
-        59 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
+        B1 04 "City"       B1 09 "SUNNYVALE"
-        57 "Address"    50
+        B1 05 "State"      B1 02 "CA"
-        54 "City"       59 "SUNNYVALE"
+        B1 07 "Address"    B1 00
-        55 "State"      52 "CA"
+        B1 07 "Country"    B1 02 "US"
-        53 "Zip"        55 "94085"
+        B1 08 "Latitude"   83 40 42 AF 9D 66 AD B4 03
-        57 "Country"    52 "US"
+        B1 09 "Longitude"  83 C0 5E 81 AA 4F CA 42 AF
        B1 09 "precision"  B1 03 "zip"
      84
    84
 ## Security Considerations
-**Empty chunks.** Chunks of zero length are prohibited in streamed
+**Whitespace.** The textual format allows arbitrary whitespace in many
-(format C) `Repr`s. However, a malicious or broken encoder may include
+positions. Consider optional restrictions on the amount of consecutive
-them nonetheless. This opens up a possibility for denial-of-service:
+whitespace that may appear.
 an attacker may begin streaming a `String`, for example, sending an
 endless sequence of zero length chunks, appearing to make progress but
 not actually doing so. Implementations *MUST* reject zero length
 chunks when decoding, and *MUST NOT* produce them when encoding.
-**Whitespace and no-ops.** Similarly, the binary format allows `0xFF`
+**Annotations.** Similarly, in modes where a `Value` is being read
-no-ops and the textual format allows arbitrary whitespace in many
+while annotations are skipped, an endless sequence of annotations may
-positions. In streaming transfer situations, consider optional
+give an illusion of progress.
 restrictions on the amount of consecutive whitespace or the number of
 consecutive no-ops that may appear.
-**Annotations.** Also similarly, in modes where a `Value` is being
+**Canonical form for cryptographic hashing and signing.** No canonical
-read while annotations are skipped, an endless sequence of annotations
+textual encoding of a `Value` is specified. A
-may give an illusion of progress.
+[canonical form][canonical] exists for binary encoded `Value`s, and
-
+implementations *SHOULD* produce canonical binary encodings by
-**Canonical form for cryptographic hashing and signing.** As
+default; however, an implementation *MAY* permit two serializations of
-specified, neither the textual nor the compact binary encoding rules
+the same `Value` to yield different binary `Repr`s.
 for `Value`s force canonical serializations. Two serializations of the
 same `Value` may yield different binary `Repr`s.
 ## Acknowledgements
-The use of low-order bits of each lead byte for the length of short
+The use of the low-order bits in certain SignedInteger tags for the
-values is inspired by a similar feature of [CBOR](http://cbor.io/).
+length of the following data is inspired by a similar feature of
 [CBOR](http://cbor.io/).
 The treatment of commas as whitespace in the text syntax is inspired
 by the same feature of [EDN](https://github.com/edn-format/edn).
@ -889,126 +787,42 @@ syntax.
 ## Appendix. Autodetection of textual or binary syntax
-Whitespace characters `0x09` (ASCII HT (tab)), `0x0A` (LF), `0x0D`
+Every tag byte in a binary Preserves `Document` falls within the range
-(CR), `0x20` (space) and `0x2C` (comma) are ignored at the start of a
+[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
-textual-syntax Preserves `Document`, and their UTF-8 encodings are
+bytes*, and will never occur as the first byte of a UTF-8 encoded code
-reserved lead byte values in binary-syntax Preserves.
+point. This means no binary-encoded document can be misinterpreted as
 valid UTF-8.
-The byte `0xFF`, signifying a no-op in binary-syntax Preserves, has no
+Conversely, a UTF-8 document must start with a valid codepoint,
-meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in
+meaning in particular that it must not start with a byte in the range
-a valid textual-syntax Preserves `Document`.
+[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
 Preserves document can be misinterpreted as a binary-syntax document.
-If applications prefix their textual-syntax documents with e.g. a
+Examination of the top two bits of the first byte of a document gives
-space or newline character, and their binary-syntax documents with a
+its syntax: if the top two bits are `10`, it should be interpreted as
-`0xFF` byte, consumers of these documents may reliably autodetect the
+a binary-syntax document; otherwise, it should be interpreted as text.
 syntax being used. In a network protocol supporting this kind of
 autodetection, clients may transmit LF or `0xFF` to select text or
 binary syntax, respectively.
-Furthermore, if an application consistently uses `Record`s for its
+## Appendix. Table of tag values
 top-level messages,[^records-and-nonatoms] eschewing `Atom`s in
 particular, then autodetection of the encoding used for a given input
 can be done as follows:
-| First byte of encoded input              | Encoding | Other conclusions                       |
+     80 - False
-| ---                                      | ---      | ---                                     |
+     81 - True
-| `0x80`--`0x8F`                           | binary   | `Record` (format B)                     |
+     82 - Float
-| `0x28`                                   | binary   | `Record` (format C)                     |
+     83 - Double
-| `0x05`                                   | binary   | annotated value (presumably a `Record`) |
+     84 - End marker
-| `0xFF`                                   | binary   | no-op; value will follow                |
+     85 - Annotation
-| ---                                      | ---      | ---                                     |
+    (8x)  RESERVED 86-8F
 | `0x7B` ("<")                             | text     | `Record`                                |
 | `0x40` ("@")                             | text     | annotated value (presumably a `Record`) |
 | `0x09`, `0x0A`, `0x0D`, `0x20` or `0x2C` | text     | whitespace; value will follow           |
-  [^records-and-nonatoms]: Similar reasoning can be used to permit
+     9x - Small integers 0..12,-3..-1
-    unambiguous detection of encoding when `Collection`s are allowed
+     An - Small integers, (n+1) bytes long
-    as top-level messages as well as `Record`s.
+     B0 - Small integers, variable length
     B1 - String
     B2 - ByteString
     B3 - Symbol
-## Appendix. Table of lead byte values
+     B4 - Record
-
+     B5 - Sequence
-     00 - False
+     B6 - Set
-     01 - True
+     B7 - Dictionary
     02 - Float
     03 - Double
     04 - End stream
     05 - Annotation
    (0x)  RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
    (1x)  RESERVED
     2x - Start Stream (NB. 20, 2C specially reserved)
     3x - Small integers 0..12,-3..-1
     4x - SignedInteger
     5x - String
     6x - ByteString
     7x - Symbol
     8x - Record
     9x - Sequence
     Ax - Set
     Bx - Dictionary
    (Cx)  RESERVED C0-CF
    (Dx)  RESERVED D0-DF
    (Ex)  RESERVED E0-EF
    (Fx)  RESERVED F0-FE
     FF   No-op
 ## Appendix. Bit fields within lead byte values
     tt nn mmmm  contents
     ---------- ---------
     00 00 0000  False
     00 00 0001  True
     00 00 0010  Float, 32 bits big-endian binary
     00 00 0011  Double, 64 bits big-endian binary
     00 00 0100  End Stream (to match a previous Start Stream)
     00 00 0101  Annotation; two more Reprs follow
     00 00 1001  (ASCII HT (tab))  \
     00 00 1010  (ASCII LF)        |- Reserved: may be used to indicate
     00 00 1101  (ASCII CR)        /    use of text encoding
     00 01 xxxx  error, RESERVED
     00 10 ttnn  Start Stream <tt,nn>
                   When tt = 00 --> error
                               When nn = 00 --> (ASCII space)
                                           Reserved: may be used to indicate
                                             use of text encoding
                                         otherwise --> error
                             01 --> each chunk is a ByteString
                             10 --> each chunk is a single encoded Value
                             11 --> error (RESERVED)
                               When nn = 00 --> (ASCII comma)
                                           Reserved: may be used to indicate
                                             use of text encoding
                                         otherwise --> error
     00 11 xxxx  Small integers 0..12,-3..-1
     01 00 mmmm  SignedInteger, big-endian binary
     01 01 mmmm  String, UTF-8 binary
     01 10 mmmm  ByteString
     01 11 mmmm  Symbol, UTF-8 binary
     10 00 mmmm  Record
     10 01 mmmm  Sequence
     10 10 mmmm  Set
     10 11 mmmm  Dictionary
     11 00 xxxx  error, RESERVED
     11 01 xxxx  error, RESERVED
     11 10 xxxx  error, RESERVED
     11 11 1111  no-op; unambiguous indication of binary Preserves format
 Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
 `m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
 decoding the varint that follows.
 Then, `l` is the length of the body that follows, counted in bytes for
 `tt`=`01` and in `Repr`s for `tt`=`10`.
 ## Appendix. Binary SignedInteger representation
@ -1016,17 +830,17 @@ Languages that provide fixed-width machine word types may find the
 following table useful in encoding and decoding binary `SignedInteger`
 values.
-| Integer range                                  | Bytes required | Encoding (hex)                               |
+| Integer range                              | Bytes required | Encoding (hex)                               |
-| ---                                            | ---            | ---                                          |
+| ---                                        | ---            | ---                                          |
-| -3 ≤ n < 13 (numbers -3..12 encoded specially) | 1              | `3X`                                         |
+| -3 ≤ n ≤ 12                                | 1              | `3X`                                         |
-| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)        | 2              | `41` `XX`                                    |
+| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
-| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16)     | 3              | `42` `XX` `XX`                               |
+| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
-| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24)     | 4              | `43` `XX` `XX` `XX`                          |
+| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
-| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32)     | 5              | `44` `XX` `XX` `XX` `XX`                     |
+| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
-| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40)     | 6              | `45` `XX` `XX` `XX` `XX` `XX`                |
+| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
-| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48)     | 7              | `46` `XX` `XX` `XX` `XX` `XX` `XX`           |
+| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
-| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56)     | 8              | `47` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
+| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
-| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64)     | 9              | `48` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
+| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
 <!-- Heading to visually offset the footnotes from the main document: -->
 ## Notes
--- a/questions.md
+++ b/questions.md
@ -29,16 +29,3 @@ not. There's only one (?) at the moment, the `%i"f"` in `Float`;
 should it be changed to case-sensitive?
 Q. Should `IOList`s be wrapped in an identifying unary record constructor?
 TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
 TODO: Probably should add a canonicalized subset. Consider adding
 explicit "I promise this is canonical" marker, like a BOM, which
 identifies a binary value as (first) binary and (second, optionally)
 as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
 text; this might be a good candidate for a marker sequence.
 ((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
 link escape"; it is not a printable ASCII character, and is disallowed
 in the textual Preserves grammar; and it is also mnemonic for "version
 0", since it is the Preserves binary encoding of the small integer
 zero.))
`@ -1,2 +1,2 @@`
	`Preserves: an Expressive Data Language`	`Preserves: an Expressive Data Language`
	`Copyright 2018-2019 Tony Garnock-Jones`	`Copyright 2018-2020 Tony Garnock-Jones`