Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

2019-07-13 22:20:22 -04:00 · 2019-07-13 22:20:22 -04:00 · e2b859e55d
parent d349e89ea4
commit e2b859e55d
2 changed files with 224 additions and 172 deletions
--- a/preserves.css
+++ b/preserves.css
@ -1,5 +1,6 @@
 body {
-  font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
+    font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
+    box-sizing: border-box;
 }
@media screen {
  body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; }
@ -59,3 +60,20 @@ h2#notes:before {
 }

 .footnotes > ol { padding: 0; font-size: 90%; }
+
+table {
+    border-collapse: collapse;
+    width: 100%;
+}
+thead tr {
+    border-bottom: solid black 1px;
+}
+th {
+    font-weight: normal;
+    text-align: left;
+    padding-right: 0.5rem;
+    padding-bottom: 0.3rem;
+}
+td {
+    padding-right: 0.5rem;
+}
--- a/preserves.md
+++ b/preserves.md
@ -6,7 +6,7 @@
 # Preserves: an Expressive Data Language

 Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
-November 2018. Version 0.0.4.
+June 2019. Version 0.0.5.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
@ -72,9 +72,8 @@ follows:[^ordering-by-syntax]
                              < String < ByteString < Symbol

  [^ordering-by-syntax]: The observant reader may note that the
-    ordering here is (almost) the same as that implied by the tagging
-    scheme used in the concrete binary syntax for `Value`s. (The
-    exception is the syntax for small integers near zero.)
+    ordering here is the same as that implied by the tagging scheme
+    used in the concrete binary syntax for `Value`s.

 **Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
 neither is less than the other according to the total order.
@ -400,7 +399,8 @@ itself have annotations.
            Value =/ ws "@" Value Value

 Each annotation is preceded by `@`; the underlying annotated value
-follows its annotations.
+follows its annotations. Here we extend only the syntactic nonterminal
+named "`Value`" without altering the semantic class of `Value`s.

 **Equivalence.** Annotations appear within syntax denoting a `Value`;
 however, the annotations are not part of the denoted value. They are
@ -411,19 +411,24 @@ Reflective tools such as debuggers, user interfaces, and message
 routers and relays---tools which process `Value`s generically---may
 use annotated inputs to tailor their operation, or may insert
 annotations in their outputs. By contrast, in ordinary programs, as a
-rule of thumb, the presence, absence or specific value of an
-annotation should not change the control flow or output of the
-program. Annotations are data *describing* `Value`s, and are not in
-the domain of any specific application of `Value`s. That is, an
-annotation will almost never cause a non-reflective program to do
-anything observably different.
+rule of thumb, the presence, absence or content of an annotation
+should not change the control flow or output of the program.
+Annotations are data *describing* `Value`s, and are not in the domain
+of any specific application of `Value`s. That is, an annotation will
+almost never cause a non-reflective program to do anything observably
+different.

 ## Compact Binary Syntax

-A `Repr` is an encoding, or representation, of a specific `Value`.
-Each `Repr` comprises one or more bytes describing first the kind of
-represented `Value` and the length of the representation, and then the
-encoded details of the `Value` itself.
+A `Repr` is a binary-syntax encoding, or representation, of either
+
+ - a `Value`,
+ - a "placeholder" for a `Value`, or
+ - an annotation on a `Repr`.
+
+Each `Repr` comprises one or more bytes describing the kind of
+represented information and the length of the representation, followed
+by the encoded details.

 For a value `v`, we write `[[v]]` for the `Repr` of v.

@ -431,19 +436,16 @@ For a value `v`, we write `[[v]]` for the `Repr` of v.

 Each `Repr` takes one of three possible forms:

- - (A) a fixed-length form, used for simple values such as `Boolean`s
-   or `Float`s.
+ - (A) type-specific form, used for simple values such as `Boolean`s
+   or `Float`s, for placeholders, and for introducing annotations.

 - (B) a variable-length form with length specified up-front, used for
-   almost all `Record`s as well as for most `Sequence`s and `String`s,
-   when their sizes are known at the time serialization begins.
+   compound and variable-length atomic data structures when their
+   sizes are known at the time serialization begins.

 - (C) a variable-length streaming form with unknown or unpredictable
-   length, used only seldom for `Record`s, since the number of fields
-   in a `Record` is usually statically known, but sometimes used for
-   `Sequence`s, `String`s etc., such as in cases when serialization
-   begins before the number of elements or bytes in the corresponding
-   `Value` is known.
+   length, used in cases when serialization begins before the number
+   of elements or bytes in the corresponding `Value` is known.

 Applications may choose between formats B and C depending on their
 needs at serialization time.
@ -455,30 +457,33 @@ Every `Repr` starts with a *lead byte*, constructed by

    leadbyte(t,n,m) = [t*64 + n*16 + m]

-The arguments `t` and `n` describe the rest of the
-representation:[^some-encodings-unused]
+The arguments `t`, `n` and `m` describe the rest of the
+representation.[^some-encodings-unused]

  [^some-encodings-unused]: Some encodings are unused. All such
    encodings are reserved for future versions of this specification.

- - `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation.
- - `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s.
- - `t`=0, `n`=2 (format C) is a Stream Start byte.
- - `t`=0, `n`=3 (format C) is a Stream End byte.
- - `t`=1 (format B) represents an `Atom` with variable-length binary representation.
- - `t`=2 (format B) represents a `Record`.
- - `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`.
+| `t` | `n` | `m` | Meaning |
+| --- | --- | --- | ------- |
+|  0  |  0  | 0–3 | (format A) An `Atom` with fixed-length binary representation |
+|  0  |  0  | 4   | (format C) Stream end |
+|  0  |  0  | 5   | (format A) Annotation |
+|  0  |  1  |     | (format A) Placeholder for an application-specific `Value` |
+|  0  |  2  |     | (format C) Stream start |
+|  0  |  3  |     | (format A) Certain small `SignedInteger`s |
+|  1  |     |     | (format B) An `Atom` with variable-length binary representation |
+|  2  |     |     | (format B) A `Compound` with variable-length representation |

-#### Encoding data of fixed length (format A).
+#### Encoding data of type-specific length (format A).

-Each specific type of data defines its own rules for this format.
+Each type of data defines its own rules for this format.

 #### Encoding data of known length (format B).

-A `Repr` where the length of the `Value` to be encoded is variable but
-known uses the value of `m` in `leadbyte` to encode its length. The
-length counts *bytes* for atomic `Value`s, but counts *contained
-values* for compound `Value`s.
+Format B is used where the length `l` of the `Value` to be encoded is
+known when serialization begins. Format B `Repr`s use `m` in
+`leadbyte` to encode `l`. The length counts *bytes* for atomic
+`Value`s, but counts *contained values* for compound `Value`s.

 - A length `l` between 0 and 14 is represented using `leadbyte` with
   `m=l`.
@ -503,11 +508,13 @@ definition,
 > two's complement representation of the number in groups of 7 bits,
 > least significant group first.

-**Examples.**
+The following table illustrates varint-encoding.

- - The varint representation of 15 is just the byte 15.
- - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
- - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
+| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
+| ------ | ------------------- | ------------ |
+| 15 | `0001111` | 15 |
+| 300 | `0000010 0101100` | 172 2 |
+| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |

 #### Streaming data of unknown length (format C).

@ -516,61 +523,69 @@ not known at the time serialization of the `Value` starts is encoded
 by a single Stream Start (“open”) byte, followed by zero or more
 *chunks*, followed by a matching Stream End (“close”) byte:

-     open(t,n) = leadbyte(0,2, t*4 + n)
-    close(t,n) = leadbyte(0,3, t*4 + n)
+     open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
+       close() = leadbyte(0,0, 4)       = [0x04]

-For a `Repr` of a `Value` containing binary data, each chunk is to be
-a format B `Repr` of a `ByteString`, no matter the type of the overall
-`Repr`.
+For a format C `Repr` of an atomic `Value`, each chunk is to be a
+format B `Repr` of a `ByteString`, no matter the type of the overall
+`Value`. Annotations are not allowed on these individual chunks.

-For a `Repr` of a `Value` containing other `Value`s, each chunk is to
-be a single `Repr`.
+For a format C `Repr` of a compound `Value`, each chunk is to be a
+single `Repr`, which may itself be annotated.
+
+Each chunk within a format C `Repr` *MUST* have non-zero length.
+Software that decodes `Repr`s *MUST* reject `Repr`s that include
+zero-length chunks.

 ### Records.

 Format B (known length):

-    [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
+    [[ L(F_1...F_m) ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]

 For `m` fields, `m+1` is supplied to `header`, to account for the
 encoding of the record label.

 Format C (streaming):

-    [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
+    [[ L(F_1...F_m) ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()

 Applications *SHOULD* prefer the known-length format for encoding
 `Record`s.

-#### Application-specific short form for labels.
+### Placeholders.

-Any given protocol using Preserves may additionally define an
-interpretation for `n`∈{0,1,2}, mapping each *short form label
-number* `n` to a specific record label. When encoding `m` fields with
-short form label number `n`, format B becomes
+Any given protocol using Preserves may define an interpretation for
+numbered *placeholders* in the binary syntax, mapping each
+*placeholder number* `n` to a specific `Value`. For example, a
+placeholder number may be assigned for a frequently-used `Record`
+label.

-    header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]
+A `Value` `v` for which placeholder number `n` has been assigned may
+be tersely encoded as

-and format C becomes
+    [[v]] = header(0,1,n)  when n is a placeholder number for v

-    open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)
+**Examples.** For example, a protocol may choose to assign placeholder
+number 4 to the symbol `void`, making

-**Examples.** For example, a protocol may choose to map records
-labelled `void` to `n=0`, making
+    [[void]] = header(0,1,4) = [0x14]
+    [[void()]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]

-    [[void()]] = header(2,0,0) = [0x80]
+or it may map symbol `person` to placeholder number 102, making

-or it may map records labelled `person` to short form label number 1,
-making
+    [[person]] = header(0,1,102) = [0x1F, 0x66]
+
+and so

    [[person("Dr", "Elizabeth", "Blackwell")]]
-        = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
-        =        [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
+      = header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
+      =          [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]

 for format B, or

-        = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
-        =    [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
+    open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
+       = [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]

 for format C.

@ -578,9 +593,9 @@ for format C.

 Format B (known length):

-            [[ [X_1...X_m] ]] = header(3,0,m)   ++ [[X_1]] ++...++ [[X_m]]
-        [[ #set{X_1...X_m} ]] = header(3,1,m)   ++ [[X_1]] ++...++ [[X_m]]
-    [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
+            [[ [X_1...X_m] ]] = header(2,1,m)   ++ [[X_1]] ++...++ [[X_m]]
+        [[ #set{X_1...X_m} ]] = header(2,2,m)   ++ [[X_1]] ++...++ [[X_m]]
+    [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
                                                ++ [[K_m]] ++ [[V_m]]

 Note that `m*2` is given to `header` for a `Dictionary`, since there
@ -588,10 +603,10 @@ are two `Value`s in each key-value pair.

 Format C (streaming):

-            [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
-        [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
-    [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
-                                          ++ [[K_m]] ++ [[V_m]] ++ close(3,2)
+            [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
+        [[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close()
+    [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
+                                          ++ [[K_m]] ++ [[V_m]] ++ close()

 Applications may use whichever format suits their needs on a
 case-by-case basis.
@ -616,15 +631,13 @@ order.
    the option of serializing with set elements and dictionary keys in
    sorted order.

-Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved.
-
 ### SignedIntegers.

 Format B/A (known length/fixed-size):

    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
-                                     header(0,1,x+16)              if -3≤x<0
-                                     header(0,1,x)                 if 0≤x<13
+                                     header(0,3,x+16)              if -3≤x<0
+                                     header(0,3,x)                 if 0≤x<13

 Integers in the range [-3,12] are compactly represented using format A
 because they are so frequently used. Other integers are represented
@ -644,22 +657,22 @@ needed to unambiguously identify the value and its sign, and `m =

 For example,

-    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 1D       [[    128 ]] = 42 00 80
-    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 1E       [[    255 ]] = 42 00 FF
-    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 1F       [[    256 ]] = 42 01 00
-    [[   -254 ]] = 42 FF 02    [[      0 ]] = 10       [[  32767 ]] = 42 7F FF
-    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 11       [[  32768 ]] = 43 00 80 00
-    [[   -128 ]] = 41 80       [[     12 ]] = 1C       [[  65535 ]] = 43 00 FF FF
+    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 3D       [[    128 ]] = 42 00 80
+    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 3E       [[    255 ]] = 42 00 FF
+    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 3F       [[    256 ]] = 42 01 00
+    [[   -254 ]] = 42 FF 02    [[      0 ]] = 30       [[  32767 ]] = 42 7F FF
+    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 31       [[  32768 ]] = 43 00 80 00
+    [[   -128 ]] = 41 80       [[     12 ]] = 3C       [[  65535 ]] = 43 00 FF FF
    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00

 ### Strings, ByteStrings and Symbols.

 Syntax for these three types varies only in the value of `n` supplied
-to `header`, `open`, and `close`. In each case, the payload following
-the header is a binary sequence; for `String` and `Symbol`, it is a
-UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
-is the raw data contained within the `Value` unmodified.
+to `header` and `open`. In each case, the payload following the header
+is a binary sequence; for `String` and `Symbol`, it is a UTF-8
+encoding of the `Value`'s code points, while for `ByteString` it is
+the raw data contained within the `Value` unmodified.

 Format B (known length):

@ -671,7 +684,8 @@ Format B (known length):

 To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
 then a sequence of zero or more format B chunks, followed by
-`close(1,n)`. Every chunk must be a `ByteString`.
+`close()`. Every chunk must be a `ByteString`, and no chunk may be
+annotated.

 While the overall content of a streamed `String` or `Symbol` must be
 valid UTF-8, individual chunks do not have to conform to UTF-8.
@ -680,8 +694,8 @@ valid UTF-8, individual chunks do not have to conform to UTF-8.

 Fixed-length atoms all use format A, and do not have a length
 representation. They repurpose the bits that format B `Repr`s use to
-specify lengths. Applications *MUST NOT* use format C with
-`open(0,n)` or `close(0,n)` for any `n`.
+specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
+for any `n`.

 #### Booleans.

@ -696,6 +710,18 @@ specify lengths. Applications *MUST NOT* use format C with
 The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 8-byte IEEE 754 binary representations of `F` and `D`, respectively.

+### Annotations.
+
+To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
+`[0x05] ++ [[v]]`.
+
+For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
+an empty sequence annotated with two symbols, `a` and `b`, is
+
+    [[ @a @b [] ]]
+      = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
+      = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
+
 ## Examples

 ### Simple examples.
@ -703,25 +729,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 <!-- TODO: Give some examples of large and small Preserves, perhaps -->
 <!-- translated from various JSON blobs floating around the internet. -->

-For the following examples, imagine an application that maps `Record`
-short form label number 0 to label `discard`, 1 to `capture`, and 2 to
+For the following examples, imagine an application that maps
+placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
 `observe`.

 | Value                                             | Encoded hexadecimal byte sequence                                    |
 |---------------------------------------------------|----------------------------------------------------------------------|
-| `capture(discard())`                              | 91 80                                                                |
-| `observe(speak(discard(), capture(discard())))`   | A1 B3 75 73 70 65 61 6B 80 91 80                                     |
-| `[1 2 3 4]` (format B)                            | C4 11 12 13 14                                                       |
-| `[1 2 3 4]` (format C)                            | 2C 11 12 13 14 3C                                                    |
-| `[-2 -1 0 1]`                                     | C4 1E 1F 10 11                                                       |
+| `capture(discard())`                              | 82 11 81 10                                                          |
+| `observe(speak(discard(), capture(discard())))`   | 82 12 83 75 73 70 65 61 6B 81 10 82 11 81 11                         |
+| `[1 2 3 4]` (format B)                            | 94 31 32 33 34                                                       |
+| `[1 2 3 4]` (format C)                            | 29 31 32 33 34 04                                                    |
+| `[-2 -1 0 1]`                                     | 94 3E 3F 30 31                                                       |
 | `"hello"` (format B)                              | 55 68 65 6C 6C 6F                                                    |
 | `"hello"` (format C, 2 chunks)                    | 25 62 68 65 63 6C 6C 6F 35                                           |
-| `"hello"` (format C, 5 chunks)                    | 25 62 68 65 62 6C 6C 60 60 61 6F 35                                  |
-| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
+| `"hello"` (format C, 5 chunks)                    | 25 61 68 61 65 61 6C 61 6C 61 6F 35                                  |
+| `["hello" there #"world" [] #set{} #true #false]` | 97 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 90 A0 01 00 |
 | `-257`                                            | 42 FE FF                                                             |
-| `-1`                                              | 1F                                                                   |
-| `0`                                               | 10                                                                   |
-| `1`                                               | 11                                                                   |
+| `-1`                                              | 3F                                                                   |
+| `0`                                               | 30                                                                   |
+| `1`                                               | 31                                                                   |
 | `255`                                             | 42 00 FF                                                             |
 | `1.0f`                                            | 02 3F 80 00 00                                                       |
 | `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                           |
@ -733,20 +759,20 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R

 encodes to

-    B5                              ;; Record, generic, 4+1
-      C5                              ;; Sequence, 5
+    85                              ;; Record, generic, 4+1
+      95                              ;; Sequence, 5
        76 74 69 74 6C 65 64            ;; Symbol, "titled"
        76 70 65 72 73 6F 6E            ;; Symbol, "person"
-        12                              ;; SignedInteger, "2"
+        32                              ;; SignedInteger, "2"
        75 74 68 69 6E 67               ;; Symbol, "thing"
-        11                              ;; SignedInteger, "1"
+        31                              ;; SignedInteger, "1"
      41 65                           ;; SignedInteger, "101"
      59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
-      B4                              ;; Record, generic, 3+1
+      84                              ;; Record, generic, 3+1
        74 64 61 74 65                  ;; Symbol, "date"
        42 07 1D                        ;; SignedInteger, "1821"
-        12                              ;; SignedInteger, "2"
-        13                              ;; SignedInteger, "3"
+        32                              ;; SignedInteger, "2"
+        33                              ;; SignedInteger, "3"
      52 44 72                        ;; String, "Dr"

  [^extensibility2]: It happens to line up with Racket's
@ -787,19 +813,19 @@ read as `Symbol`s. The first example:

 encodes to binary as follows:

-    E2
+    B2
      55 "Image"
-      EC
+      BC
        55 "Width"    42 03 20
        55 "Title"    5F 14 "View from 15th Floor"
        58 "Animated" 75 "false"
        56 "Height"   42 02 58
        59 "Thumbnail"
-          E6
+          B6
            55 "Width"  41 64
            53 "Url"    5F 26 "http://www.example.com/image/481989943"
            56 "Height" 41 7D
-            53 "IDs"    C4
+            53 "IDs"    94
                          41 74
                          42 03 AF
                          42 00 EA
@ -832,8 +858,8 @@ and the second example:

 encodes to binary as follows:

-    C2
-      EF 10
+    92
+      BF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 E2 26 80 9D 49 52
        59 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
@ -842,7 +868,7 @@ encodes to binary as follows:
        55 "State"      52 "CA"
        53 "Zip"        55 "94107"
        57 "Country"    52 "US"
-      EF 10
+      BF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 AF 9D 66 AD B4 03
        59 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
@ -957,16 +983,17 @@ such media types following the general rules for ordering of

 | Value                                      | Encoded hexadecimal byte sequence                                                                                 |
 |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
-| `mime(application/octet-stream #"abcde")`  | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
-| `mime(text/plain #"ABC")`                  | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
-| `mime(application/xml #"<xhtml/>")`        | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
-| `mime(text/csv #"123,234,345")`            | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |
+| `mime(application/octet-stream #"abcde")`  | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
+| `mime(text/plain #"ABC")`                  | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
+| `mime(application/xml #"<xhtml/>")`        | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
+| `mime(text/csv #"123,234,345")`            | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |

 Applications making heavy use of `mime` records may choose to use a
-short form label number for the record type. For example, if short
-form label number 1 were chosen, the second example above,
-`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
-74 6D 69 6D 65".
+placeholder number for the symbol `mime` as well as the symbols for
+individual media types. For example, if placeholder number 1 were
+chosen for `mime`, and placeholder number 7 for `text/plain`, the
+second example above, `mime(text/plain #"ABC")`, would be encoded as
+`83 11 17 63 41 42 43`.

 ### Unicode normalization forms.

@ -1027,20 +1054,23 @@ or `date-time` productions of

 ## Security Considerations

-**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
-`Symbol`s may include chunks of zero length. This opens up a
-possibility for denial-of-service: an attacker may begin streaming a
-string, sending an endless sequence of zero length chunks, appearing
-to make progress but not actually doing so. Implementations may place
-optional reasonable restrictions on the number of consecutive empty
-chunks that may appear in a stream, and may even supply an optional
-mode that rejects empty chunks entirely.
+**Empty chunks.** Chunks of zero length are prohibited in streamed
+(format C) `Repr`s. However, a malicious or broken encoder may include
+them nonetheless. This opens up a possibility for denial-of-service:
+an attacker may begin streaming a `String`, for example, sending an
+endless sequence of zero length chunks, appearing to make progress but
+not actually doing so. Implementations *MUST* reject zero length
+chunks when decoding, and *MUST NOT* produce them when encoding.

 **Whitespace.** Similarly, the textual format for `Value`s allows
 arbitrary whitespace in many positions. In streaming transfer
 situations, consider optional restrictions on the amount of
 consecutive whitespace that may appear in a serialized `Value`.

+**Annotations.** Also similarly, in modes where a `Value` is being
+read while annotations are skipped, an endless sequence of annotations
+may give an illusion of progress.
+
 **Canonical form for cryptographic hashing and signing.** As
 specified, neither the textual nor the compact binary encoding rules
 for `Value`s force canonical serializations. Two serializations of the
@ -1052,24 +1082,26 @@ same `Value` may yield different binary `Repr`s.
     01 - True
     02 - Float
     03 - Double
-    (0x)  RESERVED 04-0F
-     1x - Small integers 0..12,-3..-1
+     04 - End stream
+     05 - Annotation
+    (0x)  RESERVED 06-0F
+     1x - Placeholder
     2x - Start Stream
-     3x - End Stream
+     3x - Small integers 0..12,-3..-1

     4x - SignedInteger
     5x - String
     6x - ByteString
     7x - Symbol

-     8x - short form Record label index 0
-     9x - short form Record label index 1
-     Ax - short form Record label index 2
-     Bx - Record
+     8x - Record
+     9x - Sequence
+     Ax - Set
+     Bx - Dictionary

-     Cx - Sequence
-     Dx - Set
-     Ex - Dictionary
+    (Cx)  RESERVED C0-CF
+    (Dx)  RESERVED D0-DF
+    (Ex)  RESERVED E0-EF
    (Fx)  RESERVED F0-FF

 ## Appendix. Bit fields within lead byte values
@ -1081,31 +1113,40 @@ same `Value` may yield different binary `Repr`s.
     00 00 0001  True
     00 00 0010  Float, 32 bits big-endian binary
     00 00 0011  Double, 64 bits big-endian binary
+     00 00 0100  End Stream (to match a previous Start Stream)
+     00 00 0101  Annotation; two more Reprs follow

-     00 01 xxxx  Small integers 0..12,-3..-1
+     00 01 mmmm  Placeholder; m is the placeholder number

     00 10 ttnn  Start Stream <tt,nn>
                   When tt = 00 --> error
                             01 --> each chunk is a ByteString
-                             1x --> each chunk is a single encoded Value
-     00 11 ttnn  End Stream <tt,nn> (must match preceding Start Stream)
+                             10 --> each chunk is a single encoded Value
+                             11 --> error (RESERVED)
+
+     00 11 xxxx  Small integers 0..12,-3..-1

     01 00 mmmm  SignedInteger, big-endian binary
     01 01 mmmm  String, UTF-8 binary
     01 10 mmmm  ByteString
     01 11 mmmm  Symbol, UTF-8 binary

-     10 00 mmmm  application-specific Record
-     10 01 mmmm  application-specific Record
-     10 10 mmmm  application-specific Record
-     10 11 mmmm  Record
+     10 00 mmmm  Record
+     10 01 mmmm  Sequence
+     10 10 mmmm  Set
+     10 11 mmmm  Dictionary

-     11 00 mmmm  Sequence
-     11 01 mmmm  Set
-     11 10 mmmm  Dictionary
+     11 nn mmmm  error, RESERVED

-     If mmmm = 1111, a varint(m) follows, giving the length, before
-     the body; otherwise, m is the length of the body to follow.
+Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
+`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
+decoding the varint that follows.
+
+Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
+is the length of the body that follows, counted in bytes for `tt`=`01`
+and in `Repr`s for `tt`=`10`.
+
+<!-- Not yet ready

 ## Appendix. Representing Values in Programming Languages

@ -1118,6 +1159,9 @@ When designing a language mapping, an important consideration is
 roundtripping: serialization after deserialization, and vice versa,
 should both be identities.

+Also, the presence or absence of annotations on a `Value` should not
+affect comparisons of that `Value` to others in any way.
+
 ### JavaScript.

 - `Boolean` ↔ `Boolean`
@ -1211,6 +1255,8 @@ or `Record`s.
 - `Set` ↔ `Set`
 - `Dictionary` ↔ `Dictionary`

+-->
+
 ## Appendix. Why not Just Use JSON?

 <!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
@ -1367,18 +1413,14 @@ Q. Should "symbols" instead be URIs? Relative, usually; relative to
 what? Some domain-specific base URI?

 Q. Literal small integers: are they pulling their weight? They're not
-absolutely necessary. They mess up the connection between
-value-type-ordering and repr-tag-ordering! (The connection between
-*value* ordering and *repr* ordering is already irretrievably messed
-up: length prefixes blow lexicographic ordering away, sign bits are
-the wrong way around, floats are sign-magnitude, etc etc.)
+absolutely necessary.

 Q. Should we go for trying to make the data ordering line up with the
 encoding ordering? We'd have to only use streaming forms, and avoid
 the small integer encoding, and not store record arities, and sort
 sets and dictionaries, and mask floats and doubles (perhaps
 [like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
-and pick a specific `NaN`, and I don't know what to do about
+and perhaps pick a specific `NaN`, and I don't know what to do about
 SignedIntegers. Perhaps make them more like float formats, with the
 byte count acting as a kind of exponent underneath the sign bit.

@ -1413,11 +1455,3 @@ link escape"; it is not a printable ASCII character, and is disallowed
 in the textual Preserves grammar; and it is also mnemonic for "version
 0", since it is the Preserves binary encoding of the small integer
 zero.))
-
-IN PROGRESS: Remove the special short syntax for application-specific record
-label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record,
-Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free.
-
-TODO: Forbid empty chunks.
-
-## Notes