Progress

2018-09-23 22:35:00 +01:00 · 2018-09-23 22:35:00 +01:00 · 00a69ae012
parent f2f57385ce
commit 00a69ae012
1 changed files with 265 additions and 151 deletions
--- a/syndicate/mc/preserve.md
+++ b/syndicate/mc/preserve.md
@ -298,73 +298,122 @@ connections to other data languages can also be made.
 For now, we limit our attention to an easily-parsed, easily-produced
 machine-readable syntax.

-Every `Value` is represented as one or more bytes describing first its
-kind and its length, and then its specific contents.
+A `Repr` is an encoding, or representation, of a specific `Value`.
+Each `Repr` comprises one or more bytes describing first the kind of
+represented `Value` and the length of the representation, and then the
+encoded details of the `Value` itself.

-For a value `v`, we write `[[v]]` for the encoding of v.
+For a value `v`, we write `[[v]]` for the `Repr` of v.

 The following figure summarises the definitions below:

    tt nn mmmm  varint(m)  contents
    -------------------------------

-    00 00 mmmm  ...        application-specific Record
-    00 01 mmmm  ...        application-specific Record
-    00 10 mmmm  ...        application-specific Record
-    00 11 mmmm  ...        Record
+    00 00 0000             False
+    00 00 0001             True
+    00 00 0010             Float, 32 bits big-endian binary
+    00 00 0011             Double, 64 bits big-endian binary
+    00 00 x1xx             RESERVED
+    00 00 1xxx             RESERVED
+    00 01 xxxx             RESERVED
+    00 10 ttnn             Start Stream <tt,nn>
+                             When tt = 00 --> error
+                                       01 --> each chunk is a <tt,nn> piece
+                                       1x --> each chunk is a single encoded Value
+    00 11 ttnn             End Stream <tt,nn> (must match preceding Start Stream)

-    01 00 mmmm  ...        Sequence
-    01 01 mmmm  ...        Set
-    01 10 mmmm  ...        Dictionary
+    01 00 mmmm  ...        SignedInteger, big-endian binary
+    01 01 mmmm  ...        String, UTF-8 binary
+    01 10 mmmm  ...        Bytes
+    01 11 mmmm  ...        Symbol, UTF-8 binary

-    10 00 mmmm  ...        SignedInteger, big-endian binary
-    10 01 mmmm  ...        String, UTF-8 binary
-    10 10 mmmm  ...        Bytes
-    10 11 mmmm  ...        Symbol, UTF-8 binary
+    10 00 mmmm  ...        application-specific Record
+    10 01 mmmm  ...        application-specific Record
+    10 10 mmmm  ...        application-specific Record
+    10 11 mmmm  ...        Record

-    11 00 0000             False
-    11 00 0001             True
-    11 00 0010             Float, 32 bits big-endian binary
-    11 00 0011             Double, 64 bits big-endian binary
+    11 00 mmmm  ...        Sequence
+    11 01 mmmm  ...        Set
+    11 10 mmmm  ...        Dictionary
+    11 11 xxxx             RESERVED

    If mmmm = 1111, varint(m) is present; otherwise, m is the length

 #### Type and Length representation

-A `Value`'s type and length is represented by use of a function
-`header(t,n,m)` that yields a sequence of bytes when `t`, `n` and `m`
-are appropriate non-negative integers.
+Each `Repr` takes one of three possible forms:

-    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
-                    or leadbyte(t,n,15) ++ varint(m)   otherwise
+ - (A) a fixed-length form, used for simple values such as `Boolean`s
+   or `Float`s.

-The lead byte in a `Value`'s representation is constructed by a function
+ - (B) a variable-length form with length specified up-front, used for
+   almost all `Record`s as well as for most `Sequence`s and `String`s,
+   when their sizes are known at the time serialization begins.
+
+ - (C) a variable-length streaming form with unknown or unpredictable
+   length, used only seldom for `Record`s, since the number of fields
+   in a `Record` is usually statically known, but sometimes used for
+   `Sequence`s, `String`s etc., such as in cases when serialization
+   begins before the number of elements or bytes in the corresponding
+   `Value` is known.
+
+Applications may choose between formats (B) and (C) depending on their
+needs at serialization time.
+
+Every `Repr`, however, starts with a *lead byte* describing the
+remainder of the representation.
+
+##### The lead byte
+
+The lead byte is constructed by a function `leadbyte`:

    leadbyte(t,n,m) = [t*64 + n*16 + m]

+Both `t` and `n` are two-bit unsigned numbers; `m` is a four-bit
+unsigned number.
+
 The lead byte describes the rest of the representation as
 follows:[^some-encodings-unused]

-    leadbyte(0,-,-) represents a Record
-    leadbyte(1,-,-) represents a Sequence, Set or Dictionary
-    leadbyte(2,-,-) represents an Atom with variable-length binary representation
-    leadbyte(3,0,-) represents an Atom with fixed-length binary representation
-
  [^some-encodings-unused]: Some encodings are unused. All such
    encodings are reserved for future versions of this specification.

-Variable-length representations use the value of `m` to encode their
-lengths:
+ - `leadbyte(0,0,-)` (format A) represents an Atom with fixed-length binary representation.
+ - `leadbyte(0,1,-)` (format A) is RESERVED.
+ - `leadbyte(0,2,-)` (format C) is a Stream Start byte.
+ - `leadbyte(0,3,-)` (format C) is a Stream End byte.
+ - `leadbyte(1,-,-)` (format B) represents an Atom with variable-length binary representation.
+ - `leadbyte(2,-,-)` (format B) represents a Record.
+ - `leadbyte(3,-,-)` (format B) represents a Sequence, Set or Dictionary.

- - Lengths between 0 and 14 are represented using `leadbyte` with `m`
-   values 0 through 14.
- - Lengths of 15 or greater are represented by `m` value 15, and
-   additional "length bytes" describing the length then follow the
-   lead byte.
+##### Encoding data of fixed length (format A)

-These additional length bytes are formatted as
-[base 128 varints][varint]. Quoting the
-[Google Protocol Buffers][varint] definition,
+Each specific type of data defines its own rules for this format.
+
+##### Encoding data of known length (format B)
+
+A `Repr` where the length of the `Value` to be encoded is variable but
+known uses the value of `m` in `leadbyte` to encode its length. The
+length counts *bytes* for atomic `Value`s, but counts *contained
+values* for compound `Value`s.
+
+ - A length `l` between 0 and 14 is represented using `leadbyte` with
+   `m=l`.
+ - A length of 15 or greater is represented by `m=15` and additional
+   bytes describing the length following the lead byte.
+
+The function `header(t,n,m)` yields an appropriate sequence of bytes
+describing a `Repr`'s type and length when `t`, `n` and `m` are
+appropriate non-negative integers:
+
+    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
+                    or leadbyte(t,n,15) ++ varint(m)   otherwise
+
+The additional length bytes are formatted as
+[base 128 varints][varint]. We write `varint(m)` for the
+varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
+definition,

 > Each byte in a varint, except the last byte, has the most
 > significant bit (msb) set – this indicates that there are further
@ -378,43 +427,93 @@ These additional length bytes are formatted as
 - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
 - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.

-We write `varint(m)` for the varint-encoding of `m`.
+##### Streaming data of unknown length (format C)
+
+A `Repr` where the length of the `Value` to be encoded is variable and
+not known at the time serialization of the `Value` starts is encoded
+by a single Stream Start byte, followed by zero or more *chunks*,
+followed by a matching Stream End byte:
+
+    startbyte(t,n) = leadbyte(0,2, t*4 + n)
+      endbyte(t,n) = leadbyte(0,3, t*4 + n)
+
+For a `Repr` of a `Value` containing binary data, each chunk is to be
+a format B `Repr` of the same type as the overall `Repr`.
+
+For a `Repr` of a `Value` containing other `Value`s, each chunk is to
+be a single `Repr`.

 #### Records

-    [[ (L F_1 ... F_m) ]] = header(0,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
+Format B (known length):
+
+    [[ (L F_1 ... F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]

 For `m` fields, `m+1` is supplied to `header`, to account for the
 encoding of the record label.

+Format C (streaming):
+
+    [[ (L F_1 ... F_m) ]]
+           = startbyte(2,3) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,3)
+
+Applications *SHOULD* prefer the known-length format for encoding
+`Record`s.
+
 ##### Application-specific short form for labels

 Any given protocol using Preserves may additionally define an
 interpretation for `n ∈ {0,1,2}`, mapping each *short form label
 number* `n` to a specific record label. When encoding `m` fields with
-short form label number `n`, the header is `header(0,n,m)` (rather
-than `m+1`) since the label is implicit.
+short form label number `n`, format B becomes
+
+    header(2,n,m) ++ [[F_1]] ++ ... ++ [[F_m]]
+
+and format C becomes
+
+    startbyte(2,n) ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,n)

 **Examples.** For example, a protocol may choose to map records
 labelled `void` to `n=0`, making

-    [[(void)]] = header(0,0,0) = [0x00]
+    [[(void)]] = header(2,0,0) = [0x80]

 or it may map records labelled `person` to short form label number 1,
 making

    [[(person "Dr" "Elizabeth" "Blackwell")]]
-        = header(0,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
-        =        [0x13] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
+        = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
+        =        [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
+
+for format B, or
+
+        = startbyte(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ endbyte(2,1)
+        =         [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
+
+for format C.

 #### Sequences, Sets and Dictionaries

-    [[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
+Format B (known length):

-    [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]
+    [[ [X_1 ... X_m] ]] = header(3,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
+
+    [[ #set{X_1 ... X_m} ]] = header(3,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]

    [[ #dict{K_1:V_1 ... K_m:V_m} ]]
-      = header(1,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
+      = header(3,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
+
+Format C (streaming):
+
+    [[ [X_1 ... X_m] ]] = startbyte(3,0) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,0)
+
+    [[ #set{X_1 ... X_m} ]] = startbyte(3,1) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,1)
+
+    [[ #dict{K_1:V_1 ... K_m:V_m} ]]
+      = startbyte(3,2) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] ++ endbyte(3,2)
+
+Applications may use whichever format suits their needs on a
+case-by-case basis.

 There is *no* ordering requirement on the `X_i` elements or
 `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
@ -432,19 +531,23 @@ order.
    (b) sorting keys or elements makes no sense in streaming
    serialization formats.

-Note that `n=3` is unused and reserved.
+Note that `header(3,3,m)` and `startbyte(3,3)`/`endbyte(3,3)` is unused and reserved.

 #### Variable-length Atoms

 ##### SignedInteger

-    [[ x ]] when x ∈ SignedInteger = header(2,0,m) ++ intbytes(x)
+Format B (known length):
+
+    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)
      where           m = |intbytes(x)|
        and intbytes(x) = a big-endian two's-complement representation
                          of the signed integer x, taking exactly as
                          many whole bytes as needed to unambiguously
                          identify the value

+Format C *MUST NOT* be used for `SignedInteger`s.
+
 The value 0 needs zero bytes to identify the value, so `intbytes(0)`
 is the empty byte string. Non-zero values need at least one byte; the
 most-significant bit in the first byte in `intbytes(x)` for `x≠0` is
@ -452,55 +555,78 @@ the sign bit.

 For example,

-    [[   -257 ]] = [0x82, 0xFE, 0xFF]
-    [[   -256 ]] = [0x82, 0xFF, 0x00]
-    [[   -255 ]] = [0x82, 0xFF, 0x01]
-    [[   -254 ]] = [0x82, 0xFF, 0x02]
-    [[   -129 ]] = [0x82, 0xFF, 0x7F]
-    [[   -128 ]] = [0x81, 0x80]
-    [[   -127 ]] = [0x81, 0x81]
-    [[     -2 ]] = [0x81, 0xFE]
-    [[     -1 ]] = [0x81, 0xFF]
-    [[      0 ]] = [0x80]
-    [[      1 ]] = [0x81, 0x01]
-    [[    127 ]] = [0x81, 0x7F]
-    [[    128 ]] = [0x82, 0x00, 0x80]
-    [[    255 ]] = [0x82, 0x00, 0xFF]
-    [[    256 ]] = [0x82, 0x01, 0x00]
-    [[  32767 ]] = [0x82, 0x7F, 0xFF]
-    [[  32768 ]] = [0x83, 0x00, 0x80, 0x00]
-    [[  65535 ]] = [0x83, 0x00, 0xFF, 0xFF]
-    [[  65536 ]] = [0x83, 0x01, 0x00, 0x00]
-    [[ 131072 ]] = [0x83, 0x02, 0x00, 0x00]
+    [[   -257 ]] = [0x42, 0xFE, 0xFF]
+    [[   -256 ]] = [0x42, 0xFF, 0x00]
+    [[   -255 ]] = [0x42, 0xFF, 0x01]
+    [[   -254 ]] = [0x42, 0xFF, 0x02]
+    [[   -129 ]] = [0x42, 0xFF, 0x7F]
+    [[   -128 ]] = [0x41, 0x80]
+    [[   -127 ]] = [0x41, 0x81]
+    [[     -2 ]] = [0x41, 0xFE]
+    [[     -1 ]] = [0x41, 0xFF]
+    [[      0 ]] = [0x40]
+    [[      1 ]] = [0x41, 0x01]
+    [[    127 ]] = [0x41, 0x7F]
+    [[    128 ]] = [0x42, 0x00, 0x80]
+    [[    255 ]] = [0x42, 0x00, 0xFF]
+    [[    256 ]] = [0x42, 0x01, 0x00]
+    [[  32767 ]] = [0x42, 0x7F, 0xFF]
+    [[  32768 ]] = [0x43, 0x00, 0x80, 0x00]
+    [[  65535 ]] = [0x43, 0x00, 0xFF, 0xFF]
+    [[  65536 ]] = [0x43, 0x01, 0x00, 0x00]
+    [[ 131072 ]] = [0x43, 0x02, 0x00, 0x00]

 ##### String

-    [[ S ]] when S ∈ String = header(2,1,m) ++ utf8(S)
+Format B (known length):
+
+    [[ S ]] when S ∈ String = header(1,1,m) ++ utf8(S)
      where       m = |utf8(x)|
        and utf8(x) = the UTF-8 encoding of S

+To stream a `String`, emit `startbyte(1,1)` and then a sequence of
+zero or more format B `String` chunks, followed by `endbyte(1,1)`.
+
+While the overall content of a streamed `String` must be valid UTF-8,
+individual chunks do not have to conform to UTF-8.
+
 ##### ByteString

-    [[ B ]] when B ∈ ByteString = header(2,2,m) ++ B
+Format B (known length):
+
+    [[ B ]] when B ∈ ByteString = header(1,2,m) ++ B
                        where m = |B|

+To stream a `ByteString`, emit `startbyte(1,2)` and then a sequence of
+zero or more format B `ByteString` chunks, followed by `endbyte(1,2)`.
+
 ##### Symbol

-    [[ S ]] when S ∈ Symbol = header(2,2,m) ++ utf8(S)
+Format B (known length):
+
+    [[ S ]] when S ∈ Symbol = header(1,3,m) ++ utf8(S)
      where       m = |utf8(x)|
        and utf8(x) = the UTF-8 encoding of S

+To stream a `Symbol`, emit `startbyte(1,3)` and then a sequence of
+zero or more format B `Symbol` chunks, followed by `endbyte(1,3)`.
+
 #### Fixed-length Atoms

+Fixed-length atoms all use format A, and do not have a length
+representation. They repurpose the bits that format B `Repr`s use to
+specify lengths. Applications *MUST NOT* use format C with
+`startbyte(0,n)` or `endbyte(0,n)` for any `n`.
+
 ##### Booleans

-    [[ #f ]] = header(3,0,0) = [0xC0]
-    [[ #t ]] = header(3,0,1) = [0xC1]
+    [[ #f ]] = header(0,0,0) = [0x00]
+    [[ #t ]] = header(0,0,1) = [0x01]

 ##### Floats and Doubles

-    [[ F ]] when F ∈ Float  = header(3,0,2) ++ binary32(F)
-    [[ D ]] when D ∈ Double = header(3,0,3) ++ binary64(D)
+    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
+    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
      where binary32(F) and binary64(D) are big-endian 4- and 8-byte
            IEEE 754 binary representations

@ -515,21 +641,25 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to

 | Value                                                              | Encoded hexadecimal byte sequence                  |
 |--------------------------------------------------------------------|----------------------------------------------------|
-| `(capture (discard))`                                              | 11 00                                              |
-| `(observe (speak (discard) (capture (discard))))`                  | 21 33 B5 73 70 65 61 6B 00 11 00                   |
-| `[1 2 3 4]`                                                        | 44 81 01 81 02 81 03 81 04                         |
-| `[-2 -1 0 1]`                                                      | 54 81 FE 81 FF 80 81 01                            |
-| `["hello" there #"world" [] #set{} #t #f]`                         | 47 95 68 65 6C 6C 6F A5 74 68 65 72 65 40 50 C1 C0 |
-| `-257`                                                             | 82 FE FF                                           |
-| `-1`                                                               | 81 FF                                              |
-| `0`                                                                | 80                                                 |
-| `1`                                                                | 81 01                                              |
-| `255`                                                              | 82 00 FF                                           |
-| `1f`                                                               | C2 3F 80 00 00                                     |
-| `1d`                                                               | C3 3F F0 00 00 00 00 00 00                         |
-| `-1.202e300d`                                                      | C3 FE 3C B7 B7 59 BF 04 26                         |
+| `(capture (discard))`                                              | 91 80                                              |
+| `(observe (speak (discard) (capture (discard))))`                  | A1 B3 75 73 70 65 61 6B 80 91 80                   |
+| `[1 2 3 4]` (format B)                                             | C4 41 01 41 02 41 03 41 04                         |
+| `[1 2 3 4]` (format C)                                             | 2C 41 01 41 02 41 03 41 04 3C                      |
+| `[-2 -1 0 1]`                                                      | C4 41 FE 41 FF 40 41 01                            |
+| `"hello"` (format B)                                               | 55 68 65 6C 6C 6F                                  |
+| `"hello"` (format C, 2 chunks)                                     | 25 52 68 65 53 6C 6C 6F 35                         |
+| `"hello"` (format C, 5 chunks)                                     | 25 52 68 65 52 6C 6C 50 50 51 6F 35                |
+| `["hello" there #"world" [] #set{} #t #f]`                         | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 C0 D0 01 00 |
+| `-257`                                                             | 42 FE FF                                           |
+| `-1`                                                               | 41 FF                                              |
+| `0`                                                                | 40                                                 |
+| `1`                                                                | 41 01                                              |
+| `255`                                                              | 42 00 FF                                           |
+| `1f`                                                               | 02 3F 80 00 00                                     |
+| `1d`                                                               | 03 3F F0 00 00 00 00 00 00                         |
+| `-1.202e300d`                                                      | 03 FE 3C B7 B7 59 BF 04 26                         |

-Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Value`
+Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record`

    ([titled person 2 thing 1]
       101
@ -539,21 +669,21 @@ Finally, a larger example, using a non-`Symbol` label for a record.[^extensibili

 encodes to

-    35                              ;; Record, generic, 4+1
-      45                              ;; Sequence, 5
-        B6 74 69 74 6C 65 64            ;; Symbol, "titled"
-        B6 70 65 72 73 6F 6E            ;; Symbol, "person"
-        81 02                           ;; SignedInteger, "2"
-        B5 74 68 69 6E 67               ;; Symbol, "thing"
-        81 01                           ;; SignedInteger, "1"
-      81 65                           ;; SignedInteger, "101"
-      99 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
-      34                              ;; Record, generic, 3+1
-        B4 64 61 74 65                  ;; Symbol, "date"
-        82 07 1D                        ;; SignedInteger, "1821"
-        81 02                           ;; SignedInteger, "2"
-        81 03                           ;; SignedInteger, "3"
-      92 44 72                        ;; String, "Dr"
+    B5                              ;; Record, generic, 4+1
+      C5                              ;; Sequence, 5
+        76 74 69 74 6C 65 64            ;; Symbol, "titled"
+        76 70 65 72 73 6F 6E            ;; Symbol, "person"
+        41 02                           ;; SignedInteger, "2"
+        75 74 68 69 6E 67               ;; Symbol, "thing"
+        41 01                           ;; SignedInteger, "1"
+      41 65                           ;; SignedInteger, "101"
+      59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
+      B4                              ;; Record, generic, 3+1
+        74 64 61 74 65                  ;; Symbol, "date"
+        42 07 1D                        ;; SignedInteger, "1821"
+        41 02                           ;; SignedInteger, "2"
+        41 03                           ;; SignedInteger, "3"
+      52 44 72                        ;; String, "Dr"

  [^extensibility2]: It happens to line up with Racket's
    representation of a record label for an inheritance hierarchy
@ -608,15 +738,15 @@ pair.

 **Examples.**

-| `(mime application/octet-stream #"abcde")` | 33 B4 6D 69 6D 65 BF 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D A5 61 62 63 64 65 |
-| `(mime text/plain "ABC")`                  | 33 B4 6D 69 6D 65 BA 74 65 78 74 2F 70 6C 61 69 6E 93 41 42 43                                                    |
-| `(mime application/xml "<xhtml/>")`        | 33 B4 6D 69 6D 65 BF 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 98 3C 78 68 74 6D 6C 2F 3E                   |
-| `(mime text/csv "123,234,345")`            | 33 B4 6D 69 6D 65 B8 74 65 78 74 2F 63 73 76 9B 31 32 33 2C 32 33 34 2C 33 34 35                                  |
+| `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
+| `(mime text/plain #"ABC")`                 | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
+| `(mime application/xml #"<xhtml/>")`       | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
+| `(mime text/csv #"123,234,345")`           | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |

 Applications making heavy use of `mime` records may choose to use a
 short form label number for the record type. For example, if short
 form label number 1 were chosen, the second example above, `(mime
-text/plain "ABC")`, would be encoded with "12" in place of "33 B4 6D
+text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D
 69 6D 65".

 ### Text
@ -746,26 +876,29 @@ should both be identities.

 ## Appendix. Table of lead byte values

-     0x - short form Record label index 0
-     1x - short form Record label index 1
-     2x - short form Record label index 2
-     3x - Record
-     4x - Sequence
-     5x - Set
-     6x - Dictionary
-    (7x)  RESERVED
-     8x - SignedInteger
-     9x - String
-     Ax - Bytes
-     Bx - Symbol
-     C0 - False
-     C1 - True
-     C2 - Float
-     C3 - Double
-    (Cx)  RESERVED C4-CF
-    (Dx)  RESERVED
-    (Ex)  RESERVED
-    (Fx)  RESERVED
+     00 - False
+     01 - True
+     02 - Float
+     03 - Double
+    (0x)  RESERVED 04-0F
+    (1x)  RESERVED 10-1F
+     2x - Start Stream
+     3x - End Stream
+
+     4x - SignedInteger
+     5x - String
+     6x - Bytes
+     7x - Symbol
+
+     8x - short form Record label index 0
+     9x - short form Record label index 1
+     Ax - short form Record label index 2
+     Bx - Record
+
+     Cx - Sequence
+     Dx - Set
+     Ex - Dictionary
+    (Fx)  RESERVED F0-FF

 ## Appendix. Why not Just Use JSON?

@ -942,15 +1075,6 @@ Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexp
    other kind of structure, and the "hint" itself can only be a
    binary blob.

-Q. Should `MIMEData` be a special syntax for `Record`s with a single
-`ByteString` field?
-
-A. Not even. It should probably just be moved to the "conventions"
-section. Compare:
-
-    D5 BA text/plain    hello   -- using special MIMEData encoding
-    32 BA text/plain A5 hello   -- using bog standard type-labelled Record
-
 Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
 label (recursive!?) and a single `String` field?

@ -970,16 +1094,6 @@ Q. Are the language mappings reasonable? How about one for Python?

 ---

-Streaming: needed for variable-sized structures. Tricky to design
-syntax for this that isn't gratuitously warty. End byte value.
-
-SIGH. Streaming for text/bytes too I SUPPOSE. Chunks, like CBOR
-
 Literal small integers: could be nice? Not absolutely necessary.

-Maybe reorder: fixed-length atoms first, then variable-length atoms,
-then fixed-length compounds, then variable-length compounds? Reason
-being that then maybe can put the streaming forms of the
-variable-length ones very last.
-
 ---