From f2f57385ce482f3842fa870cb06c4a30f758aa6a Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sun, 23 Sep 2018 18:14:58 +0100 Subject: [PATCH] Many improvements --- syndicate/mc/preserve.md | 127 ++++++++++++++++++++------------------- 1 file changed, 65 insertions(+), 62 deletions(-) diff --git a/syndicate/mc/preserve.md b/syndicate/mc/preserve.md index cc3849c..3b67b6f 100644 --- a/syndicate/mc/preserve.md +++ b/syndicate/mc/preserve.md @@ -51,7 +51,6 @@ later in this document. | Boolean | Float | Double - | MIMEData Compound = Record | Sequence @@ -86,7 +85,7 @@ follows:[^ordering-by-syntax] (Compounds) Record < Sequence < Set < Dictionary (Atoms) SignedInteger < String < ByteString < Symbol - < Boolean < Float < Double < MIMEData + < Boolean < Float < Double [^ordering-by-syntax]: The observant reader may note that the ordering here is the same as that implied by the tagging scheme @@ -183,21 +182,6 @@ and infinities, using a suffix `f` or `d` to indicate `Float` or **Non-examples.** 10, -6, and 0, because writing them this way indicates `SignedInteger`s, not `Float`s or `Double`s. -### MIME-type tagged binary data. - -A `MIMEData` is a pair of a `Symbol` denoting a -[media type](https://tools.ietf.org/html/rfc6838) and a `ByteString` -body, intended to be interpreted as an encoding of a document having -that media type. While each media type may define its own rules for -comparing documents, we define ordering among `MIMEData` -*representations* of such media types lexicographically over the -(`Symbol`, `ByteString`) pair. We write examples using the same syntax -as for byte strings, but with the media type `Symbol` sandwiched -between the “`#`” and the first “`"`”. - -**Examples.** `#application/octet-stream""`; `#text/plain"ABC"`; -`#application/xml""`; `#text/csv"123,234,345"`. - ### Records. A `Record` is a *labelled* tuple of zero or more `Value`s, called the @@ -255,12 +239,12 @@ containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set containing 4, the string `"hello"`, the record with label `void` and no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`, the set containing a `SignedInteger` and a `Float`, both denoting the -number 1; `#set{#application/xml"" #application/xml""}`, a -set containing two different `MIMEData` -values.[^mimedata-xml-difference] +number 1; `#set{(mime application/xml #"") (mime +application/xml #"")}`, a set containing two different +type-labelled byte arrays.[^mime-xml-difference] - [^mimedata-xml-difference]: The two XML documents `` and `` - differ by bytewise comparison, and thus yield different `MIMEData` + [^mime-xml-difference]: The two XML documents `` and `` + differ by bytewise comparison, and thus yield different record values, even though under the semantics of XML they denote identical XML infoset. @@ -343,8 +327,6 @@ The following figure summarises the definitions below: 11 00 0010 Float, 32 bits big-endian binary 11 00 0011 Double, 64 bits big-endian binary - 11 01 mmmm ... MIME-type-labelled binary data - If mmmm = 1111, varint(m) is present; otherwise, m is the length #### Type and Length representation @@ -367,7 +349,6 @@ follows:[^some-encodings-unused] leadbyte(1,-,-) represents a Sequence, Set or Dictionary leadbyte(2,-,-) represents an Atom with variable-length binary representation leadbyte(3,0,-) represents an Atom with fixed-length binary representation - leadbyte(3,1,-) represents certain special variable-length values [^some-encodings-unused]: Some encodings are unused. All such encodings are reserved for future versions of this specification. @@ -430,13 +411,26 @@ making [[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]] - [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[Y_1]] ++ ... ++ [[Y_m]] - where [Y_1 ... Y_m] = sort([X_1 ... X_m]) + [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[X_1]] ++ ... ++ [[X_m]] [[ #dict{K_1:V_1 ... K_m:V_m} ]] - = header(1,2,m) ++ [[K'_1]] ++ [[V'_1]] ++ ... ++ [[K'_m]] ++ [[V'_m]] - where [[K'_1 V'_1] ... [K'_m V'_m]] - = sort([[K_1 V_1] ... [K_m V_m]]) + = header(1,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] + +There is *no* ordering requirement on the `X_i` elements or +`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any +order. + + [^no-sorting-rationale]: In the BitTorrent encoding format, + [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), + dictionary key/value pairs must be sorted by key. This is a + necessary step for ensuring serialization of `Value`s is + canonical. We do not require that key/value pairs (or set + elements) be in sorted order for serialized `Value`s, because (a) + where canonicalization is used for cryptographic signatures, it is + more reliable to simply retain the exact binary form of the signed + document than to depend on canonical de- and re-serialization, and + (b) sorting keys or elements makes no sense in streaming + serialization formats. Note that `n=3` is unused and reserved. @@ -451,6 +445,11 @@ Note that `n=3` is unused and reserved. many whole bytes as needed to unambiguously identify the value +The value 0 needs zero bytes to identify the value, so `intbytes(0)` +is the empty byte string. Non-zero values need at least one byte; the +most-significant bit in the first byte in `intbytes(x)` for `x≠0` is +the sign bit. + For example, [[ -257 ]] = [0x82, 0xFE, 0xFF] @@ -505,18 +504,6 @@ For example, where binary32(F) and binary64(D) are big-endian 4- and 8-byte IEEE 754 binary representations -#### Special variable-length values - -##### MIMEData - -Each `MIMEData` value is comprised of a media type `Symbol` and a raw -binary body. - - [[ M ]] when M ∈ MIMEData = header(3,1,m) ++ [[T]] ++ B - where m = |B| - and T is the Symbol media type of M - and B is the ByteString body of M - ## Examples @@ -554,16 +541,16 @@ encodes to 35 ;; Record, generic, 4+1 45 ;; Sequence, 5 - b6 74 69 74 6c 65 64 ;; Symbol, "titled" - b6 70 65 72 73 6f 6e ;; Symbol, "person" + B6 74 69 74 6C 65 64 ;; Symbol, "titled" + B6 70 65 72 73 6F 6E ;; Symbol, "person" 81 02 ;; SignedInteger, "2" - b5 74 68 69 6e 67 ;; Symbol, "thing" + B5 74 68 69 6E 67 ;; Symbol, "thing" 81 01 ;; SignedInteger, "1" 81 65 ;; SignedInteger, "101" - 99 42 6c 61 63 6b 77 65 6c 6c ;; String, "Blackwell" + 99 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" 34 ;; Record, generic, 3+1 - b4 64 61 74 65 ;; Symbol, "date" - 82 07 1d ;; SignedInteger, "1821" + B4 64 61 74 65 ;; Symbol, "date" + 82 07 1D ;; SignedInteger, "1821" 81 02 ;; SignedInteger, "2" 81 03 ;; SignedInteger, "3" 92 44 72 ;; String, "Dr" @@ -605,6 +592,33 @@ treat them specially. and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing `Value`s. +### MIME-type tagged binary data + +Many internet protocols use +[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types) +to indicate the format of some associated binary data. For this +purpose, we define `MIMEData` to be a record labelled `mime` with two +fields, the first being a `Symbol`, the media type, and the second +being a `ByteString`, the binary data. + +While each media type may define its own rules for comparing +documents, we define ordering among `MIMEData` *representations* of +such media types lexicographically over the (`Symbol`, `ByteString`) +pair. + +**Examples.** + +| `(mime application/octet-stream #"abcde")` | 33 B4 6D 69 6D 65 BF 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D A5 61 62 63 64 65 | +| `(mime text/plain "ABC")` | 33 B4 6D 69 6D 65 BA 74 65 78 74 2F 70 6C 61 69 6E 93 41 42 43 | +| `(mime application/xml "")` | 33 B4 6D 69 6D 65 BF 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 98 3C 78 68 74 6D 6C 2F 3E | +| `(mime text/csv "123,234,345")` | 33 B4 6D 69 6D 65 B8 74 65 78 74 2F 63 73 76 9B 31 32 33 2C 32 33 34 2C 33 34 35 | + +Applications making heavy use of `mime` records may choose to use a +short form label number for the record type. For example, if short +form label number 1 were chosen, the second example above, `(mime +text/plain "ABC")`, would be encoded with "12" in place of "33 B4 6D +69 6D 65". + ### Text #### Normalization forms @@ -681,7 +695,6 @@ should both be identities. - `Symbol` ↔ `Symbol.for(...)` - `Boolean` ↔ `Boolean` - `Float` and `Double` ↔ numbers, - - `MIMEData` ↔ `{ "type": aString, "data": aUint8Array }` - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors - `(undefined)` ↔ the undefined value - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production @@ -697,7 +710,6 @@ should both be identities. - `Symbol` ↔ symbols - `Boolean` ↔ booleans - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats) - - `MIMEData` ↔ a structure with a `type` and a `data` field (Racket: `(struct mime (type data))`) - `Record` ↔ structures (Racket: prefab struct) - `Sequence` ↔ lists - `Set` ↔ Racket: sets @@ -711,7 +723,6 @@ should both be identities. - `Symbol` ↔ a simple data class wrapping a `String` - `Boolean` ↔ `Boolean` - `Float` and `Double` ↔ `Float` and `Double` - - `MIMEData` ↔ an implementation of `javax.activation.DataSource`, maybe? - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping? - `Sequence` ↔ an implementation of `java.util.List` - `Set` ↔ an implementation of `java.util.Set` @@ -728,7 +739,6 @@ should both be identities. binary of the utf-8 - `Boolean` ↔ `true` and `false` - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision) - - `MIMEData` ↔ tuple of the type as a utf8 binary, and the data as a binary - `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions - `Sequence` ↔ a list - `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?) @@ -753,7 +763,7 @@ should both be identities. C2 - Float C3 - Double (Cx) RESERVED C4-CF - Dx - MIMEData + (Dx) RESERVED (Ex) RESERVED (Fx) RESERVED @@ -960,20 +970,13 @@ Q. Are the language mappings reasonable? How about one for Python? --- -OK so. No built-in `MIMEData`, but maybe a conventional `(mime-data -Symbol Bytes)`? Applications can put it in a short slot if they like. - Streaming: needed for variable-sized structures. Tricky to design syntax for this that isn't gratuitously warty. End byte value. +SIGH. Streaming for text/bytes too I SUPPOSE. Chunks, like CBOR + Literal small integers: could be nice? Not absolutely necessary. -Give algorithm for computing size of integers. - -Give up on sorting requirement for representation of sets and -dictionaries?? Probably a good idea if there are streaming forms of -them because that sounds impossible to do?? - Maybe reorder: fixed-length atoms first, then variable-length atoms, then fixed-length compounds, then variable-length compounds? Reason being that then maybe can put the streaming forms of the