Many improvements

This commit is contained in:
Tony Garnock-Jones 2018-09-23 18:14:58 +01:00
parent 2996970cbe
commit f2f57385ce
1 changed files with 65 additions and 62 deletions

View File

@ -51,7 +51,6 @@ later in this document.
| Boolean | Boolean
| Float | Float
| Double | Double
| MIMEData
Compound = Record Compound = Record
| Sequence | Sequence
@ -86,7 +85,7 @@ follows:[^ordering-by-syntax]
(Compounds) Record < Sequence < Set < Dictionary (Compounds) Record < Sequence < Set < Dictionary
(Atoms) SignedInteger < String < ByteString < Symbol (Atoms) SignedInteger < String < ByteString < Symbol
< Boolean < Float < Double < MIMEData < Boolean < Float < Double
[^ordering-by-syntax]: The observant reader may note that the [^ordering-by-syntax]: The observant reader may note that the
ordering here is the same as that implied by the tagging scheme ordering here is the same as that implied by the tagging scheme
@ -183,21 +182,6 @@ and infinities, using a suffix `f` or `d` to indicate `Float` or
**Non-examples.** 10, -6, and 0, because writing them this way **Non-examples.** 10, -6, and 0, because writing them this way
indicates `SignedInteger`s, not `Float`s or `Double`s. indicates `SignedInteger`s, not `Float`s or `Double`s.
### MIME-type tagged binary data.
A `MIMEData` is a pair of a `Symbol` denoting a
[media type](https://tools.ietf.org/html/rfc6838) and a `ByteString`
body, intended to be interpreted as an encoding of a document having
that media type. While each media type may define its own rules for
comparing documents, we define ordering among `MIMEData`
*representations* of such media types lexicographically over the
(`Symbol`, `ByteString`) pair. We write examples using the same syntax
as for byte strings, but with the media type `Symbol` sandwiched
between the “`#`” and the first “`"`”.
**Examples.** `#application/octet-stream""`; `#text/plain"ABC"`;
`#application/xml"<xhtml/>"`; `#text/csv"123,234,345"`.
### Records. ### Records.
A `Record` is a *labelled* tuple of zero or more `Value`s, called the A `Record` is a *labelled* tuple of zero or more `Value`s, called the
@ -255,12 +239,12 @@ containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
containing 4, the string `"hello"`, the record with label `void` and containing 4, the string `"hello"`, the record with label `void` and
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`, no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
the set containing a `SignedInteger` and a `Float`, both denoting the the set containing a `SignedInteger` and a `Float`, both denoting the
number 1; `#set{#application/xml"<x/>" #application/xml"<x />"}`, a number 1; `#set{(mime application/xml #"<x/>") (mime
set containing two different `MIMEData` application/xml #"<x />")}`, a set containing two different
values.[^mimedata-xml-difference] type-labelled byte arrays.[^mime-xml-difference]
[^mimedata-xml-difference]: The two XML documents `<x/>` and `<x />` [^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
differ by bytewise comparison, and thus yield different `MIMEData` differ by bytewise comparison, and thus yield different record
values, even though under the semantics of XML they denote values, even though under the semantics of XML they denote
identical XML infoset. identical XML infoset.
@ -343,8 +327,6 @@ The following figure summarises the definitions below:
11 00 0010 Float, 32 bits big-endian binary 11 00 0010 Float, 32 bits big-endian binary
11 00 0011 Double, 64 bits big-endian binary 11 00 0011 Double, 64 bits big-endian binary
11 01 mmmm ... MIME-type-labelled binary data
If mmmm = 1111, varint(m) is present; otherwise, m is the length If mmmm = 1111, varint(m) is present; otherwise, m is the length
#### Type and Length representation #### Type and Length representation
@ -367,7 +349,6 @@ follows:[^some-encodings-unused]
leadbyte(1,-,-) represents a Sequence, Set or Dictionary leadbyte(1,-,-) represents a Sequence, Set or Dictionary
leadbyte(2,-,-) represents an Atom with variable-length binary representation leadbyte(2,-,-) represents an Atom with variable-length binary representation
leadbyte(3,0,-) represents an Atom with fixed-length binary representation leadbyte(3,0,-) represents an Atom with fixed-length binary representation
leadbyte(3,1,-) represents certain special variable-length values
[^some-encodings-unused]: Some encodings are unused. All such [^some-encodings-unused]: Some encodings are unused. All such
encodings are reserved for future versions of this specification. encodings are reserved for future versions of this specification.
@ -430,13 +411,26 @@ making
[[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]] [[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
[[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[Y_1]] ++ ... ++ [[Y_m]] [[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]
where [Y_1 ... Y_m] = sort([X_1 ... X_m])
[[ #dict{K_1:V_1 ... K_m:V_m} ]] [[ #dict{K_1:V_1 ... K_m:V_m} ]]
= header(1,2,m) ++ [[K'_1]] ++ [[V'_1]] ++ ... ++ [[K'_m]] ++ [[V'_m]] = header(1,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
where [[K'_1 V'_1] ... [K'_m V'_m]]
= sort([[K_1 V_1] ... [K_m V_m]]) There is *no* ordering requirement on the `X_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s, because (a)
where canonicalization is used for cryptographic signatures, it is
more reliable to simply retain the exact binary form of the signed
document than to depend on canonical de- and re-serialization, and
(b) sorting keys or elements makes no sense in streaming
serialization formats.
Note that `n=3` is unused and reserved. Note that `n=3` is unused and reserved.
@ -451,6 +445,11 @@ Note that `n=3` is unused and reserved.
many whole bytes as needed to unambiguously many whole bytes as needed to unambiguously
identify the value identify the value
The value 0 needs zero bytes to identify the value, so `intbytes(0)`
is the empty byte string. Non-zero values need at least one byte; the
most-significant bit in the first byte in `intbytes(x)` for `x≠0` is
the sign bit.
For example, For example,
[[ -257 ]] = [0x82, 0xFE, 0xFF] [[ -257 ]] = [0x82, 0xFE, 0xFF]
@ -505,18 +504,6 @@ For example,
where binary32(F) and binary64(D) are big-endian 4- and 8-byte where binary32(F) and binary64(D) are big-endian 4- and 8-byte
IEEE 754 binary representations IEEE 754 binary representations
#### Special variable-length values
##### MIMEData
Each `MIMEData` value is comprised of a media type `Symbol` and a raw
binary body.
[[ M ]] when M ∈ MIMEData = header(3,1,m) ++ [[T]] ++ B
where m = |B|
and T is the Symbol media type of M
and B is the ByteString body of M
## Examples ## Examples
<!-- TODO: Give some examples of large and small Preserves, perhaps --> <!-- TODO: Give some examples of large and small Preserves, perhaps -->
@ -554,16 +541,16 @@ encodes to
35 ;; Record, generic, 4+1 35 ;; Record, generic, 4+1
45 ;; Sequence, 5 45 ;; Sequence, 5
b6 74 69 74 6c 65 64 ;; Symbol, "titled" B6 74 69 74 6C 65 64 ;; Symbol, "titled"
b6 70 65 72 73 6f 6e ;; Symbol, "person" B6 70 65 72 73 6F 6E ;; Symbol, "person"
81 02 ;; SignedInteger, "2" 81 02 ;; SignedInteger, "2"
b5 74 68 69 6e 67 ;; Symbol, "thing" B5 74 68 69 6E 67 ;; Symbol, "thing"
81 01 ;; SignedInteger, "1" 81 01 ;; SignedInteger, "1"
81 65 ;; SignedInteger, "101" 81 65 ;; SignedInteger, "101"
99 42 6c 61 63 6b 77 65 6c 6c ;; String, "Blackwell" 99 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
34 ;; Record, generic, 3+1 34 ;; Record, generic, 3+1
b4 64 61 74 65 ;; Symbol, "date" B4 64 61 74 65 ;; Symbol, "date"
82 07 1d ;; SignedInteger, "1821" 82 07 1D ;; SignedInteger, "1821"
81 02 ;; SignedInteger, "2" 81 02 ;; SignedInteger, "2"
81 03 ;; SignedInteger, "3" 81 03 ;; SignedInteger, "3"
92 44 72 ;; String, "Dr" 92 44 72 ;; String, "Dr"
@ -605,6 +592,33 @@ treat them specially.
and one which enforces validity (i.e. side-conditions) when reading, and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing `Value`s. writing, or constructing `Value`s.
### MIME-type tagged binary data
Many internet protocols use
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define `MIMEData` to be a record labelled `mime` with two
fields, the first being a `Symbol`, the media type, and the second
being a `ByteString`, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among `MIMEData` *representations* of
such media types lexicographically over the (`Symbol`, `ByteString`)
pair.
**Examples.**
| `(mime application/octet-stream #"abcde")` | 33 B4 6D 69 6D 65 BF 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D A5 61 62 63 64 65 |
| `(mime text/plain "ABC")` | 33 B4 6D 69 6D 65 BA 74 65 78 74 2F 70 6C 61 69 6E 93 41 42 43 |
| `(mime application/xml "<xhtml/>")` | 33 B4 6D 69 6D 65 BF 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 98 3C 78 68 74 6D 6C 2F 3E |
| `(mime text/csv "123,234,345")` | 33 B4 6D 69 6D 65 B8 74 65 78 74 2F 63 73 76 9B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a
short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above, `(mime
text/plain "ABC")`, would be encoded with "12" in place of "33 B4 6D
69 6D 65".
### Text ### Text
#### Normalization forms #### Normalization forms
@ -681,7 +695,6 @@ should both be identities.
- `Symbol``Symbol.for(...)` - `Symbol``Symbol.for(...)`
- `Boolean``Boolean` - `Boolean``Boolean`
- `Float` and `Double` ↔ numbers, - `Float` and `Double` ↔ numbers,
- `MIMEData``{ "type": aString, "data": aUint8Array }`
- `Record``{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors - `Record``{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
- `(undefined)` ↔ the undefined value - `(undefined)` ↔ the undefined value
- `(rfc3339 F)``Date`, if `F` matches the `date-time` RFC 3339 production - `(rfc3339 F)``Date`, if `F` matches the `date-time` RFC 3339 production
@ -697,7 +710,6 @@ should both be identities.
- `Symbol` ↔ symbols - `Symbol` ↔ symbols
- `Boolean` ↔ booleans - `Boolean` ↔ booleans
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats) - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
- `MIMEData` ↔ a structure with a `type` and a `data` field (Racket: `(struct mime (type data))`)
- `Record` ↔ structures (Racket: prefab struct) - `Record` ↔ structures (Racket: prefab struct)
- `Sequence` ↔ lists - `Sequence` ↔ lists
- `Set` ↔ Racket: sets - `Set` ↔ Racket: sets
@ -711,7 +723,6 @@ should both be identities.
- `Symbol` ↔ a simple data class wrapping a `String` - `Symbol` ↔ a simple data class wrapping a `String`
- `Boolean``Boolean` - `Boolean``Boolean`
- `Float` and `Double``Float` and `Double` - `Float` and `Double``Float` and `Double`
- `MIMEData` ↔ an implementation of `javax.activation.DataSource`, maybe?
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping? - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
- `Sequence` ↔ an implementation of `java.util.List` - `Sequence` ↔ an implementation of `java.util.List`
- `Set` ↔ an implementation of `java.util.Set` - `Set` ↔ an implementation of `java.util.Set`
@ -728,7 +739,6 @@ should both be identities.
binary of the utf-8 binary of the utf-8
- `Boolean``true` and `false` - `Boolean``true` and `false`
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision) - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
- `MIMEData` ↔ tuple of the type as a utf8 binary, and the data as a binary
- `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions - `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions
- `Sequence` ↔ a list - `Sequence` ↔ a list
- `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?) - `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
@ -753,7 +763,7 @@ should both be identities.
C2 - Float C2 - Float
C3 - Double C3 - Double
(Cx) RESERVED C4-CF (Cx) RESERVED C4-CF
Dx - MIMEData (Dx) RESERVED
(Ex) RESERVED (Ex) RESERVED
(Fx) RESERVED (Fx) RESERVED
@ -960,20 +970,13 @@ Q. Are the language mappings reasonable? How about one for Python?
--- ---
OK so. No built-in `MIMEData`, but maybe a conventional `(mime-data
Symbol Bytes)`? Applications can put it in a short slot if they like.
Streaming: needed for variable-sized structures. Tricky to design Streaming: needed for variable-sized structures. Tricky to design
syntax for this that isn't gratuitously warty. End byte value. syntax for this that isn't gratuitously warty. End byte value.
SIGH. Streaming for text/bytes too I SUPPOSE. Chunks, like CBOR
Literal small integers: could be nice? Not absolutely necessary. Literal small integers: could be nice? Not absolutely necessary.
Give algorithm for computing size of integers.
Give up on sorting requirement for representation of sets and
dictionaries?? Probably a good idea if there are streaming forms of
them because that sounds impossible to do??
Maybe reorder: fixed-length atoms first, then variable-length atoms, Maybe reorder: fixed-length atoms first, then variable-length atoms,
then fixed-length compounds, then variable-length compounds? Reason then fixed-length compounds, then variable-length compounds? Reason
being that then maybe can put the streaming forms of the being that then maybe can put the streaming forms of the