From e2b859e55dc993bffd77dd6bee7dd52ea6a469c1 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sat, 13 Jul 2019 22:20:22 -0400 Subject: [PATCH] Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks --- preserves.css | 20 ++- preserves.md | 376 +++++++++++++++++++++++++++----------------------- 2 files changed, 224 insertions(+), 172 deletions(-) diff --git a/preserves.css b/preserves.css index 35d1881..0c33afb 100644 --- a/preserves.css +++ b/preserves.css @@ -1,5 +1,6 @@ body { - font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif; + font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif; + box-sizing: border-box; } @media screen { body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; } @@ -59,3 +60,20 @@ h2#notes:before { } .footnotes > ol { padding: 0; font-size: 90%; } + +table { + border-collapse: collapse; + width: 100%; +} +thead tr { + border-bottom: solid black 1px; +} +th { + font-weight: normal; + text-align: left; + padding-right: 0.5rem; + padding-bottom: 0.3rem; +} +td { + padding-right: 0.5rem; +} diff --git a/preserves.md b/preserves.md index 8f53a60..9d57045 100644 --- a/preserves.md +++ b/preserves.md @@ -6,7 +6,7 @@ # Preserves: an Expressive Data Language Tony Garnock-Jones -November 2018. Version 0.0.4. +June 2019. Version 0.0.5. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [spki]: http://world.std.com/~cme/html/spki.html @@ -72,9 +72,8 @@ follows:[^ordering-by-syntax] < String < ByteString < Symbol [^ordering-by-syntax]: The observant reader may note that the - ordering here is (almost) the same as that implied by the tagging - scheme used in the concrete binary syntax for `Value`s. (The - exception is the syntax for small integers near zero.) + ordering here is the same as that implied by the tagging scheme + used in the concrete binary syntax for `Value`s. **Equivalence.** Two `Value`s are equal if neither is less than the other according to the total order. @@ -400,7 +399,8 @@ itself have annotations. Value =/ ws "@" Value Value Each annotation is preceded by `@`; the underlying annotated value -follows its annotations. +follows its annotations. Here we extend only the syntactic nonterminal +named "`Value`" without altering the semantic class of `Value`s. **Equivalence.** Annotations appear within syntax denoting a `Value`; however, the annotations are not part of the denoted value. They are @@ -411,19 +411,24 @@ Reflective tools such as debuggers, user interfaces, and message routers and relays---tools which process `Value`s generically---may use annotated inputs to tailor their operation, or may insert annotations in their outputs. By contrast, in ordinary programs, as a -rule of thumb, the presence, absence or specific value of an -annotation should not change the control flow or output of the -program. Annotations are data *describing* `Value`s, and are not in -the domain of any specific application of `Value`s. That is, an -annotation will almost never cause a non-reflective program to do -anything observably different. +rule of thumb, the presence, absence or content of an annotation +should not change the control flow or output of the program. +Annotations are data *describing* `Value`s, and are not in the domain +of any specific application of `Value`s. That is, an annotation will +almost never cause a non-reflective program to do anything observably +different. ## Compact Binary Syntax -A `Repr` is an encoding, or representation, of a specific `Value`. -Each `Repr` comprises one or more bytes describing first the kind of -represented `Value` and the length of the representation, and then the -encoded details of the `Value` itself. +A `Repr` is a binary-syntax encoding, or representation, of either + + - a `Value`, + - a "placeholder" for a `Value`, or + - an annotation on a `Repr`. + +Each `Repr` comprises one or more bytes describing the kind of +represented information and the length of the representation, followed +by the encoded details. For a value `v`, we write `[[v]]` for the `Repr` of v. @@ -431,19 +436,16 @@ For a value `v`, we write `[[v]]` for the `Repr` of v. Each `Repr` takes one of three possible forms: - - (A) a fixed-length form, used for simple values such as `Boolean`s - or `Float`s. + - (A) type-specific form, used for simple values such as `Boolean`s + or `Float`s, for placeholders, and for introducing annotations. - (B) a variable-length form with length specified up-front, used for - almost all `Record`s as well as for most `Sequence`s and `String`s, - when their sizes are known at the time serialization begins. + compound and variable-length atomic data structures when their + sizes are known at the time serialization begins. - (C) a variable-length streaming form with unknown or unpredictable - length, used only seldom for `Record`s, since the number of fields - in a `Record` is usually statically known, but sometimes used for - `Sequence`s, `String`s etc., such as in cases when serialization - begins before the number of elements or bytes in the corresponding - `Value` is known. + length, used in cases when serialization begins before the number + of elements or bytes in the corresponding `Value` is known. Applications may choose between formats B and C depending on their needs at serialization time. @@ -455,30 +457,33 @@ Every `Repr` starts with a *lead byte*, constructed by leadbyte(t,n,m) = [t*64 + n*16 + m] -The arguments `t` and `n` describe the rest of the -representation:[^some-encodings-unused] +The arguments `t`, `n` and `m` describe the rest of the +representation.[^some-encodings-unused] [^some-encodings-unused]: Some encodings are unused. All such encodings are reserved for future versions of this specification. - - `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation. - - `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s. - - `t`=0, `n`=2 (format C) is a Stream Start byte. - - `t`=0, `n`=3 (format C) is a Stream End byte. - - `t`=1 (format B) represents an `Atom` with variable-length binary representation. - - `t`=2 (format B) represents a `Record`. - - `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`. +| `t` | `n` | `m` | Meaning | +| --- | --- | --- | ------- | +| 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation | +| 0 | 0 | 4 | (format C) Stream end | +| 0 | 0 | 5 | (format A) Annotation | +| 0 | 1 | | (format A) Placeholder for an application-specific `Value` | +| 0 | 2 | | (format C) Stream start | +| 0 | 3 | | (format A) Certain small `SignedInteger`s | +| 1 | | | (format B) An `Atom` with variable-length binary representation | +| 2 | | | (format B) A `Compound` with variable-length representation | -#### Encoding data of fixed length (format A). +#### Encoding data of type-specific length (format A). -Each specific type of data defines its own rules for this format. +Each type of data defines its own rules for this format. #### Encoding data of known length (format B). -A `Repr` where the length of the `Value` to be encoded is variable but -known uses the value of `m` in `leadbyte` to encode its length. The -length counts *bytes* for atomic `Value`s, but counts *contained -values* for compound `Value`s. +Format B is used where the length `l` of the `Value` to be encoded is +known when serialization begins. Format B `Repr`s use `m` in +`leadbyte` to encode `l`. The length counts *bytes* for atomic +`Value`s, but counts *contained values* for compound `Value`s. - A length `l` between 0 and 14 is represented using `leadbyte` with `m=l`. @@ -503,11 +508,13 @@ definition, > two's complement representation of the number in groups of 7 bits, > least significant group first. -**Examples.** +The following table illustrates varint-encoding. - - The varint representation of 15 is just the byte 15. - - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2. - - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3. +| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | +| ------ | ------------------- | ------------ | +| 15 | `0001111` | 15 | +| 300 | `0000010 0101100` | 172 2 | +| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | #### Streaming data of unknown length (format C). @@ -516,61 +523,69 @@ not known at the time serialization of the `Value` starts is encoded by a single Stream Start (“open”) byte, followed by zero or more *chunks*, followed by a matching Stream End (“close”) byte: - open(t,n) = leadbyte(0,2, t*4 + n) - close(t,n) = leadbyte(0,3, t*4 + n) + open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n] + close() = leadbyte(0,0, 4) = [0x04] -For a `Repr` of a `Value` containing binary data, each chunk is to be -a format B `Repr` of a `ByteString`, no matter the type of the overall -`Repr`. +For a format C `Repr` of an atomic `Value`, each chunk is to be a +format B `Repr` of a `ByteString`, no matter the type of the overall +`Value`. Annotations are not allowed on these individual chunks. -For a `Repr` of a `Value` containing other `Value`s, each chunk is to -be a single `Repr`. +For a format C `Repr` of a compound `Value`, each chunk is to be a +single `Repr`, which may itself be annotated. + +Each chunk within a format C `Repr` *MUST* have non-zero length. +Software that decodes `Repr`s *MUST* reject `Repr`s that include +zero-length chunks. ### Records. Format B (known length): - [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] + [[ L(F_1...F_m) ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] For `m` fields, `m+1` is supplied to `header`, to account for the encoding of the record label. Format C (streaming): - [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3) + [[ L(F_1...F_m) ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close() Applications *SHOULD* prefer the known-length format for encoding `Record`s. -#### Application-specific short form for labels. +### Placeholders. -Any given protocol using Preserves may additionally define an -interpretation for `n`∈{0,1,2}, mapping each *short form label -number* `n` to a specific record label. When encoding `m` fields with -short form label number `n`, format B becomes +Any given protocol using Preserves may define an interpretation for +numbered *placeholders* in the binary syntax, mapping each +*placeholder number* `n` to a specific `Value`. For example, a +placeholder number may be assigned for a frequently-used `Record` +label. - header(2,n,m) ++ [[F_1]] ++...++ [[F_m]] +A `Value` `v` for which placeholder number `n` has been assigned may +be tersely encoded as -and format C becomes + [[v]] = header(0,1,n) when n is a placeholder number for v - open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n) +**Examples.** For example, a protocol may choose to assign placeholder +number 4 to the symbol `void`, making -**Examples.** For example, a protocol may choose to map records -labelled `void` to `n=0`, making + [[void]] = header(0,1,4) = [0x14] + [[void()]] = header(2,0,1) ++ [[void]] = [0x81, 0x14] - [[void()]] = header(2,0,0) = [0x80] +or it may map symbol `person` to placeholder number 102, making -or it may map records labelled `person` to short form label number 1, -making + [[person]] = header(0,1,102) = [0x1F, 0x66] + +and so [[person("Dr", "Elizabeth", "Blackwell")]] - = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] - = [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] + = header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] + = [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] for format B, or - = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1) - = [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39] + open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close() + = [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04] for format C. @@ -578,9 +593,9 @@ for format C. Format B (known length): - [[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]] - [[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]] - [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++... + [[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]] + [[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]] + [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++... ++ [[K_m]] ++ [[V_m]] Note that `m*2` is given to `header` for a `Dictionary`, since there @@ -588,10 +603,10 @@ are two `Value`s in each key-value pair. Format C (streaming): - [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0) - [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1) - [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++... - ++ [[K_m]] ++ [[V_m]] ++ close(3,2) + [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close() + [[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close() + [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++... + ++ [[K_m]] ++ [[V_m]] ++ close() Applications may use whichever format suits their needs on a case-by-case basis. @@ -616,15 +631,13 @@ order. the option of serializing with set elements and dictionary keys in sorted order. -Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved. - ### SignedIntegers. Format B/A (known length/fixed-size): [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x - header(0,1,x+16) if -3≤x<0 - header(0,1,x) if 0≤x<13 + header(0,3,x+16) if -3≤x<0 + header(0,3,x) if 0≤x<13 Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented @@ -644,22 +657,22 @@ needed to unambiguously identify the value and its sign, and `m = For example, - [[ -257 ]] = 42 FE FF [[ -3 ]] = 1D [[ 128 ]] = 42 00 80 - [[ -256 ]] = 42 FF 00 [[ -2 ]] = 1E [[ 255 ]] = 42 00 FF - [[ -255 ]] = 42 FF 01 [[ -1 ]] = 1F [[ 256 ]] = 42 01 00 - [[ -254 ]] = 42 FF 02 [[ 0 ]] = 10 [[ 32767 ]] = 42 7F FF - [[ -129 ]] = 42 FF 7F [[ 1 ]] = 11 [[ 32768 ]] = 43 00 80 00 - [[ -128 ]] = 41 80 [[ 12 ]] = 1C [[ 65535 ]] = 43 00 FF FF + [[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80 + [[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF + [[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00 + [[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF + [[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00 + [[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF [[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00 [[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00 ### Strings, ByteStrings and Symbols. Syntax for these three types varies only in the value of `n` supplied -to `header`, `open`, and `close`. In each case, the payload following -the header is a binary sequence; for `String` and `Symbol`, it is a -UTF-8 encoding of the `Value`'s code points, while for `ByteString` it -is the raw data contained within the `Value` unmodified. +to `header` and `open`. In each case, the payload following the header +is a binary sequence; for `String` and `Symbol`, it is a UTF-8 +encoding of the `Value`'s code points, while for `ByteString` it is +the raw data contained within the `Value` unmodified. Format B (known length): @@ -671,7 +684,8 @@ Format B (known length): To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and then a sequence of zero or more format B chunks, followed by -`close(1,n)`. Every chunk must be a `ByteString`. +`close()`. Every chunk must be a `ByteString`, and no chunk may be +annotated. While the overall content of a streamed `String` or `Symbol` must be valid UTF-8, individual chunks do not have to conform to UTF-8. @@ -680,8 +694,8 @@ valid UTF-8, individual chunks do not have to conform to UTF-8. Fixed-length atoms all use format A, and do not have a length representation. They repurpose the bits that format B `Repr`s use to -specify lengths. Applications *MUST NOT* use format C with -`open(0,n)` or `close(0,n)` for any `n`. +specify lengths. Applications *MUST NOT* use format C with `open(0,n)` +for any `n`. #### Booleans. @@ -696,6 +710,18 @@ specify lengths. Applications *MUST NOT* use format C with The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. +### Annotations. + +To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with +`[0x05] ++ [[v]]`. + +For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e. +an empty sequence annotated with two symbols, `a` and `b`, is + + [[ @a @b [] ]] + = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]] + = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90] + ## Examples ### Simple examples. @@ -703,25 +729,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and -For the following examples, imagine an application that maps `Record` -short form label number 0 to label `discard`, 1 to `capture`, and 2 to +For the following examples, imagine an application that maps +placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to `observe`. | Value | Encoded hexadecimal byte sequence | |---------------------------------------------------|----------------------------------------------------------------------| -| `capture(discard())` | 91 80 | -| `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 | -| `[1 2 3 4]` (format B) | C4 11 12 13 14 | -| `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C | -| `[-2 -1 0 1]` | C4 1E 1F 10 11 | +| `capture(discard())` | 82 11 81 10 | +| `observe(speak(discard(), capture(discard())))` | 82 12 83 75 73 70 65 61 6B 81 10 82 11 81 11 | +| `[1 2 3 4]` (format B) | 94 31 32 33 34 | +| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 | +| `[-2 -1 0 1]` | 94 3E 3F 30 31 | | `"hello"` (format B) | 55 68 65 6C 6C 6F | | `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 | -| `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 | -| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 | +| `"hello"` (format C, 5 chunks) | 25 61 68 61 65 61 6C 61 6C 61 6F 35 | +| `["hello" there #"world" [] #set{} #true #false]` | 97 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 90 A0 01 00 | | `-257` | 42 FE FF | -| `-1` | 1F | -| `0` | 10 | -| `1` | 11 | +| `-1` | 3F | +| `0` | 30 | +| `1` | 31 | | `255` | 42 00 FF | | `1.0f` | 02 3F 80 00 00 | | `1.0` | 03 3F F0 00 00 00 00 00 00 | @@ -733,20 +759,20 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R encodes to - B5 ;; Record, generic, 4+1 - C5 ;; Sequence, 5 + 85 ;; Record, generic, 4+1 + 95 ;; Sequence, 5 76 74 69 74 6C 65 64 ;; Symbol, "titled" 76 70 65 72 73 6F 6E ;; Symbol, "person" - 12 ;; SignedInteger, "2" + 32 ;; SignedInteger, "2" 75 74 68 69 6E 67 ;; Symbol, "thing" - 11 ;; SignedInteger, "1" + 31 ;; SignedInteger, "1" 41 65 ;; SignedInteger, "101" 59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" - B4 ;; Record, generic, 3+1 + 84 ;; Record, generic, 3+1 74 64 61 74 65 ;; Symbol, "date" 42 07 1D ;; SignedInteger, "1821" - 12 ;; SignedInteger, "2" - 13 ;; SignedInteger, "3" + 32 ;; SignedInteger, "2" + 33 ;; SignedInteger, "3" 52 44 72 ;; String, "Dr" [^extensibility2]: It happens to line up with Racket's @@ -787,19 +813,19 @@ read as `Symbol`s. The first example: encodes to binary as follows: - E2 + B2 55 "Image" - EC + BC 55 "Width" 42 03 20 55 "Title" 5F 14 "View from 15th Floor" 58 "Animated" 75 "false" 56 "Height" 42 02 58 59 "Thumbnail" - E6 + B6 55 "Width" 41 64 53 "Url" 5F 26 "http://www.example.com/image/481989943" 56 "Height" 41 7D - 53 "IDs" C4 + 53 "IDs" 94 41 74 42 03 AF 42 00 EA @@ -832,8 +858,8 @@ and the second example: encodes to binary as follows: - C2 - EF 10 + 92 + BF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 E2 26 80 9D 49 52 59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 @@ -842,7 +868,7 @@ encodes to binary as follows: 55 "State" 52 "CA" 53 "Zip" 55 "94107" 57 "Country" 52 "US" - EF 10 + BF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF @@ -957,16 +983,17 @@ such media types following the general rules for ordering of | Value | Encoded hexadecimal byte sequence | |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------| -| `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 | -| `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | -| `mime(application/xml #"")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | -| `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 | +| `mime(application/octet-stream #"abcde")` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 | +| `mime(text/plain #"ABC")` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | +| `mime(application/xml #"")` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | +| `mime(text/csv #"123,234,345")` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 | Applications making heavy use of `mime` records may choose to use a -short form label number for the record type. For example, if short -form label number 1 were chosen, the second example above, -`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3 -74 6D 69 6D 65". +placeholder number for the symbol `mime` as well as the symbols for +individual media types. For example, if placeholder number 1 were +chosen for `mime`, and placeholder number 7 for `text/plain`, the +second example above, `mime(text/plain #"ABC")`, would be encoded as +`83 11 17 63 41 42 43`. ### Unicode normalization forms. @@ -1027,20 +1054,23 @@ or `date-time` productions of ## Security Considerations -**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and -`Symbol`s may include chunks of zero length. This opens up a -possibility for denial-of-service: an attacker may begin streaming a -string, sending an endless sequence of zero length chunks, appearing -to make progress but not actually doing so. Implementations may place -optional reasonable restrictions on the number of consecutive empty -chunks that may appear in a stream, and may even supply an optional -mode that rejects empty chunks entirely. +**Empty chunks.** Chunks of zero length are prohibited in streamed +(format C) `Repr`s. However, a malicious or broken encoder may include +them nonetheless. This opens up a possibility for denial-of-service: +an attacker may begin streaming a `String`, for example, sending an +endless sequence of zero length chunks, appearing to make progress but +not actually doing so. Implementations *MUST* reject zero length +chunks when decoding, and *MUST NOT* produce them when encoding. **Whitespace.** Similarly, the textual format for `Value`s allows arbitrary whitespace in many positions. In streaming transfer situations, consider optional restrictions on the amount of consecutive whitespace that may appear in a serialized `Value`. +**Annotations.** Also similarly, in modes where a `Value` is being +read while annotations are skipped, an endless sequence of annotations +may give an illusion of progress. + **Canonical form for cryptographic hashing and signing.** As specified, neither the textual nor the compact binary encoding rules for `Value`s force canonical serializations. Two serializations of the @@ -1052,24 +1082,26 @@ same `Value` may yield different binary `Repr`s. 01 - True 02 - Float 03 - Double - (0x) RESERVED 04-0F - 1x - Small integers 0..12,-3..-1 + 04 - End stream + 05 - Annotation + (0x) RESERVED 06-0F + 1x - Placeholder 2x - Start Stream - 3x - End Stream + 3x - Small integers 0..12,-3..-1 4x - SignedInteger 5x - String 6x - ByteString 7x - Symbol - 8x - short form Record label index 0 - 9x - short form Record label index 1 - Ax - short form Record label index 2 - Bx - Record + 8x - Record + 9x - Sequence + Ax - Set + Bx - Dictionary - Cx - Sequence - Dx - Set - Ex - Dictionary + (Cx) RESERVED C0-CF + (Dx) RESERVED D0-DF + (Ex) RESERVED E0-EF (Fx) RESERVED F0-FF ## Appendix. Bit fields within lead byte values @@ -1081,31 +1113,40 @@ same `Value` may yield different binary `Repr`s. 00 00 0001 True 00 00 0010 Float, 32 bits big-endian binary 00 00 0011 Double, 64 bits big-endian binary + 00 00 0100 End Stream (to match a previous Start Stream) + 00 00 0101 Annotation; two more Reprs follow - 00 01 xxxx Small integers 0..12,-3..-1 + 00 01 mmmm Placeholder; m is the placeholder number 00 10 ttnn Start Stream When tt = 00 --> error 01 --> each chunk is a ByteString - 1x --> each chunk is a single encoded Value - 00 11 ttnn End Stream (must match preceding Start Stream) + 10 --> each chunk is a single encoded Value + 11 --> error (RESERVED) + + 00 11 xxxx Small integers 0..12,-3..-1 01 00 mmmm SignedInteger, big-endian binary 01 01 mmmm String, UTF-8 binary 01 10 mmmm ByteString 01 11 mmmm Symbol, UTF-8 binary - 10 00 mmmm application-specific Record - 10 01 mmmm application-specific Record - 10 10 mmmm application-specific Record - 10 11 mmmm Record + 10 00 mmmm Record + 10 01 mmmm Sequence + 10 10 mmmm Set + 10 11 mmmm Dictionary - 11 00 mmmm Sequence - 11 01 mmmm Set - 11 10 mmmm Dictionary + 11 nn mmmm error, RESERVED - If mmmm = 1111, a varint(m) follows, giving the length, before - the body; otherwise, m is the length of the body to follow. +Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If +`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of +decoding the varint that follows. + +Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l` +is the length of the body that follows, counted in bytes for `tt`=`01` +and in `Repr`s for `tt`=`10`. + + + ## Appendix. Why not Just Use JSON? @@ -1367,18 +1413,14 @@ Q. Should "symbols" instead be URIs? Relative, usually; relative to what? Some domain-specific base URI? Q. Literal small integers: are they pulling their weight? They're not -absolutely necessary. They mess up the connection between -value-type-ordering and repr-tag-ordering! (The connection between -*value* ordering and *repr* ordering is already irretrievably messed -up: length prefixes blow lexicographic ordering away, sign bits are -the wrong way around, floats are sign-magnitude, etc etc.) +absolutely necessary. Q. Should we go for trying to make the data ordering line up with the encoding ordering? We'd have to only use streaming forms, and avoid the small integer encoding, and not store record arities, and sort sets and dictionaries, and mask floats and doubles (perhaps [like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)), -and pick a specific `NaN`, and I don't know what to do about +and perhaps pick a specific `NaN`, and I don't know what to do about SignedIntegers. Perhaps make them more like float formats, with the byte count acting as a kind of exponent underneath the sign bit. @@ -1413,11 +1455,3 @@ link escape"; it is not a printable ASCII character, and is disallowed in the textual Preserves grammar; and it is also mnemonic for "version 0", since it is the Preserves binary encoding of the small integer zero.)) - -IN PROGRESS: Remove the special short syntax for application-specific record -label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record, -Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free. - -TODO: Forbid empty chunks. - -## Notes