From 5d719c2c6fae2b0118fdc4fa6a901efc495a39e4 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Mon, 28 Dec 2020 23:25:02 +0100 Subject: [PATCH] MUCH simpler binary format, inspired by Syrup; alterations to text format --- NOTICE | 2 +- TUTORIAL.md | 191 ++++++------ canonical-binary.md | 17 +- conventions.md | 44 +-- preserves.md | 728 +++++++++++++++++--------------------------- questions.md | 13 - 6 files changed, 399 insertions(+), 596 deletions(-) diff --git a/NOTICE b/NOTICE index 24d04ae..a401f75 100644 --- a/NOTICE +++ b/NOTICE @@ -1,2 +1,2 @@ Preserves: an Expressive Data Language -Copyright 2018-2019 Tony Garnock-Jones +Copyright 2018-2020 Tony Garnock-Jones diff --git a/TUTORIAL.md b/TUTORIAL.md index 564373f..0c954b5 100644 --- a/TUTORIAL.md +++ b/TUTORIAL.md @@ -38,7 +38,7 @@ For that, see the [Preserves specification](preserves.html). If you're familiar with JSON, Preserves looks fairly similar: -``` javascript +``` {"name": "Missy Rose", "species": "Felis Catus", "age": 13, @@ -49,35 +49,35 @@ Preserves also has something we can use for debugging/development information called "annotations"; they aren't actually read in as data but we can use them for comments. (They can also be used for other development tools and are not -restricted to strings; more on this later, but for now interpret them -as comments.) +restricted to strings; more on this later, but for now, we will stick +to the special comment annotation syntax.) -``` javascript - @"I'm an annotation... basically a comment. Ignore me!" - "I'm data! Don't ignore me!" +``` + ;I'm an annotation... basically a comment. Ignore me! + "I'm data! Don't ignore me!" ``` Preserves supports some data types you're probably already familiar with from JSON, and which look fairly similar in the textual format: -``` javascript - @"booleans" - #true - #false - - @"various kinds of numbers:" +``` + ;booleans + #t + #f + + ;various kinds of numbers: 42 123556789012345678901234567890 -10 13.5 - - @"strings" + + ;strings "I'm feeling stringy!" - - @"sequences (lists)" + + ;sequences (lists) ["cat", "dog", "mouse", "goldfish"] - - @"dictionaries (hashmaps)" + + ;dictionaries (hashmaps) {"cat": "meow", "dog": "woof", "goldfish": "glub glub", @@ -90,16 +90,16 @@ with from JSON, and which look fairly similar in the textual format: ## Going beyond JSON We can observe a few differences from JSON already; it's possible to -express numbers of arbitrary length in Preserves, and booleans look a little +*reliably* express integers of arbitrary length in Preserves, and booleans look a little bit different. A few more interesting differences: -``` javascript - @"Preserves treats commas as whitespace, so these are the same" +``` + ;Preserves treats commas as whitespace, so these are the same ["cat", "dog", "mouse", "goldfish"] ["cat" "dog" "mouse" "goldfish"] - - @"We can use anything as keys in dictionaries, not just strings" + + ;We can use anything as keys in dictionaries, not just strings {1: "the loneliest number", ["why", "was", 6, "afraid", "of", 7]: "because 7 8 9", {"dictionaries": "as keys???"}: "well, why not?"} @@ -107,17 +107,17 @@ A few more interesting differences: Preserves technically provides a few types of numbers: -``` javascript - @"Signed Integers" +``` + ;Signed Integers 42 -42 5907212309572059846509324862304968273468909473609826340 -5907212309572059846509324862304968273468909473609826340 - - @"Floats (Single-precision IEEE floats) (notice the trailing f)" + + ;Floats (Single-precision IEEE floats) (notice the trailing f) 3.1415927f - - @"Doubles (Double-precision IEEE floats)" + + ;Doubles (Double-precision IEEE floats) 3.141592653589793 ``` @@ -129,33 +129,33 @@ Often they're meant to be used for something that has symbolic importance to the program, but not textual importance (other than to guide the programmer… not unlike variable names). -``` javascript - @"A symbol (NOT a string!)" +``` + ;A symbol (NOT a string!) JustASymbol - - @"You can do mixedCase or CamelCase too of course, pick your poison" - @"(but be consistent, for the sake of your collaborators!" + + ;You can do mixedCase or CamelCase too of course, pick your poison + ;(but be consistent, for the sake of your collaborators!) iAmASymbol i-am-a-symbol - - @"A list of symbols" + + ;A list of symbols [GET, PUT, POST, DELETE] - - @"A symbol with spaces in it" + + ;A symbol with spaces in it |this is just one symbol believe it or not| ``` We can also add binary data, aka ByteStrings: -``` javascript - @"Some binary data, base64 encoded" - #base64{cGljdHVyZSBvZiBhIGNhdA==} - - @"Some other binary data, hexadecimal encoded" - #hex{616263} - - @"Same binary data as above, base64 encoded" - #base64{YWJj} +``` + ;Some binary data, base64 encoded + #[cGljdHVyZSBvZiBhIGNhdA==] + + ;Some other binary data, hexadecimal encoded + #x"616263" + + ;Same binary data as above, base64 encoded + #[YWJj] ``` What's neat about this is that we don't have to "pay the cost" of @@ -165,48 +165,41 @@ the length of the binary data is the length of the binary data. Conveniently, Preserves also includes Sets, which are collections of unique elements where ordering of items is unimportant. -``` javascript - #set{flour, salt, water} +``` + #{flour, salt, water} ``` -## Total ordering and canonicalization +## Canonicalization This is a good time to mention that even though from a semantic perspective sets and dictionaries do not carry information about the ordering of their elements (and Preserves doesn't care what order we enter them in for our hand-written-as-text Preserves documents), -Preserves has a well-defined "total ordering". +[Preserves provides support for canonical ordering](canonical-binary.html) +when serializing. -Based on this total ordering, Preserves provides support for canonical -ordering when serializing; in this mode, Preserves will always write -out the elements in the same order, every time. -When combined with binary serialization, this is Preserves' "canonical -form". -This is important and useful for many contexts, but especially for -cryptographic signatures and hashing. +In canonicalizing output mode, Preserves will always write out a given +value using exactly the same bytes, every time. This is important and +useful for many contexts, but especially for cryptographic signatures +and hashing. -``` javascript - @"This hand-typed Preserves document..." +``` + ;This hand-typed Preserves document... {monkey: {"noise": "ooh-ooh", - "eats": #set{"bananas", "berries"}} + "eats": #{"bananas", "berries"}} cat: {"noise": "meow", - "eats": #set{"kibble", "cat treats", "tinned meat"}}} - - @"Will always, always be written out in this order when canonicalized:" - {cat: {"eats": #set{"cat treats", "kibble", "tinned meat"}, + "eats": #{"kibble", "cat treats", "tinned meat"}}} + + ;Will always, always be written out in this order (except in + ;binary, of course) when canonicalized: + {cat: {"eats": #{"cat treats", "kibble", "tinned meat"}, "noise": "meow"} - monkey: {"eats": #set{"bananas", "berries"}, + monkey: {"eats": #{"bananas", "berries"}, "noise": "ooh-ooh"}} ``` -Clever implementations can get canonicalized output for free by -carefully ordering set elements and dictionary entries at construction -time, but even in simple implementations, canonical serialization is -almost as cheap as normal serialization. - - ## Defining our own types using Records @@ -216,7 +209,7 @@ sense, it's a meta-type. `Record` objects have a label and a series of arguments (or "fields"). For example, we can make a `Date` record: -``` javascript +``` ``` @@ -228,7 +221,7 @@ We could instead just decide to encode our date data in a string, like "2019-08-15". A document using such a date structure might look like so: -``` javascript +``` {"name": "Gregor Samsa", "description": "humanoid trapped in an insect body", "born": "1915-10-04"} @@ -243,13 +236,13 @@ know the date exactly. This causes a problem. Now we might have two kinds of entries: -``` javascript - @"Exact date known" +``` + ;Exact date known {"name": "Gregor Samsa", "description": "humanoid trapped in an insect body", "born": "1915-10-04"} - - @"Not sure about exact date..." + + ;Not sure about exact date... {"name": "Gregor Samsa", "description": "humanoid trapped in an insect body", "born": "Sometime in October 1915? Or was that when he became an insect?"} @@ -261,13 +254,13 @@ like a date", but doing this kind of thing is prone to errors and weird edge cases. No, it's better to be able to have a separate type: -``` javascript - @"Exact date known" +``` + ;Exact date known {"name": "Gregor Samsa", "description": "humanoid trapped in an insect body", "born": } - - @"Not sure about exact date..." + + ;Not sure about exact date... {"name": "Gregor Samsa", "description": "humanoid trapped in an insect body", "born": } @@ -285,7 +278,7 @@ the meaning the label signifies for it to be of use. Still, there are plenty of interesting labels we can define. Here is one for an "iri", a hyperlink: -``` javascript +``` ``` @@ -294,11 +287,11 @@ Records are usually symbols but aren't necessarily so. They can also be strings or numbers or even dictionaries. And very interestingly, they can also be other records: -``` javascript - < - {"to": [], - "attributedTo": , - "content": "Say, did you finish reading that book I lent you?"}> +``` + < + {"to": [], + "attributedTo": , + "content": "Say, did you finish reading that book I lent you?"} > ``` Do you see it? This Record's label is… an `iri` Record! @@ -327,16 +320,18 @@ Annotations are not strictly a necessary feature, but they are useful in some circumstances. We have previously shown them used as comments: -``` javascript - @"I'm a comment!" +``` + ;I'm a comment! "I am not a comment, I am data!" ``` Annotations annotate the values they precede. It is possible to have multiple annotations on a value. +The `;`-based comment syntax is syntactic sugar for the general +`@`-prefixed string annotation syntax. -``` javascript - @"I am annotating this number" +``` + ;I am annotating this number @"And so am I!" 42 ``` @@ -349,7 +344,7 @@ Many implementations will, in the same mode, also supply line number and column information attached to each read value. So what's the point of them then? -If annotations were just for comments, there would be indeed hardly +If annotations were just for comments, there would be indeed hardly any point at all… it would be simpler to just provide a comment syntax. However, annotations can be used for more than just comments. @@ -360,13 +355,17 @@ For instance, here's a reply from an HTTP API service running in "debug" mode annotated with the time it took to produce the reply and the internal name of the server that produced the response: -``` javascript +``` @> @ , }, > - }, > ]>> + , } + > + } + > ]>> ``` The annotations aren't related to the data requested, which is all diff --git a/canonical-binary.md b/canonical-binary.md index 50d7fa5..7d192b4 100644 --- a/canonical-binary.md +++ b/canonical-binary.md @@ -20,22 +20,17 @@ are equal. This document specifies canonical form for the Preserves compact binary syntax. -**General rules.** -Streaming formats ("format C") *MUST NOT* be used. +**Annotations.** Annotations *MUST NOT* be present. -Whenever there is a choice between fixed-length ("format A") or -variable-length ("format B") formats, the fixed-length format *MUST* be -used. **Sets.** The elements of a `Set` *MUST* be serialized sorted in ascending order -following the total order relation defined in the -[Preserves specification][spec]. +by comparing their canonical encoded binary representations. **Dictionaries.** The key-value pairs in a `Dictionary` *MUST* be serialized sorted in -ascending order by key, following the total order relation defined in -the [Preserves specification][spec].[^no-need-for-by-value] +ascending order by comparing the canonical encoded binary +representations of their keys.[^no-need-for-by-value] [^no-need-for-by-value]: There is no need to order by (key, value) pair, since a `Dictionary` has no duplicate keys. @@ -43,7 +38,9 @@ the [Preserves specification][spec].[^no-need-for-by-value] **Other kinds of `Value`.** There are no special canonicalization restrictions on `SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s, -`Float`s, `Double`s, `Record`s, or `Sequence`s. +`Float`s, `Double`s, `Record`s, or `Sequence`s. The constraints given +for these `Value`s in the [specification][spec] suffice to ensure +canonicity. ## Notes diff --git a/conventions.md b/conventions.md index f9f765d..29d56a4 100644 --- a/conventions.md +++ b/conventions.md @@ -65,28 +65,29 @@ interior portions of a tree. ## Comments. `String` values used as annotations are conventionally interpreted as -comments. +comments. Special syntax exists for such string annotations, though +the usual `@`-prefixed annotation notation can also be used. - @"I am a comment for the Dictionary" + ;I am a comment for the Dictionary { - @"I am a comment for the key" - key: @"I am a comment for the value" + ;I am a comment for the key + key: ;I am a comment for the value value } - @"I am a comment for this entire IOList" + ;I am a comment for this entire IOList [ - #hex{00010203} - @"I am a comment for the middle half of the IOList" - @"A second comment for the same portion of the IOList" - @ @"I am the first and only comment for the following comment" + #x"00010203" + ;I am a comment for the middle half of the IOList + ;A second comment for the same portion of the IOList + @ ;I am the first and only comment for the following comment "A third (itself commented!) comment for the same part of the IOList" [ - @"I am a comment for the following ByteString" - #hex{04050607} - #hex{08090A0B} + ;"I am a comment for the following ByteString" + #x"04050607" + #x"08090A0B" ] - #hex{0C0D0E0F} + #x"0C0D0E0F" ] ## MIME-type tagged binary data. @@ -105,12 +106,17 @@ such media types following the general rules for ordering of **Examples.** -| Value | Encoded hexadecimal byte sequence | -|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------| -| `` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 | -| `` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | -| `">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | -| `` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 | + «» + = B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde" + + «» + = B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84 + + «">» + = B4 B3 04 "mime" B3 0F "application/xml" B2 08 "" 84 + + «» + = B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84 ## Unicode normalization forms. diff --git a/preserves.md b/preserves.md index ceb6651..aa1b11b 100644 --- a/preserves.md +++ b/preserves.md @@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language" --- Tony Garnock-Jones -May 2020. Version 0.0.8. +Jan 2021. Version 0.4.0. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [spki]: http://world.std.com/~cme/html/spki.html @@ -12,6 +12,7 @@ May 2020. Version 0.0.8. [LEB128]: https://en.wikipedia.org/wiki/LEB128 [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map [abnf]: https://tools.ietf.org/html/rfc7405 + [canonical]: canonical-binary.html This document proposes a data model and serialization format called *Preserves*. @@ -42,20 +43,20 @@ Our `Value`s fall into two broad categories: *atomic* and *compound* data. Every `Value` is finite and non-cyclic. Value = Atom - | Compound + | Compound Atom = Boolean - | Float - | Double - | SignedInteger - | String - | ByteString - | Symbol + | Float + | Double + | SignedInteger + | String + | ByteString + | Symbol Compound = Record - | Sequence - | Set - | Dictionary + | Sequence + | Set + | Dictionary **Total order.** As we go, we will incrementally specify a total order over `Value`s. Two values of the @@ -215,14 +216,13 @@ label-`Value` followed by its field-`Value`s. `Sequence`s are enclosed in square brackets. `Dictionary` values are curly-brace-enclosed colon-separated pairs of values. `Set`s are -written either as one or more values enclosed in curly braces, or zero -or more values enclosed by the tokens `#set{` and +written as values enclosed by the tokens `#{` and `}`.[^printing-collections] It is an error for a set to contain duplicate elements or for a dictionary to contain duplicate keys. Sequence = "[" *Value ws "]" Dictionary = "{" *(Value ws ":" Value) ws "}" - Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}" + Set = "#{" *Value ws "}" [^printing-collections]: **Implementation note.** When implementing printing of `Value`s using the textual syntax, consider supporting @@ -232,9 +232,10 @@ duplicate elements or for a dictionary to contain duplicate keys. commas separating, and commas terminating elements or key/value pairs within a collection. -`Boolean`s are the simple literal strings `#true` and `#false`. +`Boolean`s are the simple literal strings `#t` and `#f` for true and +false, respectively. - Boolean = %s"#true" / %s"#false" + Boolean = %s"#t" / %s"#f" Numeric data follow the [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with @@ -310,9 +311,10 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] [^escaping-surrogate-pairs]: In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic - Multilingual Plane. We encourage implementations to avoid escaping - such characters when producing output, and instead to rely on the - UTF-8 encoding of the entire document to handle them correctly. + Multilingual Plane. We encourage implementations to avoid using + `\u` escapes when producing output, and instead to rely on the + UTF-8 encoding of the entire document to handle non-ASCII + codepoints correctly. A `ByteString` may be written in any of three different forms. @@ -327,16 +329,16 @@ value with `\x`. binunescaped = %x20-21 / %x23-5B / %x5D-7E The second is as a sequence of pairs of hexadecimal digits interleaved -with whitespace and surrounded by `#hex{` and `}`. +with whitespace and surrounded by `#x"` and `"`. - ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}" + ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22 The third is as a sequence of [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved -with whitespace and surrounded by `#base64{` and `}`. Plain and -URL-safe Base64 characters are allowed. +with whitespace and surrounded by `#[` and `]`. Plain and URL-safe +Base64 characters are allowed. - ByteString =/ %s"#base64{" *(ws / base64char) ws "}" / + ByteString =/ "#[" *(ws / base64char) ws "]" / base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as @@ -365,10 +367,10 @@ double quote mark. Finally, any `Value` may be represented by escaping from the textual syntax to the [compact binary syntax](#compact-binary-syntax) by prefixing a `ByteString` containing the binary representation of the -`Value` with `#value`.[^rationale-switch-to-binary] +`Value` with `#`.[^rationale-switch-to-binary] [^no-literal-binary-in-text] [^compact-value-annotations] - Compact = %s"#value" ws ByteString + Compact = "#" ws ByteString [^rationale-switch-to-binary]: **Rationale.** The textual syntax cannot express every `Value`: specifically, it cannot express the @@ -387,8 +389,8 @@ prefixing a `ByteString` containing the binary representation of the access the representation of the text from within the text itself. [^compact-value-annotations]: Any text-syntax annotations preceding - the `#value` are prepended to any binary-syntax annotations - yielded by decoding the `ByteString`. + the `#` are prepended to any binary-syntax annotations yielded by + decoding the `ByteString`. ### Annotations. @@ -403,6 +405,17 @@ Each annotation is preceded by `@`; the underlying annotated value follows its annotations. Here we extend only the syntactic nonterminal named “`Value`” without altering the semantic class of `Value`s. +**Comments.** Strings annotating a `Value` are conventionally +interpreted as comments associated with that value. Comments are +sufficiently common that special syntax exists for them. + + Value =/ ws + ";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline + Value + +When written this way, everything between the `;` and the newline is +included in the string annotating the `Value`. + **Equivalence.** Annotations appear within syntax denoting a `Value`; however, the annotations are not part of the denoted value. They are only part of the syntax. Annotations do not play a part in @@ -421,86 +434,25 @@ different. ## Compact Binary Syntax -A `Repr` is a binary-syntax encoding, or representation, of either a -`Value` or an annotation on a `Repr`. - -Each `Repr` comprises one or more bytes describing the kind of -represented information and the length of the representation, followed -by the encoded details. - -For a value `v`, we write `[[v]]` for the `Repr` of v. +A `Repr` is a binary-syntax encoding, or representation, of a `Value`. +For a value `v`, we write `«v»` for the `Repr` of v. ### Type and Length representation. -Each `Repr` takes one of three possible forms: +Each `Repr` starts with a tag byte, describing the kind of information +represented. Depending on the tag, a length indicator, further encoded +information, and/or an ending tag may follow. - - (A) type-specific form, used for simple values such as `Boolean`s - or `Float`s as well as for introducing annotations. + tag (simple atomic data and small integers) + tag ++ binarydata (most integers) + tag ++ length ++ binarydata (large integers, strings, symbols, and binary) + tag ++ repr ++ ... ++ endtag (compound data) - - (B) a variable-length form with length specified up-front, used for - compound and variable-length atomic data structures when their - sizes are known at the time serialization begins. +The unique end tag is byte value `0x84`. - - (C) a variable-length streaming form with unknown or unpredictable - length, used in cases when serialization begins before the number - of elements or bytes in the corresponding `Value` is known. - -Applications may choose between formats B and C depending on their -needs at serialization time. - -#### The lead byte. - -Every `Repr` starts with a *lead byte*, constructed by -`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16: - - leadbyte(t,n,m) = [t*64 + n*16 + m] - -The arguments `t`, `n` and `m` describe the rest of the -representation.[^some-encodings-unused] - - [^some-encodings-unused]: Some encodings are unused. All such - encodings are reserved for future versions of this specification. - -| `t` | `n` | `m` | Meaning | -| --- | --- | --- | ------- | -| 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation | -| 0 | 0 | 4 | (format C) Stream end | -| 0 | 0 | 5 | (format A) Annotation | -| 0 | 2 | | (format C) Stream start | -| 0 | 3 | | (format A) Certain small `SignedInteger`s | -| 1 | | | (format B) An `Atom` with variable-length binary representation | -| 2 | | | (format B) A `Compound` with variable-length representation | -| 3 | 3 | 15 | (format A) 0xFF byte; no-op | - -#### Encoding data of type-specific length (format A). - -Each type of data defines its own rules for this format. - -Of particular note is lead byte `0xFF`, which is a no-op byte acting -as a kind of pseudo-whitespace in a binary-syntax encoding. - -#### Encoding data of known length (format B). - -Format B is used where the length `l` of the `Value` to be encoded is -known when serialization begins. Format B `Repr`s use `m` in -`leadbyte` to encode `l`. The length counts *bytes* for atomic -`Value`s, but counts *contained values* for compound `Value`s. - - - A length `l` between 0 and 14 is represented using `leadbyte` with - `m=l`. - - A length of 15 or greater is represented by `m=15` and additional - bytes describing the length following the lead byte. - -The function `header(t,n,m)` yields an appropriate sequence of bytes -describing a `Repr`'s type and length when `t`, `n` and `m` are -appropriate non-negative integers: - - header(t,n,m) = leadbyte(t,n,m) when m < 15 - or leadbyte(t,n,15) ++ varint(m) otherwise - -The additional length bytes are formatted as -[base 128 varints][varint].[^see-also-leb128] We write `varint(m)` for -the varint-encoding of `m`. Quoting the +If present after a tag, the length of a following piece of binary data +is formatted as a [base 128 varint][varint].[^see-also-leb128] We +write `varint(m)` for the varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint] definition, [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned @@ -515,174 +467,114 @@ the varint-encoding of `m`. Quoting the The following table illustrates varint-encoding. -| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | -| ------ | ------------------- | ------------ | -| 15 | `0001111` | 15 | -| 300 | `0000010 0101100` | 172 2 | -| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | +| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | +| ------ | ------------------- | ------------ | +| 15 | `0001111` | 15 | +| 300 | `0000010 0101100` | 172 2 | +| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | It is an error for a varint-encoded `m` in a `Repr` to be anything other than the unique shortest encoding for that `m`. That is, a -varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. However, -the `varint(m)` encoding of a length *MUST NOT* be used when `m`<15, -meaning that a `Repr` *MUST NOT* contain any varint-encoding with -final byte `0`. +varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. -#### Streaming data of unknown length (format C). +### Records, Sequences, Sets and Dictionaries. -A `Repr` where the length of the `Value` to be encoded is variable and -not known at the time serialization of the `Value` starts is encoded -by a single Stream Start (“open”) byte, followed by zero or more -*chunks*, followed by a matching Stream End (“close”) byte: - - open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n] - close() = leadbyte(0,0, 4) = [0x04] - -For a format C `Repr` of an atomic `Value`, each chunk is to be a -format B `Repr` of a `ByteString`, no matter the type of the overall -`Value`. Annotations are not allowed on these individual chunks. - -For a format C `Repr` of a compound `Value`, each chunk is to be a -single `Repr`, which may itself be annotated. - -Each chunk within a format C `Repr` *MUST* have non-zero length. -Software that decodes `Repr`s *MUST* reject `Repr`s that include -zero-length chunks. - -### Records. - -Format B (known length): - - [[ ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] - -For `m` fields, `m+1` is supplied to `header`, to account for the -encoding of the record label. - -Format C (streaming): - - [[ ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close() - -Applications *SHOULD* prefer the known-length format for encoding -`Record`s. - -### Sequences, Sets and Dictionaries. - -Format B (known length): - - [[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]] - [[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]] - [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++... - ++ [[K_m]] ++ [[V_m]] - -Note that `m*2` is given to `header` for a `Dictionary`, since there -are two `Value`s in each key-value pair. - -Format C (streaming): - - [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close() - [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close() - [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++... - ++ [[K_m]] ++ [[V_m]] ++ close() - -Applications may use whichever format suits their needs on a -case-by-case basis. + «» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84] + «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84] + «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84] + «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84] There is *no* ordering requirement on the `E_i` elements or `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any -order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. +order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In +addition, implementations *SHOULD* default to writing set elements and +dictionary key/value pairs in order sorted lexicographically by their +`Repr`s[^not-sorted-semantically], and *MAY* offer the option of +serializing in some other implementation-defined order. [^no-sorting-rationale]: In the BitTorrent encoding format, [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of `Value`s is canonical. We do not require that key/value pairs (or set - elements) be in sorted order for serialized `Value`s, because (a) - where canonicalization is used for cryptographic signatures, it is - more reliable to simply retain the exact binary form of the signed - document than to depend on canonical de- and re-serialization, and - (b) sorting keys or elements makes no sense in streaming - serialization formats. + elements) be in sorted order for serialized `Value`s; however, a + [canonical form][canonical] for `Repr`s does exist where a sorted + ordering is required. - However, a quality implementation may wish to offer the programmer - the option of serializing with set elements and dictionary keys in - sorted order. + [^not-sorted-semantically]: It's important to note that the sort + ordering for writing out set elements and dictionary key/value + pairs is *not* the same as the sort ordering implied by the + semantic ordering of those elements or keys. For example, the + `Repr` of a negative number very far from zero will start with + byte that is *greater* than the byte which starts the `Repr` of + zero, making it sort lexicographically later by `Repr`, despite + being semantically *less than* zero. + + **Rationale**. This is for ease-of-implementation reasons: not all + languages can easily represent sorted sets or sorted dictionaries, + but encoding and then sorting byte strings is much more likely to + be within easy reach. ### SignedIntegers. -Format B/A (known length/fixed-size): + «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16 + ([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16 + ([0xA0] + x) if (-3≤x≤-1) + ([0x90] + x) if ( 0≤x≤12) + where m = |intbytes(x)| - [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x - header(0,3,x+16) if -3≤x<0 - header(0,3,x) if 0≤x<13 - -Integers in the range [-3,12] are compactly represented using format A -because they are so frequently used. Other integers are represented -using format B. - -Format C *MUST NOT* be used for `SignedInteger`s. Format A *MUST* be -used for integers in the range -3 to 12, inclusive. +Integers in the range [-3,12] are compactly represented with tags +between `0x90` and `0x9F` because they are so frequently used. +Integers up to 16 bytes long are represented with a single-byte tag +encoding the length of the integer. Larger integers are represented +with an explicit varint length. Every `SignedInteger` *MUST* be +represented with its shortest possible encoding. The function `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign, and `m = |intbytes(x)|`. The most-significant bit in the first byte in -`intbytes(x)` is the sign bit.[^zero-intbytes] +`intbytes(x)` is the sign bit.[^zero-intbytes] For +example, + + «87112285931760246646623899502532662132736» + = B0 12 01 00 00 00 00 00 00 00 + 00 00 00 00 00 00 00 00 + 00 00 + + «-257» = A1 FE FF «-3» = 9D «128» = A1 00 80 + «-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF + «-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00 + «-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF + «-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00 + «-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF + «-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00 + «-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00 [^zero-intbytes]: The value 0 needs zero bytes to identify the value, so `intbytes(0)` is the empty byte string. Non-zero values need at least one byte. -For example, - - [[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80 - [[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF - [[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00 - [[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF - [[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00 - [[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF - [[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00 - [[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00 - ### Strings, ByteStrings and Symbols. -Syntax for these three types varies only in the value of `n` supplied -to `header` and `open`. In each case, the payload following the header -is a binary sequence; for `String` and `Symbol`, it is a UTF-8 -encoding of the `Value`'s code points, while for `ByteString` it is -the raw data contained within the `Value` unmodified. +Syntax for these three types varies only in the tag used. For `String` +and `Symbol`, the data following the tag is a UTF-8 encoding of the +`Value`'s code points, while for `ByteString` it is the raw data +contained within the `Value` unmodified. -Format B (known length): + «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String + [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString + [0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol - [[ S ]] = header(1,n,m) ++ encode(S) - where m = |encode(S)| - and (n,encode(S)) = (1,utf8(S)) if S ∈ String - (2,S) if S ∈ ByteString - (3,utf8(S)) if S ∈ Symbol +### Booleans. -To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and -then a sequence of zero or more format B chunks, followed by -`close()`. Every chunk must be a `ByteString`, and no chunk may be -annotated. + «#f» = [0x80] + «#t» = [0x81] -While the overall content of a streamed `String` or `Symbol` must be -valid UTF-8, individual chunks do not have to conform to UTF-8. +### Floats and Doubles. -### Fixed-length Atoms. - -Fixed-length atoms all use format A, and do not have a length -representation. They repurpose the bits that format B `Repr`s use to -specify lengths. Applications *MUST NOT* use format C with `open(0,n)` -for any `n`. - -#### Booleans. - - [[ #false ]] = header(0,0,0) = [0x00] - [[ #true ]] = header(0,0,1) = [0x01] - -#### Floats and Doubles. - - [[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F) - [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D) + «F» when F ∈ Float = [0x82] ++ binary32(F) + «D» when D ∈ Double = [0x83] ++ binary64(D) The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. @@ -690,40 +582,43 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and ### Annotations. To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with -`[0x05] ++ [[v]]`. +`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual +syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, +`a` and `b`, is -For example, the `Repr` corresponding to textual syntax `@a@b[]`, -i.e. an empty sequence annotated with two symbols, `a` and `b`, is - - [[ @a @b [] ]] - = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]] - = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90] + «@a @b []» + = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]» + = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84] ## Examples +### Ordering. + +The total ordering specified [above](#total-order) means that the following statements are true: + + "bzz" < "c" < "caa" + #t < 3.0f < 3.0 < 3 < "3" < |3| < [] + ### Simple examples. -| Value | Encoded byte sequence | -|---------------------------------------------------|-------------------------------------------------------------------------------------| -| `>` | 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd' | -| `[1 2 3 4]` (format B) | 94 31 32 33 34 | -| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 | -| `[-2 -1 0 1]` | 94 3E 3F 30 31 | -| `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' | -| `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 | -| `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 | -| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 | -| `-257` | 42 FE FF | -| `-1` | 3F | -| `0` | 30 | -| `1` | 31 | -| `255` | 42 00 FF | -| `1.0f` | 02 3F 80 00 00 | -| `1.0` | 03 3F F0 00 00 00 00 00 00 | -| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 | +| Value | Encoded byte sequence | +|-----------------------------|---------------------------------------------------------------------------------| +| `>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 | +| `[1 2 3 4]` | B5 91 92 93 94 84 | +| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 | +| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' | +| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 | +| `-257` | A1 FE FF | +| `-1` | 9F | +| `0` | 90 | +| `1` | 91 | +| `255` | A1 00 FF | +| `1.0f` | 82 3F 80 00 00 | +| `1.0` | 83 3F F0 00 00 00 00 00 00 | +| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 | The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` @@ -731,21 +626,24 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R encodes to - 85 ;; Record, generic, 4+1 - 95 ;; Sequence, 5 - 76 74 69 74 6C 65 64 ;; Symbol, "titled" - 76 70 65 72 73 6F 6E ;; Symbol, "person" - 32 ;; SignedInteger, "2" - 75 74 68 69 6E 67 ;; Symbol, "thing" - 31 ;; SignedInteger, "1" - 41 65 ;; SignedInteger, "101" - 59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" - 84 ;; Record, generic, 3+1 - 74 64 61 74 65 ;; Symbol, "date" - 42 07 1D ;; SignedInteger, "1821" - 32 ;; SignedInteger, "2" - 33 ;; SignedInteger, "3" - 52 44 72 ;; String, "Dr" + B4 ;; Record + B5 ;; Sequence + B3 06 74 69 74 6C 65 64 ;; Symbol, "titled" + B3 06 70 65 72 73 6F 6E ;; Symbol, "person" + 92 ;; SignedInteger, "2" + B3 05 74 68 69 6E 67 ;; Symbol, "thing" + 91 ;; SignedInteger, "1" + 84 ;; End (sequence) + A0 65 ;; SignedInteger, "101" + B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" + B4 ;; Record + B3 04 64 61 74 65 ;; Symbol, "date" + A1 07 1D ;; SignedInteger, "1821" + 92 ;; SignedInteger, "2" + 93 ;; SignedInteger, "3" + 84 ;; End (record) + B1 02 44 72 ;; String, "Dr" + 84 ;; End (record) [^extensibility2]: It happens to line up with Racket's representation of a record label for an inheritance hierarchy @@ -785,23 +683,27 @@ read as `Symbol`s. The first example: encodes to binary as follows: - B2 - 55 "Image" - BC - 55 "Width" 42 03 20 - 55 "Title" 5F 14 "View from 15th Floor" - 58 "Animated" 75 "false" - 56 "Height" 42 02 58 - 59 "Thumbnail" - B6 - 55 "Width" 41 64 - 53 "Url" 5F 26 "http://www.example.com/image/481989943" - 56 "Height" 41 7D - 53 "IDs" 94 - 41 74 - 42 03 AF - 42 00 EA - 43 00 97 89 + B7 + B1 05 "Image" + B7 + B1 05 "Title" B1 14 "View from 15th Floor" + B1 05 "Width" A1 03 20 + B1 06 "Height" A1 02 58 + B1 08 "Animated" B3 05 "false" + B1 09 "Thumbnail" + B7 + B1 03 "Url" B1 26 "http://www.example.com/image/481989943" + B1 03 "IDs" B5 + A0 74 + A1 03 AF + A1 00 EA + A2 00 97 89 + 84 + B1 05 "Width" A0 64 + B1 06 "Height" A0 7D + 84 + 84 + 84 and the second example: @@ -830,55 +732,51 @@ and the second example: encodes to binary as follows: - 92 - BF 10 - 59 "precision" 53 "zip" - 58 "Latitude" 03 40 42 E2 26 80 9D 49 52 - 59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 - 57 "Address" 50 - 54 "City" 5D "SAN FRANCISCO" - 55 "State" 52 "CA" - 53 "Zip" 55 "94107" - 57 "Country" 52 "US" - BF 10 - 59 "precision" 53 "zip" - 58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 - 59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF - 57 "Address" 50 - 54 "City" 59 "SUNNYVALE" - 55 "State" 52 "CA" - 53 "Zip" 55 "94085" - 57 "Country" 52 "US" + B5 + B7 + B1 03 "Zip" B1 05 "94107" + B1 04 "City" B1 0D "SAN FRANCISCO" + B1 05 "State" B1 02 "CA" + B1 07 "Address" B1 00 + B1 07 "Country" B1 02 "US" + B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52 + B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21 + B1 09 "precision" B1 03 "zip" + 84 + B7 + B1 03 "Zip" B1 05 "94085" + B1 04 "City" B1 09 "SUNNYVALE" + B1 05 "State" B1 02 "CA" + B1 07 "Address" B1 00 + B1 07 "Country" B1 02 "US" + B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03 + B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF + B1 09 "precision" B1 03 "zip" + 84 + 84 ## Security Considerations -**Empty chunks.** Chunks of zero length are prohibited in streamed -(format C) `Repr`s. However, a malicious or broken encoder may include -them nonetheless. This opens up a possibility for denial-of-service: -an attacker may begin streaming a `String`, for example, sending an -endless sequence of zero length chunks, appearing to make progress but -not actually doing so. Implementations *MUST* reject zero length -chunks when decoding, and *MUST NOT* produce them when encoding. +**Whitespace.** The textual format allows arbitrary whitespace in many +positions. Consider optional restrictions on the amount of consecutive +whitespace that may appear. -**Whitespace and no-ops.** Similarly, the binary format allows `0xFF` -no-ops and the textual format allows arbitrary whitespace in many -positions. In streaming transfer situations, consider optional -restrictions on the amount of consecutive whitespace or the number of -consecutive no-ops that may appear. +**Annotations.** Similarly, in modes where a `Value` is being read +while annotations are skipped, an endless sequence of annotations may +give an illusion of progress. -**Annotations.** Also similarly, in modes where a `Value` is being -read while annotations are skipped, an endless sequence of annotations -may give an illusion of progress. - -**Canonical form for cryptographic hashing and signing.** As -specified, neither the textual nor the compact binary encoding rules -for `Value`s force canonical serializations. Two serializations of the -same `Value` may yield different binary `Repr`s. +**Canonical form for cryptographic hashing and signing.** No canonical +textual encoding of a `Value` is specified. A +[canonical form][canonical] exists for binary encoded `Value`s, and +implementations *SHOULD* produce canonical binary encodings by +default; however, an implementation *MAY* permit two serializations of +the same `Value` to yield different binary `Repr`s. ## Acknowledgements -The use of low-order bits of each lead byte for the length of short -values is inspired by a similar feature of [CBOR](http://cbor.io/). +The use of the low-order bits in certain SignedInteger tags for the +length of the following data is inspired by a similar feature of +[CBOR](http://cbor.io/). The treatment of commas as whitespace in the text syntax is inspired by the same feature of [EDN](https://github.com/edn-format/edn). @@ -889,126 +787,42 @@ syntax. ## Appendix. Autodetection of textual or binary syntax -Whitespace characters `0x09` (ASCII HT (tab)), `0x0A` (LF), `0x0D` -(CR), `0x20` (space) and `0x2C` (comma) are ignored at the start of a -textual-syntax Preserves `Document`, and their UTF-8 encodings are -reserved lead byte values in binary-syntax Preserves. +Every tag byte in a binary Preserves `Document` falls within the range +[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation +bytes*, and will never occur as the first byte of a UTF-8 encoded code +point. This means no binary-encoded document can be misinterpreted as +valid UTF-8. -The byte `0xFF`, signifying a no-op in binary-syntax Preserves, has no -meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in -a valid textual-syntax Preserves `Document`. +Conversely, a UTF-8 document must start with a valid codepoint, +meaning in particular that it must not start with a byte in the range +[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax +Preserves document can be misinterpreted as a binary-syntax document. -If applications prefix their textual-syntax documents with e.g. a -space or newline character, and their binary-syntax documents with a -`0xFF` byte, consumers of these documents may reliably autodetect the -syntax being used. In a network protocol supporting this kind of -autodetection, clients may transmit LF or `0xFF` to select text or -binary syntax, respectively. +Examination of the top two bits of the first byte of a document gives +its syntax: if the top two bits are `10`, it should be interpreted as +a binary-syntax document; otherwise, it should be interpreted as text. -Furthermore, if an application consistently uses `Record`s for its -top-level messages,[^records-and-nonatoms] eschewing `Atom`s in -particular, then autodetection of the encoding used for a given input -can be done as follows: +## Appendix. Table of tag values -| First byte of encoded input | Encoding | Other conclusions | -| --- | --- | --- | -| `0x80`--`0x8F` | binary | `Record` (format B) | -| `0x28` | binary | `Record` (format C) | -| `0x05` | binary | annotated value (presumably a `Record`) | -| `0xFF` | binary | no-op; value will follow | -| --- | --- | --- | -| `0x7B` ("<") | text | `Record` | -| `0x40` ("@") | text | annotated value (presumably a `Record`) | -| `0x09`, `0x0A`, `0x0D`, `0x20` or `0x2C` | text | whitespace; value will follow | + 80 - False + 81 - True + 82 - Float + 83 - Double + 84 - End marker + 85 - Annotation + (8x) RESERVED 86-8F - [^records-and-nonatoms]: Similar reasoning can be used to permit - unambiguous detection of encoding when `Collection`s are allowed - as top-level messages as well as `Record`s. + 9x - Small integers 0..12,-3..-1 + An - Small integers, (n+1) bytes long + B0 - Small integers, variable length + B1 - String + B2 - ByteString + B3 - Symbol -## Appendix. Table of lead byte values - - 00 - False - 01 - True - 02 - Float - 03 - Double - 04 - End stream - 05 - Annotation - (0x) RESERVED 06-0F (NB. 09, 0A, 0D specially reserved) - (1x) RESERVED - 2x - Start Stream (NB. 20, 2C specially reserved) - 3x - Small integers 0..12,-3..-1 - - 4x - SignedInteger - 5x - String - 6x - ByteString - 7x - Symbol - - 8x - Record - 9x - Sequence - Ax - Set - Bx - Dictionary - - (Cx) RESERVED C0-CF - (Dx) RESERVED D0-DF - (Ex) RESERVED E0-EF - (Fx) RESERVED F0-FE - FF No-op - -## Appendix. Bit fields within lead byte values - - tt nn mmmm contents - ---------- --------- - - 00 00 0000 False - 00 00 0001 True - 00 00 0010 Float, 32 bits big-endian binary - 00 00 0011 Double, 64 bits big-endian binary - 00 00 0100 End Stream (to match a previous Start Stream) - 00 00 0101 Annotation; two more Reprs follow - - 00 00 1001 (ASCII HT (tab)) \ - 00 00 1010 (ASCII LF) |- Reserved: may be used to indicate - 00 00 1101 (ASCII CR) / use of text encoding - - 00 01 xxxx error, RESERVED - - 00 10 ttnn Start Stream - When tt = 00 --> error - When nn = 00 --> (ASCII space) - Reserved: may be used to indicate - use of text encoding - otherwise --> error - 01 --> each chunk is a ByteString - 10 --> each chunk is a single encoded Value - 11 --> error (RESERVED) - When nn = 00 --> (ASCII comma) - Reserved: may be used to indicate - use of text encoding - otherwise --> error - - 00 11 xxxx Small integers 0..12,-3..-1 - - 01 00 mmmm SignedInteger, big-endian binary - 01 01 mmmm String, UTF-8 binary - 01 10 mmmm ByteString - 01 11 mmmm Symbol, UTF-8 binary - - 10 00 mmmm Record - 10 01 mmmm Sequence - 10 10 mmmm Set - 10 11 mmmm Dictionary - - 11 00 xxxx error, RESERVED - 11 01 xxxx error, RESERVED - 11 10 xxxx error, RESERVED - 11 11 1111 no-op; unambiguous indication of binary Preserves format - -Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If -`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of -decoding the varint that follows. - -Then, `l` is the length of the body that follows, counted in bytes for -`tt`=`01` and in `Repr`s for `tt`=`10`. + B4 - Record + B5 - Sequence + B6 - Set + B7 - Dictionary ## Appendix. Binary SignedInteger representation @@ -1016,17 +830,17 @@ Languages that provide fixed-width machine word types may find the following table useful in encoding and decoding binary `SignedInteger` values. -| Integer range | Bytes required | Encoding (hex) | -| --- | --- | --- | -| -3 ≤ n < 13 (numbers -3..12 encoded specially) | 1 | `3X` | -| -27 ≤ n < 27 (i8) | 2 | `41` `XX` | -| -215 ≤ n < 215 (i16) | 3 | `42` `XX` `XX` | -| -223 ≤ n < 223 (i24) | 4 | `43` `XX` `XX` `XX` | -| -231 ≤ n < 231 (i32) | 5 | `44` `XX` `XX` `XX` `XX` | -| -239 ≤ n < 239 (i40) | 6 | `45` `XX` `XX` `XX` `XX` `XX` | -| -247 ≤ n < 247 (i48) | 7 | `46` `XX` `XX` `XX` `XX` `XX` `XX` | -| -255 ≤ n < 255 (i56) | 8 | `47` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | -| -263 ≤ n < 263 (i64) | 9 | `48` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | +| Integer range | Bytes required | Encoding (hex) | +| --- | --- | --- | +| -3 ≤ n ≤ 12 | 1 | `3X` | +| -27 ≤ n < 27 (i8) | 2 | `A0` `XX` | +| -215 ≤ n < 215 (i16) | 3 | `A1` `XX` `XX` | +| -223 ≤ n < 223 (i24) | 4 | `A2` `XX` `XX` `XX` | +| -231 ≤ n < 231 (i32) | 5 | `A3` `XX` `XX` `XX` `XX` | +| -239 ≤ n < 239 (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` | +| -247 ≤ n < 247 (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` | +| -255 ≤ n < 255 (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | +| -263 ≤ n < 263 (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | ## Notes diff --git a/questions.md b/questions.md index ffacc84..a3737a3 100644 --- a/questions.md +++ b/questions.md @@ -29,16 +29,3 @@ not. There's only one (?) at the moment, the `%i"f"` in `Float`; should it be changed to case-sensitive? Q. Should `IOList`s be wrapped in an identifying unary record constructor? - -TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|` - -TODO: Probably should add a canonicalized subset. Consider adding -explicit "I promise this is canonical" marker, like a BOM, which -identifies a binary value as (first) binary and (second, optionally) -as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a -text; this might be a good candidate for a marker sequence. -((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data -link escape"; it is not a printable ASCII character, and is disallowed -in the textual Preserves grammar; and it is also mnemonic for "version -0", since it is the Preserves binary encoding of the small integer -zero.))