From 7055a6467c0d4492c234382fd803e12747d26d98 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Fri, 10 Jun 2022 17:33:52 +0200 Subject: [PATCH] New "blue jelly" machine-oriented binary syntax, inspired by argdata --- _config.yml | 2 +- preserves-binary.md | 242 ++++++++++++++++++++++++-------------------- preserves-text.md | 36 +++---- preserves.md | 142 ++++++++++++-------------- representations.md | 2 + 5 files changed, 220 insertions(+), 204 deletions(-) diff --git a/_config.yml b/_config.yml index c24e804..57eb78b 100644 --- a/_config.yml +++ b/_config.yml @@ -14,4 +14,4 @@ defaults: title: "Preserves" version_date: "June 2022" -version: "0.6.3" +version: "0.7.0" diff --git a/preserves-binary.md b/preserves-binary.md index ddae30c..1dd9dca 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -6,9 +6,11 @@ title: "Preserves: Binary Syntax" Tony Garnock-Jones {{ site.version_date }}. Version {{ site.version }}. - [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [LEB128]: https://en.wikipedia.org/wiki/LEB128 + [argdata]: https://github.com/NuxiNL/argdata [canonical]: canonical-binary.html + [google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints + [vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity *Preserves* is a data model, with associated serialization formats. This document defines one of those formats: a binary syntax for `Value`s from @@ -24,49 +26,52 @@ For a value `v`, we write `«v»` for the `Repr` of v. ### Type and Length representation. Each `Repr` starts with a tag byte, describing the kind of information -represented. Depending on the tag, a length indicator, further encoded -information, and/or an ending tag may follow. +represented. - tag (simple atomic data and small integers) - tag ++ binarydata (most integers) - tag ++ length ++ binarydata (large integers, strings, symbols, and binary) - tag ++ repr ++ ... ++ endtag (compound data) +However, inspired by [argdata][], a `Repr` does *not* describe its own +length. Instead, the surrounding context must supply the length of the +`Repr`. -The unique end tag is byte value `0x84`. +As a consequence, `Repr`s for `Compound` values store the lengths of +their contained values. Each contained `Value` is represented as a +length in bytes followed by its own `Repr`. -If present after a tag, the length of a following piece of binary data -is formatted as a [base 128 varint][varint].[^see-also-leb128] We -write `varint(m)` for the varint-encoding of `m`. Quoting the -[Google Protocol Buffers][varint] definition, + Each length is stored as an [argdata][]-compatible +big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint +stores seven bits of the length. All bytes have a clear upper bit, +except the final byte, which has the upper bit set. We write +`len(m)` for the varint-encoding of a non-negative integer `m`, +defined recursively as follows: - [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned - integers. Varints and LEB128-encoded integers differ only for - signed integers, which are not used in Preserves. + len(m) = e(m, 128) + where e(v, d) = [v + d] if v < 128 + e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128 -> Each byte in a varint, except the last byte, has the most -> significant bit (msb) set – this indicates that there are further -> bytes to come. The lower 7 bits of each byte are used to store the -> two's complement representation of the number in groups of 7 bits, -> least significant group first. + [^see-also-leb128]: Argdata's length representation is very close to + [Variable-length quantity (VLQ)][VLQ] encoding, differing only in + the flipped interpretation of the high bit of each byte. It is + big-endian, unlike [LEB128][] encoding ([as used by + Google][google-varint] in protobufs). The following table illustrates varint-encoding. -| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | -| ------ | ------------------- | ------------ | -| 15 | `0001111` | 15 | -| 300 | `0000010 0101100` | 172 2 | -| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | +| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes | +|-------------|-------------------------------------------|-----------------| +| 15 | `0001111` | 143 | +| 300 | `0000010 0101100` | 2 172 | +| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 | -It is an error for a varint-encoded `m` in a `Repr` to be anything -other than the unique shortest encoding for that `m`. That is, a -varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. +It is an error for a varint-encoded `m` in a `Repr` to be anything other +than the unique shortest encoding for that `m`. That is, a +varint-encoding of `m` *MUST NOT* start with `0`. ### Records, Sequences, Sets and Dictionaries. - «» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84] - «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84] - «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84] - «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84] + «» = [0xA7] ++ seq(«L», «F_1», ..., «F_m») + «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m») + «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m») + «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m») + where seq(R_1, ... R_m) = len(R_1) ++ R_1 ++...++ len(R_m) ++ R_m There is *no* ordering requirement on the `E_i` elements or `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any @@ -89,7 +94,7 @@ serializing in some other implementation-defined order. ordering for writing out set elements and dictionary key/value pairs is *not* the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the - `Repr` of a negative number very far from zero will start with + `Repr` of a negative number very far from zero will start with a byte that is *greater* than the byte which starts the `Repr` of zero, making it sort lexicographically later by `Repr`, despite being semantically *less than* zero. @@ -101,39 +106,31 @@ serializing in some other implementation-defined order. ### SignedIntegers. - «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16 - ([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16 - ([0xA0] + x) if (-3≤x≤-1) - ([0x90] + x) if ( 0≤x≤12) - where m = |intbytes(x)| + «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x) -Integers in the range [-3,12] are compactly represented with tags -between `0x90` and `0x9F` because they are so frequently used. -Integers up to 16 bytes long are represented with a single-byte tag -encoding the length of the integer. Larger integers are represented -with an explicit varint length. Every `SignedInteger` *MUST* be -represented with its shortest possible encoding. +The function `intbytes(x)` gives the big-endian two's-complement binary +representation of `x`, taking exactly as many whole bytes as needed to +unambiguously identify the value and its sign. As a special case, +`intbytes(0)` is the empty byte sequence. The most-significant bit in +the first byte in `intbytes(x)` (for `x`≠0) is the sign +bit.[^zero-intbytes] Every `SignedInteger` *MUST* be represented with +its shortest possible encoding. -The function `intbytes(x)` gives the big-endian two's-complement -binary representation of `x`, taking exactly as many whole bytes as -needed to unambiguously identify the value and its sign, and `m = -|intbytes(x)|`. The most-significant bit in the first byte in -`intbytes(x)` is the sign bit.[^zero-intbytes] For -example, +For example, «87112285931760246646623899502532662132736» - = B0 12 01 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 - 00 00 + = A3 01 00 00 00 00 00 00 00 + 00 00 00 00 00 00 00 00 + 00 00 - «-257» = A1 FE FF «-3» = 9D «128» = A1 00 80 - «-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF - «-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00 - «-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF - «-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00 - «-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF - «-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00 - «-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00 + «-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80 + «-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF + «-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00 + «-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF + «-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00 + «-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF + «-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00 + «-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00 [^zero-intbytes]: The value 0 needs zero bytes to identify the value, so `intbytes(0)` is the empty byte string. Non-zero values @@ -146,19 +143,19 @@ and `Symbol`, the data following the tag is a UTF-8 encoding of the `Value`'s code points, while for `ByteString` it is the raw data contained within the `Value` unmodified. - «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String - [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString - [0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol + «S» = [0xA4] ++ utf8(S) if S ∈ String + [0xA5] ++ S if S ∈ ByteString + [0xA6] ++ utf8(S) if S ∈ Symbol ### Booleans. - «#f» = [0x80] - «#t» = [0x81] + «#f» = [0xA0] + «#t» = [0xA1] ### Floats and Doubles. - «F» when F ∈ Float = [0x82] ++ binary32(F) - «D» when D ∈ Double = [0x83] ++ binary64(D) + «F» when F ∈ Float = [0xA2] ++ binary32(F) + «D» when D ∈ Double = [0xA2] ++ binary64(D) The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. @@ -166,20 +163,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and ### Embeddeds. The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to -represent the denoted object, prefixed with `[0x86]`. +represent the denoted object, prefixed with `[0xBF]`. - «#!V» = [0x86] ++ «V» + «#!V» = [0xBF] ++ «V» ### Annotations. -To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with -`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual -syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, -`a` and `b`, is +To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ..., +v_m]`, surround `r` as follows: + + [0xBE] ++ len(r) ++ r ++ len(v_1) ++ v_1 ++...++ len(v_m) ++ v_m + +The `Repr` `r` *MUST NOT* already have annotations; that is, it must not begin with `0xBE`. + +For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e. +an empty sequence annotated with two symbols, `a` and `b`, is «@a @b []» - = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]» - = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84] + = [0xBE] ++ len(«[]») ++ «[]» ++ len(«a») ++ «a» ++ len(«b») ++ «b» + = [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62] ## Security Considerations @@ -194,45 +196,67 @@ implementations *SHOULD* produce canonical binary encodings by default; however, an implementation *MAY* permit two serializations of the same `Value` to yield different binary `Repr`s. +## Acknowledgements + +The exclusion of lengths from `Repr`s, placing lengths instead ahead of +contained values in sequences, is inspired by [argdata][]. + ## Appendix. Autodetection of textual or binary syntax -Every tag byte in a binary Preserves `Document` falls within the range +Every tag byte in a binary Preserves `Repr` falls within the range [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation bytes*, and will never occur as the first byte of a UTF-8 encoded code -point. This means no binary-encoded document can be misinterpreted as +point. This means no binary-encoded `Repr` can be misinterpreted as valid UTF-8. -Conversely, a UTF-8 document must start with a valid codepoint, +Conversely, a UTF-8 `Document` must start with a valid codepoint, meaning in particular that it must not start with a byte in the range [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax -Preserves document can be misinterpreted as a binary-syntax document. +Preserves `Document` can be misinterpreted as a binary-syntax `Repr`. -Examination of the top two bits of the first byte of a document gives -its syntax: if the top two bits are `10`, it should be interpreted as -a binary-syntax document; otherwise, it should be interpreted as text. +Examination of the top two bits of the first byte of an encoded `Value` +gives its syntax: if the top two bits are `10`, it should be interpreted +as a binary-syntax `Repr`; otherwise, it should be interpreted as text. + +**Streaming.** Autodetection is still possible when streaming an +undetermined number of `Value`s across, say, a TCP/IP connection: + + - If the text syntax is to be used for the connection, simply start + writing each `Document` one after the other. Documents for `Atom`s + *MUST* be separated from their neighbours by whitespace; in general, + whitespace *SHOULD* be used to separate adjacent documents. + Specifically, whitespace separating adjacent documents *SHOULD* be + ASCII newline (10). + + - If the binary syntax is to be used for the connection, start the + connection with byte `0xA8` (sequence). After the initial byte, send + each value `v` as `len(«v») ++ «v»`. A side effect of this approach + is that the entire stream, when complete, is a valid `Sequence` + `Repr`. ## Appendix. Table of tag values - 80 - False - 81 - True - 82 - Float - 83 - Double - 84 - End marker - 85 - Annotation - 86 - Embedded - (8x) RESERVED 87-8F + (8x) RESERVED 80-8F + (9x) RESERVED 90-9F - 9x - Small integers 0..12,-3..-1 - An - Medium integers, (n+1) bytes long - B0 - Large integers, variable length - B1 - String - B2 - ByteString - B3 - Symbol + A0 - False + A1 - True + A2 - Float or Double (length disambiguates) + A3 - SignedIntegers (0 is encoded with no bytes at all) + A4 - String (no trailing NUL is added) + A5 - ByteString + A6 - Symbol - B4 - Record - B5 - Sequence - B6 - Set - B7 - Dictionary + A7 - Record + A8 - Sequence + A9 - Set + AA - Dictionary + + (Ax) RESERVED AB-AF + + (Bx) RESERVED B0-BD + BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...} + BF - Embedded ## Appendix. Binary SignedInteger representation @@ -242,15 +266,15 @@ values. | Integer range | Bytes required | Encoding (hex) | | --- | --- | --- | -| -3 ≤ n ≤ 12 | 1 | `9X` | -| -27 ≤ n < 27 (i8) | 2 | `A0` `XX` | -| -215 ≤ n < 215 (i16) | 3 | `A1` `XX` `XX` | -| -223 ≤ n < 223 (i24) | 4 | `A2` `XX` `XX` `XX` | +| 0 | 1 | `A3` | +| -27 ≤ n < 27 (i8) | 2 | `A3` `XX` | +| -215 ≤ n < 215 (i16) | 3 | `A3` `XX` `XX` | +| -223 ≤ n < 223 (i24) | 4 | `A3` `XX` `XX` `XX` | | -231 ≤ n < 231 (i32) | 5 | `A3` `XX` `XX` `XX` `XX` | -| -239 ≤ n < 239 (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` | -| -247 ≤ n < 247 (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` | -| -255 ≤ n < 255 (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | -| -263 ≤ n < 263 (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | +| -239 ≤ n < 239 (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` | +| -247 ≤ n < 247 (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` | +| -255 ≤ n < 255 (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | +| -263 ≤ n < 263 (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | ## Notes diff --git a/preserves-text.md b/preserves-text.md index e6a77c0..a33ace4 100644 --- a/preserves-text.md +++ b/preserves-text.md @@ -206,8 +206,8 @@ object, prefixed with `#!`. Embedded = "#!" Value Finally, any `Value` may be represented by escaping from the textual -syntax to the [machine-oriented binary syntax](preserves-binary.html) -by prefixing a `ByteString` containing the binary representation of the +syntax to the [machine-oriented binary syntax](preserves-binary.html) by +prefixing a `ByteString` containing the binary representation of the `Value` with `#=`.[^rationale-switch-to-binary] [^no-literal-binary-in-text] [^machine-value-annotations] @@ -216,18 +216,18 @@ by prefixing a `ByteString` containing the binary representation of the [^rationale-switch-to-binary]: **Rationale.** The textual syntax cannot express every `Value`: specifically, it cannot express the several million floating-point NaNs, or the two floating-point - Infinities. Since the machine-oriented binary format for `Value`s - expresses each `Value` with precision, embedding binary `Value`s - solves the problem. + Infinities. Since the machine-oriented binary format for `Value`s expresses + each `Value` with precision, embedding binary `Value`s solves the + problem. [^no-literal-binary-in-text]: Every text is ultimately physically - stored as bytes; therefore, it might seem possible to escape to the - raw form of binary encoding from within a piece of textual syntax. - However, while bytes must be involved in any *representation* of - text, the text *itself* is logically a sequence of *code points* and - is not *intrinsically* a binary structure at all. It would be - incoherent to expect to be able to access the representation of the - text from within the text itself. + stored as bytes; therefore, it might seem possible to escape to + the raw binary encoding from within a + piece of textual syntax. However, while bytes must be involved in + any *representation* of text, the text *itself* is logically a + sequence of *code points* and is not *intrinsically* a binary + structure at all. It would be incoherent to expect to be able to + access the representation of the text from within the text itself. [^machine-value-annotations]: Any text-syntax annotations preceding the `#` are prepended to any binary-syntax annotations yielded by @@ -235,11 +235,11 @@ by prefixing a `ByteString` containing the binary representation of the ## Annotations -When written down, a `Value` may have an associated sequence of -*annotations* carrying “out-of-band” contextual metadata about the -value. Each annotation is, in turn, a `Value`, and may itself have -annotations. The ordering of annotations attached to a `Value` is -significant. +When written down, a `Value` may have an associated +sequence of *annotations* carrying “out-of-band” contextual metadata +about the value. Each annotation is, in turn, a `Value`, and may +itself have annotations. The ordering of annotations attached to a +`Value` is significant. Value =/ ws "@" Value Value @@ -276,7 +276,7 @@ different. ## Security Considerations -**Whitespace.** The textual format allows arbitrary whitespace in many +**Whitespace.** The text syntax allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear. diff --git a/preserves.md b/preserves.md index 411d0b7..f52fc2d 100644 --- a/preserves.md +++ b/preserves.md @@ -220,21 +220,21 @@ The total ordering specified [above](#total-order) means that the following stat -| Value | Encoded byte sequence | -|-----------------------------|---------------------------------------------------------------------------------| -| `>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 | -| `[1 2 3 4]` | B5 91 92 93 94 84 | -| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 | -| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' | -| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 | -| `-257` | A1 FE FF | -| `-1` | 9F | -| `0` | 90 | -| `1` | 91 | -| `255` | A1 00 FF | -| `1.0f` | 82 3F 80 00 00 | -| `1.0` | 83 3F F0 00 00 00 00 00 00 | -| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 | +| Value | Encoded byte sequence | +|-----------------------------|------------------------------------------------------------------------------| +| `>` | A7 88 A6 'c' 'a' 'p' 't' 'u' 'r' 'e' 8A A7 88 A6 'd' 'i' 's' 'c' 'a' 'r' 'd' | +| `[1 2 3 4]` | A8 82 A3 01 82 A3 02 82 A3 03 82 A3 04 | +| `[-2 -1 0 1]` | A8 82 A3 FE 82 A3 FF 81 A3 82 A3 01 | +| `"hello"` | A4 'h' 'e' 'l' 'l' 'o' | +| `["a" b #"c" [] #{} #t #f]` | A8 82 A4 'a' 82 A6 'b' 82 A5 'c' 81 A8 81 A9 81 A1 81 A0 | +| `-257` | A3 FE FF | +| `-1` | A3 FF | +| `0` | A3 | +| `1` | A3 01 | +| `255` | A3 00 FF | +| `1.0f` | A2 3F 80 00 00 | +| `1.0` | A2 3F F0 00 00 00 00 00 00 | +| `-1.202e300` | A2 FE 3C B7 B7 59 BF 04 26 | The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` @@ -242,24 +242,21 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R encodes to - B4 ;; Record - B5 ;; Sequence - B3 06 74 69 74 6C 65 64 ;; Symbol, "titled" - B3 06 70 65 72 73 6F 6E ;; Symbol, "person" - 92 ;; SignedInteger, "2" - B3 05 74 68 69 6E 67 ;; Symbol, "thing" - 91 ;; SignedInteger, "1" - 84 ;; End (sequence) - A0 65 ;; SignedInteger, "101" - B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" - B4 ;; Record - B3 04 64 61 74 65 ;; Symbol, "date" - A1 07 1D ;; SignedInteger, "1821" - 92 ;; SignedInteger, "2" - 93 ;; SignedInteger, "3" - 84 ;; End (record) - B1 02 44 72 ;; String, "Dr" - 84 ;; End (record) + A7 ;; Record + 9E A8 ;; Length 30, Sequence + 87 A6 74 69 74 6C 65 64 ;; Length 7, Symbol, "titled" + 87 A6 70 65 72 73 6F 6E ;; Length 7, Symbol, "person" + 82 A3 02 ;; Length 2, SignedInteger, "2" + 86 A6 74 68 69 6E 67 ;; Length 6, Symbol, "thing" + 82 A3 01 ;; Length 2, SignedInteger, "1" + 82 A3 65 ;; Length 2, SignedInteger, "101" + 8A A4 42 6C 61 63 6B 77 65 6C 6C ;; Length 10, String, "Blackwell" + 91 A7 ;; Length 17, Record + 85 A6 64 61 74 65 ;; Length 5, Symbol, "date" + 83 A3 07 1D ;; Length 3, SignedInteger, "1821" + 82 A3 02 ;; Length 2, SignedInteger, "2" + 82 A3 03 ;; Length 2, SignedInteger, "3" + 83 A4 44 72 ;; Length 3, String, "Dr" [^extensibility2]: It happens to line up with Racket's representation of a record label for an inheritance hierarchy @@ -311,27 +308,23 @@ The first RFC 8259 example: when read using the Preserves text syntax encodes via the binary syntax as follows: - B7 - B1 05 "Image" - B7 - B1 03 "IDs" B5 - A0 74 - A1 03 AF - A1 00 EA - A2 00 97 89 - 84 - B1 05 "Title" B1 14 "View from 15th Floor" - B1 05 "Width" A1 03 20 - B1 06 "Height" A1 02 58 - B1 08 "Animated" B3 05 "false" - B1 09 "Thumbnail" - B7 - B1 03 "Url" B1 26 "http://www.example.com/image/481989943" - B1 05 "Width" A0 64 - B1 06 "Height" A0 7D - 84 - 84 - 84 + AA + 86 A4 "Image" + 01 AC AA + 89 A4 "Animated" 86 A6 "false" + 87 A4 "Height" 83 A3 02 58 + 84 A4 "IDs" 91 A8 + 82 A3 74 + 83 A3 03 AF + 83 A3 00 EA + 84 A3 00 97 89 + 8A A4 "Thumbnail" + C3 AA + 87 A4 "Height" 82 A3 7D + 84 A4 "Url" A7 A4 "http://www.example.com/image/481989943" + 86 A4 "Width" 82 A3 64 + 86 A4 "Title" 95 A4 "View from 15th Floor" + 86 A4 "Width" 83 A3 03 20 The second RFC 8259 example: @@ -360,28 +353,25 @@ The second RFC 8259 example: encodes to binary as follows: - B5 - B7 - B1 03 "Zip" B1 05 "94107" - B1 04 "City" B1 0D "SAN FRANCISCO" - B1 05 "State" B1 02 "CA" - B1 07 "Address" B1 00 - B1 07 "Country" B1 02 "US" - B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52 - B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21 - B1 09 "precision" B1 03 "zip" - 84 - B7 - B1 03 "Zip" B1 05 "94085" - B1 04 "City" B1 09 "SUNNYVALE" - B1 05 "State" B1 02 "CA" - B1 07 "Address" B1 00 - B1 07 "Country" B1 02 "US" - B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03 - B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF - B1 09 "precision" B1 03 "zip" - 84 - 84 + A8 + FE AA + 88 A4 "Address" 81 A4 + 85 A4 "City" 8E A4 "SAN FRANCISCO" + 88 A4 "Country" 83 A4 "US" + 89 A4 "Latitude" 89 A2 40 42 E2 26 80 9D 49 52 + 8A A4 "Longitude" 89 A2 C0 5E 99 56 6C F4 1F 21 + 86 A4 "State" 83 A4 "CA" + 84 A4 "Zip" 86 A4 "94107" + 8A A4 "precision" 84 A4 "zip" + FA AA + 88 A4 "Address" 81 A4 + 85 A4 "City" 8A A4 "SUNNYVALE" + 88 A4 "Country" 83 A4 "US" + 89 A4 "Latitude" 89 A2 40 42 AF 9D 66 AD B4 03 + 8A A4 "Longitude" 89 A2 C0 5E 81 AA 4F CA 42 AF + 86 A4 "State" 83 A4 "CA" + 84 A4 "Zip" 86 A4 "94085" + 8A A4 "precision" 84 A4 "zip" ## Notes diff --git a/representations.md b/representations.md index 9efc9bb..21e986f 100644 --- a/representations.md +++ b/representations.md @@ -2,6 +2,8 @@ title: "Representing Values in Programming Languages" --- + [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map + **NOT YET READY** We have given a definition of `Value` and its semantics, and proposed