From db5c890e1ce6c2f7371e832723e9fdc2fda56929 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sat, 29 Sep 2018 17:50:57 +0100 Subject: [PATCH] Simplify, repair, and regularise embedded binary values in textual syntax --- implementations/racket/preserves/main.rkt | 13 +++++- preserves.md | 51 ++++++++++------------- 2 files changed, 34 insertions(+), 30 deletions(-) diff --git a/implementations/racket/preserves/main.rkt b/implementations/racket/preserves/main.rkt index 381b03d..7e9fbc7 100644 --- a/implementations/racket/preserves/main.rkt +++ b/implementations/racket/preserves/main.rkt @@ -466,8 +466,10 @@ [#\# (match i [(px #px#"^#set\\{" (list _)) (sequence-fold (set) (lambda (acc) (set-add acc (read-value))) values #\})] - [(px #px#"^#hexvalue\\{" (list _)) - (decode (read-hex-binary '()) (lambda () (parse-error "Invalid #hexvalue encoding")))] + [(px #px#"^#value" (list _)) + (define bs (read-value)) + (when (not (bytes? bs)) (parse-error "ByteString must follow #value")) + (decode bs)] [(px #px#"^#true" (list _)) #t] [(px #px#"^#false" (list _)) @@ -631,6 +633,13 @@ (cross-check "#base64{SGk}" #"Hi" (#x62 "Hi")) (cross-check "#base64{ S G k }" #"Hi" (#x62 "Hi")) + (cross-check "#value#\"fcorymb\"" #"corymb" (#x66 "corymb")) + (cross-check "#value#\"\x01\"" #t (#x01)) + (cross-check "#value#base64{AQ}" #t (#x01)) + (cross-check "#value#base64{AQ==}" #t (#x01)) + (cross-check "#value #base64{AQ==}" #t (#x01)) + (cross-check "#value ;;comment\n #base64{AQ==}" #t (#x01)) + (check-equal? (string->preserve "[]") '()) (check-equal? (string->preserve "{}") (hash)) (check-equal? (string->preserve "\"\"") "") diff --git a/preserves.md b/preserves.md index 42359ee..79e1e27 100644 --- a/preserves.md +++ b/preserves.md @@ -317,21 +317,6 @@ tokens `#set{` and `}`.[^printing-collections] commas separating, and commas terminating elements or key/value pairs within a collection. -Any `Value` may be represented using the -[compact binary syntax](#compact-binary-syntax) by directly prefixing -the binary form of the `Value` with ASCII `SOH` (`%x01`), or by -enclosing a hexadecimal representation of the binary form of the -`Value` in the tokens `#hexvalue{` and `}`.[^rationale-switch-to-binary] - - Compact = %x01 / %s"#hexvalue{" *(ws / HEXDIG) ws "}" - - [^rationale-switch-to-binary]: **Rationale.** The textual syntax - cannot express every `Value`: specifically, it cannot express the - several million floating-point NaNs, or the two floating-point - Infinities. Since the compact binary format for `Value`s expresses - each `Value` with precision, embedding binary `Value`s solves the - problem. - `Boolean`s are the simple literal strings `#true` and `#false`. Boolean = %s"#true" / %s"#false" @@ -456,6 +441,29 @@ double quote mark. definition of "token representation", and with the [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). +Finally, any `Value` may be represented by escaping from the textual +syntax to the [compact binary syntax](#compact-binary-syntax) by +prefixing a `ByteString` containing the binary representation of the +`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text] + + Compact = %s"#value" ws ByteString + + [^rationale-switch-to-binary]: **Rationale.** The textual syntax + cannot express every `Value`: specifically, it cannot express the + several million floating-point NaNs, or the two floating-point + Infinities. Since the compact binary format for `Value`s expresses + each `Value` with precision, embedding binary `Value`s solves the + problem. + + [^no-literal-binary-in-text]: Every text is ultimately physically + stored as bytes; therefore, it might seem possible to escape to + the raw binary form of compact binary encoding from within a + pieces of textual syntax. However, while bytes must be involved in + any *representation* of text, the text *itself* is logically a + sequence of *code points* and is not *intrinsically* a binary + structure at all. It would be incoherent to expect to be able to + access the representation of the text from within the text itself. + ## Compact Binary Syntax A `Repr` is an encoding, or representation, of a specific `Value`. @@ -1395,17 +1403,4 @@ tell whether it is an open-parenthesis or not! For this reason, I've disallowed whitespace between a label `Value` and the open-parenthesis of the fields. Is this reasonable?? -Q. Should SOH-prefixed binary values embedded in a textual representation -be length-prefixed, too - byte strings, essentially? Also, why not -base64 embedded binary values? The length-prefixing might help with -being able to avoid having to care whether the embedded value is well- -formed or not; on the other hand, it means streaming-format embeddings -aren't possible. - -TODO. The SOH-prefixed embedded binary idea is probably incoherent. -Textual form is *text*, not binary, and since it's code-points, we -cannot rely on having access to a hypothetical underlying bytestream. -Remove it, and consider generalizing `#hexvalue{}` to include -`#base64value{}` or similar. - ## Notes