Simplify, repair, and regularise embedded binary values in textual syntax

This commit is contained in:
Tony Garnock-Jones 2018-09-29 17:50:57 +01:00
parent 6feb320aad
commit db5c890e1c
2 changed files with 34 additions and 30 deletions

View File

@ -466,8 +466,10 @@
[#\# (match i
[(px #px#"^#set\\{" (list _))
(sequence-fold (set) (lambda (acc) (set-add acc (read-value))) values #\})]
[(px #px#"^#hexvalue\\{" (list _))
(decode (read-hex-binary '()) (lambda () (parse-error "Invalid #hexvalue encoding")))]
[(px #px#"^#value" (list _))
(define bs (read-value))
(when (not (bytes? bs)) (parse-error "ByteString must follow #value"))
(decode bs)]
[(px #px#"^#true" (list _))
#t]
[(px #px#"^#false" (list _))
@ -631,6 +633,13 @@
(cross-check "#base64{SGk}" #"Hi" (#x62 "Hi"))
(cross-check "#base64{ S G k }" #"Hi" (#x62 "Hi"))
(cross-check "#value#\"fcorymb\"" #"corymb" (#x66 "corymb"))
(cross-check "#value#\"\x01\"" #t (#x01))
(cross-check "#value#base64{AQ}" #t (#x01))
(cross-check "#value#base64{AQ==}" #t (#x01))
(cross-check "#value #base64{AQ==}" #t (#x01))
(cross-check "#value ;;comment\n #base64{AQ==}" #t (#x01))
(check-equal? (string->preserve "[]") '())
(check-equal? (string->preserve "{}") (hash))
(check-equal? (string->preserve "\"\"") "")

View File

@ -317,21 +317,6 @@ tokens `#set{` and `}`.[^printing-collections]
commas separating, and commas terminating elements or key/value
pairs within a collection.
Any `Value` may be represented using the
[compact binary syntax](#compact-binary-syntax) by directly prefixing
the binary form of the `Value` with ASCII `SOH` (`%x01`), or by
enclosing a hexadecimal representation of the binary form of the
`Value` in the tokens `#hexvalue{` and `}`.[^rationale-switch-to-binary]
Compact = %x01 <binary data> / %s"#hexvalue{" *(ws / HEXDIG) ws "}"
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
cannot express every `Value`: specifically, it cannot express the
several million floating-point NaNs, or the two floating-point
Infinities. Since the compact binary format for `Value`s expresses
each `Value` with precision, embedding binary `Value`s solves the
problem.
`Boolean`s are the simple literal strings `#true` and `#false`.
Boolean = %s"#true" / %s"#false"
@ -456,6 +441,29 @@ double quote mark.
definition of "token representation", and with the
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the
`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]
Compact = %s"#value" ws ByteString
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
cannot express every `Value`: specifically, it cannot express the
several million floating-point NaNs, or the two floating-point
Infinities. Since the compact binary format for `Value`s expresses
each `Value` with precision, embedding binary `Value`s solves the
problem.
[^no-literal-binary-in-text]: Every text is ultimately physically
stored as bytes; therefore, it might seem possible to escape to
the raw binary form of compact binary encoding from within a
pieces of textual syntax. However, while bytes must be involved in
any *representation* of text, the text *itself* is logically a
sequence of *code points* and is not *intrinsically* a binary
structure at all. It would be incoherent to expect to be able to
access the representation of the text from within the text itself.
## Compact Binary Syntax
A `Repr` is an encoding, or representation, of a specific `Value`.
@ -1395,17 +1403,4 @@ tell whether it is an open-parenthesis or not! For this reason, I've
disallowed whitespace between a label `Value` and the open-parenthesis
of the fields. Is this reasonable??
Q. Should SOH-prefixed binary values embedded in a textual representation
be length-prefixed, too - byte strings, essentially? Also, why not
base64 embedded binary values? The length-prefixing might help with
being able to avoid having to care whether the embedded value is well-
formed or not; on the other hand, it means streaming-format embeddings
aren't possible.
TODO. The SOH-prefixed embedded binary idea is probably incoherent.
Textual form is *text*, not binary, and since it's code-points, we
cannot rely on having access to a hypothetical underlying bytestream.
Remove it, and consider generalizing `#hexvalue{}` to include
`#base64value{}` or similar.
## Notes