More TODOs in the text; initial textual reader in Racket

This commit is contained in:
Tony Garnock-Jones 2018-09-27 19:25:28 +01:00
parent f9497d64c5
commit b807b38a44
2 changed files with 454 additions and 24 deletions

View File

@ -302,12 +302,20 @@ grouping of its field-`Value`s.
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as a simple curly-brace-enclosed non-empty sequence of
values, or as a possibly-empty sequence of values enclosed by the
tokens `#set{` and `}`.
tokens `#set{` and `}`.[^printing-collections]
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
[^printing-collections]: **Implementation note.** When implementing
printing of `Value`s using the textual syntax, consider supporting
(a) optional pretty-printing with indentation, (b) optional
JSON-compatible print mode for that subset of `Value` that is
compatible with JSON, and (c) optional submodes for no commas,
commas separating, and commas terminating elements or key/value
pairs within a collection.
Any `Value` may be represented using the
[compact binary syntax](#compact-binary-syntax) by directly prefixing
the binary form of the `Value` with ASCII `SOH` (`%x01`), or by
@ -419,28 +427,26 @@ URL-safe Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A `Symbol` may be written in a "bare" form,[^cf-sexp-token] so long as
A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol, or in a quoted form. The quoted form is much the same as the
syntax for `String`s, including embedded escape syntax, except using a
bar or pipe character (`|`) instead of a double quote mark.
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for `String`s, including embedded
escape syntax, except using a bar or pipe character (`|`) instead of a
double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct
symcont = ALPHA / sympunct / DIGIT / "-" / "."
symstart = ALPHA / sympunct / symunicode
symcont = ALPHA / sympunct / symunicode / DIGIT / "-" / "."
sympunct = "~" / "!" / "@" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "<" / ">" / "/"
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
symunicode = <any code point greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co>
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
definition of "token representation".
TODO: More unicode in unescaped symbols?
### Printing
TODO: Recommend a JSON-compatible print mode. Recommend a submode with
trailing commas.
definition of "token representation", and with the
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
## Compact Binary Syntax
@ -722,6 +728,8 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
## Examples
### Simple examples
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->
@ -749,7 +757,7 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record`
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
[titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")
@ -783,6 +791,97 @@ encodes to
For more detail on Racket's representations of record labels, see
[the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
---
### JSON examples
The examples from
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
valid Preserves, though the JSON literals `true`, `false` and `null`
read as `Symbol`s. The first example:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
encodes to binary as follows:
E2
55 "Image"
EC
55 "Width" 42 03 20
55 "Title" 5F 14 "View from 15th Floor"
58 "Animated" 75 "false"
56 "Height" 42 02 58
59 "Thumbnail"
E6
55 "Width" 41 64
53 "Url" 5F 26 "http://www.example.com/image/481989943"
56 "Height" 41 7D
53 "IDs" C4
41 74
42 03 AF
42 00 EA
43 00 97 89
and the second example:
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
encodes to binary as follows:
C2
EF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
57 "Address" 50
54 "City" 5D "SAN FRANCISCO"
55 "State" 52 "CA"
53 "Zip" 55 "94107"
57 "Country" 52 "US"
EF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
57 "Address" 50
54 "City" 59 "SUNNYVALE"
55 "State" 52 "CA"
53 "Zip" 55 "94085"
57 "Country" 52 "US"
## Conventions for Common Data Types
The `Value` data type is essentially an S-Expression, able to
@ -918,8 +1017,6 @@ or `date-time` productions of
## Security Considerations
TODO: Lots of whitespace is just like lots of empty chunks
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
`Symbol`s may include chunks of zero length. This opens up a
possibility for denial-of-service: an attacker may begin streaming a
@ -929,6 +1026,12 @@ optional reasonable restrictions on the number of consecutive empty
chunks that may appear in a stream, and may even supply an optional
mode that rejects empty chunks entirely.
**Whitespace.** Similarly, the textual format for `Value`s allows
arbitrary whitespace in many positions. In streaming transfer
situations, consider optional restrictions on the amount of
consecutive whitespace and comments that may appear in a serialized
`Value`.
**Canonical form for cryptographic hashing and signing.** As
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the

View File

@ -4,6 +4,8 @@
(provide (struct-out stream-of)
(struct-out record)
short-form-labels
read-preserve
string->preserve
encode
decode
wire-value)
@ -176,15 +178,15 @@
(define s (bytes->string/utf-8 bs))
(lambda () (ks (if (= minor 3) (string->symbol s) s) rest))))]))
(define (build-record label fields)
(with-handlers [(exn:fail:contract? (lambda (e) (record label fields)))]
(apply make-prefab-struct label fields)))
(define (decode-record minor fields rest ks kf)
(define (build key fs)
(ks (with-handlers [(exn:fail:contract? (lambda (e) (record key fs)))]
(apply make-prefab-struct key fs))
rest))
(match* (minor fields)
[(3 (list* key fs)) (build key fs)]
[(3 (list* key fs)) (ks (build-record key fs) rest)]
[(3 '()) (kf)]
[(n fs) (build (vector-ref (short-form-labels) n) fs)]))
[(n fs) (ks (build-record (vector-ref (short-form-labels) n) fs) rest)]))
(define (decode-collection minor vs rest ks kf)
(match minor
@ -244,6 +246,241 @@
;;---------------------------------------------------------------------------
(define (read-preserve [i (current-input-port)])
(local-require (only-in syntax/readerr raise-read-error))
(local-require net/base64)
(local-require file/sha1)
(define-match-expander px
(syntax-rules ()
[(_ re pat) (app (lambda (v) (regexp-try-match re v)) pat)]))
(define (skip-whitespace)
(regexp-match? #px#"^(\\s|,)*" i)
(match (peek-char i)
[#\; (regexp-match? #px#"[^\r\n]*[\r\n]" i) (skip-whitespace)]
[_ #t]))
(define (parse-error fmt . args)
(define-values [line column pos] (port-next-location i))
(raise-read-error (format "read-preserve: ~a" (apply format fmt args))
(object-name i)
line
column
pos
#f))
(define (eof-guard ch)
(match ch
[(? eof-object?) (parse-error "Unexpected end of input")]
[ch ch]))
(define (peek/no-eof) (eof-guard (peek-char i)))
(define (read/no-eof) (eof-guard (read-char i)))
(define (read-sequence terminator)
(sequence-fold '() accumulate-value reverse terminator))
(define (accumulate-value acc)
(cons (read-value) acc))
(define (read-dictionary-or-set)
(sequence-fold #f
(lambda (acc)
(define k (read-value))
(skip-whitespace)
(match (peek-char i)
[#\: (when (set? acc) (parse-error "Unexpected key/value separator in set"))
(read-char i)
(define v (read-value))
(hash-set (or acc (hash)) k v)]
[_ (when (hash? acc) (parse-error "Missing expected key/value separator"))
(set-add (or acc (set)) k)]))
values
#\}))
(define PIPE #\|)
(define (read-raw-symbol)
(let loop ((acc '()))
(match (peek-char i)
[(or #\{ #\} #\[ #\] #\" #\; #\, #\# #\: (? char-whitespace?) (? eof-object?) (== PIPE))
(string->symbol (list->string (reverse acc)))]
[_ (loop (cons (read-char i) acc))])))
(define (read-encoded-binary kind valid-char? finish)
(let loop ((acc '()))
(skip-whitespace)
(let ((ch (read/no-eof)))
(cond [(valid-char? ch)
(loop (cons ch acc))]
[(eqv? ch #\})
(finish (list->string (reverse acc)))]
[else (parse-error "Invalid ~a character" kind)]))))
(define (read-base64-binary)
(read-encoded-binary "base64"
(lambda (ch)
(or (and (char>=? ch #\A) (char<=? ch #\Z))
(and (char>=? ch #\a) (char<=? ch #\z))
(and (char>=? ch #\0) (char<=? ch #\9))
(memv ch '(#\+ #\/ #\- #\_ #\=))))
(lambda (s)
(base64-decode (string->bytes/latin-1 s)))))
(define (read-hex-binary)
(read-encoded-binary "hex"
(lambda (ch)
(or (and (char>=? ch #\A) (char<=? ch #\F))
(and (char>=? ch #\a) (char<=? ch #\f))
(and (char>=? ch #\0) (char<=? ch #\9))))
(lambda (s)
(hex-string->bytes s))))
(define (read-stringlike xform-item finish terminator-char hexescape-char hexescape-proc)
(let loop ((acc '()))
(match (read/no-eof)
[(== terminator-char) (finish (reverse acc))]
[#\\ (match (read/no-eof)
[(== hexescape-char) (loop (cons (hexescape-proc) acc))]
[(and c (or (== terminator-char) #\\ #\/)) (loop (cons (xform-item c) acc))]
[#\b (loop (cons (xform-item #\u08) acc))]
[#\f (loop (cons (xform-item #\u0C) acc))]
[#\n (loop (cons (xform-item #\u0A) acc))]
[#\r (loop (cons (xform-item #\u0D) acc))]
[#\t (loop (cons (xform-item #\u09) acc))]
[c (parse-error "Invalid escape code \\~a" c)])]
[c (loop (cons (xform-item c) acc))])))
(define (read-string terminator-char)
(read-stringlike values
list->string
terminator-char
#\u
(lambda ()
(integer->char
(match i
[(px #px#"^[a-fA-F0-9]{4}" (list hexdigits))
(define n1 (string->number (bytes->string/utf-8 hexdigits)))
(if (<= #xd800 n1 #xdfff) ;; surrogate pair first half
(match i
[(px #px#"^\\\\u([a-fA-F0-9]{4})" (list hexdigits2))
(define n2 (string->number (bytes->string/utf-8 hexdigits2)))
(if (<= #xdc00 n2 #xdfff)
(+ (arithmetic-shift (- n1 #xd800) 10)
(- n2 #xdc00)
#x10000)
(parse-error "Bad second half of surrogate pair"))]
[_ (parse-error "Missing second half of surrogate pair")])
n1)]
[_ (parse-error "Bad string \\u escape")])))))
(define (read-literal-binary)
(read-stringlike (lambda (c)
(define b (char->integer c))
(when (>= b 256)
(parse-error "Invalid code point ~a (~v) in literal binary" b c))
b)
list->bytes
#\"
#\x
(lambda ()
(match i
[(px #px#"^[a-fA-F0-9]{2}" (list hexdigits))
(string->number (bytes->string/utf-8 hexdigits))]
[_ (parse-error "Bad binary \\x escape")]))))
(define (read-intpart acc-rev)
(match (peek-char i)
[#\0 (read-fracexp (cons (read-char i) acc-rev))]
[_ (read-digit+ acc-rev read-fracexp)]))
(define (read-digit* acc-rev k)
(match (peek-char i)
[(? char-numeric?) (read-digit* (cons (read-char i) acc-rev) k)]
[_ (k acc-rev)]))
(define (read-digit+ acc-rev k)
(match (peek-char i)
[(? char-numeric?) (read-digit* (cons (read-char i) acc-rev) k)]
[_ (parse-error "Incomplete number")]))
(define (read-fracexp acc-rev)
(match (peek-char i)
[#\. (read-digit+ (cons (read-char i) acc-rev) read-exp)]
[_ (read-exp acc-rev)]))
(define (read-exp acc-rev)
(match (peek-char i)
[(or #\e #\E) (read-sign-and-exp (cons (read-char i) acc-rev))]
[_ (finish-number acc-rev)]))
(define (read-sign-and-exp acc-rev)
(match (peek-char i)
[(or #\+ #\-) (read-digit+ (cons (read-char i) acc-rev) finish-number)]
[_ (read-digit+ acc-rev finish-number)]))
(define (finish-number acc-rev)
(define s (list->string (reverse acc-rev)))
(define n (string->number s))
(when (not n) (parse-error "Invalid number: ~v" s))
n)
(define (read-number)
(match (peek/no-eof)
[#\- (read-intpart (list (read-char i)))]
[_ (read-intpart (list))]))
(define (sequence-fold acc accumulate-one finish terminator-char)
(let loop ((acc acc))
(skip-whitespace)
(match (peek/no-eof)
[(== terminator-char) (read-char i) (finish acc)]
[_ (loop (accumulate-one acc))])))
(define (collect-fields head)
(skip-whitespace)
(match (peek-char i)
[#\( (collect-fields (build-record head (read-sequence #\))))]
[_ head]))
(define (read-value)
(skip-whitespace)
(collect-fields
(match (peek-char i)
[(? eof-object? o) o]
[#\{ (read-char i) (read-dictionary-or-set)]
[#\[ (read-char i) (read-sequence #\])]
[(or #\- #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (read-number)]
[#\" (read-char i) (read-string #\")]
[(== PIPE) (read-char i) (string->symbol (read-string PIPE))]
[#\# (match i
[(px #px#"^#set\\{" (list _))
(sequence-fold (set) accumulate-value values #\})]
[(px #px#"^#hexvalue\\{" (list _))
(decode (read-hex-binary) (lambda () (parse-error "Invalid #hexvalue encoding")))]
[(px #px#"^#true" (list _))
#t]
[(px #px#"^#false" (list _))
#f]
[(px #px#"^#\"" (list _))
(read-literal-binary)]
[(px #px#"^#hex\\{" (list _))
(read-hex-binary)]
[(px #px#"^#base64\\{" (list _))
(read-base64-binary)]
[_
(parse-error "Invalid preserve value")])]
[#\: (parse-error "Unexpected key/value separator between items")]
[_ (read-raw-symbol)])))
(read-value))
(define (string->preserve s)
(read-preserve (open-input-string s)))
;;---------------------------------------------------------------------------
(module+ test
(require rackunit)
(require (for-syntax racket syntax/srcloc))
@ -444,4 +681,94 @@
#xE2 #x7A "first-name" #x59 "Elizabeth"
#xE2 #x77 "surname" #x59 "Blackwell"
))
(let ()
(local-require json)
(define rfc8259-example1 (string->preserve #<<EOF
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
EOF
))
(define rfc8259-example2 (string->preserve #<<EOF
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
EOF
))
(check-both-directions/nondeterministic
rfc8259-example1
(#xe2
#x55 "Image"
#xec
#x55 "Width" #x42 #x03 #x20
#x55 "Title" #x5f #x14 "View from 15th Floor"
#x58 "Animated" #x75 "false"
#x56 "Height" #x42 #x02 #x58
#x59 "Thumbnail"
#xe6
#x55 "Width" #x41 #x64
#x53 "Url" #x5f #x26 "http://www.example.com/image/481989943"
#x56 "Height" #x41 #x7d
#x53 "IDs" #xc4
#x41 #x74
#x42 #x03 #xaf
#x42 #x00 #xea
#x43 #x00 #x97 #x89
))
(check-both-directions/nondeterministic
rfc8259-example2
(#xc2
#xef #x10
#x59 "precision" #x53 "zip"
#x58 "Latitude" #x03 #x40 #x42 #xe2 #x26 #x80 #x9d #x49 #x52
#x59 "Longitude" #x03 #xc0 #x5e #x99 #x56 #x6c #xf4 #x1f #x21
#x57 "Address" #x50
#x54 "City" #x5D "SAN FRANCISCO"
#x55 "State" #x52 "CA"
#x53 "Zip" #x55 "94107"
#x57 "Country" #x52 "US"
#xef #x10
#x59 "precision" #x53 "zip"
#x58 "Latitude" #x03 #x40 #x42 #xaf #x9d #x66 #xad #xb4 #x03
#x59 "Longitude" #x03 #xc0 #x5e #x81 #xaa #x4f #xca #x42 #xaf
#x57 "Address" #x50
#x54 "City" #x59 "SUNNYVALE"
#x55 "State" #x52 "CA"
#x53 "Zip" #x55 "94085"
#x57 "Country" #x52 "US"
))
)
)