-
Couldn't load subscription status.
- Fork 240
Description
I've been trying to use fq to edit torrent files.
Encoding is easy (patch attached), but torepr|to_bencode will not produce the original input, data is lost when doing torepr on bencoded strings like info.pieces which contains sha1 bytes. Presumably tovalue being called on bencode strings is the lossy step in _bencode_torepr in bencode.jq.
There are ways to represent mostly UTF-8 strings in JSON, like UTF-8b (used in Python's default approach to file names; invalid bytes are smuggled as unpaired surrogates, escaped with "\uDCxx" in a JSON string), the mapping is bijective, this is round-trip safe. This is valid JSON but not interoperable in the sense of RFC 8259 8.2; jq mangles it with replacement characters. And for SHA1 hashes like info.pieces it is wasteful ("\uDCxx" is 6 bytes instead of 1).
A better way would probably be to pick a different byte string type when data isn't valid UTF-8 (the UTF-8 string requirement should be enforced on dict keys though).
Does fq have a good type for byte strings? One that will not be confused with arrays?
Could that type be emitted by the default tovalue function, in cases where tovalue would currently perform a lossy conversion?
(Linking #639. Byte strings need a new value type which jqlang/jq#2736 has rejected)