Given that interoperability is one of the explicitly mentioned design criteria, it seems a bit of an oversight that they forgot to specify how the encoding actually works. It says nothing about how you encode the 32 bit integer into the five characters and the description of the decoding operation is also pretty vague.
The five characters SHALL each be converted into a value 0 to 84, and accumulated by multiplication by 85, from most to least significant.
There is of course the example and the reference implementation and everyone knows how this kind of encoding works, so you can certainly figure out how you are supposed to implement it.
Also the interoperability rationale isn't future proof: when someone invents a variant of Z85, say, with a padding rule, then Z85 will have the same interoperability problem as base64.
The mapping seems fairly clear, even if an implementation is not provided. Just write the 32-bit integer in base 85, using the provided numerals. Most significant numeral first. Since 2^32 < 85^5 you can do it in five characters, though some sequences cannot be achieved. Just like you would do with any other numeric base.
But where does it say that it has to be done like that? There are a lot of functions that map [0, 2³² - 1] to [0, 84]⁵ or to [0, 85⁵ - 1] which they at least imply by mentioning the base-85 representation. There is of course the canonical mapping, the identity function, but it never says that we are supposed to use it.
That is one way to extend the encoding from 4-byte chunks to arbitrary byte lengths.
Another is to encode 4k+i bytes as 5k+1+i base-85 characters, for 0<i<4. That way the encoding length immediately determines the input length.
And there's again plenty of space since 85^{i+1} > 256^i for i < 4.
This leaves encoding lengths of 5k+1 unused. These could be used to support arbitrary bit lengths, i.e. for encoding an input of 4(k-1) bytes + i bits, with 0<i<32. Set the final byte to i, and let the preceding 5 base-85 characters encode the i bits with 0 padding.
The "reference implementation" in C seems to just let the int wrap around. It also does an out of bounds array access if the string contains non-ASCII chars.
Sounds like the reference implementation only specifies what should be done with valid values then? I’d avoid using it in production without separate validation.
And the author specifies Z85 only for 4 byte chunks of data - not all data comes in 32 bit chunks. If the authors want Z85 to be widely used, specifying padding would be helpful.
It looks like they’re counting on most data being padded to 32 bits. The simplicity due to lack of padding rules seems to be the only real advantage this encoding has over base 64. Otherwise you’re right, offering up a new standard because there are too many competing base 64 standards makes no sense.
Good to know I'm not the only one who has done this! I once wrote a template generation tool that embedded its input configuration into its output files as base64-encoded gzip'd JSON. Worked quite well at allowing template regeneration.
It's not defined in the format, sure. That doesn't mean the format is bad. The format also doesn't specify how to interpret the binary data, or how to encode multiple strings, or automatically checksum it for you. But you can obviously use the format to exchange strings of variable length, or encode both integers and images, send multiple strings, or add your own checksum.
It's not "impossible to encode" that content just because you need to decide how to represent it. It's not "a mess" if some people use fixed-length strings and some use length-prefixed strings. It's just reality for any encoding scheme - you build layers around it, according to how you want to use it.
The same way the context of your software determines whether this binary string is supposed to represent a float, a name, a hash, or a pixel-art masterpiece, it will also determine the appropriate serialization.
You could only encode the ones whose length was a multiple of 8. 0x00000000 would be 00000. 0x0000000000000000 would be 00000 00000. How would you encode binary 000000 in hex?
"Hex(adecimal)" is not an encoding scheme, it's a number basis. "Binary 00000" as a number is just zero, which is written equivalently as "0", "00"... in hexadecimal. This is different from a chunk of data consisting of six bits that all happen to be zero. And indeed, you need to decide on how to convey that, and simply saying "hex" - or "binary" or "decimal" for that matter - is not enough, that is correct.
> The four octets SHALL be treated as an unsigned 32-bit integer in network byte order (big endian). The five characters SHALL be output from most significant to least significant (big endian).
Why oh why??!
If it were little endian, you could probably skip the "must be multiple of 5 chars/4 bytes" requirement, not to mention that 99.9999% of processors out there are running in little-endian mode.
There is nothing "envious" about network byte order.
You wouldn't be able to skip the 5 char/4 byte requirement, you'd just be able to strip 0x00 bytes from the end. That actually complicates things, since you then need to specify in the spec whether handling that is a requirement for a conforming parser/reader.
I don't quite understand what you're saying, but it should be possible to infer the length from the number of bytes received.
Assuming n is an integer:
* 5n bytes received = 4n bytes data
* 5n+1 bytes received is [invalid]
* 5n+2 bytes received = 4n+1 bytes data
* 5n+3 bytes received = 4n+2 bytes data
* 5n+4 bytes received = 4n+3 bytes data
This is like modified Base64, which doesn't need any padding.
You never need padding as long as you know how many input characters are missing. My point is that if you encode the single byte binary input 0x01 as "00001" (big endian) instead of "10000" (little endian) you avoid the temptation for people to trim off the zeroes (leaving "1"). This means your decode() input will always be a multiple of 5 character chars by construction.
This comes down to whether there should be 5 valid encodings ("10000", "1000", "100", "10", "1") of a single 0x01 byte, or one. The variable length encoding of integers in Protocol Buffers has the same malleability problem
It's also not clear to me why you say 6 char input is invalid.
In your scheme you can't tell the difference between the single byte binary input 0x01, and the four byte binary input 0x00,0x00,0x00,0x01.
Those are the same if you're treating the binary data as a stream of 32 bit numbers, but not if it's a stream of an arbitrary number of octets.
Your parent is suggesting that if after chunking the input into 5s, your last chunk is "10" you would treat that as 0x01, "100" as 0x00,0x01, "1000" as 0x00,0x00,0x01 and only "10000" as 0x00,0x00,0x00,0x01. That's not four encodings of the same value at all.
Treating "1" (or any single leftover character) as invalid in such a scheme makes sense because a single character can only encode 85 values, from 0x00 to 0x54.
I'm not sure if this is a fair question or not, but suppose I have a 6-byte blob I want to send. I can pad it out to 8 bytes, use this scheme to encode it, and send it.
Then I want the receiver to understand that only the first 6 bytes of their decoded results are part of the transmission -- how do I do that?
Base64 has a special character ('=') that is used for encoding padding, but this method doesn't seem to have that. The spec says "it is up to the application to ensure that frames and strings are padded if necessary", which suggests they've scoped this problem out.
I suppose I can always build a little "packet" that starts with the payload length, so that the receiver can infer the existence of padding if there is additional data beyond the advertised payload length, but now the receiver and I need to agree on that protocol.
Padding is actually not really necessary in base64, as you can infer the length from the number of characters received.
Unfortunately for Z85, they made the highly questionable decision to use big-endian, which means it can't take base64's route. You could probably define an incomplete group at the end to be right-aligned or similar, but you may as well be sensible and just go little-endian.
> The binary frame SHALL have a length that is divisible by 4 with no remainder. The string frame SHALL have a length that is divisible by 5 with no remainder. It is up to the application to ensure that frames and strings are padded if necessary.
Not including padding will seriously hinder this. You want that UX for people to be able to use it widely and just work. The post talks about wanting one compatible standard, but leaving out how to deal with padding means you know have incompatible ways of doing so. Plus many junior devs won’t even understand the need of padding, and will be extra confused when it doesn’t “just work.”
I do appreciate the many other bits it tries to solve, however.
One "alternative" encoding I've tried is just saving an HTML file as a UTF-16 file, then just simply save the binary data as a UTF-16 string in a javascript block. A few characters need to be escaped, such as quotation mark, backslash and newline, as well as unmatched surrogate pairs. The unmatched surrogate pairs eat up 12 bytes instead of 2 bytes when you write them as \uXXXX.
How can this have a copyright and also be GPL? Or maybe a better way of asking it is - what does a copyright even do if you give limitless permission to copy?
Quite right. But licensing a specification with a license designed for software is pretty darned strange.
> This Specification is free software;
even though This Specification is not software.
And then you're left with the disturbing need to send the text of the GPL off to corporate lawyers to determine whether there are any bombs in the GPL when applied to a specification instead of a piece of software? Will implementing the specification contaminate our sources? Must we include a GPL 3 notification in our legal notices?
To "modify" a work means to copy from or adapt all or
part of the work
in a fashion requiring copyright permission,
And the corporate lawyers will respond (as they always do) that the text of the GPL wasn't written by a lawyer and nobody really knows what it means because of numerous drafting errors.
Using the GPL here senselessly and needlessly causes stress, and is a more than ample reason in itself not to use the standard.
The one good thing is that they used an MIT license on the reference implementation, so we won't have to use clean room protocols to write our own implementation.
Aside from very specific cases, like if you work for NASA, copyright springs into existence automatically, so even assigning it to the public domain requires some action to divest yourself of the copyright. And even then, 'moral rights' may survive. The GPL is very much not public domain, but a license to a copyrighted work though. Which, I mean, you can read it. See 'how to apply' https://www.gnu.org/licenses/gpl-3.0.html
The five characters SHALL each be converted into a value 0 to 84, and accumulated by multiplication by 85, from most to least significant.
There is of course the example and the reference implementation and everyone knows how this kind of encoding works, so you can certainly figure out how you are supposed to implement it.