Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Z85: Format for representing binary data as printable text (zeromq.org)
58 points by ingve on March 17, 2024 | hide | past | favorite | 51 comments


Given that interoperability is one of the explicitly mentioned design criteria, it seems a bit of an oversight that they forgot to specify how the encoding actually works. It says nothing about how you encode the 32 bit integer into the five characters and the description of the decoding operation is also pretty vague.

The five characters SHALL each be converted into a value 0 to 84, and accumulated by multiplication by 85, from most to least significant.

There is of course the example and the reference implementation and everyone knows how this kind of encoding works, so you can certainly figure out how you are supposed to implement it.


Also the interoperability rationale isn't future proof: when someone invents a variant of Z85, say, with a padding rule, then Z85 will have the same interoperability problem as base64.


The mapping seems fairly clear, even if an implementation is not provided. Just write the 32-bit integer in base 85, using the provided numerals. Most significant numeral first. Since 2^32 < 85^5 you can do it in five characters, though some sequences cannot be achieved. Just like you would do with any other numeric base.


But where does it say that it has to be done like that? There are a lot of functions that map [0, 2³² - 1] to [0, 84]⁵ or to [0, 85⁵ - 1] which they at least imply by mentioning the base-85 representation. There is of course the canonical mapping, the identity function, but it never says that we are supposed to use it.


Given that 85^5 > 2^32, it's a little strange that they also didn't specify what happens if the text encodes a number that doesn't fit in 4 bytes.


You could theoretically put 1, 2, or 3 byte sequences into the space between 4294967296 and 4437053124. That gap holds over 27 bits of information.

Example: 0 to 4294967295 = 4 bytes

4294967296 to 4311744511 = 3 bytes (subtract 4294967296 first)

4311744512 to 4311810047 = 2 bytes (subtract 4311744512 first)

4311810048 to 4311810303 = 1 byte (subtract 4311810048 first)

Then you're still left with 125242821 remaining numbers, over 26 bits.


That is one way to extend the encoding from 4-byte chunks to arbitrary byte lengths.

Another is to encode 4k+i bytes as 5k+1+i base-85 characters, for 0<i<4. That way the encoding length immediately determines the input length. And there's again plenty of space since 85^{i+1} > 256^i for i < 4.

This leaves encoding lengths of 5k+1 unused. These could be used to support arbitrary bit lengths, i.e. for encoding an input of 4(k-1) bytes + i bits, with 0<i<32. Set the final byte to i, and let the preceding 5 base-85 characters encode the i bits with 0 padding.


Then it’s invalid, same as if it includes unexpected characters or isn’t a multiple of five characters long.


The "reference implementation" in C seems to just let the int wrap around. It also does an out of bounds array access if the string contains non-ASCII chars.


Sounds like the reference implementation only specifies what should be done with valid values then? I’d avoid using it in production without separate validation.


And the author specifies Z85 only for 4 byte chunks of data - not all data comes in 32 bit chunks. If the authors want Z85 to be widely used, specifying padding would be helpful.

Anyway, it all seems a bit https://xkcd.com/927/


It looks like they’re counting on most data being padded to 32 bits. The simplicity due to lack of padding rules seems to be the only real advantage this encoding has over base 64. Otherwise you’re right, offering up a new standard because there are too many competing base 64 standards makes no sense.


All the issues mentioned in these comments is a good indication why this encoding isn't more widely known/used.

I myself typically binary compress then base64 encode.


Good to know I'm not the only one who has done this! I once wrote a template generation tool that embedded its input configuration into its output files as base64-encoded gzip'd JSON. Worked quite well at allowing template regeneration.


For your trouble, you save some extra bytes with 85 vs 64.


With the added bonus of funny characters which are a pain to encode in JSON, XML and URLs. I guess I'll keep using using URL safe Base64, thank you.


Modifying the base alphabet is simple if there are a couple of problem characters.

I use 85 in json just fine without escaping overhead. Never tried for URLs or XML.


and then the webserver probably recompresses again via gzip.


>The simplicity due to lack of padding rules seems to be the only real advantage

Because there is no padding rule given, it's impossible to use the standard to encode most binary strings.

For example, how would you encode/decode these?

  0x0000000000000000  
  0x00000000000000
  0x000000000000
  0x0000000000
  0x00000000 
  0x000000


If the situation requires decoding strings of unknown length, you simply prefix the length.


It was rhetorical question. What you describe is not defined in the format. You can't use the format to exchange strings of unknown length.

If people would start using this encoding, different users would adopt different solutions padding, length prefixes etc. and it becomes mess.


It's not defined in the format, sure. That doesn't mean the format is bad. The format also doesn't specify how to interpret the binary data, or how to encode multiple strings, or automatically checksum it for you. But you can obviously use the format to exchange strings of variable length, or encode both integers and images, send multiple strings, or add your own checksum.

It's not "impossible to encode" that content just because you need to decide how to represent it. It's not "a mess" if some people use fixed-length strings and some use length-prefixed strings. It's just reality for any encoding scheme - you build layers around it, according to how you want to use it.

The same way the context of your software determines whether this binary string is supposed to represent a float, a name, a hash, or a pixel-art masterpiece, it will also determine the appropriate serialization.


You could only encode the ones whose length was a multiple of 8. 0x00000000 would be 00000. 0x0000000000000000 would be 00000 00000. How would you encode binary 000000 in hex?


"Hex(adecimal)" is not an encoding scheme, it's a number basis. "Binary 00000" as a number is just zero, which is written equivalently as "0", "00"... in hexadecimal. This is different from a chunk of data consisting of six bits that all happen to be zero. And indeed, you need to decide on how to convey that, and simply saying "hex" - or "binary" or "decimal" for that matter - is not enough, that is correct.


That XKCD is so old now that the alt-text:

> Fortunately, the charging one has been solved now that we've all standardized on mini-USB. Or is it micro-USB? Shit.

is hilariously even more true.


That xkcd comic was the first thing that came to mind as well.

The article explains how Base64 is “problematic because it has more than a dozen variants”. So instead of picking one, let’s invent a new standard.


> The four octets SHALL be treated as an unsigned 32-bit integer in network byte order (big endian). The five characters SHALL be output from most significant to least significant (big endian).

Why oh why??!

If it were little endian, you could probably skip the "must be multiple of 5 chars/4 bytes" requirement, not to mention that 99.9999% of processors out there are running in little-endian mode.

There is nothing "envious" about network byte order.


You wouldn't be able to skip the 5 char/4 byte requirement, you'd just be able to strip 0x00 bytes from the end. That actually complicates things, since you then need to specify in the spec whether handling that is a requirement for a conforming parser/reader.


I don't quite understand what you're saying, but it should be possible to infer the length from the number of bytes received.

Assuming n is an integer:

  * 5n bytes received = 4n bytes data
  * 5n+1 bytes received is [invalid]
  * 5n+2 bytes received = 4n+1 bytes data
  * 5n+3 bytes received = 4n+2 bytes data
  * 5n+4 bytes received = 4n+3 bytes data
This is like modified Base64, which doesn't need any padding.


You never need padding as long as you know how many input characters are missing. My point is that if you encode the single byte binary input 0x01 as "00001" (big endian) instead of "10000" (little endian) you avoid the temptation for people to trim off the zeroes (leaving "1"). This means your decode() input will always be a multiple of 5 character chars by construction.

This comes down to whether there should be 5 valid encodings ("10000", "1000", "100", "10", "1") of a single 0x01 byte, or one. The variable length encoding of integers in Protocol Buffers has the same malleability problem

It's also not clear to me why you say 6 char input is invalid.


In your scheme you can't tell the difference between the single byte binary input 0x01, and the four byte binary input 0x00,0x00,0x00,0x01.

Those are the same if you're treating the binary data as a stream of 32 bit numbers, but not if it's a stream of an arbitrary number of octets.

Your parent is suggesting that if after chunking the input into 5s, your last chunk is "10" you would treat that as 0x01, "100" as 0x00,0x01, "1000" as 0x00,0x00,0x01 and only "10000" as 0x00,0x00,0x00,0x01. That's not four encodings of the same value at all.

Treating "1" (or any single leftover character) as invalid in such a scheme makes sense because a single character can only encode 85 values, from 0x00 to 0x54.


I'm not sure if this is a fair question or not, but suppose I have a 6-byte blob I want to send. I can pad it out to 8 bytes, use this scheme to encode it, and send it.

Then I want the receiver to understand that only the first 6 bytes of their decoded results are part of the transmission -- how do I do that?

Base64 has a special character ('=') that is used for encoding padding, but this method doesn't seem to have that. The spec says "it is up to the application to ensure that frames and strings are padded if necessary", which suggests they've scoped this problem out.

I suppose I can always build a little "packet" that starts with the payload length, so that the receiver can infer the existence of padding if there is additional data beyond the advertised payload length, but now the receiver and I need to agree on that protocol.


Padding is actually not really necessary in base64, as you can infer the length from the number of characters received.

Unfortunately for Z85, they made the highly questionable decision to use big-endian, which means it can't take base64's route. You could probably define an incomplete group at the end to be right-aligned or similar, but you may as well be sensible and just go little-endian.


> The binary frame SHALL have a length that is divisible by 4 with no remainder. The string frame SHALL have a length that is divisible by 5 with no remainder. It is up to the application to ensure that frames and strings are padded if necessary.

Not including padding will seriously hinder this. You want that UX for people to be able to use it widely and just work. The post talks about wanting one compatible standard, but leaving out how to deal with padding means you know have incompatible ways of doing so. Plus many junior devs won’t even understand the need of padding, and will be extra confused when it doesn’t “just work.”

I do appreciate the many other bits it tries to solve, however.


> Base64 implementations are [...] not necessarily interoperable.

Also this spec:

> It is up to the application to ensure that frames and strings are padded if necessary.

So they didn't even specify the standard way to treat byte strings with the length not divisible by 4.


Seems reasonable. Avoids single and double quotes and the backslash. Though still not usable in some places, like urls.


There are different versions of base85, it is in general perfect for dense passwords https://en.m.wikipedia.org/wiki/Ascii85


Also allows & so probably not safe to use in HTML (assuming browsers may helpfully add the omitted ; ).


Why are ` ~ _ | ; excluded? I could maybe see backtick and pipe to be shell friendly.


Probably because 85 symbols are enough to store 32 bits in 5 characters. You’d need 256 symbols to do better.


But it includes $, which is decidedly shell-unfriendly.


One "alternative" encoding I've tried is just saving an HTML file as a UTF-16 file, then just simply save the binary data as a UTF-16 string in a javascript block. A few characters need to be escaped, such as quotation mark, backslash and newline, as well as unmatched surrogate pairs. The unmatched surrogate pairs eat up 12 bytes instead of 2 bytes when you write them as \uXXXX.


How can this have a copyright and also be GPL? Or maybe a better way of asking it is - what does a copyright even do if you give limitless permission to copy?


Copyright just asserts who gets to set the terms.

GPL is just a particular set of terms.

GPL never meant no copyright or no terms. In fact it sets very strict terms, just not very many, not very complicated, and not the usual ones.

You never knew the basic premis or theory of how the gpl works?


Quite right. But licensing a specification with a license designed for software is pretty darned strange.

> This Specification is free software;

even though This Specification is not software.

And then you're left with the disturbing need to send the text of the GPL off to corporate lawyers to determine whether there are any bombs in the GPL when applied to a specification instead of a piece of software? Will implementing the specification contaminate our sources? Must we include a GPL 3 notification in our legal notices?

    To "modify" a work means to copy from or adapt all or 
    part of the work
    in a fashion requiring copyright permission,
And the corporate lawyers will respond (as they always do) that the text of the GPL wasn't written by a lawyer and nobody really knows what it means because of numerous drafting errors.

Using the GPL here senselessly and needlessly causes stress, and is a more than ample reason in itself not to use the standard.

The one good thing is that they used an MIT license on the reference implementation, so we won't have to use clean room protocols to write our own implementation.


Yeah a spec should probably be one of the CC licenses.


Aside from very specific cases, like if you work for NASA, copyright springs into existence automatically, so even assigning it to the public domain requires some action to divest yourself of the copyright. And even then, 'moral rights' may survive. The GPL is very much not public domain, but a license to a copyrighted work though. Which, I mean, you can read it. See 'how to apply' https://www.gnu.org/licenses/gpl-3.0.html


Limitless? The GPL 3 is quite specific about what you can and can't do.

This opens up a new question: so this encoding will never be allowed in proprietary software?


You can't copyright an idea. What's copyrighted is the text of the specification. Anybody who's made aware of the idea is free to implement it.


Without a copyright, there is no GPL, because there's nothing to license.


At first glance I thought SHALL standed for SHA-language model :|




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: