What Every Developer Should Know About URLs

rbritton · on May 4, 2010

The // part isn't required for every scheme as the article implies. For example, tel:5555555555 is perfectly acceptable.

The two slashes are even now considered a bad design decision originally: http://news.bbc.co.uk/2/hi/8306631.stm

thwarted · on May 4, 2010

Yeah, the // isn't part of the scheme, it's part of the authority.

That BBC article doesn't say why "net users" find them annoying. I like the double slashes (or at least some unique separator of the scheme from the hostname and designates where the hostname starts), since it allows building of useful relative URLs, as mentioned in the OA. As an example, not covered in the examples in the article, you don't need to serve different CSS files for secure and insecure content when you serve media assets from another domain which is also available via both http and https.

The CSS

   background-image: url(//media.example.com/image.png)

can be used on both HTTP and HTTPS served pages and the browser will resolve that relative URL by filling in the protocol from the base document. Without the double slashes, you wouldn't be able to distinguish between a relative path and a relative URL. If I remember correctly, a scheme change on the same hostname but with a different path like:

  base document:  http://example.com/some/path
  relative URL:   https:/some/other/path

is possible too (I wonder how the parsing should work with port numbers, if they can be relative too -- I have not read the RFC in a while, and it's such a rare thing to use port numbers anyway).

Double slashes (well, backslashes) is how Microsoft/CIFS originally designated server names in UNC paths, which I think may have been around before URLs were standardized (don't quote me on that, they're most likely roughly the same age and influenced each other). This is also why the file: scheme "requires" three leading slashes, as the "host" is empty to designate the local machine -- but you could put in a hostname to access network shares (I put "requires" in quotes because file has always had some ambiguities in the parsing implementations).

I find it annoying when people read addresses and call them "backslashes". Talk about wasting time and energy, that's a whole additional syllable said for every path component in a URL!

pak · on May 4, 2010

These tricks by combining parts of URLs to do relative linking on scheme, host, etc. are little known but useful; I was expecting them to show up in the article.

Going back to the debate over Chrome's potential dropping of "http:// in the address bar, if they were to use "//" instead they would have an argument for technical correctness because the default protocol of "http:" could be assumed. But having no leading "//" visually confuses it with a relative path omitting the host, because it breaks the signifier for the authority component of the URL. Just a thought.

timf · on May 4, 2010

More familiar example: mailto:

Hoff · on May 4, 2010

It can be an easy way to filter XSS attacks, however.

Anything with :// or /.. in the parameter probably isn't, um, friendly.

almost · on May 4, 2010

Are there any web developers here who didn't know these things? Are there any developers of any type who didn't? I'm not sure who the audience for this article is meant to be but I'm guessing you won't find too many of them here.

daleharvey · on May 4, 2010

I never knew about params, I know a few people who didnt know things like fragments dont get sent to the server, everyone gets encoding wrong at some point, and I know a lot of people dont really know how the base tags work

Cheers for the article, handy quick reference.

locopati · on May 4, 2010

I didn't either - in path params seems like an interesting way to address issues that can come up when devising RESTful URL schemes (rather than relying on query params only).

blasdel · on May 4, 2010

For fucks sake, there is no such thing as a RESTful URL scheme. 'Pretty' URLs are a major anti-pattern -- they not only don't make you any more RESTful, they actively undermine HATEOAS, which is the true signifier of RESTfulness.

Actual REST treats URL strings (including query parameters) as being completely opaque implementation details. The server is supposed to respond with URLs in the hypertext -- you're never supposed to be formatting them yourself client-side using out of band knowledge. Query parameters are no exception to that: if you want the client to pass them, give the client a form in the response.

If you're expecting a client to munge together "path components" based on foreknowledge of your data model, you're doing it wrong.

stephenjudkins · on May 5, 2010

Thank you! This is a very common misconception. A lot of folks have a conception of REST that is more or less completely the opposite of Fielding's.

That said, I like pretty URLs. But they should be pretty for human beings and opaque for computers.

WesleyJohnson · on May 5, 2010

Any chance you have articles to back this up? Not that I doubt you, but I've never really understood or looked into all this REST business and if I'm ever going to do so, I'd like to learn what being RESTful really means --- and not just jumping on what sounds like a bandwagon for 'Pretty URLs' as you mentioned. :)

adoyle · on May 5, 2010

http://tech.groups.yahoo.com/group/rest-discuss/

bluesnowmonkey · on May 5, 2010

How are you supposed to represent collections, if every link must be provided by the server? I've seen REST examples where you give some results and a link to the next page of results. That's impractical if you have a million pages.

blasdel · on May 5, 2010

If your collection is truly random access (array, search results, relational sets) then use a form that takes the 'foreign key' as input. Then there's one URL and standard query parameters.

In pretty much all other cases a collection would have data/metadata in the response that far outweighs the links. And given that it's perfectly fine for the links to be completely opaque, there's no reason for them to be very long.

The real problem with pagination is that all but a few brave souls completely fuck up the implementation of it. This is the worst possible way to paginate something, but just about every webapp ever written does it like this:

  SELECT * FROM posts ORDER BY date DESC LIMIT x OFFSET n*x

The locations of items on pages change constantly as new items are created and destroyed! You page through the history (usually via links with the worst possible names: prev & next), and the items shift around as you move around. It's OK that the page with the most recent items changes as new ones are created, but having the archives be a pushdown stack is just idiotic.

Shit would be less fucked if the pages would count up instead of down -- with the oldest items on page 1 and the newest on page N: http://www.dehora.net/journal/2008/07/20/efficient-api-pagin...

It would be terrific if people used meaningful pagination instead of arbitrary offsets: posts by year/month/week/day/hour/minute/second/etc. is far better than "Page N" -- you don't even need to give me any options, just use older/newer links that point to the level of granularity that would give an appropriate number of results.

imp · on May 4, 2010

I didn't know some of the unusual specifics, but then again, I don't think that I need to either. It seemed less like a tutorial on what every web developer needs to know than it was documentation on what every browser maker needs to know.

stephenjudkins · on May 5, 2010

I thought I knew these things until I worked on a project to scrape arbitrary websites. We had to follow arbitrary links around a site (easy, right?) but it turns out there are many, many edge cases we had to deal with.

Another developer pretty quickly decided to ditch Ruby's URL parser and write our own, since there are tons of things browsers deal with that you wouldn't think of. For example, relative links starting with "//" share only the protocol (http or https) between the current page. Add in vagaries specific to some HTTP servers, like that http://foo/bar == http://foo/bar/ and we quickly realized it was a lot bigger task than we thought.

We ultimately got the thing working OK, but crazy edge cases just kept popping up.

cldwalker · on May 4, 2010

If you read his other posts, they tend to state the obvious. Why this made it on HN perplexes me as well.

dkarl · on May 4, 2010

I never knew any of this stuff until I looked at the Javadoc for the Java URL classes. I'm not a web developer, but web developers aren't the only developers who use URLs, of course.

TorKlingberg · on May 4, 2010

No mention of non-ASCII characters at all? Punycode in the host name may still be uncommon, but passing non-ASCII in the query is important. There is also nothing about the encoding of space as + rather than %20 that happens a lot.

throw_away · on May 5, 2010

note, however, that encoding a space as a + is supposed to happen only in the query portion of the url, not the path.

mclin · on May 4, 2010

Did this guy just paraphrase the RFC? It's a lot of work, but if it gets you on HN...

nostrademons · on May 4, 2010

I suspect that there're a lot of RFCs and W3C specs that could be paraphrased and get you on HN.

How many web developers actually know HTML, for example? In a few discussions here and on Reddit, it seemed like well over 80% did not realize that <!doctype html> is a valid doctype, or that you do not need to close many common tags.

blasdel · on May 4, 2010

I place all the blame for the cargo-cult validation-seeking know-nothing standardistas squarely on the shoulders of Jeffery Zeldman.

If it wasn't for his ignorant boosterism, we probably could have euthanized XHTML a long time ago.

blaix · on May 5, 2010

If it wasn't for his boosterism, we'd probably still be writing HTML for specific browsers and versions.

ars · on May 5, 2010

For extra credit explain how to encode an IPv6 address in a URL without getting it mixed up with the port.

moeffju · on May 5, 2010

By putting it inside of square brackets.

[2001:db8::a00:20ff:fea7:ccea]:80

tomlin · on May 4, 2010

In general, I am not a big fan of the father-knows-best narrative "What Every Developer Should Know".

Thanks, pa. All I needed from you was an "atta-boy!". sigh