QDN: How bit.ly ignores the wishes of web authors

How bit.ly ignores the wishes of web authors

Jul 9, 2008 | Bugs | Internet | Search

There’s been quite a bit of press today about bit.ly, a new service from the folks at switchAbit; it’s a service that adds page caching, click-through counting, and a bunch of semantic data analysis atop a URL-shortening service that’s very much like TinyURL and others (and others!). Reading the unveiling announcement, the part that interested me most was the page caching part — they bill it as a service to help prevent link rot (i.e., when a page you’ve linked to or bookmarked then goes away), which would be a great service to those folks who rely on linked content remaining available. (And since they store their cached content on Amazon’s S3 network, robustness and uptime should be great as well.)

That being said, having worked with (and on) a bunch of caching services in the past, I also know that caching is a feature that many developers implement haphazardly, and in a way that isn’t exactly adherent to either specs or the wishes of the page authors. So I set out to test how bit.ly handles page caching, and I can report here that the service does a great job of caching the text of pages, a bad job of caching the non-text contents of pages, and a disappointingly abhorrent job of respecting the wishes of web authors who ask for services like this to not cache their pages.

First, let’s start off with the text content of pages — as you’d expect, bit.ly handles this bit of caching just fine. Take the bit.ly URL for my home page (http://bit.ly/queso), which is cached here; the text is reproduced faithfully for the content at the time of caching, so there are no issues there. Same thing with http://bit.ly/a, which someone used to shorten Apple’s home page; looking at the cached version shows faithful reproduction of this morning’s page text.

Next, move on to the non-text content of the page. Here, bit.ly doesn’t do any caching — meaning that pages which link to images (or CSS files, or scripts, or other media content) at fixed locations will always link to those fixed locations, and if those files move or are deleted, then the cached pages won’t have any copy to fall back on. (For example, on that cached version of my home page, there should be a graphic just underneath the “Elsewhere” header in the right-hand sidebar, but since it’s moved, the cache can’t find it.) Given how much of today’s web is based on images, CSS, Javascript, Flash, and other added content, this means that quite a bit of the content on cached versions of most pages will be missing, so it’s unclear how much utility these cached pages will hold to anyone. The same thing holds for any content that was referenced using relative URLs, references that will be broken when those same relative URLs now point to the root of bit.ly’s Amazon S3 cache. (This is something you can see when looking at that cached version of the Apple home page, which is missing its CSS file.) So all in all, this is pretty bad; a caching system that doesn’t take into account a huge chunk of page content is a caching system that holds limited value for its users.

Finally, let’s take a look at how bit.ly behaves when it comes to deciding how to handle pages that have various cache-control headers applied to them. (As a few-sentence background for those unversed in such things, the standards by which web pages are authored and sent over the internet provide various headers that page authors can use to control how those pages are handled by web browsers, caching servers, and other intermediaries on the wire. If they so desire, page authors can say when a page should expire out of caches, or can specify that the page should never be cached; it’s expected that anything handling the page should respect these headers.) The short version of what I discovered is that bit.ly does an absolutely terrible job here — it doesn’t respect a page’s caching headers one iota.

Take the pragma no-cache meta header, which is supposed to tell a cache server not to cache a page. Bit.ly doesn’t care — when given a page with this header set, it caches away without a care in the world. You might say, “wait, that header was an HTTP 1.0 header that has been deprecated, so bit.ly can do whatever it wants here!” Fine, I’ll allow you that (although that isn’t technically correct); let’s look at the more modern way to say don’t-cache-this-page.

In HTTP 1.1, quite a bit of thought was given to page caching, so much so that there’s a specific HTTP 1.1 document on the new ways to control caching. The take-home point here is that there are a few headers that are supposed to serve as ways for a page to advertise how it should or shouldn’t be cached; meta cache-control headers and meta expires headers are the two biggies. And looking at how bit.ly handles them, it’s equal-opportunity — it ignores them all. Throw it a cache-control no-cache header, and bit.ly caches the page. Throw cache-control no-store into the headers, and bit.ly still caches the page. Combine all the HTTP 1.0 and 1.1 cache-related headers together, and bit.ly still caches the page. Even if you add a header setting the page expiration date to the past, shock of all shocks, bit.ly still caches the page. It appears that there is no way to tell bit.ly not to cache a page, which is the ultimate example of being a terrible web neighbor.

So in the end, it’s nice to see someone thinking about ways to add to the feature set of URL shortening services, but it’s quite disappointing that that one of the most-trumpted additions to that new feature set behaves so poorly when it comes to respecting content authors. It’d be nice to see the folks behind bit.ly commit to fixing this, and fixing it soon — it’s simply unacceptable to ignore the wishes of those folks who are providing the substrate on which your new business model feeds.

(And as a final postscript, this is now the second Betaworks website that was released to the public in the past few weeks without a privacy policy in place. This is a bit disheartening, especially since the first site, Twitabit, asks users to type in their Twitter usernames and passwords. Even more stunning to me is that users are totally OK doing so — giving their usernames and passwords for site A to site B — without their being a strong privacy policy in place, but that’s a story for another time.)

Thanks, this was helpful :)

• Posted by: Enrique Allen on Jul 12, 2008, 5:13 AM

Perhaps I should have commented this here, rather than @switchabit.wordpress.com/2008/07/08/bitly/#comment-68, but… it’s done.

While I’m glad that there are people like you worrying about esoteric points like this, I think you mistakenly confuse web-page authors with web-page publishers. The first seldom ask for their work to be available only in some fleetningly intermittent fashion, as no-cached items; the latter may so require for commercial or administrivial reasons. In any case thar their wish will never be their command, because the powers that be, Tim Berners-Lee, Robert Cailliau, and others of the original info.cern.ch team ecplicitly did not design that into the specs. Period. So any afterthoughts on that no-cache matter are just that, illusory at best. Which is me PRAGMAtic point of view!

• Posted by: Ianf on Jul 13, 2008, 9:00 AM

Ianf, that misses the point entirely. Of *course* people can still save pages that have cache control mechanisms on them — hell, in any browser, you can just do a File/Save and have the page. But page authors and publishers (yes, there’s a distinction, but both can and do use cache controls to state their wishes) DO have a mechanism to tell caching engines what they should do, and good internet neighbors are the ones who listen to those wishes and respect them. So if the folks behind Bit.ly want to be assholes, they’re more than welcome to act in contravention of those expressed wishes… but if, as a new face on the ‘net, they want to behave with proper respect for the folks who create the content that drives their business, then they should listen to them.

(And as a semi-aside, you’re totally wrong when you say that the specs aren’t designed to give voice to page authors’ and publishers’ wishes when it comes to cache controls. You should spend a little time reading the HTTP 1.1 spec; for example, this is what it has to say for the cache-control no-store header:

If sent in a response, a cache MUST NOT store any part of either this response or the request that elicited it. This directive applies to both non- shared and shared caches. “MUST NOT store” in this context means that the cache MUST NOT intentionally store the information in non-volatile storage, and MUST make a best-effort attempt to remove the information from volatile storage as promptly as possible after forwarding it.

That’s as clear and unambiguous as it gets.)

• Posted by: Jason on Jul 13, 2008, 10:35 AM

Jason, IF there was a method to combine delivery of remotely-rendered content AND its permanent “aether-ity” (perhaps “eternal virginity” would be a better term?), THEN I am sure it’d have found its way into the original http specs. Now, once the cat outta bag…. all such later restrictions must therefore be viewed as conventions at best (before that, Ted Nelson envisioned Xanadu, a closed hypertextual granularly byte-accountable docuverse, but even that didn’t assume that a document could be prevented from being cached - it merely “made sure” that local cache would be more expensive than cumulative miniscule payments due for each subsequent viewing of the same server-stored doc).

In any case, you need to loosen up, we’re all in this together. You remind me of linguistic purists who view languages as closed sets of words and rules, never to be soiled by foreign influx. That is not the WWW that Tim has built.

HTTP 1.0-0.12 Pragma

”[…] All pragma directives specify OPTIONAL BEHAVIOR from the viewpoint of the protocol; […] however, any pragma directive not relevant to a recipient SHOULD BE IGNORED by that recipient.”

For some reason that no-tresspass-no-cache-clause means a lot to you (or may be just provides a suitable indignation-fodder for this very debate?) Fine, but, apparently it doesn’t appeal to bit.ly-folks, no need to call them assholes for that. Remember Feldman’s Preamble to all Internet debate laws: “anyone calling Interadversary an asshole is a bigger asshole.”

Besides, no-cache, etc. is as much employed for its once-intended purpose, to provide for continuously-updated, always fresh content, as it is as access control/restriction mechanism for vacuuous commercial, but usually masquerading as involatile intellectual-property, reasons. Perhaps if it weren’t, I might be more sympathetic to your…. er, cache-verbotten crusade?

• Posted by: Ianf on Jul 13, 2008, 2:58 PM

Ianf, this is inane.

First, if you’re saying that everything since HTTP 1.0 is somehow bad, optional, wrong, or invalid, then you’re giving up on a ton of functionality — keep-alive connections, resumption of downloads mid-stream, content compression, digest and proxy authentication, content negotiation, and the list goes on. I think that if this is your contention, you’re in that boat alone.

Second, your examples both times are using a deprecated HTTP 1.0 header, the pragma header — sure, that spec says that it’s optional, which is exactly why HTTP 1.1 went so much further in giving page authors and publishers better cache control mechanisms, mechanisms that are far stronger in what they tell caching engine providers what they should do in response to them.

And finally, looking at my server logs, when Bit.ly makes its page request to generate a cached copy, it makes it as an HTTP 1.1 request. In other words, even Bit.ly’s bots claim to be adhering to HTTP 1.1, which they’re clearly not.

I’d love to see any validation of your claim about the oft-intended uses of cache control; I think you’re just making those up. As it is, and while it actually doesn’t matter to either side of this argument at all, Tim Berners-Lee was on the HTTP 1.1 committee and signed the final spec, so why would I think that he wasn’t all in favor of the new cache control mechanisms introduced in the spec?

In the end, the question is: how hard would it have been for the Bit.ly folks to read the various cache control headers and respect them? They are building a business on providing easy access to content, but are acting as if they don’t give a shit about the producers of that content. Why does this make sense at all, and why should potential users not care about this?

• Posted by: Jason on Jul 13, 2008, 3:32 PM

Please note that comments automatically close after 60 days; the comment spammers love to use the older, rarely-viewed pages to work their magic. If comments are closed and you want to let me know something, feel free to use the contact page!