You are here

The problem of trackback and comment spam in Drupal, and one way to address it

(I just posted a version of the following over at, in their "Drupal Core" forum. I doubt it will make much of an impact, but I had to try...)

I propose that there is a problem with the ways that program function URLs are written in Drupal, that causes Drupal to be a disproportionate target for trackback and comment spammers.

The problem with comment and trackback spam in Drupal is this: It's too easy to guess the URL for comments and trackbacks.

In Drupal, the link for a node has the form "/node/x", where x is the node id. In fact, you can formulate a lot of Drupal URLs that way; for example, to track-back to x, the URI would be "/trackback/x"; to post a comment to x would be "/node/comment/reply/x". So you can see that it would be a trivially easy task to write a script that just walked the node table from top to bottom, trying to post comments.

Which is pretty much what spammers do to my site: They fire up a 'bot to walk my node tree, looking for nodes that are open to comment or accepting trackbacks. I have some evidence that it's different groups of spammers trying to do each thing -- one group seems to be switching IPs after a small number of attempts, and the other tends to use the same IP until I block it, and then takes a day or so to begin again -- but that hardly matters. What does matter is that computational horsepower and network bandwidth cost these guys so little that they don't even bother to stop trying after six or seven hundred failures -- they just keep on going, like the god damned energizer bunny. For the first sixteen days of August this year, I got well over 100,000 page views, of which over 95% were my 404 error page. The "not found" URL in over 90% of those cases was some variant on a standard Drupal trackback or comment-posting URL.

One way to address this would be to use something other than a sequential integer as the node ID. This is effectively what happens with tools like MoveableType and Wordform/WordPress because they use real words to form the path elements in their URIs -- for example, /archives/2005/07/05/wordform-metadata-for-wordpress/, which links to an article on Shelley Powers's site. Whether those real words correspond to real directories or not is kind of immaterial; the important point is that they're impractically difficult to crank out iteratively with a simple scripted 'bot. Having to discover the links would probably increase the 'bot's time per transaction by a factor of five or six. Better to focus on vulnerable tools, like Drupal.

But the solution doesn't need to be that literal. What if, instead of a sequential integer, Drupal assigned a Unix timestamp value as a node ID? That would introduce an order of complexity to the node naming scheme that isn't quite as dramatic as that found in MT or WordPress, but is still much, much greater than what we've got now. Unless you post at a ridiculous frequency, it would guarantee unique node IDs. And all at little cost in human readability (since I don't see any evidence that humans address articles or taxonomy terms by ID number, anyway).

Some people will immediately respond that this is "security through obscurity", and that it's therefore bad. I'd argue that they're wrong on two counts: First, it's not security through obscurity so much as security through economic disincentive; second, it's not bad, because even knowing exactly how it works doesn't help you very much to game it. The problem with security through obscurity, see, is that it's gameable. Once you know that the path to the images directory is "/roger/jessica/rabbit/", then you can get the images whenever you want; even if you know that the URL to post a comment is "/node/timestamp/reply/comment/", you're not a heck of a lot closer to getting a valid trackback URL than you were before you knew that.

Add new comment