parsing URLs
There are lots of things that you’d want computers to simplify for you - but the obvious methods for simplification don’t actually work. For example, finding out if a user entered email is actually valid. The description and code involved (see here and the monstrous chunk of code here) belies the apparent simplicity of the task.
So it was for my own little task. What I wanted to do was write a general purpose method (in PHP, actually - but the language itself is unimportant) to infer the blog address, given a permalink. So, given any post URL - I wanted to find the address of the blog itself. You’d think that this was a relatively straightforward task. But, if the preamble didn’t alert you already, it wasn’t quite as simple as I first envisaged.
The simplest case - subdomains
The three or four main blog hosts that are likely to send blog addresses out into the wild are Blogspot, Wordpress.com, Typepad and Livejournal. Usage statistics may differ, but for the space I was in - those were the most commonly seen. Of those, wordpress.com and blogspot dominated.
All four providers (and a few others such as Squarespace and Vox) have a simple URL scheme. Each assigns a unique subdomain for a blog. For example; http://myblog.wordpress.com or http://myself.blogspot.com. Post URLs are simply in the form http://myblog.wordpress.com/2007/12/13/all-about-me/ or http://myself.blogspot.com/2007/12/all-about-myself.html respectively.
Discovering the blog address from either of those URL forms? Easy. Use parse_url to find the host and you’re all done.
Custom domains or subdomains
An increasing number of people opt to run blogs in their own domains (or sometimes a subdomain). For example, this blog is hosted on lair.fierydragon.org. You also have http://sanjiva.weerawarana.org/ and indi.ca to pick a couple more examples at random. Handling these cases is also surprisingly simple, even if some of the URLs on offer are distinctly hairy. Some of the furrier examples that I’ve seen so far are http://www.kanabona.com:80/www/?q=galle_fort_party or http://www.kulendra.net/index.php?option=com_content&task=view&id=209&Itemid=9. Lots of text and slightly scary looking URLs, but it’s all good. Much to my relief, parse_url still works well here.
So far, so good. A single built-in function in PHP is capable of inferring more than 80% of the URLs that I encounter in my application. Not a great deal of effort to figure it out, everything works fine. Unfortunately, the remaining 20% of URLs are slightly more challenging.
Special case 1: Trailing blogname
For example, a case like this - http://www.bloglines.com/blog/sanjiva?id=258. Or perhaps like this - http://www.xanga.com/LadyKiadri/631754561/perth-diary-overheard-at-the-cricket.html. Running a parse_url on those blog addresses and examining the host doesn’t actually get you anything because the host remains unchanged. All the action happens afterwards.
Special case 2: Horribly munged URLs
Feedburner is the biggest culprit here. Examples: http://feeds.feedburner.com/~r/LankaCricket/~3/202319740/video-sri-lanka-v-england-2nd-test-day_18.html or http://feeds.feedburner.com/~r/Defencenet/~3/202252903/explosion-near-kanthale-railway-station.html. Yes, you can just about pick out the blog address, but there is plenty of cruft inbetween and afterwards to confuse the issue.
Special case 3: URLs that make no sense at all
A recent traumatic example: http://blog.360.yahoo.com/blog-12rsa5Q4aKewp.lL5PhUGuNBFQ–?cq=1&p=66
Spaces used to produce similarly confusing URLs, but they got a clue. Yahoo 360 on the other hand … well. The funniest thing is that some of the elements of that Yahoo 360 address actually seem to have no discernible function. For example the — chars. Remove them or keep them on, they seem to change nothing. Inexplicable.
How to solve it
(liberally chanelling G. Polya here)
Clearly, there are a few considerations when writing a monster “find-me-the-blog-address” function of this type. The first and foremost is that I really wouldn’t know about all the different URL mappings that various blogs are going to produce in advance. Each time a new blog URL appears, I’ll have to deal with it. If I handle the special cases with custom code, this implies that each new type of blog post URL to blog mapping requires a fresh condition in code of the type If new-blog-address-type-is-seen do <make-sense-of-URL>.
Frequent changes of code for this purpose being unacceptable, I needed a table driven solution where predefined blog address patterns are stored externally and read by the host parsing function.
At one point, performance was also an issue because I needed to call this function several hundred times in the course of rendering a page. However, a bit of benchmarking made me realize that this computation was better shifted offline. Flexible, maintainable or fast. Pick one. I picked flexibility over speed.
The one-size fits all hammer for all these special cases is that once the blog domain is identified - a simple regex (regular expression) can be used to determine the blog address. The regular expression itself differs between the special cases. So; I devised a simple table to hold all the domains which require special treatment and the corresponding regular expression.
Domain: www.risefromtsunami.com —Regex:/\?p=\d+.*/
Domain: digg.com —Regex:/news.*/
Domain: www.xanga.com —Regex:/\d+.*/
Load this table up, check the domain being examined (with parse_url) and if the domain matches anything in the table, apply the regular expression instead of relying on the results of parse_url alone. Simple, extensible and relatively quick - since the table of patterns only need to be loaded once.
And that’s how the domains are displayed on Ach.
Just say it
Can't post a comment ? Any other commenting problems ? email lair - at - fierydragon . org