the escape artist
Been doing large scale data crunching for the past few days - and I’m slowly relearning how to do large scale database imports quickly, but carefully at the same time. That seems like a no-brainer - careful and quick seems to be the obvious attributes for an import job. But where the quality of data is required to be 100% error free (instead of merely adequate), it’s far more time consuming to validate than might otherwise be the case.
In the distant past, I was doing datawarehousing for a (very very) large dataset. Importing the diff of a month’s worth of records into the warehouse took about 2 days of continuous SQL Loader work - so making sure that the import happened without any failures was important. If there is a failure in the middle of the import, the entire job had to be aborted and restarted. Screw up often enough (hey, I was an intern, fer chrissake!) and you’d generally find your spare time being consumed by the need to get that sodding data into the database at all costs.
It didn’t help that the all knowing DBAs I was working with had set up Oracle with the wrong character set and my import data contained quite a few umlauts and essets. Aargh. But anyway… I’m just picking at old scars now.
Had about double the amount of data to get into a database the past couple of days and I encountered a series of unique problems. The first being that the data file is so fricking huge that I couldn’t have another copy on the same disk. Yes, in these days of mammoth hard disks, I still had less than 20 gigabytes of free space left on the disk, so I needed some other way to get a copy. (Yes, the data file is about 20 gigs). For only the second time in my entire life, I had to resort to the wonderful split. Of course, the programmatic equivalent of split is File::Split and that’s what I would use in the actual script.
The objective of this exercise is essentially to get the contents of the data file into a database. I can partition the datafile into chunks. The next problem (as the hahahah-you-are-so-lame title of this post indicates) was escaping. I don’t know what sort of data might be in the file - and escaping proved to be a bit of a bear. For one thing, I was using single quotes to delimit the SQL strings and quotes can occur naturally in the text.
I handled the double quotes properly - but I completely forgot to do single quotes and the importer, quite naturally, threw serveral types of hissy fits. Aaargh.
So, I now need to write a script to clean out the unescaped single quotes. Obviously (or maybe not), some sort of regex needs to be employed here: so my first attempt was something along the lines of
m/[^ ]‘[^,]/gs to catch a case like the one below:
insert into synset_scores (method, word1, word2, score) values ('jcn', '11_November#n#1', 'St_Martin's_Day#n#1', '0');.
See that extra ‘ in St_Martin’s_Day ? That is one place where the importer is going to barf. I can see it, it’s just that I had forgotten to do anything about it when I was generating the SQL. Yeah. Call me a dumbass and be done with it.
Then I figured - hey, I don’t need to do this slow negation shit. All I need is to count the number of single quotes in a sentence. If it’s larger than 8 - I need to run it through my escaping routine and all would be good. So, for the first time outside golfing or obfuscation contests, I used tr///.
if ( scalar tr/'// >= 9) …
The moral of the story: think about escaping first - or you’ll pay for it later. (And I can’t be bothered to make a clevah word play on Sun Tzu’s advice here: When you surround an army, leave an outlet free).
Just say it
Can't post a comment ? Any other commenting problems ? email lair - at - fierydragon . org