will no one rid me of that troublesome …
So, it’s like … I have this series of numbers, right ? And like, these numbers are weirdly clumped together. Most of them are quite small (10-2) values). Unfortunately, two or three of this series are whopping large numbers (108 values). My challenge (since I don’t have a choice about accepting it) is to normalize all these values into a simple range of 0.0 to 1.0 - for purposes of comparison with other numbers.
Yeah. So, I used simple linear reduction to bring things down to scale. That results in the majority of the figures being incredibly, impossibly small.. So, lots of dots in the graph near 0 and just one or two near 1. That won’t do at all.
The obvious solutions, using log or sqrt, yielded interesting but unsatisfactory results. Using cos (I was just curious about what a cosine transform might accomplish here) yielded results which verged on the bizarre.
Ideas ? Get a few stats books from the library, maybe. Is this problem even solvable ? If the data is clumped towards the lower half, perhaps that just means I need to suck it up and deal with incredibly small numbers. Or … I could possibly omit the largish numbers as being statistically insignificant - they constitute less than 1% of the total sample size. Hmmm.
On 30-Jan-06 at 3:27 pm,
spizkapa wrote:
You need to ask yourself: are all the numbers necessary?
What I mean is that, if the numbers that are huge are just outliers, you can simply filter the list to keep only those that you want. This may sound like cooking the data, but it isn’t. You’re trying to show what’s going on, not make conclusions. You obviously need to say that you’ve done this in the surrounding text.
If they are necessary (could indeed be the whole result - look, I get huge numbers when these guys get small numbers) then you can still normalise the rest in a standard way and simply call these huge numbers 1s. I know, it’s not clean.
Other solutions include using a log log plot. It does distort the data but there isn’t a much better way that I know. HTH.
On 30-Jan-06 at 3:47 pm,
drac wrote:
Heh. Damn, you’re fast
I just updated the entry with the thought of dismissing the larger numbers as being outliers, as you suggest.
I probably shouldn’t though. This is word similarity I’m trying to compare - so the large values do mean something, in a sense… as any given word will have a few synonyms (large numbers) and lots of completely unrelated words (small numbers). The large numbers mean something, even if they don’t occur very often.
I actually went this far:
minimum number
and
maximum number.
Of course, I was just applying sqrt till I got to a workable range - but it doesn’t really conform to a nice theory. I think loglog is probably less surprising to anyone reading the material though, so I’ll use that instead. thanks!
Incidentally, I used exactly your suggestion of calling the larger numbers 1s - but a reviewer (some statistician ?) wasn’t happy with that approximation. That’s why I was searching for a more “standard” method of normalization for the revision.
On 30-Jan-06 at 4:00 pm,
spizkapa wrote:
How about telling the statistician to go screw himself? Usually does it for me… Otherwise, log log is the way as far as I can see.