farmdev

Thoughts on Data Mining

Data mining in Python and beyond?

Dear Lazy Web,

I am a software engineer who hasn't done any college level math (gasp!). Recently, I've been having a lot of fun transforming data into more meaningful data. This, I believe, is more commonly known as data mining and I'd like to learn more about it.

Specifically, I've been looking at Internet search data where keywords are buried in some kind of template like {foo}::{bar}::{keywords}, placeholders replaced with actual content; there are many different, disparate template formats and no template ID to go by. So, I spent some quality time with my favorite programming language, Python, and identified as many patterns as I could in a sampling of 5 million candidate strings. After much tweakage, my pattern recognition became 94.54% accurate. This rate was more than good enough to the users of the data so I basked in sweet triumph and called it a day :)

In another project I used frequency analysis and deduction to turn eBay auction titles into more meaningful identifiers of items up for sale. This worked fairly well because my dataset was large and all the auctions were for a specific kind of item.

My question to you is where can I find out more about this kind of fun stuff? Can you recommend a good book on data mining? Any good blogs to read? Should I take some math classes before I get too deep into this? If so, which ones?

read article