Farm Development

Thoughts on Data Mining

back to all thoughts

Data mining in Python and beyond?

Dear Lazy Web,

I am a software engineer who hasn't done any college level math (gasp!). Recently, I've been having a lot of fun transforming data into more meaningful data. This, I believe, is more commonly known as data mining and I'd like to learn more about it.

Specifically, I've been looking at Internet search data where keywords are buried in some kind of template like {foo}::{bar}::{keywords}, placeholders replaced with actual content; there are many different, disparate template formats and no template ID to go by. So, I spent some quality time with my favorite programming language, Python, and identified as many patterns as I could in a sampling of 5 million candidate strings. After much tweakage, my pattern recognition became 94.54% accurate. This rate was more than good enough to the users of the data so I basked in sweet triumph and called it a day :)

In another project I used frequency analysis and deduction to turn eBay auction titles into more meaningful identifiers of items up for sale. This worked fairly well because my dataset was large and all the auctions were for a specific kind of item.

My question to you is where can I find out more about this kind of fun stuff? Can you recommend a good book on data mining? Any good blogs to read? Should I take some math classes before I get too deep into this? If so, which ones?

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.