Farm Development

Data mining in Python and beyond?

Dear Lazy Web,

I am a software engineer who hasn't done any college level math (gasp!). Recently, I've been having a lot of fun transforming data into more meaningful data. This, I believe, is more commonly known as data mining and I'd like to learn more about it.

Specifically, I've been looking at Internet search data where keywords are buried in some kind of template like {foo}::{bar}::{keywords}, placeholders replaced with actual content; there are many different, disparate template formats and no template ID to go by. So, I spent some quality time with my favorite programming language, Python, and identified as many patterns as I could in a sampling of 5 million candidate strings. After much tweakage, my pattern recognition became 94.54% accurate. This rate was more than good enough to the users of the data so I basked in sweet triumph and called it a day :)

In another project I used frequency analysis and deduction to turn eBay auction titles into more meaningful identifiers of items up for sale. This worked fairly well because my dataset was large and all the auctions were for a specific kind of item.

My question to you is where can I find out more about this kind of fun stuff? Can you recommend a good book on data mining? Any good blogs to read? Should I take some math classes before I get too deep into this? If so, which ones?

  • Re: Data mining in Python and beyond?

    You may want to try out "Programming Collective Intelligence" by Toby Segaran, all the code snippets are in Python (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325)

  • Re: Data mining in Python and beyond?

    Try Orange

    http://magix.fri.uni-lj.si/orange/

  • Re: Data mining in Python and beyond?

    Yeah, Orange looks great. I was having trouble getting it installed on OS X though (I wanted the Python lib version, not the GUI).

  • Re: Data mining in Python and beyond?

    I second the 'Programming Collective Intelligence' recommendation. I have the book and it's a good starting point, and will give you some good ideas.

  • Re: Data mining in Python and beyond?

    ok, I need to get this book ... thanks!

  • Re: Data mining in Python and beyond?

    Have a look at Weka:

    http://www.cs.waikato.ac.nz/ml/weka/

    There's also a book published by the guys that wrote the software.

  • Re: Data mining in Python and beyond?

    And a third for Programming Collective Intelligence. The collective intelligence part of it refers to very much the problems you're talking about, the ways we can derive useful information from large collections of data. And the programming part means every chapter uses python to build concrete examples of a technique, using everyday data available on the web.

    Categorization, prediction, finding exemplar or independent features, detecting similarity - all sides of a similar coin, Programming Collective Intelligence does a good job of covering a decent number of algorithms and approaches, and illustrating by it's choices of datasets how they're more or less useful with varying types/quantities of input and output data.

    On the math side, it's something I struggle with as well. Another good book is "Geometry and Meaning" by Dominic Widdows, a survey of primarily graphs and vectors in the context of programming (but not so much code in the text) to derive useful information about words from large corpuses. Programming Collective Intelligence concepts stretched out and diving into a bit more math.

    I'm clearly not a good person to knowledgeably recommend math studies, but personally I wish I'd learned statistics and linear algebra rigorously.

    No time like the present, that's the plan of the moment.

  • Re: Data mining in Python and beyond?

    I have used both Orange and WEKA.

    Orange can be tad difficult to get going. esp on Linux/mac.

    WEKA is written in Java. works well on all oses. Also, it has extensive set of algorithms/data manipulation/attribute selection etc.,

    WEKA has multiple User interfaces (CLI, Explorer and Workflow based).

    Once you get familiar with the UI, you might even want to explore further using Jython. I've used Jython+Weka with with some success.

    But, now-a-days I'm trying to learn "R", which of course is excellent for any statistical related work. It has a steep learning curve however.

  • Re: Data mining in Python and beyond?

    For Python, go for Orange library and " Programming Collective Intelligence"

    Weka is very good too... and Ive found the java implementations to be very useful...

  • Re: Data mining in Python and beyond?

    YaLE (http://rapid-i.com/) is also great for doing experiments, trying out algorithms and visualizing results. Most of the learners in WEKA are part of it.

  • Re: Data mining in Python and beyond?

    You could help out with ThoughtTrail. It's a cross-platform open, Semantic framework.

    We can easily reuse web data. E.g: getting del.icio.us tags for a url.

    def toptags(url):

    path="//div[@class='list']/div[@class='sidebar-inner']/ul[1]/li/ul/li/a"

    return getWebXpath("http://del.icio.us/url/check?url="+url, path, None, {'Cookie':'_url_tagview=list'})

    #simple Google search

    def google(query):

    return urlxpath.getWebXpath("http://google.com/ie?q="+urllib.quote(query), "/html/body/nobr/a/@href")

    #Youtube search for the cyborg dude

    getWebXpath("http://youtube.com/results?search_query=kevin%20warwick","//div[@class='vlshortTitle']/a/@href")[0]

    We can find phrases, and datamine how interrelated topics are, getting real numbers back.

    We're working on adding more data source, and AI/stats/datamining methods.

    Our first propper add will be this: http://thoughttrail.com/minibrowser/

    We were in Techcrunch UK with an experiment, Instopix.

    http://uk.techcrunch.com/2008/05/13/instopix-visualising-chat-as-you-type/

    Email luke@thoughttrail.com if you want to help out! :D

  • Re: Data mining in Python and beyond?

    I want to know more about orange

Note: HTML tags will be stripped. Hit enter twice for a new paragraph.

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.