Steven Lott gave a talk roughly about implementing the star schema in Python (slides) suggesting that ETL operations—that is, the process of extracting, transforming, and loading data into a dimensional model (the Star Schema)—then analyzing that data can be done completely in Python memory. No databases. Yes, that's right, he's saying: create these traditional ETL entities as Python objects. Just so you are hearing me correctly: create the Entity, the Dimension, and the Fact AS PYTHON CODE, and report on the facts IN MEMORY!
In case you're not sure what this is all about: the Entity is something used to characterize the fact. Say, for example, a consumer event with regard to advertising: a click on a banner, a submit on the landing page, a product order, etc. Say this comes from an external data source, a CSV file from the client containing simply the session ID, the date the event occurred, and the event type. Next we will have, in python code, at line 1, an Entity object created with a level of ('session ID foo', '2007-02-26 09:25:05', 'submit'). Simple, right? Just a tuple. The level is the hierarchy, the segments of your data.
Since this data is important to things other than reporting, you would load this into the ODS (operational data store). In this example, it would end up [roughly] as a "submit" in the Event Log with a foreign key pointing to the session that started the user's activity. But then a separate process would add it to the Event Fact table, so that reports can be generated on the lifecycle of the user's activity, i.e. the fact that he/she clicked on the banner at 9:15 Monday morning then submitted the form on the offer page at 9:25.
More specifically, the Fact is "a measurement plus the entities which characterize the measurement"; it has many dimensions, which are "a collection of related entities [think: sub-entities]". In our example, we have the Start Date Of The Session as one dimension of the fact, the Campaign that the user clicked on (derived from the session) as another, the Event Type itself, and so on. Great, we have this in a big database because reports are generated off these tables daily.
So why would I ever consider doing this all in python code? That's exactly what I asked Steven after the talk. Several forth and backs and a whiteboard scribble later, he showed me the light. I saw his proposal not as "do all this in python memory" but instead as an intelligent caching mechanism for ETL processing, free to use databases. Caching is very hard to make transparent. Well, only hard because it's usually an afterthought, so you end up with stupid, error prone functions that say if object in cache, return object, otherwise create one. With these "dumb" objects instead you end up with a transparent representation of what's going on, thus a perfect jumping off point into implementing caching. Funny how it takes something ridiculously simple to solve a complex problem sometimes.
However, to get into the nitty gritty, I still don't see how you would ever want to run reports off any real business data from within python code. Steven suggests:
for k,e in dim1.items():
if k[2] == 'criteria':
sum=0
for f in e.facts:
sum += f.measure
yield e.level, sum
Bwahh! Yes, that is insane. This is what SQL is for. But I can actually see that when you are loading facts, you can start building dimensions by first creating a link to that dimension (here, the select query occurs) then propagating an instance of that dimension (cached) for future facts. When it's time to insert the facts, the dimension instances all become foreign keys. Steven says if you must use a database then this approach should start by caching the entire table! I like the lazy select/cache approach better but he says sometimes dimensions can be small data sets and I can see how that might yield efficient results.
There's nothing like a good rebellious idea to get ones gears greased up and turning. My jaw was literally on the floor after this talk; I couldn't believe someone would ever suggest such a ridiculous idea. Steven, thank you :)