When is data mining not data mining?
Posted at 4/19/2008 07:51:00 AM
Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high-profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.
With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.
Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.
We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.
The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.
With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.
Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.
We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.
The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.
An inexpensive data warehouse?
Posted at 4/09/2008 08:24:00 AM
What makes a data warehouse system expensive or affordable? Let’s take a quick look at two examples; one based on an RDBMS that is free and another based on a CDBMS that sells for $100K. To keep it simple, I’ll use 100 gigabytes of raw data as the input to these systems and assume a small number of users directly on the warehouse and a larger number of users with OLAP screens, dashboards using star schemas, etc. The costs of the front end systems will be essentially the same for both systems so they are not included.
Design – RDBMS: Six calendar months, about $150,000 in labor cost.
Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.
Design – CDBMS: Three calendar days, about $750 in labor cost
Define aggregations, prepare load and aggregation processes.
Startup – RDBMS: 20 calendar days, about $15,000 in labor cost
Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.
Startup – CDBMS: Four calendar days, about $1000 in labor cost
Load data and validate for accuracy and completeness
Operation – RDBMS: One full time person, about $75,000 per year
Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.
Operation – CDBMS: Less than 1/10 of one full time person
Weekly loads.
It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.
Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.
Design – RDBMS: Six calendar months, about $150,000 in labor cost.
Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.
Design – CDBMS: Three calendar days, about $750 in labor cost
Define aggregations, prepare load and aggregation processes.
Startup – RDBMS: 20 calendar days, about $15,000 in labor cost
Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.
Startup – CDBMS: Four calendar days, about $1000 in labor cost
Load data and validate for accuracy and completeness
Operation – RDBMS: One full time person, about $75,000 per year
Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.
Operation – CDBMS: Less than 1/10 of one full time person
Weekly loads.
It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.
Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.