Queries From Hell
A blog about data warehousing, correlation databases, associative and incremental queries, value-based storage, metadata, on-the-fly indexing, automatic data-driven schemas, BI tools, data mining, visual mapping, pattern recognition, and the limitations of standard SQL in answering "queries from Hell."
Or, how to discover what you don't know you don't know.

How expensive is free software?

Posted at 6/25/2008 09:10:00 AM
I was just looking at a white paper from Greenplum describing how fast, easy and inexpensive a data warehouse is because it runs on commodity hardware. I have seen uncountable comments about how inexpensive the Microsoft data warehouse infrastructure is and even more comments about the low cost of using open source software. It would be easy to believe that for the price of a few computers and a little software you could have a data warehouse running next week.

At a recent TDWI conference in Germany, Larissa Moss delivered a presentation defining the development steps and the required team members in a data warehouse implementation project. The presentation did a beautiful job of defining what has to be done and who has to do it. It involves a minimum team of five or six full time people and a number of part time specialists. The entire project is divided into sixteen steps with each step having multiple activities.

While many organizations will not follow every step and activity of her plan, it is an excellent definition of what really should be done to ensure quality and minimize total cost of ownership. So, using her project definition as a guideline, it is easy to see that the cost of the database software and the server hardware represents only a small portion of the total cost of a data warehouse.

The use of commodity hardware and open source software can reduce the total cost of the project by a couple of percentage points at best. Reducing the amount of human resource required can reduce the cost of the project by half. Some of the tasks identified in her presentation won’t change with any technology but there are some large improvements possible that can result in cost reductions that substantially exceed the total cost of the hardware and software.

For example, the Data Analysis Step includes six activities. Each of these activities involves a number of people and takes a significant amount of time. Using free (or nearly free) relational database software and/or inexpensive commodity hardware, all six steps are still required to have a reasonable prospect of a successful project. If another type of database can reduce the time and expense of these steps by just ten percent, the entire cost of the software and hardware can be offset.

Using the CDBMS structure, two of the activities, “Refine logical data model” and “Expand enterprise logical data model” are no longer needed. The other activities, like “Analyze source data quality”, will remain essentially the same regardless of database structure. Removing two of the six activities will reduce the overall time and cost of just one step by about one third.

As another example, take a look at one of the activities in the first step, “Cost Justification”. If the cost of a project is a few hundred dollars there is no justification needed other than someone saying it will help them with their job. If the cost of a project is a few tens of millions, it will require extremely detailed and robust justification and approval at top executive level. In between, the justification should have a relatively linear relationship to the expected cost. If the Data Analysis step is reduced by one third, and other steps are reduced by a substantial amount, the cost of the entire project is reduced and the cost justification time and effort can also be reduced.

Going through all of the steps, it is easy to see how the total effort required for a successful data warehouse project could be reduced by half with no compromise on the quality of the result.

Add to Technorati Favorites      | Blogger Comments (2) |    

When is an appliance not an appliance?

Posted at 6/03/2008 09:52:00 AM

I recently had an interesting discussion with Richard Hackathorn about data warehouse appliances and he convinced me that the illuminate database is an appliance in spite of the complete absence of hardware. In Data Warehouse Appliances: Evolution or Revolution?, he and Colin White describe the attributes of a DW appliance and the illuminate database aligns well with all points.

So is a software-only database really an appliance? Like so many things in the IT world, the answer is “it depends.” His article states “From the classical definition, an appliance is designed for a specific purpose.” And he defines that purpose, in the case of the DW appliance, as data management. His diagram specifically excludes Data integration services, BI applications and tools and Information delivery from the DW appliance functionality.

So when the illuminate profiling loader provides a report on incoming data quality, is that just data management or is it data integration services? Or perhaps one could call it information delivery?

A display of the extended metadata from the illuminate dictionary shows the range of values in a field, the total count of values and the count of unique values. If the field is numeric, it also shows min, max, average and standard deviation. An ordered display of all values and the associated count of occurrences is also available. Is that information delivery or possibly a BI tool?

The line between the appliance and the services is blurry. The features described above, and others, were included in the illuminate database system to facilitate data management. However, they are frequently used to assist in discovery and analytical processes. BI and data mining users can use these features to enhance their processes.

I would still call the illuminate database an appliance; it just has a few extra features. As the database becomes more aware of the context and meaning of the data it manages, these extra features will broaden until it will be nearly impossible to identify the end of the appliance and the start of related services. Another discussion on this topic a year from now will be very interesting.

Add to Technorati Favorites      | Blogger Comments (0) |    

Denormalizing your world

Posted at 5/29/2008 04:11:00 AM
How many tables are in your data warehouse? 500? 1000? 5000? More? One data warehouse that I know of has approximately 500,000 tables!

Is it any wonder that business people can not use a data warehouse? Even most data marts are more complex than the business activity they represent. The paradox here is that if you try to list all of the real world entities that constitute a business information environment it is very hard to find more than fifty or sixty and a good case could be made for as few as twenty.

Where then do all of these tables come from and what do they mean? Most of them are the result of optimizing data structures to improve the performance of a computer. Good computer performance is gained at the expense of good business performance—that seems like a poor tradeoff.

Normalizing data structures and metadata rather than normalizing data is the first step in making information accessible to business users. A data structure that never repeated a metadata value, including repeats in the form of contrived names to represent the same thing in different contexts and levels of aggregation, would be the most understandable structure for business information users. This means one table for all customer data regardless of source or usage, providing a single customer view. The same goes for product, service, order, invoice, etc.

In an environment where one table represents one entity the average business user could easily navigate through the tables and BI for the masses would become a more realistic concept.

Add to Technorati Favorites      | Blogger Comments (0) |    

B-Eye Podcast from TDWI

Posted at 5/20/2008 05:36:00 AM
During the whirlwind that was last week's Data Warehouse Institute (TDWI) conference in Chicago, I had the opportunity to sit down with David Loshin of the Business Intelligence Network for a rapid-fire discussion on data warehouse design, implementation and use.

In this podcast, David and I talk about how organizations can quickly deploy enterprise data warehouses, eliminate the business requirements definition phase and use all of their data for functions such as finance and performance management. We also discuss how a correlation data warehouse lowers the barrier to entry by reducing both up-front and ongoing maintenance costs, and enables business users to perform ad hoc analysis, discovery and exploration.


Add to Technorati Favorites      | Blogger Comments (0) |    

What's in a Name?

Posted at 5/15/2008 06:04:00 AM
Naming of data objects has always been an interesting issue. Long ago, programmers named files and fields arbitrarily, at their own whim. Later, most IT departments standardized naming rules. Now there are industry standard naming conventions like International Standard ISO/IEC 11179-5 and complete naming systems like the Dublin Core Metadata Initiative. Yet most business users today could not even begin to interpret the meaning of the names used for the hundreds or even thousands of tables and the multiple thousands of columns in an enterprise information management infrastructure. These names identify the information they use every day but the names are so unintelligible that no one could interpret them without a written standard.

Why should objects that are used every day by the business user have names that are incomprehensible?

The primary reason is that data objects are structured to optimize computer performance rather than to reflect the business structures. A common thing like a customer record quickly becomes split into five or ten or more tables in third normal form and business information users can no longer find anything.

The best solution to date has been to build supplementary data structures especially for user access and name all objects in those structures according to a centrally managed naming control system. The business users then only see a data mart with a small subset of all data objects and are usually able to learn the meaning of that subset of names.

A better solution for naming is to keep just one data structure for each real world object and give the attributes of that structure names that align with the business names. In this environment, every business entity is represented in a single table and each attribute occurs once and has a meaningful name. Each customer has one customer record and each product has one product record. All information access systems use these tables and access the same data using the same names. The data marts, star schemas and OLAP cubes of conventional systems are just ordinary queries in this environment.

This concept, as simple as it may seem, has a couple of issues that keep it from being implemented. The first issue is computer performance. The massive redundancy created by this structure would be crippling in an RDBMS environment. The solution for this issue is simply to use the Value Based Structure in the CDBMS. In this structure redundant data storage does not cause any problem and normalizing provides no gain.

The second problem is application design. Applications are still designed to interact with data stores the same way they did thirty years ago. New data storage capabilities provide functions that didn’t exist thirty years ago and applications should be upgraded to use the best that is available. However, that is a discussion for another day.

Add to Technorati Favorites      | Blogger Comments (0) |    

When is data mining not data mining?

Posted at 4/19/2008 07:51:00 AM
Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high-profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.

With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.

Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.

We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.

The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.

Add to Technorati Favorites      | Blogger Comments (0) |    

An inexpensive data warehouse?

Posted at 4/09/2008 08:24:00 AM
What makes a data warehouse system expensive or affordable? Let’s take a quick look at two examples; one based on an RDBMS that is free and another based on a CDBMS that sells for $100K. To keep it simple, I’ll use 100 gigabytes of raw data as the input to these systems and assume a small number of users directly on the warehouse and a larger number of users with OLAP screens, dashboards using star schemas, etc. The costs of the front end systems will be essentially the same for both systems so they are not included.

Design – RDBMS: Six calendar months, about $150,000 in labor cost.
Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.

Design – CDBMS: Three calendar days, about $750 in labor cost
Define aggregations, prepare load and aggregation processes.


Startup – RDBMS: 20 calendar days, about $15,000 in labor cost
Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.

Startup – CDBMS: Four calendar days, about $1000 in labor cost
Load data and validate for accuracy and completeness


Operation – RDBMS: One full time person, about $75,000 per year
Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.

Operation – CDBMS: Less than 1/10 of one full time person
Weekly loads.

It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.

Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.

Add to Technorati Favorites      | Blogger Comments (0) |