Queries From Hell
A blog about data warehousing, correlation databases, associative and incremental queries, value-based storage, metadata, on-the-fly indexing, automatic data-driven schemas, BI tools, data mining, visual mapping, pattern recognition, and the limitations of standard SQL in answering "queries from Hell."
Or, how to discover what you don't know you don't know.

Fabulous new Technology

Posted at 1/27/2010 08:59:00 AM
Add to Technorati Favorites
We just finished some preliminary testing on the new Intel 55xx chip (this is the server version of the I7) with our database. The results were astonishing!
The Illuminate database tends to be processor and memory bound before becoming I/O bound. The chip has new technology for improving memory/processor communication so I expected that it would improve our performance. I was hoping that we would see 10% - 15% improvement on our worst case processes.
Woo Hoo! almost 50% improvement on chips running exactly the same clock speed as our control test system.
Most queries in Illuminate run in sub-second timeframes so it is hard to see any improvement there. However, loading data and aggregating large data sets take big chunks of time and the difference here was obvious.
Congratulations and thanks to Intel!

Add to Technorati Favorites      | Blogger Comments (0) |    


revisiting KISS

Posted at 11/16/2009 09:22:00 AM
Any intelligent fool can make things bigger and more complex... It takes a touch of genius --- and a lot of courage to move in the opposite direction.
Albert Einstein

I didn’t know that Albert Einstein designed databases among all of his other activities but, a good database design is a relative concept so I guess he should understand it. Databases designs have grown in complexity as fast as they have grown in size and I would say 90+% of the complexity is not due simply to the increase in size.
Data normalization has created more complexity than any other single concept in database design. The idea that data should be reduced to its lowest possible level of redundancy to optimize databases has made computerized data incomprehensible to humans. The complexity of these designs is further complicated by stovepipe systems, OLAP cubes, star schemas and other interesting approaches to structuring data. Many of these approaches individually provide an information access point that is better than simply looking at the EDW. Unfortunately, none of these actually relate to storing data for people to understand; they are all methods of optimizing the computers ability to store data and all of them add complexity.
The best database design is one that people can understand. Optimize information design for people – what an interesting idea! Our DBA’s have been very well trained in designing data structures for computers but database design for people is not a class offered in any computer science program that I have seen.
The most obvious structure for human understanding is the object role model (ORM). This concept has been around for quite a while but implementing the model on a relational database has proven to be too difficult and it has never gained a foothold. Consequently, the model has languished in obscurity even though it is a superior design concept.
The basic idea of this model is that there are objects i.e. people, products, etc. and roles those objects play relative to each other such as, people are purchasers of products, etc. There are books written about object role modeling so this is truly a basic description but it illustrates the simplicity of the concept. Model the database using the objects and processes of the business.
With a model like this, data can finally be meaningful information for the human being that needs to use it.
Of course, the old standard SQL has some serious problems in this kind of access, but that is a topic for another discussion.

Add to Technorati Favorites

Add to Technorati Favorites      | Blogger Comments (0) |    


How fast is your train?

Posted at 8/10/2009 06:24:00 AM
Train of thought analysis is the only new solution left for business problems that can not be answered with routine reports. But a new problem that was not foreseen by the clairvoyance of the requirements definition team now can’t be answered by your EDW.
Your train is stopped.
Train of thought analysis is the ability to ask a question, learn from the answer and then ask a new question based on what you learned from the first one. If every question is answered in a second or two, you can proceed through this analysis process as quickly as you can create the next question and, as a result, resolve the business issue quickly. If your competitor has to wait for a new index to be built on their EDW or a new cube to be built or a new dimension to be added, you have the competitive advantage.
Computers and databases have improved information access to the point where they can all provide nearly instant answers to all questions that use the structure that was designed into the system. There is no competitive advantage left in OLAP cubes or ad hoc query systems since nearly every company has them. The only competitive advantage that can be obtained from a computerized information system now is through responding to unplanned questions.
I would like to see a new TPC benchmark test that measures the ability of a system to respond to unplanned questions. The Transaction Processing Performance Council (TPC) could provide a complex set of data with no indication of the questions to be asked. The system under test would then be required to answer an unknown set of questions (note: questions, not SQL queries) that have answers somewhere in the data that was provided. Performance would be measured as the time it takes to work through the complete set of questions and produce the correct answers.
I’ll take on all comers in that test.

Add to Technorati Favorites      | Blogger Comments (1) |    


How many tables in an EDW?

Posted at 6/16/2009 06:37:00 AM
I have heard stories of enterprise data warehouses that have thousands of tables; I talked to one organization that claimed that they had 500,000 tables in their EDW! What on earth can require 500,000 or 50,000 or even 5,000 tables?

I make the argument that it is just bad design forced by the technical constraints of relational database managers.

If an EDW is a central repository of an organization’s information then it should be organized around the information entities of that organization. In any organization, there are real objects like customer, product, policy, account, vendor, etc that represent master data. In a large complex organization that is involved in multiple lines of business, there may be as many as 50 of this class of objects. Then there are logical objects (relationships or transactions) like orders, shipments, payments, invoices, etc. Again, in a complex organization there may be as many as 50 of these.

If we make the organization extremely complex and double the number of information entities, then there could be 200. Double it again and there are still only 400. Where do the thousands come from?

  • Normalization
  • Reporting tables
  • Summary tables
  • Application oriented tables
  • Cubes and other special structure
  • Duplication
  • Poor design
Each of these reasons is a good topic for a paper but the normalization issue is the most difficult. All well trained DBAs have been thoroughly instructed that data must be normalized to optimize the EDW. However, this is only true for record-based structures. A column-based structure gets little advantage from strict third normal form design and a value based system like iLuminate gets no benefit at all.

It is true that a table with 500 columns in an RDBMS used for analytics would be very difficult to work with. Indexing would be a major problem (imagine managing 100 indices on one table for just 20% indexed access) and I/O loads would be extreme. In a value-based structure there is no limit on the number of columns that will affect performance; indexing is always universal. The only limit on the number of columns is the limit of a human’s ability to comprehend the entire set of information. A table with 5,000 columns would probably be incomprehensible for people.

Likewise, redundant values are not much of a problem for column-based systems and no problem at all for the value-based structure. In the value structure, each value is stored once regardless of the number of uses or the location of use. Empty or null values also cause no problem in either column or value structures. Simply the absence of a value defines a null, so large, sparsely filled records are not a problem as they are in record-based systems.

Based on this, I propose that in a column or value structure, adherence to third normal form is unnecessary and a 5,000 table EDW has a serious design problem.

Add to Technorati Favorites      | Blogger Comments (0) |    


"Easy to Consume" BI

Posted at 5/26/2009 06:17:00 AM
Recently Claudia Imhoff wrote in her blog that BI should be “easy to consume,” and I most certainly agree. As she accurately points out, there are many forms of BI that are “easy to use,” but easy to consume is a different thing. We at illuminate have been talking about “time to analytics” and “time to answer.” Time to answer is defined as the time it takes from submitting the first question to the time the business issue is resolved. The ability to select and run the query that you want with one click is easy to use. Working your way through a problem with the computer helping you as you go is easy to consume.

The incremental query process is built around the idea that a person may not know the right question to ask until several learning questions have been answered. This “learn as you go” process is what we call an incremental query and what I would propose is exactly "easy to consume" BI.

Add to Technorati Favorites      | Blogger Comments (0) |    


Chaos

Posted at 5/18/2009 07:08:00 AM
If you never read the original Chaos report from the Standish Group you should stop looking at this screen and immediately read it. When you have finished, come back here and read on.

For those of you who read the report fifteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.

Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!

One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the reduction of project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.

What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.

Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design and optimizing indexing strategy.

Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that a moderate success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.

Labels:


Add to Technorati Favorites      | Blogger Comments (4) |    


Colin White Podcast on The Correlation Database

Posted at 3/13/2009 12:54:00 PM
At the recent TDWI event in Las Vegas, Colin White, president and founder of BI Research and BeyeNETWORK blogger, sat down with Andy Fletcher of illuminate Solutions to talk about the correlation database (CDBMS). A widely recognized expert in database technology and frequent speaker at industry conferences, Mr. White found the correlation database technology "fascinating and different" compared with relational and column databases.

Among the high points of the discussion:
  • There is much less effort prior to load with the CDBMS than with relational or column databases, as the need for physical and logical design is completely eliminated.

  • A CDBMS data warehouse can be depoyed in a fraction of the time of other technologies.

  • Data is "all indexed, all the time, in all dimensions," enabling rapid performance even when doing train-of-thought analysis—asking questions unplanned at the outset of data exploration.

  • A CDBMS is often deployed to supplement an existing enterprise data warehouse, to enable ad hoc analysis on a subset of data.

  • Business analysts are the key users, pursuing in-depth analysis rather than standard reporting.
Listen to the complete podcast.

Add to Technorati Favorites      | Blogger Comments (2) |