How expensive is free software?
Posted at 6/25/2008 09:10:00 AM
I was just looking at a white paper from Greenplum describing how fast, easy and inexpensive a data warehouse is because it runs on commodity hardware. I have seen uncountable comments about how inexpensive the Microsoft data warehouse infrastructure is and even more comments about the low cost of using open source software. It would be easy to believe that for the price of a few computers and a little software you could have a data warehouse running next week.
At a recent TDWI conference in Germany, Larissa Moss delivered a presentation defining the development steps and the required team members in a data warehouse implementation project. The presentation did a beautiful job of defining what has to be done and who has to do it. It involves a minimum team of five or six full time people and a number of part time specialists. The entire project is divided into sixteen steps with each step having multiple activities.
While many organizations will not follow every step and activity of her plan, it is an excellent definition of what really should be done to ensure quality and minimize total cost of ownership. So, using her project definition as a guideline, it is easy to see that the cost of the database software and the server hardware represents only a small portion of the total cost of a data warehouse.
The use of commodity hardware and open source software can reduce the total cost of the project by a couple of percentage points at best. Reducing the amount of human resource required can reduce the cost of the project by half. Some of the tasks identified in her presentation won’t change with any technology but there are some large improvements possible that can result in cost reductions that substantially exceed the total cost of the hardware and software.
For example, the Data Analysis Step includes six activities. Each of these activities involves a number of people and takes a significant amount of time. Using free (or nearly free) relational database software and/or inexpensive commodity hardware, all six steps are still required to have a reasonable prospect of a successful project. If another type of database can reduce the time and expense of these steps by just ten percent, the entire cost of the software and hardware can be offset.
Using the CDBMS structure, two of the activities, “Refine logical data model” and “Expand enterprise logical data model” are no longer needed. The other activities, like “Analyze source data quality”, will remain essentially the same regardless of database structure. Removing two of the six activities will reduce the overall time and cost of just one step by about one third.
As another example, take a look at one of the activities in the first step, “Cost Justification”. If the cost of a project is a few hundred dollars there is no justification needed other than someone saying it will help them with their job. If the cost of a project is a few tens of millions, it will require extremely detailed and robust justification and approval at top executive level. In between, the justification should have a relatively linear relationship to the expected cost. If the Data Analysis step is reduced by one third, and other steps are reduced by a substantial amount, the cost of the entire project is reduced and the cost justification time and effort can also be reduced.
Going through all of the steps, it is easy to see how the total effort required for a successful data warehouse project could be reduced by half with no compromise on the quality of the result.
At a recent TDWI conference in Germany, Larissa Moss delivered a presentation defining the development steps and the required team members in a data warehouse implementation project. The presentation did a beautiful job of defining what has to be done and who has to do it. It involves a minimum team of five or six full time people and a number of part time specialists. The entire project is divided into sixteen steps with each step having multiple activities.While many organizations will not follow every step and activity of her plan, it is an excellent definition of what really should be done to ensure quality and minimize total cost of ownership. So, using her project definition as a guideline, it is easy to see that the cost of the database software and the server hardware represents only a small portion of the total cost of a data warehouse.
The use of commodity hardware and open source software can reduce the total cost of the project by a couple of percentage points at best. Reducing the amount of human resource required can reduce the cost of the project by half. Some of the tasks identified in her presentation won’t change with any technology but there are some large improvements possible that can result in cost reductions that substantially exceed the total cost of the hardware and software.
For example, the Data Analysis Step includes six activities. Each of these activities involves a number of people and takes a significant amount of time. Using free (or nearly free) relational database software and/or inexpensive commodity hardware, all six steps are still required to have a reasonable prospect of a successful project. If another type of database can reduce the time and expense of these steps by just ten percent, the entire cost of the software and hardware can be offset.
Using the CDBMS structure, two of the activities, “Refine logical data model” and “Expand enterprise logical data model” are no longer needed. The other activities, like “Analyze source data quality”, will remain essentially the same regardless of database structure. Removing two of the six activities will reduce the overall time and cost of just one step by about one third.
As another example, take a look at one of the activities in the first step, “Cost Justification”. If the cost of a project is a few hundred dollars there is no justification needed other than someone saying it will help them with their job. If the cost of a project is a few tens of millions, it will require extremely detailed and robust justification and approval at top executive level. In between, the justification should have a relatively linear relationship to the expected cost. If the Data Analysis step is reduced by one third, and other steps are reduced by a substantial amount, the cost of the entire project is reduced and the cost justification time and effort can also be reduced.
Going through all of the steps, it is easy to see how the total effort required for a successful data warehouse project could be reduced by half with no compromise on the quality of the result.
When is an appliance not an appliance?
Posted at 6/03/2008 09:52:00 AM

I recently had an interesting discussion with Richard Hackathorn about data warehouse appliances and he convinced me that the illuminate database is an appliance in spite of the complete absence of hardware. In Data Warehouse Appliances: Evolution or Revolution?, he and Colin White describe the attributes of a DW appliance and the illuminate database aligns well with all points.
So is a software-only database really an appliance? Like so many things in the IT world, the answer is “it depends.” His article states “From the classical definition, an appliance is designed for a specific purpose.” And he defines that purpose, in the case of the DW appliance, as data management. His diagram specifically excludes Data integration services, BI applications and tools and Information delivery from the DW appliance functionality.
So when the illuminate profiling loader provides a report on incoming data quality, is that just data management or is it data integration services? Or perhaps one could call it information delivery?
A display of the extended metadata from the illuminate dictionary shows the range of values in a field, the total count of values and the count of unique values. If the field is numeric, it also shows min, max, average and standard deviation. An ordered display of all values and the associated count of occurrences is also available. Is that information delivery or possibly a BI tool?
The line between the appliance and the services is blurry. The features described above, and others, were included in the illuminate database system to facilitate data management. However, they are frequently used to assist in discovery and analytical processes. BI and data mining users can use these features to enhance their processes.
I would still call the illuminate database an appliance; it just has a few extra features. As the database becomes more aware of the context and meaning of the data it manages, these extra features will broaden until it will be nearly impossible to identify the end of the appliance and the start of related services. Another discussion on this topic a year from now will be very interesting.