Multidimensional   Databases   (MOLAP)     
       
          Sample Data  

 

 

 

Introduction

 

The spreadsheet has long been the “bread and butter” of financial data analysis.  But the isolated spreadsheet raises some significant management issues.  As organizations discovered many years ago, the number of spreadsheets needed to manage business activities growths exponentially with the size of the organization: PC folders soon become littered with vast numbers of spreadsheets containing stale data of dubious provenance.

 

The solution to this problem has been to populate spreadsheets dynamically from a database that contains up to date data, and now every BI application on the market sports the inevitable “Excel Add-in”.

 

The databases that hosted spreadsheet data were at first the same relational databases into which the raw transaction data was stored by OLTP applications – relational OLAP (ROLAP).  Now while relational databases are very flexible and can be almost unlimited in their data storage capacity, they can be slow in serving data, and are ill-equipped to perform the analytic calculations needed for time series analysis and financial forecasting.

 

The multidimensional database – multidimensional OLAP (MOLAP) – arose as a way of consolidating spreadsheet data, while at the same time providing excellent performance and good analytic support (Essbase is an acronym for “Extended Spreadsheet Database”).  Performance gains come from:

 

*  An array-based storage architecture (think of a hypercube as being stored in a multidimensional C array);

 

*  Selective aggregation of the raw relational data to reduce the size of the data set that needs to be stored; and

 

*  The pre-calculation of all the summaries that can be obtained by collapsing various dimensions of the hypercube that contains the data.

 

So when an end user queries a multidimensional database the database engine effectively responds with a, “here’s a report I prepared earlier”.  By anticipating the likely range of end user queries, considerable performance gains are possible.

 

But the limitations of the multidimensional database soon became apparent.  It failed to cope with large data sets containing sparse data.  This problem has been addressed in various ways, by:

 

*  Changing the architecture so that a sparse hypercube can be stored efficiently, but still appear as a dense hypercube to applications;

 

*  Storing only the core hypercube data and dynamically calculating summaries on demand, or by calculating only a subset of the possible summaries based on storage requirements and frequency of use; and

 

*  Storing the metadata needed to allow dynamic drill through to the relational database by applications, so that only a subset of the most frequently used data needs to be placed in the hypercube (the database engine provides a query-rewrite facility, sourcing data quickly from the hypercube, where possible, and more slowly from the relational database, where not).

 

Mature multidimensional products, such as Essbase, support all this varied functionality, which allows contemporary hybrid OLAP (HOLAP) products to offer the “best of both worlds”: superior performance for most queries, together with the flexibility to drill-through to relational data sources when required.

 

 

Sample Relational Tables

 

To illustrate the issues involved in Essbase cube creation in the following articles, we’ll use a simple set of relational tables, each containing a small rowset.  The objective is to ensure that you can easily work out the results of aggregations and ensure that what Essbase is doing with the data is “what you think it should be doing with the data”.

 

We’ll use three relational tables: “Customers”, “Products”, and “Sales”:

Customer IDCustomer NameCustomer Region
1BrudenbackerUS
2SmithersUS
3ContraineEMEA
Customers

 

Product IDProduct NameProduct Category
1Disk DriveElectronics
2DrillElectronics
3Circular SawHardware
Products

 

Customer IDProduct IDSales QuantitySales Value
11102,500
1251,000
13101,000
13202,000
22204,000
23303,000
31102,500
32204,000
33101,000
Sales

 

 

What we can expect from Essbase

 

The following table shows the set of views of the relational data that are likely to be of the greatest interest to end users (the nine different views are shown in different colours):

      *
    Hypercube Views

 

Fact and Dimension Tables

 

In multidimensional terminology, the “Sales” table is called a “fact” table, and the “Customers” and “Products” tables are called dimension tables.

 

The fact table contains two very different kinds of columns: the foreign-key columns, “Customer ID” and “Product ID”, and the measure columns, “Sales Quantity” and “Sales Value”.  The foreign-key columns have no business value and just serve to allow the three tables to be joined together, as they must be to produce the end user views shown above.  The measure columns, which are almost always numeric, represent key business information, and almost always represent information that is “logically additive” – measure values make sense when added together.

 

Dimension Members and Hierarchies

 

The “Customers” and “Products” dimensions each have a distinct hierarchical structure.  The “Customer” dimension contains three base or leaf members: “Brudenbacker”, “Smithers”, and “Contraine”, corresponding to the “Customer Names” level of the “Customers” hierarchy.  But customers are also associated with geographic “Customer Regions”, which contain two members, “US” and “EMEA”.  A customer name can be associated with only one region, but a customer region can be associated with many customer names, thereby forming a natural hierarchy.  All dimensional hierarchies have a master level; in this case “All Customers”.

 

Measure values corresponding to members of a non-leaf hierarchy can be determined by aggregating or adding up the measure values for dimension members that are lower down in the hierarchy.  A multidimensional database can therefore choose to store these aggregated values or to calculate them dynamically depending on the trade-off between performance and storage. 

 

Cardinality of Members, Views, and Cubes

 

We have two dimensions, each of which contains a hierarchy with three levels.  This gives us nine independent views of the data that are likely to be of most interest to end users.  In general, the total number of views equals the product of the number of levels in each hierarchy.  These views are independent in that a user will request one view at a time (perhaps filtered or with additional calculations).  Views towards the bottom right of the table contain fewer data points and will be accessed more frequently by end users.

 

The data points corresponding to each view constitute a cube (a rectangle in two dimensions, a hypercube in four or more).  Because views differ in their importance, hypercube data also differs in its importance.  These variations mean that a multidimensional database can store the data points associated with different hypercubes in different ways.  For example, partial aggregation is a common method in which the smaller hypercubes corresponding to the most frequently requested views are pre-calculated and stored, while the data points for other views are calculated dynamically on demand.  In the case of our sample data, only data points in the “grey” area above need to be stored.  All other data points could be calculated dynamically.  Or some of the aggregated values might be pre-calculated and stored; for example, those in the lower right hand corner (the “light orange”, “orange”, “pink”, and “red” subsets).

 

The “Customers” dimension contains five members, three corresponding to level three and two corresponding to level two.  The total number of data points in a particular view equals the product of the number of members in the dimensional levels that make up that view.  For example, the “light green” view above contains six data points – level three of the “Customers” hierarchy contains three members and level two of the “Products” hierarchy contains two members.

 

The minimum number of data points equals the product of the number of members in the leaf levels of the hierarchies that comprise the view (three by three in the “grey” area above).  The maximum number of data points to be stored (assuming that all aggregates are pre-calculated) equals the products of the total numbers of the members plus one from each dimension (for the sample data, (5+1) by (5+1), or 36 data points).

 

Base Summation and Missing Values

 

The data points corresponding to the intersection of members in the leaf hierarchies are usually aggregated from raw relational data.  We can see that there are two rows for a “Customer ID” of one and a “Product ID” of three, so we have consolidated these rows by aggregating the “Sales Quantity” and “Sales Value” column values when computing the hypercube (see the calculation at the intersection of “Brudenbacker” and “Circular Saw” above).  If an end user wishes to drill down below this level then the multidimensional database cannot provide an answer.  It may however store information about how the data points were derived from the corresponding relational tables.  With this information to hand an application may be able to drill through to the relational data source in what appears to the end user to be a seamless manner (the “30 (3000)” datapoint at the intersection of “Brudenbacker” and “Circular Saw” above can be split apart into its two components).

 

Note, that there may be insufficient values to populate all the data points in the base hypercube: there is no row corresponding to the intersection of customer name “Smithers” and product name “Disk Drive”.  Usually, these missing values can be treated as zeros, though depending on the architecture they may, or may not, be stored in the hypercube.