Horizontal   (Column)   Partitioning     
       
          Mapping Horizontally Striped Measure Columns  

 

 

 

Introduction

 

Horizontal (column) partitioning occurs when the measure data in a table is divided into two or more components by column, with each component being stored in a separate table.  This architecture can arise when two different applications maintain different measures that logically belong in the same table. 

 

For example, let’s suppose that a supply chain manager extracts a subset of table columns from a “suppliers” table to an Excel spreadsheet, and then adds a dozen columns – characterising the suppliers according to a variety of metrics – to the data in the spreadsheet.  For reporting purposes the data in the database is needed as the “suppliers” table joins to others containing, say, order fulfilment details, for example.  But the data in the Excel spreadsheet is also needed for report construction.  We could, of course, merge the Excel data into the “suppliers” table in the database, but this might not be worthwhile if the reporting requirement was a temporary one, or it might not be practical if the Excel spreadsheet was updated at intervals by some other software application that, say, gathered statistics from an external source.

 

 

Source Tables

 

Let’s start with a FK hierarchy diagram of the tables in our source database:

*
Source Database FK Hierarchy Diagram

 

In this article we’ll focus on the PRODUCTS and SALES tables:

*
Products and Sales Tables

 

These tables contain the following simple rowsets:

*
Products and Sales Rowsets

 

Now, we’ll split the SALES table into two tables: SALES_QUANTITIES that contains the QUANTITY_SOLD measure column and SALES_UNIT_PRICES that contains the UNIT_PRICE measure column:

*
Sales Quantities and Sales Unit Prices Tables

 

The data in the SALES table is divided as follows:

*
Horizontally Partitioned Sales Data

 

Each table has the same number of rows as table SALES, the same foreign-key columns, but only one of the pair of measure columns.

 

In the case of this example, the tables will be stored in the same database, but they could also be stored in different databases (of course, if they are stored in the same database it’s far better to combine them at source, rather than using OBIEE to do so during SQL generation).

 

An end user issuing a query from Answers or any other third-party client that connects to the BI Server will want to see a single SALES table.  So, in the rest of this article we’ll examine the repository mappings that we can use to achieve this objective.

 

 

Multiple Fact Table Design

 

Import the PRODUCTS, SALES_QUANTITIES, and SALES_UNIT_PRICES tables into the Physical layer of the repository (eliminating the columns that are not of interest):

*
Physical Layer

 

Construct a Physical layer diagram to reflect the foreign-key relationships between the three tables:

*
Physical Layer Diagram

 

Create a new business model and drag across the three tables.  Run the Rename Wizard to tidy up the Business layer names; add “sum” aggregation rules to the two measure columns, QUANTITY_SOLD and UNIT_PRICE; and create a hierarchy for the “Products” dimension:

*
Business Model Layer

 

Check the Business layer logical diagram to make sure that the correct links were created automatically when the columns were dragged across:

*
Business Model Diagram

 

Drag the Business Model across to the Presentation layer and delete the unwanted identifiers (in practice we would also tidy up the folder structure):

*
Presentation Layer

 

Check the repository for consistency and then save it to disk:

*
Completed Repository

 

Finally, modify file “NQSConfig.INI” to reference the new repository and start up the BI Server service.

 

 

Inconsistent Code Generation

 

A Correct Answer

 

Log into Answers and create a request based on the new catalog:

*
Table Spanning Request

 

which leads to the following correct result:

*
Table Spanning Request Result

 

Some Overly Complex SQL

 

So far so good.  Let’s have a look at the SQL that the BI Server is sending out the back door.  Navigate to “Settings => Administration => Manage Sessions => View Log”.  Two separate queries have been sent to the database:

*
Query Directed to Sales Quantities Table

 

and

*
Query Directed to Sales Unit Prices Table

 

Now, these queries seem rather more complicated than they need to be!

 

Some slightly better SQL

 

Let’s just make a small change to the repository.  Delete the “Product Name” key from the “Productsdim” hierarchy.  The hierarchy doesn’t contain any information that is relevant to query optimization, so a “well designed” code generation engine should produce the same SQL queries as before.  However, we now get two different queries:

*
Simplified Pair of SQL Queries

 

Much simpler this time, but we haven’t made any changes to the repository that merit changing the SQL queries (note that both queries contain the same tables and the same columns).

 

If these tables had resided in separate databases, then the last pair of SQL queries is about the best that we can expect.  With separate databases, the BI Server has to bring the rowsets back from each source and to perform the outer join between the rowsets itself.  If the databases have query rewrite facilities supported by summary tables / materialized views, then each query will be performed efficiently at source, and the overall efficiency with which the request can be processed will depend on the number of products and hence on the size of the rowsets returned to the BI Server.

 

SQL that’s better still

 

In the case of the example, both tables reside in the same database, and therefore the BI Server might reasonably be expected to issue a single query that evaluated the two subqueries, as above, and then performed the outer join in memory within the database server, returning only a single rowset as the final result.  However, the fact that it hasn’t done so with the two versions of the repository we have used to date, doesn’t mean that the BI Server never gets things right.  Make another small, and insignificant, change to the repository – one that provides no useful information for the purposes of query optimization, run the request again, and we get the following SQL:

*
At Last - A Single SQL Query!

 

This time the BI Server has issued a single query so that the full outer join takes place in the database, where it rightly belongs.

 

As these examples illustrate, one of the worst features of the BI Server is its inability to consistently generate the same code for repositories with very similar architectures.  It seems that when it comes to the efficiency of the generated SQL code the BI Server just “tosses a virtual coin”.  Every time we make even a trivial change to the repository – one with no functional significance – then we have to check that the SQL queries being produced by all the most frequently issued requests are still reasonably efficient.  Even worse, the problem we face is that if we adjust the repository to increase the efficiency of one request, the BI Server may well decrease the efficiency of another.  While the ability of the BI Server to integrate complex data sets is certainly welcome, its inability to do so consistently is most certainly not.  In many cases the only practical solution is to rework the data sources into a set of simple star schemas, so that the BI Server doesn’t have to “think too hard”.

 

 

Opaque View Design

 

In the case of horizontal partitioned tables, a single database, and a request that uses all the measures from the partitioned tables we can force the BI Server to do our bidding.  Instead of importing the partitioned tables, we can create an opaque view, or a virtual table, that contains the optimized SQL query that we wish to send to the database.

 

We could use the last SQL statement shown in the previous section.  However, we can do better than that.  That SQL statement contains a full outer join.  The reason why this is so is that the BI Server has to assume that one table may contain “Product Id” values that are not present in the other table and vice versa.  However, we have more information.  We know that each table will contain an identical set of rows as far as the “Product Id” column is concerned, so that we can use a more efficient inner join.  The last SQL statement shown in the previous section is also less efficient than is possible since the join takes place on the “Product Name” column.  We can use the “Product Id” column instead.

 

For this version of the repository just import the PRODUCTS table.  Then create an opaque view (a physical table with a table type of “Select”) and enter the query as the “Default Initialization String”:

*
Opaque View - Query Definition

 

Then add columns to match those listed in the query:

*
Opaque View - Column Definition

 

The Physical layer diagram simplifies to:

*
Physical Layer Diagram

 

with a similarly structured diagram in the Business layer.  The simplified repository is as follows:

*
Completed Repository

 

Now if we issue the same request as before (taking “Product Name” from the “Consolidated Sales” folder) we get the SQL:

*
Optimized SQL Query

 

which is just what we wanted.  So we now control the SQL sent to the database.

 

Note, that if we issue the same request as before (taking “Product Name” from the “Products” folder this time), then a single SQL query is still issued, as before, but now it is less efficient as it contains a redundant join:

*
SQL Query with Redundant Join

 

However, for frequently used queries we can always structure the catalog so that the preferred source for the “Product Name” item is used.

 

We could deploy the opaque view to the database to create a view, but if we have the opportunity to modify the database, then it would be far better to create a single SALES table as the data source.  If we’re operating in an OLTP environment and the two tables must remain separate – perhaps they are populated by separate applications – then we can always add database triggers to the component tables to post any changes to the single SALES table that contains the merged data.