Tuesday, May 5, 2009

Mapping Optimization Techniques

1.     Reduce the number of transformations. There is always overhead involved in moving data between transformations.

2.     Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice.

3.     Calculate once, use many times.

o    Avoid calculating or testing the same value over and over.

o    Calculate it once in an expression, and set a True/False flag.

o    Within an expression, use variable ports to calculate a value than can be used multiple times within that transformation.

4.     Only connect what is used.

o    Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier.

o    This is also helpful for maintenance. If a transformation needs to be reconnected,  it is best to only have necessary ports set as input and output to reconnect.

o    In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which  cuts down on the amount of cache necessary and thereby improves performance.

5.     Watch the data types.

o    The engine automatically converts compatible types.

o    Sometimes data conversion is excessive. Data types are automatically converted when types are different between connected ports.  Minimize data type changes between transformations by planning data flow prior to developing the mapping.

6.     Facilitate reuse.

o    Plan for reusable transformations upfront.

o    Use variables. Use both mapping variables as well as ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times

o    Use mapplets to encapsulate multiple reusable transformations.

o    Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions.

7.     Only manipulate data that needs to be moved and transformed.

o    Reduce the number of non-essential records that are passed through the entire mapping.

o    Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible).

o    Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.

8.     Utilize single-pass reads.

o    Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier.

o    Remove or reduce field-level stored procedures.

o    If you use field-level stored procedures, the PowerCenter server has to make a call to that stored procedure for every row, slowing performance.

What are the types of Facts?

Types of Facts

There are three types of facts:

  • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
  • Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
  • Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date

Store

Product

Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:

Date

Account

Current_Balance

Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables

Based on the above classifications, there are two types of fact tables:

  • Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
  • Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

What is a Degenerated Dimension(DD)?

An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension.  Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension. Now coming to the slowly changing dimensions (SCD) and Slowly Growing Dimensions (SGD):  I would like to classify them to be more of an attributes of dimensions its self.    Although other might disagree to this view but Slowly Changing Dimensions are basically those dimensions whose key value will remain static but description might change over the period of time.  For example, the product id in a companies, product line might remain the same, but the description might change from time to time, hence, product dimension is called slowly changing dimension.    Lets consider a customer dimension, which will have a unique customer id but the customer name (company name) might change periodically due to buy out / acquisitions, Hence, slowly changing dimension, as customer number is static but customer name is changing,  However, on the other hand the company will add more customers to its existing list of customers and it is highly unlikely that the company will acquire astronomical number of customer over night (wouldn’t the company CEO love that) hence, the customer dimension is both a Slowly changing as well as slowly growing dimension. Another definition of Degenerated Dimension: A fact or "degenerate" dimension has a one-to-one relationship with the facts, such as a transaction number. The degenerate dimension is not used for analysis as such, but rather for identification. For example, to locate a specific transaction or to identify the transactions that make up an aggregate cell.

What is a Junk Dimension(JD)?

When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few records in them, cluttering your database with these mini ‘identifier’ tables, all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table.  (Since we are storing all the junk in this one table) For example: a company might have handful of manufacture plants, handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked dimension table

What is a Conformed Dimensions (CD)?

These dimensions are something that is built once in your model and can be reused multiple times with different fact tables.   For example, consider a model containing multiple fact tables, representing different data marts.  Now look for a dimension that is common to these facts tables.  In this example let’s consider that the product dimension is common and hence can be reused by creating short cuts and joining the different fact tables.Some of the examples are time dimension, customer dimensions, product dimension. 

When do u we use dynamic cache and when do we use static cache in an connected and unconnected lookup transformation?

We use dynamic cache only for connected lookup. We use dynamic cache to check whether the record already exists in the target table are not. And depending on that, we insert,update or delete the records using update strategy. Static cache is the default cache in both connected and unconnected. If u select static cache on lookup table in infa, it own't update the cache and the row in the cache remain constant. We use this to check the results and also to update slowly changing records 

Why sorter transformation is an active transformation?

This is type of active transformation which is responsible for sorting the data either in the ascending order or descending order according to the key specifier. the port on which the sorting takes place is called as sortkeyport. As we have a DISTINCT option to process the distinct records, we can call the Sorter Transformation as an ACTIVE transformation

Properties:

if u select distinct eliminate duplicates 

case sensitive valid for strings to sort the data
null treated low null values are given least priority