Friday, June 21, 2013

SCD Types and How Many ways to develope the SCD's




1. What type of SCD you have used so far?

Slowly Changing Dimensions (SCD) is dimensions that have data that slowly changes.
For example, we may have a Dimension in our database that tracks the sales records of your company's salesmen and when sales person is transferred from one regional office to another.
Dealing with these issues involves SCD management methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
-The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered.
-The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming we won't ever need to know how it used to be misspelled in the past.)
-The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
-The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
- The Type 4 method is usually just referred to as using "history tables", where one table keeps the current data and an additional table is used to keep a record of some or all changes.
- The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6).  It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.

2. How did you implement Type2 SCD in Datastage?

The following steps are required to implement SCD type2 in Datastage.
1) Need to take a snap shot of the WareHouse final target dimensional table and store in a DataSet or Temp Table or directly from the Data warehouse target table itself.
2) Retrieve the new records from the source (Source table/ Flat file/ view/ or any other source) and lookup the snap shot with the help of lookup or Join stage and based on Primary Key
3) Allow to pass both the values for these particular SCD columns (the columns that are affected with change) with a different column name (like SalesTerritory Source and SalesTerritoryLkp.
4) In the next step, in a transformer, compare these two values for every single primary key using stage variables. If these values are different then close the previous record by choosing SalesTerritoryLkp value (coming from target snap shot) for SalesTerritory column ,updating the CURRENT_RECORD='N'  and setting the END_DATE as Current Time ( when the records are being processed) in first link  from the transformer. Simultaneously with the new value (Sales_TeritorySource ) in the SCD columns insert One more records into the target table with START_DATE as current time, CURRENT_RECORD=‘Y’ and setting the END_DATE as Null.
5) To maintain the uniqueness of these two records generate a surrogate key and use this as the primary key in the target for dimensional table


SCD Type 1
Type 1 Slowly Changing Dimension data warehouse architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
SCD 1 implementation in Datastage

The job described and depicted below shows how to implement SCD Type 1 in Datastage. It is one of many possible designs which can implement this dimension. The example is based on the customers load into a data warehouse


Datastage SCD1 job design





The most important facts and stages of the CUST_SCD2 job processing:
           There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
           A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD1 Transformer mapping


           A T002 transformer updates old values with new ones without concerning about the overwritten data.
SCD1 Transformer update old entries



           The database is updated in a target ODBC stage (with the 'update existing rows' update action)


SCD Type 2

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage



The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which has the following structure and data:





Datastage SCD2 job design



The most important facts and stages of the CUST_SCD2 job processing:
·  The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:






·  There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
·  A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer
                                       
                                       
·  A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer

                                       

·  A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer



·  ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
·  ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and the key column is CUST_ID.

                                   

SCD Type 3

In the Type 3 Slowly Changing Dimension only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.

To implement SCD Type 3 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to update the old value with a new one and update the previous value field.

SCD Type 4

The Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.

To implement SCD Type 4 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to insert an old value into the destionation stage connected to the historical data table (D_CUSTOMER_HIST for example) and update the old value with a new one.
SCD 2 implementation in Datastage Parallel Jobs 7.5X2.

Change Capture Stage:
            “It is processing stage, that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”.

Simple example of change capture:


 





Text Box:  
 
                                                            Change_capture                     
                                   

                                                                       
Properties of Change Capture:
Ø  Change keys
o   Key = EID (key column name)
§  Sort order = ascending order
Ø  Change valves
o   Values =? \\ ENAME
o   Values =? \\ ADD
Ø  Options
o   Change mode = (explicit keys & values / explicit keys, values)
o   Drop output for copy = (false/ true)  “false – default ”
o   Drop output for delete = (false/ true) “false – default”
o   Drop output for edit = (false/ true) “false – default”
o   Drop output for insert = (false/ true) “false – default”
§  Copy code = 0
§  Delete code = 2
§  Edit code = 3
§  Insert code = 1
§  Code column name = <column name>
o   Log statistics = (false/ true) “false – default”

Change Apply Stage:
            “It is processing stage, that it applies the changes of records of a table”.

Text Box:  
 
                                                                                   








 




                                                            Change Apply            
                                   
Properties of Change Apply:
Ø  Change keys
o   Key = EID
§  Sort order = ascending order
Ø  Options
o   Change mode = explicit key & values
o   Check value columns on delete = (false/ true) “true - default”
o   Log statistics = false
o   Code column name = <column name> \\ change capture and this has to be SAME for apply operations


SCD II in version 7.5.x2
Design of that
                                                                                                ESDATE=current date ()
                                                                                                EEDATE= “9999-12-31”
Text Box:   

                                    Key=EID                               ACF= “Y”
Text Box:  Text Box:                          -option: e k & v
Before.txt                                                                 c=3          








 
Text Box:  Text Box:  Text Box:                                                                            c=all
                                                     
after.txt
Text Box:  Text Box:  Text Box:  Text Box:                                                                          key= EID                                                                                                     -option: e k & v                                  
before.txt                                                             

                                                                                                                ESDATE- current date ()
EEDATE- if c=3 then DFJD(JDFD(CD())-1)
                                                                                                                                    else EEDATE = “9999-12-31”                                                                                                                      ACF- if(c=3) then “N” else “Y”      



Example table of SCD data:
SID
CID
CNAME
ADD
AF
ESDATE
EEDATE
RV
UID
1
11
A
HYD
N
03-06-06
29-11-10
1
1
2
22
B
SEC
N
03-06-06
07-09-07
1
2
3
33
C
DEL
Y
03-06-06
9999-12-31
1
3
4
22
B
DEL
N
08-09-07
29-11-10
2
2
5
44
D
MCI
Y
08-09-07
9999-12-31
1
5
6
11
A
GDK
Y
30-11-10
9999-12-31
2
1
7
22
B
RAJ
Y
30-11-10
9999-12-31
3
2
8
55
E
CUL
Y
30-11-10
9999-12-31
1
8

Table: this table is describing the SCD six types and the description is shown above.
DAY 44
SCD I & SCD II (Design and Properties)


SCD – I: Type1 (Design and Properties):
Transfer job                                                                             Load job
Text Box:  Text Box:  10,20,30








 
Text Box:  Text Box:  OE_DIM            before              fact           DS_FACT          10, 20, 40                                            10, 20, 40







Text Box:

 
                                                                                    DS_TRG_DIM                                     OE_UPSERT
Text Box:  Text Box:  10, 20, 40           After                     dim   10,20, 40                                             -update and insert

OE_SRC                                                DS_TRG_DIM

In oracle we have to create table1 and table2,
Table1:
Ø  Create table SRC(SNO number, SNAME varchar2(25)); 
o   Insert into src values(111, ‘naveen’);
o   Insert into src values(222, ‘munna’);
o   Insert into src values(333, ‘kumar’);
Table2:
Ø  Create table DIM(SKID number, SNO number, SNAME varchar2(25));
o   No records to display;

Processes of transform job SCD1:

Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources.

Step 2: “SCD1 properties”

fact
 
Fast path 1 of 5:         select output link as:

Fast path 2 of 5:         navigating the key column value between before and after tables


Fast path 3 of 5:         selecting source type and source name.






 
Source type:                                        source name:

NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.

Fast path 4 of 5:         select output in DIM.



For path 5 of 5:           setting the output paths to FACT data set.
Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert  \\ if key column value is already.
   -insert n update        \\ if key column value is new.

Here SCD I result is for the below input





SCD – II:  (Design and Properties):

Text Box:  Transfer job                                                                             Load job
Text Box:  
 
Text Box:  10,20,30
                 before
Text Box:  OE_DIM                                    fact           DS_FACT          10, 20, 20, 30, 40                         10, 20, 20, 30, 40







Text Box:

 
                                                                                    DS_TRG_DIM                                     OE_UPSERT
Text Box:  Text Box:  10, 20, 40           After                     dim   10, 20, 20, 30, 40                         -update and insert

OE_SRC                                                DS_TRG_DIM


Step 1: in transformer stage:
Adding some columns to the to before table – to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II







In TX properties:



In SCD II properties:

fact
 
Fast path 1 of 5:         select output link as:


Fast path 2 of 5:         navigating the key column value between before and after tables

Fast path 3 of 5:         selecting source type and source name.






 
Source type:                                        source name:

NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.



Fast path 4 of 5:         select output in DIM.

For path 5 of 5:           setting the output paths to FACT data set.

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert  \\ if key column value is already.
   -insert n update        \\ if key column value is new.





Here SCD II result is for the below input

6 comments:

  1. datastage is the best option who ever looking for good option for career.

    Filter Stage

    ReplyDelete
  2. I'm unable to view any of the images here. :(

    ReplyDelete
  3. Thanks & will try to post DWH relevant topics going forward

    ReplyDelete
  4. Your blog is in a convincing manner, thanks for sharing such an information with lots of your effort and time
    datastage online training
    datastage online training India
    datastage online training Hyderabad

    ReplyDelete
  5. I'm truly grateful for this article. This is exactly what I've been looking for. Turkey visa for Mexico is a visa for Mexican people. With the help of this necessary document all the Mexicans visit Turkey

    ReplyDelete