1. What type of SCD you have used so far?
Slowly Changing Dimensions (SCD) is dimensions that have
data that slowly changes.
For example, we may have a Dimension in our database that
tracks the sales records of your company's salesmen and when sales person is
transferred from one regional office to another.
Dealing with these issues involves SCD management
methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also
sometimes called Hybrid SCDs.
-The Type 0 method is a passive approach to
managing dimension value changes, in which no action is taken. Values remain as
they were at the time the dimension record was first entered.
-The Type 1 methodology overwrites old data
with new data, and therefore does not track historical data at all. This is
most appropriate when correcting certain types of data errors, such as the
spelling of a name. (Assuming we won't ever need to know how it used to be
misspelled in the past.)
-The Type 2 method tracks historical data by
creating multiple records in the dimensional tables with separate keys. With
Type 2, we have unlimited history preservation as a new record is inserted each
time a change is made.
-The Type 3 method tracks changes using
separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has
limited history preservation, as it's limited to the number of columns we
designate for storing historical data. Where the original table structure in
Type 1 and Type 2 was very similar, Type 3 will add additional columns to the
tables:
- The Type 4 method is usually just referred
to as using "history tables", where one table keeps the current data
and an additional table is used to keep a record of some or all changes.
- The Type 6 method is one that combines
the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). It is not frequently used because it has the
potential to complicate end user access, but has some advantages over the other
approaches especially when techniques are employed to mitigate the downstream
complexity.
2. How did you implement Type2 SCD in Datastage?
The following steps are required to implement SCD type2 in
Datastage.
1) Need to take a snap shot of the WareHouse final target
dimensional table and store in a DataSet or Temp Table or directly from the
Data warehouse target table itself.
2) Retrieve the new records from the source (Source table/
Flat file/ view/ or any other source) and lookup the snap shot with the help of
lookup or Join stage and based on Primary Key
3) Allow to pass both the values for these particular SCD columns
(the columns that are affected with change) with a different column name (like
SalesTerritory Source and SalesTerritoryLkp.
4) In the next step, in a transformer, compare these two
values for every single primary key using stage variables. If these values are
different then close the previous record by choosing SalesTerritoryLkp value
(coming from target snap shot) for SalesTerritory column ,updating the
CURRENT_RECORD='N' and setting the
END_DATE as Current Time ( when the records are being processed) in first
link from the transformer.
Simultaneously with the new value (Sales_TeritorySource ) in the SCD columns
insert One more records into the target table with START_DATE as current time,
CURRENT_RECORD=‘Y’ and setting the END_DATE as Null.
5) To maintain the uniqueness of these two records generate
a surrogate key and use this as the primary key in the target for dimensional
table
SCD Type 1
Type
1 Slowly Changing Dimension data warehouse architecture applies when no
history is kept in the database. The new, changed data simply overwrites old
entries. This approach is used quite often with data which change over the time
and it is caused by correcting data quality errors (misspells, data
consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
SCD 1 implementation in
Datastage
The job described and depicted below shows how to implement
SCD Type 1 in Datastage. It is one of many possible designs which can implement
this dimension. The example is based on the customers load into a data
warehouse
Datastage SCD1 job design
The most important facts and stages of the CUST_SCD2 job processing:
• There is a
hashed file (Hash_NewCust) which handles a lookup of the new data coming from
the text file.
• A
T001_Lookups transformer does a lookup into a hashed file and maps new and old
values to separate columns.
SCD1 Transformer mapping
• A T002
transformer updates old values with new ones without concerning about the
overwritten data.
SCD1 Transformer update old entries
• The
database is updated in a target ODBC stage (with the 'update existing rows'
update action)
SCD Type 2
Slowly changing dimension Type 2 is a model where the whole
history is stored in the database. An additional dimension record is created
and the segmenting between the old record values and the new (current) value is
easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very
often used in that dimension and the fact table usually stores dimension key
and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement
SCD Type 2 in Datastage. It is one of many possible designs which can implement
this dimension.
For this example, we will use a table with customers data
(it's name is D_CUSTOMER_SCD2) which has the following structure and data:
The most important facts and stages of the CUST_SCD2 job processing:
· The dimension table with customers is
refreshed daily and one of the data sources is a text file. For the purpose of
this example the CUST_ID=ETIMAA5 differs from the one stored in the database
and it is the only record with changed data. It has the following structure and
data:
· There is a hashed file (Hash_NewCust) which
handles a lookup of the new data coming from the text file.
· A T001_Lookups transformer does a lookup into
a hashed file and maps new and old values to separate columns.
· A T002_Check_Discrepacies_exist transformer
compares old and new values of records and passes through only records that
differ.
SCD 2 check discrepancies transformer
· A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
· ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
· ODBC Insert stage (O_DW_Customers_SCD2_Ins) -
insert action 'insert rows without clearing' and the key column is CUST_ID.
SCD Type 3
In the Type 3 Slowly Changing Dimension only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.
To implement SCD Type 3 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to update the old value with a new one and update the previous value field.
SCD Type 4
The Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.To implement SCD Type 4 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to insert an old value into the destionation stage connected to the historical data table (D_CUSTOMER_HIST for example) and update the old value with a new one.
SCD 2 implementation in
Datastage Parallel Jobs 7.5X2.
Change Capture Stage:
“It is processing
stage, that it capture whether a record from table is copy or edited or insert
or to delete by keeping the code column name”.
Simple example of change capture:
Change_capture
Properties of Change Capture:
Ø Change
keys
o Key
= EID (key column name)
§ Sort
order = ascending order
Ø Change
valves
o Values
=? \\ ENAME
o Values
=? \\ ADD
Ø Options
o Change
mode = (explicit keys & values / explicit keys, values)
o Drop
output for copy = (false/ true) “false –
default ”
o Drop
output for delete = (false/ true) “false – default”
o Drop
output for edit = (false/ true) “false – default”
o Drop
output for insert = (false/ true) “false – default”
§ Copy
code = 0
§ Delete
code = 2
§ Edit
code = 3
§ Insert
code = 1
§ Code
column name = <column name>
o Log
statistics = (false/ true) “false – default”
“It
is processing stage, that it applies the changes of records of a table”.
Change
Apply
Properties of Change Apply:
Ø Change
keys
o Key
= EID
§ Sort
order = ascending order
Ø Options
o Change
mode = explicit key & values
o Check
value columns on delete = (false/ true) “true - default”
o Log
statistics = false
o Code
column name = <column name> \\ change capture and this has to be SAME for
apply operations
SCD II in version 7.5.x2
Design of that
ESDATE=current
date ()
EEDATE=
“9999-12-31”
Key=EID ACF= “Y”
-option: e k & v
Before.txt c=3
c=all
after.txt
key= EID -option: e k & v
before.txt
ESDATE-
current date ()
EEDATE- if
c=3 then DFJD(JDFD(CD())-1)
else EEDATE = “9999-12-31” ACF- if(c=3) then “N” else “Y”
Example table of SCD data:
SID
|
CID
|
CNAME
|
ADD
|
AF
|
ESDATE
|
EEDATE
|
RV
|
UID
|
1
|
11
|
A
|
HYD
|
N
|
03-06-06
|
29-11-10
|
1
|
1
|
2
|
22
|
B
|
SEC
|
N
|
03-06-06
|
07-09-07
|
1
|
2
|
3
|
33
|
C
|
DEL
|
Y
|
03-06-06
|
9999-12-31
|
1
|
3
|
4
|
22
|
B
|
DEL
|
N
|
08-09-07
|
29-11-10
|
2
|
2
|
5
|
44
|
D
|
MCI
|
Y
|
08-09-07
|
9999-12-31
|
1
|
5
|
6
|
11
|
A
|
GDK
|
Y
|
30-11-10
|
9999-12-31
|
2
|
1
|
7
|
22
|
B
|
RAJ
|
Y
|
30-11-10
|
9999-12-31
|
3
|
2
|
8
|
55
|
E
|
CUL
|
Y
|
30-11-10
|
9999-12-31
|
1
|
8
|
Table: this table is describing the
SCD six types and the description is shown above.
DAY 44
SCD I & SCD II (Design and Properties)
SCD – I: Type1 (Design
and Properties):
Transfer job Load
job
10,20,30
OE_DIM before fact DS_FACT 10, 20, 40 10, 20, 40
DS_TRG_DIM OE_UPSERT
10, 20, 40 After dim 10,20, 40 -update
and insert
OE_SRC DS_TRG_DIM
In oracle we have to create table1
and table2,
Table1:
Ø Create
table SRC(SNO number, SNAME
varchar2(25));
o Insert
into src values(111, ‘naveen’);
o Insert
into src values(222, ‘munna’);
o Insert
into src values(333, ‘kumar’);
Table2:
Ø Create
table DIM(SKID number, SNO number,
SNAME varchar2(25));
o No
records to display;
Processes of transform job SCD1:
Step 1: Load plug-in Meta
data from oracle of before and after data as shown in the above links that
coming from different sources.
Step 2: “SCD1 properties”
|
Fast path 2 of 5: navigating
the key column value between before and after tables
Fast path 3 of 5: selecting
source type and source name.
Source type: source name:
NOTE: for every time of
running the program we should empty the source name i.e., empty.txt, else
surrogate key will continue with last stored value.
Fast path 4 of 5: select
output in DIM.
For path 5 of 5: setting the output paths to FACT data
set.
Step 3: In the Next job,
i.e. in load job if we change or edit in the source table and when you are
loading into oracle we must change the write method = upsert in that we have
two options they are, -update n insert \\
if key column value is already.
-insert n update \\ if key column value is new.
Here SCD I result is for the below input
SCD – II: (Design and
Properties):
Transfer job Load
job
10,20,30
before
OE_DIM fact
DS_FACT 10, 20, 20, 30, 40 10, 20, 20, 30, 40
DS_TRG_DIM OE_UPSERT
10, 20, 40 After dim 10, 20, 20, 30, 40 -update
and insert
OE_SRC DS_TRG_DIM
Step 1: in transformer stage:
Adding some columns to the to
before table – to covert EEDATE and ESDATE columns into time stamp transformer
stage to perform SCD II
In TX properties:
In SCD II properties:
|
Fast path 2 of 5: navigating
the key column value between before and after tables
Fast path 3 of 5: selecting
source type and source name.
Source type: source name:
NOTE: for every time of running
the program we should empty the source name i.e., empty.txt, else surrogate key
will continue with last stored value.
Fast path 4 of 5: select
output in DIM.
For path 5 of 5: setting the output paths to FACT data
set.
Step 3: In the Next job,
i.e. in load job if we change or edit in the source table and when you are
loading into oracle we must change the write method = upsert in that we have
two options they are, -update n insert \\
if key column value is already.
-insert n update \\ if key column value is new.
Here SCD II result is for the below input
datastage is the best option who ever looking for good option for career.
ReplyDeleteFilter Stage
I'm unable to view any of the images here. :(
ReplyDeleteThanks & will try to post DWH relevant topics going forward
ReplyDeleteYour blog is in a convincing manner, thanks for sharing such an information with lots of your effort and time
ReplyDeletedatastage online training
datastage online training India
datastage online training Hyderabad
I'm truly grateful for this article. This is exactly what I've been looking for. Turkey visa for Mexico is a visa for Mexican people. With the help of this necessary document all the Mexicans visit Turkey
ReplyDeletenft nasıl alınır
ReplyDeletelisans satın al
en son çıkan perde modelleri
özel ambulans
uc satın al
minecraft premium
en son çıkan perde modelleri
yurtdışı kargo