Thursday, August 11, 2016

Datastage Interview Questions and Answer's

1)How can remove duplicates in a file using UNIX?

Answer :You can use UNIQ to remove duplicates from a file but it can
consider only same consecutive records.
 Ex: if u hve records as
1 vishwa
1 Shruti
2 Ravi
3 Naveen
1 vishwa
and if u use UNIQ cmd, then you wil get vishwa, Ravi, Naveen
and vishwa.
So perform sort before using UNIQ so tat to avoid duplicates
(Or)
$sort -u filename will sorts the data and removes the duplicates lines from the output
2)What is configuration your file structure 2)I have two databases both are Oracle while loading data from source to target the job takes 30 min but I want to load less time how? 
3)I have 2 jobs.I want to run job B if job A has run 3 times.How can I achieve this through datastage? IBM
Answer
Hi Friend, by using sequencer, we can run this scenario. Put
the first job A in loop for 3 times. After successful
completion of 3 runs, then, run the second job B.

Pls correct me, if i am wrong.
 4) Hi, I am getting data from sequential file like below column1 A&B C&D A&E X&Y&Z v&x But I need the target column as like below column1 A and B C and D A and E X and Y and Z v and x
Answer1
using sequential file using unix command sed s/&/and/g we will get answer
OR
using data stage functions
convert(‘&’,’and’,expression);
OR
using expersion
u can create o/p in that write
replacestr(1,columnname,'&','and')
5)SEQUENTIAL FILE I HAVE ONE RECORD,I WANT 100 RECORDS IN TARGET?HOW CAN WE DO THAT?PLS EXPLAIN ME AND WHAT STAGES ARE THERE?WHAT LOGIC?tcs



6)A flatfile contains 200 records.I want to load first 50 records at first time running the job,second 50 records at second time running and so on,how u can develop the job?pls give the steps?
Answer1
first u use sequence generator for the flat file,after that
we take rank transformation in that take top as 50 records
and run the session 4 times it gives 200 records

7)if lookup tr is giving poor performamence. what u do. 2. i have emp,dept table. i want output ename,empno,dname,max(sal). 3. can any one tell me project flow in real time scenario. please send me mail any queries and answers my mail id is yaparala@gmail.com .igate
Answer1
Hi,
i dont know wether this is write or wrong .but in my way in look up we r having 2 modes cache &noncachemode noncache mode is low performance than the cachemode .bcz look up will maintains the connection with lookup table until comparision is completed.in cache mode lookup table data loaded into memory &dis connected from the lookup table this gives high performance
by using joins we can get the results as of my knowledge

8)I have file with empid,empname and I want to load these two fields along with sal in my target 1)salary must be same for all the records 2)I want pass the salary at run time?TCS



9)source file is having 5 records while moving into target it want to be 10 records ?ibm
Answer2
it's not a big issue create a one more node extra and select
the partition entire and run the job
OR
target can create instance

10)can u explain me banking domain projects ? and in banking domain projects how many dimension tables and how many fact tables and how to load source and targets in mapping level plese expain give me one example?
Answer1
Read the Ralph kimball DW tool kit
In that no. of domains and their work processes and Dimensions and Facts are given

11)How cookies can be tested in QTP?
Answer2
Cookies can be tested using WebUtil object.
Eg:
WebUtil.AddCookie
WebUtil.DeleteCookie
WebUtil.GetCookies
OR
WebUtil.DeleteCookies

1.     what are the production issues in datastage?
2.     1.Identify and discuss the problems which are occurred during data quality verification for the single-source and multi-source problems. 2.Testing has very important role in DWH. In ETL testing phase, how can we perform the integration testing and regression testing? 3.What are the perquisites of system testing and regression testing in ETL phase?

3.     Can anyone please send me the latest informatica dumps (power center 8.1.1) for all the three exams? My mail id is yokilamai@gmail.com
4.     How to know informatica conslole password??
5.     CAN ANY ONE SEND ME THE CODE FOR THE BELOW 1).REMOVE '_' FROM TELE PHONE "323-237-2323" AND SHOULD DISPLAY OUT PUT AS "323237232" 2). NEED TO ONLY "COMPANY.COM" FROM EMAIL FILED "TEST@COMPANY.COM"
Sgr Ans: using field function as field(link.RECORD,’-’, 1) , field(link.RECORD,’-’, 2), field(link.RECORD,’-’, 3)in transformerstage & Concatenate above fields.2)using substring in txr link.record[6,16].
6.     we have 1 source table containing 100 records. now we have to transfer first set of 1-10 (i.e1-10) records to one target table and another set of 1-10(11-20) records to other target table and continue like that till 100th record
7.     5. How many repositories can we create in Informatica? 6. What is surrogate key? 7. What is difference between Mapplet and reusable transformation? 8. What is aggregate awareness? 9. Explain reference cursor? 10. What are parallel querys and query hints? 11. DWH architecture? 12. What are cursors? 13. Advantages of de normalized data? 14. What is operational data source (ODS)? 15. What is meta data and system catalog? 16. What is factless fact schema? 17. What is confirmed dimension? 18. What is the capacity of power cube? 19. Difference between PowerPlay transformer and power play reports? 20. What is IQD file? 21. What is Cognos script editor? 22. What is difference macros and prompts? 23. What is power play plug in? 24. Which kind of index is preferred in DWH? 25. What is hash partition? 26. What is DTM session? 27. How can you define a transformation? What are different types of transformations in Informatica? 28. What is mapplet? 29. What is query panel? 30. What is a look up function? What is default transformation for the look up function? 31. What is difference between a connected look up and unconnected look up? 32. What is staging area? 33. What is data merging, data cleansing and sampling? 34. What is up date strategy and what are th options for update strategy? 35. OLAP architecture? 36. What is subject area? 37. Why do we use DSS database for OLAP tools?
1.     indatastage 8.1 surrgoate key generates seqences numbers write but there is limite value i want to reset the seqence after gettting some highest value like i want o/p like this 1 2 3 4 5 .............10000 1 2 3.........10000
2.     diff b/w dwh & data mart diff b/w star schema & snow flake schema diff b/w fact table & dimension table what are the contents in requirement specification what are the docs needed for testing draw ur proj architecture
3.     Knowledge on what are the improtant characteristice that we need ot consider while developing an ETL Code
4.     Hi please try to reply with answer. my mail id is raghu.1028@gmail.com. if the source is having millions of records, but having theree target tables. As per user the first record need to insert into first target and second record in to a second target and third into third target, similarly the process needs to repeat?
5.     What is the main difference between peek stage and copy stage
6.     give detail on the sales project how many number of dimensions & fact tables are used in projec briefly. how the four fact tables taken in the project.
7.     How to load Session Statistics (Workflow name, TgtRows, Rejected Rows, SessStartTime,SessEndTime..)into a database table ?? Another Question: I want to pass values from oracle table or Flat file in Mapping parameter/Variable.. What is the method for that..Please provide me the solution step by step
8.     indatastage 8.1 surrgoate key generates seqences numbers write but there is limite value i want to reset the seqence after gettting some highest value like i want o/p like this 1 2 3 4 5 .............10000 1 2 3.........10000
9.     diff b/w dwh & data mart diff b/w star schema & snow flake schema diff b/w fact table & dimension table what are the contents in requirement specification what are the docs needed for testing draw ur proj architecture
10. Knowledge on what are the improtant characteristice that we need ot consider while developing an ETL Code
11. Hi please try to reply with answer. my mail id is raghu.1028@gmail.com. if the source is having millions of records, but having theree target tables. As per user the first record need to insert into first target and second record in to a second target and third into third target, similarly the process needs to repeat?
12. What is the main difference between peek stage and copy stage
13. give detail on the sales project how many number of dimensions & fact tables are used in projec briefly. how the four fact tables taken in the project.
14. How to load Session Statistics (Workflow name, TgtRows, Rejected Rows, SessStartTime,SessEndTime..)into a database table ?? Another Question: I want to pass values from oracle table or Flat file in Mapping parameter/Variable.. What is the method for that..Please provide me the solution step by step
15.my source seq file have col1 1 2 3 4 5 6 7 8 9 i have 4 targets t1 t2 t3 t4 1 2 3 4 5 6 7 8 9 like this how we can get?POLARIS
Ans:it's a very simple task
select the partition type as round robin
and run the job
you will get the answer

16.how to retrive the max(sal),deptno,empno in datastage? POLARIS

17.hi my source is::
empno,deptno,salary
1,                10,    3.5
2,                20,    8
2,                10,    4.5
1,                30,    5
3,                10,    6
3,                20,    4
1,                20,    9
then target should be in below form...

empno,max(salary),min(salary),deptno
1,              9,                 3.5,            20
2,              8,                 4.5,            20
3,              6,                    4,            10
can anyone give data flow in data stage for the above
scenario....
ANS:create one group using column generator and then Using Aggregator stage select max sal,min sal




18) where we use column generator stage in real time scenario?
Answer
#1
The Column Generator stage adds columns to incoming data
and generates mock data for these columns for each data row
processed. The new data set is then output..

Let us two table x and Y ..and you are doing funnel..
X file is having only 2 columns  and Y file is having 3
columns..
While doing funnel metadata should be same. in order to
achieve this include one column to X file and next use this
column generator stage, this stage populates some mock data
in third column..
OR
2.ANS:If u have two tables that are a and b. A table have two
columns and b table have 3 columns in two tables are
metadata to be same that time we can use column generator
stage for getting total columns in a target
OR
3.ANS:Suppose consider a situation wherein u need to join two
tables A and B but u don hve a same key column in  both the
tables. U can use column generator for generating a DUMMY
column (key column) and perform joining

OR
if u want to add one mmore column in target while populate
data from source to taget.
19)when we will use connected Lookup & Unconnected Lookup
 Answer
# 3
In Datastage, the same can be expalained as Normal
(Connected)Lookup and Sparse(unconnected) lookup.

For  sparse lookup
When the refrence data volume is huge and primary volume is
very less,you can pass the input records key columns to
match against reference in refrence table extraction query
where clause.

20) A+B+C+D=D+E+F+G=G+H+I=17 where each letter represent a number from 1 to 9. Find out what does letter D and G represent if letter A=4? Answer
# 1
4+2D+2G+B+C+E+F+H+I=51=45+D+G
D+G=6.
therefore, D+E+F+G=17
E+F=11
--
Case 1) D,G =1,5; A=4
H+I=12
(9,3)
E+F=11
3,8!!! (wRONG solution)

--
Case 2) D,G=5,1; A=4
H+I=16
(9,7)
E+F=11
(3,8)
4+B+C+5=17; B+C=8
(2,6)

Solution:
4+2+6+5=5+3+8+1=1+7+9=17

OR
ans;(2,6)
21)which partitioning follows in join,merge and lookup?
Answer
Round Robin partition
OR
Join Stage follows Modulus partitioning method.Merge follows same partitioning method as well as Auto partitioning method.Lookup follows Entire partitioning method.
22) How can u Call the Shell Scripting/Unix Commands in Job Sequence? IBM
Answer1
hai..,

by using Routines activity.u can access unix/shellscript
OR
There are two scenarios where u myt want to call a script

Scenario 1(Dependency exists between script and a job):
Where a job has to be executed first then the script has to
run, upon completion of script execution only the sec job
has to be invoked. In this case develop a sequencer job
where first job activity will invoke the first job then
using Execute command activity call the script u would
desire to invoke by typing "sh <script name>" in the
command property of the activity, then with the other job
activity call the second job.

Scenario 2: (Script and job are independent) : In this case
right in your parallel job say job1, under job properties u
can find "After-job subroutine" where u need to
select "ExecSH" and pass the script name which you would
like to execute. By doing this once the job1 execution
completes the script gets invoked. The  job succeeding the
job1 say job2 doesnt wait for the execution of the script

23)If I make any changes in the parallel job,do I need to implement the changes in the sequencer job,else the changes will be reflected automatically ?IBM
Answer
# 1
hai this is kiran ..,
if any changes do in ur parallal job automatically it will
reflect on sequencer jobs.
OR
For your "WHY "

ANS2:The sequence doesnot contain the underlying code of the
job  but it picks up the code from the job whose reference
is given in the sequence.
OR
once modifications are done,simply compile the job and run
the sequencer.no need to do any changes in sequencer.


24) How can u execute the sql query through unix? What is the Primary key for Dimension table? what is the primary key for Fact table? TCS
Answer
# 1 Primary Key for the dimension table is its own primary Key
PRimary Key for Fact table is Foriegn key which dirived from
Dimensional table

OR
connect to sqlplus from unix
typing sqlplus
user name:user_name@database
password:
 then type
@pathname of the file containing sql query\sql query file
name.sql
eg:@path\file.sql
press enter
  25): what is time dimension? and how to populate time demension ?TCS
Answer
# 1
Hi
Every DWH has time dimension u can load the time dimension
though pl/sql script.
26) how can i abort the job if i get more than 100 errors in job log? ACCENTURE
Answer
You can use the Transformer stage,we have an option called
abort rows just mention the value it will abort once it
reach that count.
OR
Compile the job and then run it- in the Job run option select the "Abort Job after" and give 100. So if there
27)What is diff between Junk dimensions and conform dimension? Polaris
Answer
# 1
a junk dimension is dimension but not using for
maping..like its a boolean value.but conform dimesion is
which dimension is shared by two or more than two fact
tables is called conform dimension.and conform dimesin
reduce the no of tables...

OR
JUNK DIMENSION
A Dimension which cannot be used to describe the facts is
known as junk dimension(junk dimension provides additional
information to the main dimension)
ex:-customer add

Confirmed Dimension
A dimension table which can be shared by multiple fact tables
is known as Confirmed dimension
Ex:- Time dimension
28)How many nodes supported by a one cpu in parallel jobs?
Answer
# 2
here node is nothing but processor, parallel jobs supports uniprocessor,smp,mpp and clustered system so nodes supporting is depend upon h/w architecture ok.
29)I am running a job with 1000 records.. If the job gots aborted after loading 400 records into target... In this case i want to load the records in the target with 401 record... How will we do it??? This scenario is not for sequence job it's only in the job Ex: Seq file--> Trans--> Dataset.. Answer
# 5
by using look-up stage we can get the answer..
there are two tables like 1000 records(source) table and 400
records(target) table.

take the source table as primary table and 400 records table
as reference table to look-up table

                  reference table
                       .
                       .
                       .
source............. look-up......... target
30)how to remove duplicates in transformer stage by using stage variables?one example? Answer
# 2
In Stage variable:

stage_variable3 <map> stage_variable1
if column=stage_variable1 than 0 else 1 <map>
stage_variable2
column <map> stage_variable3

Put stage_variable2 as constrain to target stage.
OR
if you want to remove duplicates in transformer stage
use one of the partition technic hash partition you can
easily remove duplicatess
OR
duplication of transformer stage is removed b7y using a
call by referance and call by value , using we create one
object and no other duplication is done
31)how to convert rows into columns ?
Answer
# 1
BY using Pivot stage convert rows into columns in
datastage.If we want to change in informatica we use
normalizer transformation.
32) what is factless fact table?
Answer
A fact table without any facts is known as factless fact
table. Always a Fact should be a Numeric value but every
numeric value need not be a fact.
33) how to cleansing data
Answer
# 1 Data cleansing means converting non unique data format into unique format .This is performed in Transformer stage
OR
In this removes the unwanted data(Bad records OR NULL
Values) and find the inconsistent data and make it
consistent data.

Example:

LOc
---
Hyd
Hyderabad
hyde

After Cleansing

Loc
---
Hyderabad
Hyderabad
Hyderabad
 OR
Data cleansing  means removing unwanted spaces.
By using LTRim,Rtrim functions we can remove unwanted space

34)how to calculate sum(sal) with out using aggregator satge
Answer
# 2
we have to add 2 new ports in "expression"
1)variable-- declare sum(sal)
2)output port- cal this variable
35)i have one table with one column in this column i have three rows like 1,1,2 then that rows populate to target as first tow rows as one row and remaing row as one row how it posible? COLUMN_NAME SHIVA RAMU MADHU THEN I WANT TO LIKE SHIVA AND RAMU IN ONE ROW AND MADHU IS ONE ROW IF ANY ONE KNOW PLZ TELL ME ?capgemini
Answer
# 1
seqfile->sort->transformer->removeduplicate->targer
inser file load
col1
1
1
2
in the sort
key->sort
allowduplicate->true
keychangecolumn->true
in the transformer
create the stage variable
if keychnage=1 then col1 else stagevariablename:col1
drag col1 in transformer
in the derivation area only put stagevariable
remove duplicate:
key:col1
you select last records
we will get answer
col1
11
2
36)explain how to create SCD-2 IN DATASTAGE 7.5X2 PLZ EXPLAIN WITH 4 OR 5 RECORDS TAKE IT EXAMPLE AND JOB DESINGN URGENT?ibm
 Answer
# 1



Dim_table

            change capture  xfm  

Src_table
37) can we half project in parallel jobs and half project in server jobs? infosys
Answer1
No
38)1.what is stagearea?what is stage variable?
2.this is my source
source:id, name  target:id, name
       100,  murty       100,madan

we have three duplicate records for the id column,how can we getthe source record?
       100,madan
       100,saran
Ans;stage area is temporary memory location in real time
environment we get the data in form of files from client so
we need to store one temporary memory location and perform a
all validations like data validations and filed validation
and file validations . ofter we lode into a warehouse


stage variable assinge the expression in transformerstage
that can be reusable only with in transformerstage only
so it is called the stage variable

i am taking key column as name so we get the source record
39) WHAT are unix quentios in datastage
Answer
How to kill a process
1.Kill
Process status
2.ps
Regular Expresion
3.Grep
Shellscript execution
4../script name
Running Jobs
5.dsjob -run
listing out files
6.ls -lrt

these are basic questions
40) i have source data like
empno,enmae
11   ,aa
12   ,bb

i want output like
empno,ename
11    ,aa
12    ,bb
11    ,aa
12    ,bb
tcs
Ans:Take source in the sequential file and take two copy stages
and map the columns to two copy stages and use funnel stage
to funnel the data from two copy stages and o/p into the
file now you can see the o/p is your requirement....
41) tell me abt Datastage trigger?HP
 Answer1DATA STAGE TRIGGERS USED TO CONTROL JOB ACTIVITIES IN JOB
SEQUENCES,IN JOB SEQUENCES STAGES ARE CALLED AS ACTIVITIES
AND LINKS ARE CALLED AS TRIGGERS.
TRIGGERS ARE OF 3 TYPES.)1.0CONDITIONAL,2.)UN-
CONDITIONAL,3.)OTHER WISE.
 SO, BY USING TRIGGERS WE CAN CONTROL THE JOB
ACTIVITIES,USUALLY LARGE DWH JOBS DEPENDS ONE 1 OR 2 JOBS
TO GET DATA,SO BY USING THESE TRIGGERS WE CAN GIVE
CONDITION TO RUN POST JOB ONCE IF PREVIOUS ONE RUNS
SUCCESFULLY
42)I have scenario like this seq file --> transformer --> sort stage --> dataset In this job which prtitioning technique you are using and why??? cts
Answer1
Hash partition or modulus partition. When using sort
operations always better to go for key partition techniques.

43)What is Fastly Changing Dimension? Tcs
Answer
# 1
usually in almost all projects have customer dimension which
is fast growing dimension .

The fast changing dimensions which changes frequently

44)How can we do null handling in sequential files? Reliance
Answer
# 1
While importing the data we can specifie nullable 'yes' in define tab.

or
In format tab we can specifie the Final delimeter as 'null'.
45)what is the best stage in datastage parller jobs to use full out join and why ?virtusa
Answer
# 2
join stage is the best,why means data getting from buffer
area.
Or
join stage supports all the joins like inner join,leftouter join,right outer join,full outer join
 lookup supports only inner,leftouter only
 46) What is the difference between SQl Loader and OCI in datastage? CSC
Answer
# 1 SQL Loader (sqlldr) is a bulk loader utility from Oracle
for loading data into Oracle tables where as DataStage OCI
(Oracle Call Inteface) is a passive stage that connects to
any Oracle Database and collects necessary data and passes
it on to Datastage ETL for processing and applying
transformation where necessary and loads into target table.
DataStage OCI offers input and output stages for
conveniently connecting to Oracle Source databases and
Target Oracle databases
OR
2.An addition of above info through sqlldr we load bulk data
for this we use ora bulk stage. In this we load data
through script so data loaded in form of package. But there
is some restriction also, when we load data (bulk) target
table should not have index. We generally load data in
table-partation like weekly,monthly. It is faster than OCI.

Where as through OCI we load data as sequence manar. we can
set buffer size as require to read or write data. Here no
need to remove index from table.


 47) i have seq file that contents 10 million records load to target any data base.. in that case it takes lot of time for loading..how do performance tuning in that situation...? TCS
Answer
# 1
1) In sequential stage we have stage properties on that
select options and increase the readers

 
2) In datastage admin we have on environmental variable like
logical reader that set as one

48)How to Convert a string function to date function by using only sequential file stage with out using other stages ?tcs
 Answer 1
I think this is not posible......
But we can load date in sequenatial file stage.
49) how can we create a Sorrogate key in transformer stage? I want it in parallel mode
 Answer
In transformer Properties, there is a tab called “Surrogate Key”. Here we can define it.
We can select from
a.         Flat file : State file which holds the SK values and generates SK values.
b.         DB Sequence: Database sequence object to use to generate the SK values.
50) how to retrive the max(sal),deptno,empno in datastage?
Answer
we can retrive the max(sal),deptno,empno in aggreator stage.
way is
set group=deptno
    aggregator type=calculation
   column for calculation=sal
max value output column=max(sal)
51)how to find no.of records in sequntial file itself?
Answer
seq file --> go for options --> select row count.
then automatically updated.
52) How can you join flat file, oracle as a sources?
Answer
first populate flatfile data to orale db

two methods,

1. write a join query in source oralce stage and link to
target.

2. take two oracle sources like s1 and s2 , use join stage
and link to target

or
direct method

3.take flat file and oracle source, use join stage and link
to target.
53)types of errors in datastage?
Answer1
source file not found
null value populated not nullable column
died with signal 11( due to server down)
fatal errors
nls-warning
meta data missmatch
datatype missmatch
parallel loading not enabled
field size different
data type size between source and target diff
column missmatch
file opening error
mutex error
process time out...

54) Wat is pre-load in Hashed file
Answer
you can speed up the read operation of reference hash file
link by pre loading the hash file in memory.

55)what is Audit table?Have u use audit table in ur project?
Answer
Audit table mean its log file.in every job should has audit
table.
56) How can your remove the duplicates in sequential File?
Answer
4
In the source sequential file goto Output->Properties
under options you mention reject mode = Output.

This will gives the duplicate records in the output.

This will 100% work why because in my present project we
are using the same format.
OR
Sort the incoming data using a sort stage & use remove
duplicates on the key columns. This will remove the
duplicates in ur source file.
57) WHAT ARE THE STAGES U WILL USE IN IMPLEMENTING SCD2/DATE FIELD Answer
# 1
Hai..,
This is Kiran.
In SCD's type we r using change capture and transfernar
stage.
date fields are using in target level as Eff_start_date and
Eff_end_date.

Eff_start_date=current date
Eff_end_date  =Currentdate-1.

58)WHERE YOU USE UNIX COMMANDS AS A ETL DEVELOPER?wipro
Answer
# 2
you can call in the following places in ETL(DataStage)

1-Sequential File Stage (Stage->Stage Uses filter commands).
2-Before and after Stage Subroutines in Job Paramametrs tab.
3-Before and after Stage Subroutines in Transformer stage.
4-Job Sequences using Execute command Activity and Routine 
activity.
5-Using DataStage Routines
6-Using Routines called ExecSH and ExecSHSilent
59)eno   ename    esal acct1   acct2    amount1    amount2
100   suresh   10000 sbi1    sbi2    1000        2000
this is our sourse data
i would loke to disply like this
eno   ename    esal acct    amount
100   suresh   10000 sbi1    1000 
100   suresh   10000 sbi2    2000   
 
Ans:
Use Pivot stage......
60) In my source i've 10 records in a single column.... but i want to split those records into 5 sequential files each seq file contains 2 records.?.... can any body help me?
Answer
# 4 Using transformer we have a system variable @INROWNUM which
will store the row number of each of the incoming row.
Using this variable in constrants part of the each link we
can split the records.
OR
source is seq file and take target as seq file
then in target, cilk on target folder option and select
file property for 5 times then run the job.
      if u have any doubts pls make a call
61) In my source i have 10 records but i want 11 records in target. How can i acheive this in Server jobs.? ibm
Answer
using pivot satage you can get that target records dont bother ok you have to implement that ,otherwise transformer stage in also you can add one record in output colomns and implement thatderivation cell . ok i think it very useful to tour question





 1. What is the flow of loading data into fact & dimensional tables?

A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values.

Dimension table - Table with Unique Primary Key.

Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table.

2.  What is the default cache size? How do you change the cache size if needed?

A. Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.

3.  What are types of Hashed File?
A)  Hashed File is classified broadly into 2 types.
a)  Static - Sub divided into 17 types based on Primary Key Pattern.
b)  Dynamic - sub divided into 2 types
i)  Generic  ii) Specific.

Dynamic files do not perform as well as a well, designed static file, but do perform better than a badly designed one. When creating a dynamic file you can specify the following Although all of these have default values)

By Default Hashed file is "Dynamic - Type Random 30 D"

4. What does a Config File in parallel extender consist of?

A) Config file consists of the following. a) Number of Processes or Nodes.

b) Actual Disk Storage Location.

5. What is Modulus and Splitting in Dynamic Hashed File?

A. In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus".

If the size of the file decreases it is called as "Splitting".

6. What are Stage Variables, Derivations and Constants?

A. Stage Variable - An intermediate processing variable that retains value during read and doesn’t pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link.

7. Types of views in Datastage Director?
There are 3 types of views in Datastage Director

a)  Job View - Dates of Jobs Compiled.
b)  Log View - Status of Job last run



                                                                                                            ☻Page 2 of 210☻



c) Status View - Warning Messages, Event Messages, Program Generated Messages.

8. Types of Parallel Processing?
A) Parallel Processing is broadly classified into 2 types.

a)  SMP - Symmetrical Multi Processing.
b)  MPP - Massive Parallel Processing.

9. Orchestrate Vs Datastage Parallel Extender?

A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.

10. Importance of Surrogate Key in Data warehousing?

A) Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.

11.    How to run a Shell Script within the scope of a Data stage job?
A) By using "ExcecSH" command at Before/After job properties.

12.  How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to
yyyy-dd-mm?
A)  We use a) "Iconv" function - Internal Conversion.
b)  "Oconv" function - External Conversion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

13 How do you execute datastage job from command line prompt?

A) Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

14. Functionality of Link Partitioner and Link Collector?

Link Partitioner: It actually splits data into various partitions or data flows using various partition methods.

Link Collector: It collects the data coming from partitions, merges it into a single

15. Types of Dimensional Modeling?
A) Dimensional modeling is again sub divided into 2 types.

a)  Star Schema - Simple & Much Faster. Denormalized form.
b)  Snowflake Schema - Complex with more Granularity. More normalized form.

16. Differentiate Primary Key and Partition Key?




                                                                                                            ☻Page 3 of 210☻


Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

17.    Differentiate Database data and Data warehouse data?
A) Data in a Database is
a)  Detailed or Transactional
b)  Both Readable and Writable.
c)  Current.

18.  Containers Usage and Types?

Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers.

a)  Local Container: Job Specific
b)   Shared Container: Used in any job within a project.

19. Compare and Contrast ODBC and Plug-In stages?
ODBC: a) Poor Performance.

b)  Can be used for Variety of Databases.
c)  Can handle Stored Procedures.

Plug-In: a) Good Performance.
b)  Database specific. (Only one database)
c)  Cannot handle Stored Procedures.

20. Dimension Modelling types along with their significance
Data Modelling is Broadly classified into 2 types.

a)  E-R Diagrams (Entity - Relatioships).
b)  Dimensional Modelling.

Q 21 What are Ascential Dastastage Products, Connectivity
Ans:
Ascential Products

Ascential DataStage
Ascential DataStage EE (3)
Ascential DataStage EE MVS
Ascential DataStage TX
Ascential QualityStage
Ascential MetaStage
Ascential RTI (2)
Ascential ProfileStage
Ascential AuditStage
Ascential Commerce Manager
Industry Solutions




                                                                                                            ☻Page 4 of 210☻


Connectivity
Files

RDBMS
Real-time
PACKs
EDI
Other

Q 22 Explain Data Stage Architecture in 7.5x2?
Data Stage contains two components,

Client Component.
Server Component.
Client Component:
      Data Stage Administrator.
      Data Stage Manager
      Data Stage Designer
      Data Stage Director

Server Components:
      Data Stage Engine
      Meta Data Repository
      Package Installer

Data Stage Administrator:
Used to create the project.

Contains set of properties





























                                                                                                            ☻Page 5 of 210☻


We can increase the buffer size.
We can set the Environment Variables.
In tunable we have in process and inter-process
In-process—Data read in sequentially
Inter-process— It reads the data as it comes.
It just interfaces to metadata.


Q 23 What is Meta Data Repository?
Meta Data is a data about the data.

It also contains
      Query statistics
      ETL statistics
      Business subject area
      Source Information
      Target Information
      Source to Target mapping Information.

Q  24 What is Data Stage Engine?
It is a JAVA engine running at the background.

Q 25 What is Dimensional Modeling?

Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

Q 26 What is Star Schema?

Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded by dimensions table.

Dimension Table: It contains a primary key and description about the fact table.

Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

Q 27 What is surrogate Key?

It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table. We can store up to 2 billion record.

Q 28 Why we need surrogate key?
It is used for integrating the data may help better for primary key.

Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

Q 29  What is Snowflake schema?

It is partially normalized dimensional model in which at two represents least one dimension or more hierarchy related tables.





Q 30 Explain Types of Fact Tables?

Factless Fact: It contains only foreign keys to the dimension tables. Additive Fact: Measures can be added across any dimensions.

Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount Non-Additive: Measures cannot be added across any dimensions. Eg, Average

Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are measured across the dimensions with a same set of measures.

Q 31 Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table, the granularity that is defined in the dimension table is common across between the fact tables.

Junk Dimension: The Dimension table, which contains only flags.

Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension. De-generative Dimension: It is line item-oriented fact table design.

Q 32 What are stage variables?

Stage variables are declaratives in Transformer Stage used to store values. Stage variables are active at the run time. (Because memory is allocated at the run time).

Q 33 What is sequencer?
It sets the sequence of execution of server jobs.

Q 34 What are Active and Passive stages?

Active Stage: Active stage model the flow of data and provide mechanisms for combining data streams, aggregating data and converting data from one data type to another. Eg, Transformer, aggregator, sort, Row Merger etc.

Passive Stage: A Passive stage handles access to Database for the extraction or writing of data. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

Q 35 What is ODS?
Operational Data Store is a staging area where data can be rolled back.

Q 36 What are Macros?
They are built from Data Stage functions and do not require arguments.

A number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer stages), job control routines, filenames and table names, and before/after subroutines.
Q 37 What is keyMgtGetNextValue?

It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output type is string.

Q 38 What are stages?
The stages are either passive or active stages.

Passive stages handle access to databases for extracting or writing data.

Active stages model the flow of data and provide mechanisms for combining data streams, aggregating data, and converting data from one data type to another.

Q 39 What index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.

Q 40 What is container?

A container is a group of stages and links. Containers enable you to simplify and modularize your server job designs by replacing complex areas of the diagram with a single container stage. You can also use shared containers as a way of incorporating server job functionality into parallel jobs.

DataStage provides two types of container:
Question: What are Routines and where/how are they written and have you written any routines before?
Answer:

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit.

The following are different types of Routines:
1.      Transform Functions
2.      Before-After Job subroutines
3.      Job Control Routines

Question: How did you handle an 'Aborted' sequencer? Answer:

In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

Question: What are Sequencers? Answer:

Sequencers are job control programs that execute other jobs with preset Job parameters.

Question: Read the String functions in DS Answer:

Functions like [] -> sub-string function and ':' -> concatenation operator Syntax:

string [ [ start, ] length ]
string [ delimiter, instance, repeats ]

Question: What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
Answer:

       Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.

       Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

Question: What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Answer:

Use crontab utility along with dsexecute() function along with proper parameters passed.
















3.DIMENSION TABLE VS FACT TABLE


Ans




DIMENSION TABLE
FACT TABLE
It provides the context /descriptive information for a fact table measurements.
It provides measurement of an enterprise.
Structure of Dimension - Surrogate key , one or more other fields that compose the natural key (nk) and set of Attributes.
Measurement is the amount determined by observation.
Size of Dimension Table is smaller than Fact Table.
Structure of Fact Table - foreign key (fk), Degenerated Dimension and Measurements.
. In a schema more number of dimensions are presented than Fact Table.
Size of Fact Table is larger than Dimension Table.
Surrogate Key is used to prevent the primary key (pk) violation(store historical data).
In a schema less number of Fact Tables observed compared to Dimension Tables.
Provides entry points to data.
Compose of Degenerate Dimension fields act as Primary Key.
Values of fields are in numeric and text representation.
Values of the fields always in numeric or integer form.

DWH Architecture




The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the system’s relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design.

The differences in the database architectures are caused by different purposes of their existence.
In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE.

Opposite to OLTP systems, a DataWarehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly but the thing is that the data doesn’t have to be the freshest possible and in a lowest detail level – usually a data warehouse is refreshed on a daily basis and very often overnight.
The main challenge of DataWarehouse architecture is to store historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement.
6.SNOWFLAKE SCHEMA


Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that dimensional tables in a snowflake schema are normalized, so they have a typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with BRANDS for instance).
The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated.

An example of a snowflake schema architecture is depicted below.


7.Mapping
:
Increamental Data load code using SQL Override logic
Explanation
:
In this Mapping we will see how to implement the incremental loading..
We will go for incremental loading to speed up the data loading and reduce data actually loaded.
There are different ways we can implement this incremantal loads.
One of the those methods by writing SQL Override in SQL using Mapping variable.
Explanation of the mapping: First we hav to create one mapping variable of type date.
In the source qualifier write the sql override as follows
SELECT SRC_INCRE_LOAD.EMPNO, SRC_INCRE_LOAD.ENAME, SRC_INCRE_LOAD.HIREDATE FROM
SRC_INCRE_LOAD WHERE SRC_INCRE_LOAD.HIREDATE>'$$V_LOAD_DATE'
In the expression assign sysdate to mapping variable to update the mapping variable because from next load it will pick the records greater than today's date.It will accept only recent records
Output Port :: INCRE_LOAD
SETVARIABLE($$v_incre_load,sysdate)
8.Mapping
:
How to find the number of success , rejected and bad records in the same mapping.
Explanation
:
In this Mapping we will see how to find the number of success , rejected and bad records in one mapping.

  • Source file is a flat file which is in .csv format . Click here to download the source file.The table appears like as shown below..
EMPNO
NAME
HIREDATE
SEX
100
RAJ
21-APR
M
101
JOHN
21-APR-08
M
102
MAON
08-APR
M
103

22-APR-08
M
105
SANTA
22-APR-08
F
104
SMITHA
22-APR-08
F
106


M
  • In the above table it shows few values are missing in the table .ANd also the date format of few records are improper.This must be considered as invlaid records and should be loaded into Bad_records table ( target table which is relational).
  • Other than 2 , 3 , 5, 6 records ,remaining all are invalid records because of NULL values or improper DATE format or both .
  • INVALID & VALID RECORDS ::
  • First we seperate this data using Expression transformation.Which is used to flag the row for 1 or 0 .The condition as follows ..
  • IIF(NOT IS_DATE(HIREDATE,'DD-MON-YY') OR ISNULL(EMPNO) OR ISNULL(NAME) OR ISNULL(HIREDATE) OR ISNULL(SEX) ,1,0)
  • FLAG =1 is considered as invalid data and FLAG =0 is considered as valid data .This data will be routed into next transformation using router transformation .Here we added two user groups one as FLAG=1 for invalid data and the other as FLAG=0 for valid data.
  • FLAG=1 data is forwarded to the expression transformation .Here we take one variable port and trwo ouput ports .One for increament purpose and the other for flag the row ...
  • INVALID RECORDS
  • INCREAMENT ::
PORT
EDIT EXPRESSION
COUNT_INVALID
V_PORT ( output port )
V_PORT
V_PORT+1 ( variable)
  • INVALID DATE ::
PORT
EDIT EXPRESSION
INVALID_DATE
IIF( IS_DATE(O_HIREDATE,'DD-MON-YY'), O_HIREDATE, 'INVALID DATE')
  • This data will be moved to the BAD_RECORDS table.Look at the below table::
EMPNO
NAME
HIREDATE
SEX
COUNT
100
RAJ
INVALID DATE
M
1
102
MAON
INVALID DATE
M
2
103
NULL 
22-APR-08
M
3
106
 NULL
 NULL
M
4
  • VALID RECORDS ::
  • In this we will have the valid records.But here we dont want the Employee ,who is 'F' (Female).So our goal is to load MALE employee info., into the SUCCESS_RECORDS target table.
  • For this we need to use a Router transformation and declare the user group as follows \
  • IIF( sex='M',TRUE,FALSE)
  • And the defined group will capture teh rejected records which are nothing but employee who is FEMALE .
  • This data passed to the REUSABLE Expressiona transformation.Where the Increamental logic is applied to get the count value for the the no., of success and rejected records which are passing it.And loaded into the target table.
  • Look at the below tables :::
  • SUCCESS_RECORDS::
EMPNO
NAME
HIREDATE
SEX
COUNT
101
JOHN
22-APR-08
M
1
  • REJECTED_RECORDS::
EMPNO
NAME
HIREDATE
SEX
COUNT
105
SANTA
22-APR-2008
F
1
106
SMITHA
22-APR-2008
F
2
9.Design a mapping as shown below
Source Definition
Target Definition
Level1
Level2
P1
P2
P2
P3
P3
P4
level1
level2
level3
level4
p1
p1
p1
p1
p1
p2
p2
p2
P1
p2
p3
p3
p1
p2
p3
p4

Source : Flatfile
Target : flatfile
 Solution
:

10.Design a mapping as shown below
Source Definition
Target Definition
ID
Name
1
100
2
Ramesh
3
INDIA
1
101
2
Rakesh
3
INDIA
1
102
2
Johny
3
USA
EID
ENAME
COUNTRY
100
Ramesh
INDIA
101
Rakesh
INDIA
102
Johny
USA
Source : Flatfile
Target : Relational

Solution
:



11.Source Definition


Target Definition
EMP_NO
ESAL
10
2Hundred
11
10Thousand
12
15Lakhs
13
170
EMP_NO
ESAL
10
200.00
11
10000.00
12
15000000.00
13
170.00















No comments:

Post a Comment