Datastage4u: Datastage Interview Questions and Answer's

1)How can remove duplicates in a file using UNIX?

Answer :You can use UNIQ to remove duplicates from a file but it can

consider only same consecutive records.

Ex: if u hve records as

1 vishwa

1 Shruti

2 Ravi

3 Naveen

1 vishwa

and if u use UNIQ cmd, then you wil get vishwa, Ravi, Naveen

and vishwa.

So perform sort before using UNIQ so tat to avoid duplicates

(Or)

$sort -u filename will sorts the data and removes the duplicates lines from the output

2)What is configuration your file structure 2)I have two databases both are Oracle while loading data from source to target the job takes 30 min but I want to load less time how?

3)I have 2 jobs.I want to run job B if job A has run 3 times.How can I achieve this through datastage? IBM

Answer

Hi Friend, by using sequencer, we can run this scenario. Put

the first job A in loop for 3 times. After successful

completion of 3 runs, then, run the second job B.

Pls correct me, if i am wrong.

4) Hi, I am getting data from sequential file like below column1 A&B C&D A&E X&Y&Z v&x But I need the target column as like below column1 A and B C and D A and E X and Y and Z v and x

Answer1

using sequential file using unix command sed s/&/and/g we will get answer

using data stage functions

convert(‘&’,’and’,expression);

using expersion

u can create o/p in that write

replacestr(1,columnname,'&','and')

5)SEQUENTIAL FILE I HAVE ONE RECORD,I WANT 100 RECORDS IN TARGET?HOW CAN WE DO THAT?PLS EXPLAIN ME AND WHAT STAGES ARE THERE?WHAT LOGIC?tcs

6)A flatfile contains 200 records.I want to load first 50 records at first time running the job,second 50 records at second time running and so on,how u can develop the job?pls give the steps?

Answer1

first u use sequence generator for the flat file,after that

we take rank transformation in that take top as 50 records

and run the session 4 times it gives 200 records

7)if lookup tr is giving poor performamence. what u do. 2. i have emp,dept table. i want output ename,empno,dname,max(sal). 3. can any one tell me project flow in real time scenario. please send me mail any queries and answers my mail id is yaparala@gmail.com .igate

Answer1

Hi,

i dont know wether this is write or wrong .but in my way in look up we r having 2 modes cache &noncachemode noncache mode is low performance than the cachemode .bcz look up will maintains the connection with lookup table until comparision is completed.in cache mode lookup table data loaded into memory &dis connected from the lookup table this gives high performance

by using joins we can get the results as of my knowledge

8)I have file with empid,empname and I want to load these two fields along with sal in my target 1)salary must be same for all the records 2)I want pass the salary at run time?TCS

9)source file is having 5 records while moving into target it want to be 10 records ?ibm

Answer2

it's not a big issue create a one more node extra and select

the partition entire and run the job

target can create instance

10)can u explain me banking domain projects ? and in banking domain projects how many dimension tables and how many fact tables and how to load source and targets in mapping level plese expain give me one example?

Answer1

Read the Ralph kimball DW tool kit

In that no. of domains and their work processes and Dimensions and Facts are given

11)How cookies can be tested in QTP?

Answer2

Cookies can be tested using WebUtil object.

Eg:

WebUtil.AddCookie

WebUtil.DeleteCookie

WebUtil.GetCookies

WebUtil.DeleteCookies

1. what are the production issues in datastage?

2. 1.Identify and discuss the problems which are occurred during data quality verification for the single-source and multi-source problems. 2.Testing has very important role in DWH. In ETL testing phase, how can we perform the integration testing and regression testing? 3.What are the perquisites of system testing and regression testing in ETL phase?

3. Can anyone please send me the latest informatica dumps (power center 8.1.1) for all the three exams? My mail id is yokilamai@gmail.com

4. How to know informatica conslole password??

5. CAN ANY ONE SEND ME THE CODE FOR THE BELOW 1).REMOVE '_' FROM TELE PHONE "323-237-2323" AND SHOULD DISPLAY OUT PUT AS "323237232" 2). NEED TO ONLY "COMPANY.COM" FROM EMAIL FILED "TEST@COMPANY.COM"

Sgr Ans: using field function as field(link.RECORD,’-’, 1) , field(link.RECORD,’-’, 2), field(link.RECORD,’-’, 3)in transformerstage & Concatenate above fields.2)using substring in txr link.record[6,16].

6. we have 1 source table containing 100 records. now we have to transfer first set of 1-10 (i.e1-10) records to one target table and another set of 1-10(11-20) records to other target table and continue like that till 100th record

7. 5. How many repositories can we create in Informatica? 6. What is surrogate key? 7. What is difference between Mapplet and reusable transformation? 8. What is aggregate awareness? 9. Explain reference cursor? 10. What are parallel querys and query hints? 11. DWH architecture? 12. What are cursors? 13. Advantages of de normalized data? 14. What is operational data source (ODS)? 15. What is meta data and system catalog? 16. What is factless fact schema? 17. What is confirmed dimension? 18. What is the capacity of power cube? 19. Difference between PowerPlay transformer and power play reports? 20. What is IQD file? 21. What is Cognos script editor? 22. What is difference macros and prompts? 23. What is power play plug in? 24. Which kind of index is preferred in DWH? 25. What is hash partition? 26. What is DTM session? 27. How can you define a transformation? What are different types of transformations in Informatica? 28. What is mapplet? 29. What is query panel? 30. What is a look up function? What is default transformation for the look up function? 31. What is difference between a connected look up and unconnected look up? 32. What is staging area? 33. What is data merging, data cleansing and sampling? 34. What is up date strategy and what are th options for update strategy? 35. OLAP architecture? 36. What is subject area? 37. Why do we use DSS database for OLAP tools?

1. indatastage 8.1 surrgoate key generates seqences numbers write but there is limite value i want to reset the seqence after gettting some highest value like i want o/p like this 1 2 3 4 5 .............10000 1 2 3.........10000

2. diff b/w dwh & data mart diff b/w star schema & snow flake schema diff b/w fact table & dimension table what are the contents in requirement specification what are the docs needed for testing draw ur proj architecture

3. Knowledge on what are the improtant characteristice that we need ot consider while developing an ETL Code

4. Hi please try to reply with answer. my mail id is raghu.1028@gmail.com. if the source is having millions of records, but having theree target tables. As per user the first record need to insert into first target and second record in to a second target and third into third target, similarly the process needs to repeat?

5. What is the main difference between peek stage and copy stage

6. give detail on the sales project how many number of dimensions & fact tables are used in projec briefly. how the four fact tables taken in the project.

7. How to load Session Statistics (Workflow name, TgtRows, Rejected Rows, SessStartTime,SessEndTime..)into a database table ?? Another Question: I want to pass values from oracle table or Flat file in Mapping parameter/Variable.. What is the method for that..Please provide me the solution step by step

8. indatastage 8.1 surrgoate key generates seqences numbers write but there is limite value i want to reset the seqence after gettting some highest value like i want o/p like this 1 2 3 4 5 .............10000 1 2 3.........10000

9. diff b/w dwh & data mart diff b/w star schema & snow flake schema diff b/w fact table & dimension table what are the contents in requirement specification what are the docs needed for testing draw ur proj architecture

10. Knowledge on what are the improtant characteristice that we need ot consider while developing an ETL Code

11. Hi please try to reply with answer. my mail id is raghu.1028@gmail.com. if the source is having millions of records, but having theree target tables. As per user the first record need to insert into first target and second record in to a second target and third into third target, similarly the process needs to repeat?

12. What is the main difference between peek stage and copy stage

13. give detail on the sales project how many number of dimensions & fact tables are used in projec briefly. how the four fact tables taken in the project.

14. How to load Session Statistics (Workflow name, TgtRows, Rejected Rows, SessStartTime,SessEndTime..)into a database table ?? Another Question: I want to pass values from oracle table or Flat file in Mapping parameter/Variable.. What is the method for that..Please provide me the solution step by step

15.my source seq file have col1 1 2 3 4 5 6 7 8 9 i have 4 targets t1 t2 t3 t4 1 2 3 4 5 6 7 8 9 like this how we can get?POLARIS

Ans:it's a very simple task

select the partition type as round robin

and run the job

you will get the answer

16.how to retrive the max(sal),deptno,empno in datastage? POLARIS

17.hi my source is::

empno,deptno,salary

1, 10, 3.5

2, 20, 8

2, 10, 4.5

1, 30, 5

3, 10, 6

3, 20, 4

1, 20, 9

then target should be in below form...

empno,max(salary),min(salary),deptno

1, 9, 3.5, 20

2, 8, 4.5, 20

3, 6, 4, 10

can anyone give data flow in data stage for the above

scenario....

ANS:create one group using column generator and then Using Aggregator stage select max sal,min sal

18) where we use column generator stage in real time scenario?

Answer
#1

The Column Generator stage adds columns to incoming data

and generates mock data for these columns for each data row

processed. The new data set is then output..

Let us two table x and Y ..and you are doing funnel..

X file is having only 2 columns and Y file is having 3

columns..

While doing funnel metadata should be same. in order to

achieve this include one column to X file and next use this

column generator stage, this stage populates some mock data

in third column..

2.ANS:If u have two tables that are a and b. A table have two

columns and b table have 3 columns in two tables are

metadata to be same that time we can use column generator

stage for getting total columns in a target

3.ANS:Suppose consider a situation wherein u need to join two

tables A and B but u don hve a same key column in both the

tables. U can use column generator for generating a DUMMY

column (key column) and perform joining

if u want to add one mmore column in target while populate

data from source to taget.

19)when we will use connected Lookup & Unconnected Lookup

Answer
# 3

In Datastage, the same can be expalained as Normal

(Connected)Lookup and Sparse(unconnected) lookup.

For sparse lookup

When the refrence data volume is huge and primary volume is

very less,you can pass the input records key columns to

match against reference in refrence table extraction query

where clause.

20) A+B+C+D=D+E+F+G=G+H+I=17 where each letter represent a number from 1 to 9. Find out what does letter D and G represent if letter A=4? Answer
# 1

4+2D+2G+B+C+E+F+H+I=51=45+D+G

D+G=6.

therefore, D+E+F+G=17

E+F=11

Case 1) D,G =1,5; A=4

H+I=12

(9,3)

E+F=11

3,8!!! (wRONG solution)

Case 2) D,G=5,1; A=4

H+I=16

(9,7)

E+F=11

(3,8)

4+B+C+5=17; B+C=8

(2,6)

Solution:

4+2+6+5=5+3+8+1=1+7+9=17

ans;(2,6)

21)which partitioning follows in join,merge and lookup?

Answer
Round Robin partition
OR
Join Stage follows Modulus partitioning method.Merge follows same partitioning method as well as Auto partitioning method.Lookup follows Entire partitioning method.

22) How can u Call the Shell Scripting/Unix Commands in Job Sequence? IBM

Answer1

hai..,

by using Routines activity.u can access unix/shellscript

There are two scenarios where u myt want to call a script

Scenario 1(Dependency exists between script and a job):

Where a job has to be executed first then the script has to

run, upon completion of script execution only the sec job

has to be invoked. In this case develop a sequencer job

where first job activity will invoke the first job then

using Execute command activity call the script u would

desire to invoke by typing "sh <script name>" in the

command property of the activity, then with the other job

activity call the second job.

Scenario 2: (Script and job are independent) : In this case

right in your parallel job say job1, under job properties u

can find "After-job subroutine" where u need to

select "ExecSH" and pass the script name which you would

like to execute. By doing this once the job1 execution

completes the script gets invoked. The job succeeding the

job1 say job2 doesnt wait for the execution of the script

23)If I make any changes in the parallel job,do I need to implement the changes in the sequencer job,else the changes will be reflected automatically ?IBM

Answer
# 1

hai this is kiran ..,

if any changes do in ur parallal job automatically it will

reflect on sequencer jobs.

For your "WHY "

ANS2:The sequence doesnot contain the underlying code of the

job but it picks up the code from the job whose reference

is given in the sequence.

once modifications are done,simply compile the job and run

the sequencer.no need to do any changes in sequencer.

24) How can u execute the sql query through unix? What is the Primary key for Dimension table? what is the primary key for Fact table? TCS

Answer
# 1 Primary Key for the dimension table is its own primary Key

PRimary Key for Fact table is Foriegn key which dirived from

Dimensional table

connect to sqlplus from unix

typing sqlplus

user name:user_name@database

password:

then type

@pathname of the file containing sql query\sql query file

name.sql

eg:@path\file.sql

press enter

25): what is time dimension? and how to populate time demension ?TCS

Answer
# 1

Every DWH has time dimension u can load the time dimension

though pl/sql script.

26) how can i abort the job if i get more than 100 errors in job log? ACCENTURE

Answer
You can use the Transformer stage,we have an option called

abort rows just mention the value it will abort once it

reach that count.

Compile the job and then run it- in the Job run option select the "Abort Job after" and give 100. So if there

27)What is diff between Junk dimensions and conform dimension? Polaris

Answer
# 1

a junk dimension is dimension but not using for

maping..like its a boolean value.but conform dimesion is

which dimension is shared by two or more than two fact

tables is called conform dimension.and conform dimesin

reduce the no of tables...

JUNK DIMENSION

A Dimension which cannot be used to describe the facts is

known as junk dimension(junk dimension provides additional

information to the main dimension)

ex:-customer add

Confirmed Dimension

A dimension table which can be shared by multiple fact tables

is known as Confirmed dimension

Ex:- Time dimension

28)How many nodes supported by a one cpu in parallel jobs?

Answer
# 2

here node is nothing but processor, parallel jobs supports uniprocessor,smp,mpp and clustered system so nodes supporting is depend upon h/w architecture ok.

29)I am running a job with 1000 records.. If the job gots aborted after loading 400 records into target... In this case i want to load the records in the target with 401 record... How will we do it??? This scenario is not for sequence job it's only in the job Ex: Seq file--> Trans--> Dataset.. Answer
# 5

by using look-up stage we can get the answer..

there are two tables like 1000 records(source) table and 400

records(target) table.

take the source table as primary table and 400 records table

as reference table to look-up table

reference table

source............. look-up......... target

30)how to remove duplicates in transformer stage by using stage variables?one example? Answer
# 2

In Stage variable:

stage_variable3 <map> stage_variable1

if column=stage_variable1 than 0 else 1 <map>

stage_variable2

column <map> stage_variable3

Put stage_variable2 as constrain to target stage.

if you want to remove duplicates in transformer stage

use one of the partition technic hash partition you can

easily remove duplicatess

duplication of transformer stage is removed b7y using a

call by referance and call by value , using we create one

object and no other duplication is done

31)how to convert rows into columns ?

Answer
# 1

BY using Pivot stage convert rows into columns in

datastage.If we want to change in informatica we use

normalizer transformation.

32) what is factless fact table?

Answer

A fact table without any facts is known as factless fact

table. Always a Fact should be a Numeric value but every

numeric value need not be a fact.

33) how to cleansing data

Answer
# 1 Data cleansing means converting non unique data format into unique format .This is performed in Transformer stage

In this removes the unwanted data(Bad records OR NULL

Values) and find the inconsistent data and make it

consistent data.

Example:

LOc

---

Hyd

Hyderabad

hyde

After Cleansing

Loc

---

Hyderabad

Data cleansing means removing unwanted spaces.

By using LTRim,Rtrim functions we can remove unwanted space

34)how to calculate sum(sal) with out using aggregator satge

Answer
# 2

we have to add 2 new ports in "expression"

1)variable-- declare sum(sal)

2)output port- cal this variable

35)i have one table with one column in this column i have three rows like 1,1,2 then that rows populate to target as first tow rows as one row and remaing row as one row how it posible? COLUMN_NAME SHIVA RAMU MADHU THEN I WANT TO LIKE SHIVA AND RAMU IN ONE ROW AND MADHU IS ONE ROW IF ANY ONE KNOW PLZ TELL ME ?capgemini

Answer
# 1

seqfile->sort->transformer->removeduplicate->targer

inser file load

col1

in the sort

key->sort

allowduplicate->true

keychangecolumn->true

in the transformer

create the stage variable

if keychnage=1 then col1 else stagevariablename:col1

drag col1 in transformer

in the derivation area only put stagevariable

remove duplicate:

key:col1

you select last records

we will get answer

col1

36)explain how to create SCD-2 IN DATASTAGE 7.5X2 PLZ EXPLAIN WITH 4 OR 5 RECORDS TAKE IT EXAMPLE AND JOB DESINGN URGENT?ibm

Answer
# 1

Dim_table

change capture xfm

Src_table

37) can we half project in parallel jobs and half project in server jobs? infosys

Answer1

38)1.what is stagearea?what is stage variable?

2.this is my source

source:id, name target:id, name

100, murty 100,madan

we have three duplicate records for the id column,how can we getthe source record?

100,madan

100,saran

Ans;stage area is temporary memory location in real time

environment we get the data in form of files from client so

we need to store one temporary memory location and perform a

all validations like data validations and filed validation

and file validations . ofter we lode into a warehouse

stage variable assinge the expression in transformerstage

that can be reusable only with in transformerstage only

so it is called the stage variable

i am taking key column as name so we get the source record

39) WHAT are unix quentios in datastage

Answer
How to kill a process

1.Kill

Process status

2.ps

Regular Expresion

3.Grep

Shellscript execution

4../script name

Running Jobs

5.dsjob -run

listing out files

6.ls -lrt

these are basic questions

40) i have source data like

empno,enmae

11 ,aa

12 ,bb

i want output like

empno,ename

11 ,aa

12 ,bb

11 ,aa

12 ,bb

tcs

Ans:Take source in the sequential file and take two copy stages

and map the columns to two copy stages and use funnel stage

to funnel the data from two copy stages and o/p into the

file now you can see the o/p is your requirement....

41) tell me abt Datastage trigger?HP

Answer1DATA STAGE TRIGGERS USED TO CONTROL JOB ACTIVITIES IN JOB

SEQUENCES,IN JOB SEQUENCES STAGES ARE CALLED AS ACTIVITIES

AND LINKS ARE CALLED AS TRIGGERS.

TRIGGERS ARE OF 3 TYPES.)1.0CONDITIONAL,2.)UN-

CONDITIONAL,3.)OTHER WISE.

SO, BY USING TRIGGERS WE CAN CONTROL THE JOB

ACTIVITIES,USUALLY LARGE DWH JOBS DEPENDS ONE 1 OR 2 JOBS

TO GET DATA,SO BY USING THESE TRIGGERS WE CAN GIVE

CONDITION TO RUN POST JOB ONCE IF PREVIOUS ONE RUNS

SUCCESFULLY

42)I have scenario like this seq file --> transformer --> sort stage --> dataset In this job which prtitioning technique you are using and why??? cts

Answer1

Hash partition or modulus partition. When using sort

operations always better to go for key partition techniques.

43)What is Fastly Changing Dimension? Tcs

Answer
# 1

usually in almost all projects have customer dimension which

is fast growing dimension .

The fast changing dimensions which changes frequently

44)How can we do null handling in sequential files? Reliance

Answer
# 1

While importing the data we can specifie nullable 'yes' in define tab.

In format tab we can specifie the Final delimeter as 'null'.

45)what is the best stage in datastage parller jobs to use full out join and why ?virtusa

Answer
# 2

join stage is the best,why means data getting from buffer

area.

join stage supports all the joins like inner join,leftouter join,right outer join,full outer join

lookup supports only inner,leftouter only

46) What is the difference between SQl Loader and OCI in datastage? CSC

Answer
# 1 SQL Loader (sqlldr) is a bulk loader utility from Oracle

for loading data into Oracle tables where as DataStage OCI

(Oracle Call Inteface) is a passive stage that connects to

any Oracle Database and collects necessary data and passes

it on to Datastage ETL for processing and applying

transformation where necessary and loads into target table.

DataStage OCI offers input and output stages for

conveniently connecting to Oracle Source databases and

Target Oracle databases

2.An addition of above info through sqlldr we load bulk data

for this we use ora bulk stage. In this we load data

through script so data loaded in form of package. But there

is some restriction also, when we load data (bulk) target

table should not have index. We generally load data in

table-partation like weekly,monthly. It is faster than OCI.

Where as through OCI we load data as sequence manar. we can

set buffer size as require to read or write data. Here no

need to remove index from table.

47) i have seq file that contents 10 million records load to target any data base.. in that case it takes lot of time for loading..how do performance tuning in that situation...? TCS

Answer
# 1

1) In sequential stage we have stage properties on that

select options and increase the readers

2) In datastage admin we have on environmental variable like

logical reader that set as one

48)How to Convert a string function to date function by using only sequential file stage with out using other stages ?tcs

Answer 1

I think this is not posible......

But we can load date in sequenatial file stage.

49) how can we create a Sorrogate key in transformer stage? I want it in parallel mode

Answer
In transformer Properties, there is a tab called “Surrogate Key”. Here we can define it.

We can select from

a. Flat file : State file which holds the SK values and generates SK values.

b. DB Sequence: Database sequence object to use to generate the SK values.

50) how to retrive the max(sal),deptno,empno in datastage?

Answer
we can retrive the max(sal),deptno,empno in aggreator stage.

way is

set group=deptno

aggregator type=calculation

column for calculation=sal

max value output column=max(sal)

51)how to find no.of records in sequntial file itself?

Answer
seq file --> go for options --> select row count.

then automatically updated.

52) How can you join flat file, oracle as a sources?

Answer

first populate flatfile data to orale db

two methods,

1. write a join query in source oralce stage and link to

target.

2. take two oracle sources like s1 and s2 , use join stage

and link to target

direct method

3.take flat file and oracle source, use join stage and link

to target.

53)types of errors in datastage?

Answer1

source file not found

null value populated not nullable column

died with signal 11( due to server down)

fatal errors

nls-warning

meta data missmatch

datatype missmatch

parallel loading not enabled

field size different

data type size between source and target diff

column missmatch

file opening error

mutex error

process time out...

54) Wat is pre-load in Hashed file

Answer
you can speed up the read operation of reference hash file

link by pre loading the hash file in memory.

55)what is Audit table?Have u use audit table in ur project?

Answer
Audit table mean its log file.in every job should has audit

table.

56) How can your remove the duplicates in sequential File?

Answer
4

In the source sequential file goto Output->Properties

under options you mention reject mode = Output.

This will gives the duplicate records in the output.

This will 100% work why because in my present project we

are using the same format.

Sort the incoming data using a sort stage & use remove

duplicates on the key columns. This will remove the

duplicates in ur source file.

57) WHAT ARE THE STAGES U WILL USE IN IMPLEMENTING SCD2/DATE FIELD Answer
# 1

Hai..,

This is Kiran.

In SCD's type we r using change capture and transfernar

stage.

date fields are using in target level as Eff_start_date and

Eff_end_date.

Eff_start_date=current date

Eff_end_date =Currentdate-1.

58)WHERE YOU USE UNIX COMMANDS AS A ETL DEVELOPER?wipro

Answer
# 2

you can call in the following places in ETL(DataStage)

1-Sequential File Stage (Stage->Stage Uses filter commands).

2-Before and after Stage Subroutines in Job Paramametrs tab.

3-Before and after Stage Subroutines in Transformer stage.

4-Job Sequences using Execute command Activity and Routine

activity.

5-Using DataStage Routines

6-Using Routines called ExecSH and ExecSHSilent

59)eno ename esal acct1 acct2 amount1 amount2

100 suresh 10000 sbi1 sbi2 1000 2000

this is our sourse data

i would loke to disply like this

eno ename esal acct amount

100 suresh 10000 sbi1 1000

100 suresh 10000 sbi2 2000

Ans:

Use Pivot stage......

60) In my source i've 10 records in a single column.... but i want to split those records into 5 sequential files each seq file contains 2 records.?.... can any body help me?

Answer
# 4 Using transformer we have a system variable @INROWNUM which

will store the row number of each of the incoming row.

Using this variable in constrants part of the each link we

can split the records.

source is seq file and take target as seq file

then in target, cilk on target folder option and select

file property for 5 times then run the job.

if u have any doubts pls make a call

61) In my source i have 10 records but i want 11 records in target. How can i acheive this in Server jobs.? ibm

Answer

using pivot satage you can get that target records dont bother ok you have to implement that ,otherwise transformer stage in also you can add one record in output colomns and implement thatderivation cell . ok i think it very useful to tour question

1. What is the flow of loading data into fact & dimensional tables?

A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values.

Dimension table - Table with Unique Primary Key.

Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table.

2. What is the default cache size? How do you change the cache size if needed?

A. Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.

3. What are types of Hashed File?

A) Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern.

b) Dynamic - sub divided into 2 types

i) Generic ii) Specific.

Dynamic files do not perform as well as a well, designed static file, but do perform better than a badly designed one. When creating a dynamic file you can specify the following Although all of these have default values)

By Default Hashed file is "Dynamic - Type Random 30 D"

4. What does a Config File in parallel extender consist of?

A) Config file consists of the following. a) Number of Processes or Nodes.

b) Actual Disk Storage Location.

5. What is Modulus and Splitting in Dynamic Hashed File?

A. In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus".

If the size of the file decreases it is called as "Splitting".

6. What are Stage Variables, Derivations and Constants?

A. Stage Variable - An intermediate processing variable that retains value during read and doesn’t pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link.

7. Types of views in Datastage Director?

There are 3 types of views in Datastage Director

a) Job View - Dates of Jobs Compiled.

b) Log View - Status of Job last run

☻Page 2 of 210☻

c) Status View - Warning Messages, Event Messages, Program Generated Messages.

8. Types of Parallel Processing?

A) Parallel Processing is broadly classified into 2 types.

a) SMP - Symmetrical Multi Processing.

b) MPP - Massive Parallel Processing.

9. Orchestrate Vs Datastage Parallel Extender?

A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.

10. Importance of Surrogate Key in Data warehousing?

A) Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.

11. How to run a Shell Script within the scope of a Data stage job?

A) By using "ExcecSH" command at Before/After job properties.

12. How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to

yyyy-dd-mm?

A) We use a) "Iconv" function - Internal Conversion.

b) "Oconv" function - External Conversion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is

Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

13 How do you execute datastage job from command line prompt?

A) Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

14. Functionality of Link Partitioner and Link Collector?

Link Partitioner: It actually splits data into various partitions or data flows using various partition methods.

Link Collector: It collects the data coming from partitions, merges it into a single

15. Types of Dimensional Modeling?

A) Dimensional modeling is again sub divided into 2 types.

a) Star Schema - Simple & Much Faster. Denormalized form.

b) Snowflake Schema - Complex with more Granularity. More normalized form.

16. Differentiate Primary Key and Partition Key?

☻Page 3 of 210☻

Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

17. Differentiate Database data and Data warehouse data?

A) Data in a Database is

a) Detailed or Transactional

b) Both Readable and Writable.

c) Current.

18. Containers Usage and Types?

Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers.

a) Local Container: Job Specific

b) Shared Container: Used in any job within a project.

19. Compare and Contrast ODBC and Plug-In stages?

ODBC: a) Poor Performance.

b) Can be used for Variety of Databases.

c) Can handle Stored Procedures.

Plug-In: a) Good Performance.

b) Database specific. (Only one database)

c) Cannot handle Stored Procedures.

20. Dimension Modelling types along with their significance

Data Modelling is Broadly classified into 2 types.

a) E-R Diagrams (Entity - Relatioships).

b) Dimensional Modelling.

Q 21 What are Ascential Dastastage Products, Connectivity

Ans:

Ascential Products

Ascential DataStage

Ascential DataStage EE (3)

Ascential DataStage EE MVS

Ascential DataStage TX

Ascential QualityStage

Ascential MetaStage

Ascential RTI (2)

Ascential ProfileStage

Ascential AuditStage

Ascential Commerce Manager

Industry Solutions

☻Page 4 of 210☻

Connectivity

Files

RDBMS

Real-time

PACKs

EDI

Other

Q 22 Explain Data Stage Architecture in 7.5x2?

Data Stage contains two components,

Client Component.

Server Component.

Client Component:

₃ Data Stage Administrator.

₃ Data Stage Manager

₃ Data Stage Designer

₃ Data Stage Director

Server Components:

₃ Data Stage Engine

₃ Meta Data Repository

₃ Package Installer

Data Stage Administrator:

Used to create the project.

Contains set of properties

☻Page 5 of 210☻

We can increase the buffer size.

We can set the Environment Variables.

In tunable we have in process and inter-process

In-process—Data read in sequentially

Inter-process— It reads the data as it comes.

It just interfaces to metadata.

Q 23 What is Meta Data Repository?

Meta Data is a data about the data.

It also contains

₃ Query statistics

₃ ETL statistics

₃ Business subject area

₃ Source Information

₃ Target Information

₃ Source to Target mapping Information.

Q 24 What is Data Stage Engine?

It is a JAVA engine running at the background.

Q 25 What is Dimensional Modeling?

Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

Q 26 What is Star Schema?

Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded by dimensions table.

Dimension Table: It contains a primary key and description about the fact table.

Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

Q 27 What is surrogate Key?

It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table. We can store up to 2 billion record.

Q 28 Why we need surrogate key?

It is used for integrating the data may help better for primary key.

Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

Q 29 What is Snowflake schema?

It is partially normalized dimensional model in which at two represents least one dimension or more hierarchy related tables.

Q 30 Explain Types of Fact Tables?

Factless Fact: It contains only foreign keys to the dimension tables. Additive Fact: Measures can be added across any dimensions.

Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount Non-Additive: Measures cannot be added across any dimensions. Eg, Average

Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are measured across the dimensions with a same set of measures.

Q 31 Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table, the granularity that is defined in the dimension table is common across between the fact tables.

Junk Dimension: The Dimension table, which contains only flags.

Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension. De-generative Dimension: It is line item-oriented fact table design.

Q 32 What are stage variables?

Stage variables are declaratives in Transformer Stage used to store values. Stage variables are active at the run time. (Because memory is allocated at the run time).

Q 33 What is sequencer?

It sets the sequence of execution of server jobs.

Q 34 What are Active and Passive stages?

Active Stage: Active stage model the flow of data and provide mechanisms for combining data streams, aggregating data and converting data from one data type to another. Eg, Transformer, aggregator, sort, Row Merger etc.

Passive Stage: A Passive stage handles access to Database for the extraction or writing of data. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

Q 35 What is ODS?

Operational Data Store is a staging area where data can be rolled back.

Q 36 What are Macros?

They are built from Data Stage functions and do not require arguments.

A number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer stages), job control routines, filenames and table names, and before/after subroutines.

Q 37 What is keyMgtGetNextValue?

It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output type is string.

Q 38 What are stages?

The stages are either passive or active stages.

Passive stages handle access to databases for extracting or writing data.

Active stages model the flow of data and provide mechanisms for combining data streams, aggregating data, and converting data from one data type to another.

Q 39 What index is created on Data Warehouse?

Bitmap index is created in Data Warehouse.

Q 40 What is container?

A container is a group of stages and links. Containers enable you to simplify and modularize your server job designs by replacing complex areas of the diagram with a single container stage. You can also use shared containers as a way of incorporating server job functionality into parallel jobs.

DataStage provides two types of container:

Question: What are Routines and where/how are they written and have you written any routines before?

Answer:

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit.

The following are different types of Routines:

1. Transform Functions

2. Before-After Job subroutines

3. Job Control Routines

Question: How did you handle an 'Aborted' sequencer? Answer:

In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

Question: What are Sequencers? Answer:

Sequencers are job control programs that execute other jobs with preset Job parameters.

Question: Read the String functions in DS Answer:

Functions like [] -> sub-string function and ':' -> concatenation operator Syntax:

string [ [ start, ] length ]

string [ delimiter, instance, repeats ]

Question: What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

Answer:

• Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.

• Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

Question: What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

Answer:

Use crontab utility along with dsexecute() function along with proper parameters passed.

3.DIMENSION TABLE VS FACT TABLE

Ans

DIMENSION TABLE	FACT TABLE
It provides the context /descriptive information for a fact table measurements.	It provides measurement of an enterprise.
Structure of Dimension - Surrogate key , one or more other fields that compose the natural key (nk) and set of Attributes.	Measurement is the amount determined by observation.
Size of Dimension Table is smaller than Fact Table.	Structure of Fact Table - foreign key (fk), Degenerated Dimension and Measurements.
. In a schema more number of dimensions are presented than Fact Table.	Size of Fact Table is larger than Dimension Table.
Surrogate Key is used to prevent the primary key (pk) violation(store historical data).	In a schema less number of Fact Tables observed compared to Dimension Tables.
Provides entry points to data.	Compose of Degenerate Dimension fields act as Primary Key.
Values of fields are in numeric and text representation.	Values of the fields always in numeric or integer form.

DWH Architecture

The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the system’s relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design.

The differences in the database architectures are caused by different purposes of their existence.

In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE.

Opposite to OLTP systems, a DataWarehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly but the thing is that the data doesn’t have to be the freshest possible and in a lowest detail level – usually a data warehouse is refreshed on a daily basis and very often overnight.
The main challenge of DataWarehouse architecture is to store historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement.

6.SNOWFLAKE SCHEMA

Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that dimensional tables in a snowflake schema are normalized, so they have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with BRANDS for instance).

The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated.

An example of a snowflake schema architecture is depicted below.

7.Mapping

Increamental Data load code using SQL Override logic

Explanation

In this Mapping we will see how to implement the incremental loading..

We will go for incremental loading to speed up the data loading and reduce data actually loaded.

There are different ways we can implement this incremantal loads.

One of the those methods by writing SQL Override in SQL using Mapping variable.

Explanation of the mapping: First we hav to create one mapping variable of type date.

In the source qualifier write the sql override as follows
SELECT SRC_INCRE_LOAD.EMPNO, SRC_INCRE_LOAD.ENAME, SRC_INCRE_LOAD.HIREDATE FROM
SRC_INCRE_LOAD WHERE SRC_INCRE_LOAD.HIREDATE>'$$V_LOAD_DATE'

In the expression assign sysdate to mapping variable to update the mapping variable because from next load it will pick the records greater than today's date.It will accept only recent records
Output Port :: INCRE_LOAD
SETVARIABLE($$v_incre_load,sysdate)

8.Mapping

How to find the number of success , rejected and bad records in the same mapping.

Explanation

In this Mapping we will see how to find the number of success , rejected and bad records in one mapping.

Source file is a flat file which is in .csv format . Click here to download the source file.The table appears like as shown below..

EMPNO	NAME	HIREDATE	SEX
100	RAJ	21-APR	M
101	JOHN	21-APR-08	M
102	MAON	08-APR	M
103		22-APR-08	M
105	SANTA	22-APR-08	F
104	SMITHA	22-APR-08	F
106			M

In the above table it shows few values are missing in the table .ANd also the date format of few records are improper.This must be considered as invlaid records and should be loaded into Bad_records table ( target table which is relational).
Other than 2 , 3 , 5, 6 records ,remaining all are invalid records because of NULL values or improper DATE format or both .
INVALID & VALID RECORDS ::
First we seperate this data using Expression transformation.Which is used to flag the row for 1 or 0 .The condition as follows ..
IIF(NOT IS_DATE(HIREDATE,'DD-MON-YY') OR ISNULL(EMPNO) OR ISNULL(NAME) OR ISNULL(HIREDATE) OR ISNULL(SEX) ,1,0)
FLAG =1 is considered as invalid data and FLAG =0 is considered as valid data .This data will be routed into next transformation using router transformation .Here we added two user groups one as FLAG=1 for invalid data and the other as FLAG=0 for valid data.
FLAG=1 data is forwarded to the expression transformation .Here we take one variable port and trwo ouput ports .One for increament purpose and the other for flag the row ...
INVALID RECORDS
INCREAMENT ::

PORT	EDIT EXPRESSION
COUNT_INVALID	V_PORT ( output port )
V_PORT	V_PORT+1 ( variable)

INVALID DATE ::

PORT	EDIT EXPRESSION
INVALID_DATE	IIF( IS_DATE(O_HIREDATE,'DD-MON-YY'), O_HIREDATE, 'INVALID DATE')

This data will be moved to the BAD_RECORDS table.Look at the below table::

EMPNO	NAME	HIREDATE	SEX	COUNT
100	RAJ	INVALID DATE	M	1
102	MAON	INVALID DATE	M	2
103	NULL	22-APR-08	M	3
106	NULL	NULL	M	4

VALID RECORDS ::
In this we will have the valid records.But here we dont want the Employee ,who is 'F' (Female).So our goal is to load MALE employee info., into the SUCCESS_RECORDS target table.
For this we need to use a Router transformation and declare the user group as follows \
IIF( sex='M',TRUE,FALSE)
And the defined group will capture teh rejected records which are nothing but employee who is FEMALE .
This data passed to the REUSABLE Expressiona transformation.Where the Increamental logic is applied to get the count value for the the no., of success and rejected records which are passing it.And loaded into the target table.
Look at the below tables :::
SUCCESS_RECORDS::

EMPNO	NAME	HIREDATE	SEX	COUNT
101	JOHN	22-APR-08	M	1

REJECTED_RECORDS::

EMPNO	NAME	HIREDATE	SEX	COUNT
105	SANTA	22-APR-2008	F	1
106	SMITHA	22-APR-2008	F	2

9.Design a mapping as shown below

Source Definition

Target Definition

Level1	Level2
P1	P2
P2	P3
P3	P4

level1	level2	level3	level4
p1	p1	p1	p1
p1	p2	p2	p2
P1	p2	p3	p3
p1	p2	p3	p4

Source : Flatfile
Target : flatfile

Solution

10.Design a mapping as shown below

Source Definition

Target Definition

ID	Name
1	100
2	Ramesh
3	INDIA
1	101
2	Rakesh
3	INDIA
1	102
2	Johny
3	USA

EID	ENAME	COUNTRY
100	Ramesh	INDIA
101	Rakesh	INDIA
102	Johny	USA

Source : Flatfile
Target : Relational

Solution

11.Source Definition

Target Definition

EMP_NO	ESAL
10	2Hundred
11	10Thousand
12	15Lakhs
13	170

EMP_NO	ESAL
10	200.00
11	10000.00
12	15000000.00
13	170.00

Datastage4u

Thursday, August 11, 2016

Datastage Interview Questions and Answer's

No comments:

Post a Comment

About Me