Friday, June 21, 2013

Data Warehousing Concepts

1
What is a Data Warehousing?
Data Warehouse is a repository of integrated information, available for queries and analysis. Data
and information are extracted from heterogeneous sources as they are generated....This makes it
much easier and more efficient to run queries over data that originally came from different
sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not
meet the requirements for effective on-line analytical processing (OLAP). As a result, data
warehouses are designed differently than traditional relational databases.
What are Data Marts?
Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a
section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts
are sometimes complete individual data warehouses which are usually smaller than the corporate
data warehouse.
What is ER Diagram?
Stands for entity relationship diagrams. It is the first step in the design of data model which will
later lead to a physical database design of possible a OLTP or OLAP database
What is a Star Schema?
A relational database schema organized around a central table (fact table) joined to a few smaller
tables (dimension tables) using foreign key references. The fact table contains raw numeric items
that represent relevant business facts (price, discount values, number of units sold, dollar value,
etc.)
What is Dimensional Modeling?
Dimensional Modeling is a design concept used by many data warehouse designers to build their
data warehouse. In this design model all the data is stored in two types of tables - Facts table and
Dimension table. Fact table contains the facts/measurements of the business and the dimension
table contains the context of measurements i.e., the dimensions on which the facts are
calculated.
What Snow Flake Schema?
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data
has been grouped into multiple tables instead of one large table. For example, a product
dimension table in a star schema might be normalized into a products table, a product_category
table, and a product_manufacturer table in a snowflake schema. While this saves space, it
increases the number of dimension tables and requires more foreign key joins. The result is more
complex queries and reduced query performance
What are the Different methods of loading Dimension tables?
The data in the dimension tables may change over a period of time. Depending upon how you
want to treat the historical data in dimension tables, there are three different ways of loading the
(slowly) varying dimensions:
1. Type One Dimension: do not keep the history. Hence update the record if found, else insert
the data.
2. Type Two Dimension: Do not update the existing record. Create a new record (with version
number or change date as part of key) of the dimension, while retaining the old one.
2
2. Type three Dimension keeps more than one column for each changing attribute. The new
value of the attribute is recorded in the existing record, but in an empty column.
Type 2 dimensions are the most commonly used dimension.
What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is grouped to certain
levels of dimensions. Retrieving the required data from the actual table, which have millions of
records will take more time and also affects the server performance. To avoid this we can
aggregate the table to certain required level and can use it. This tables reduces the load in the
database server and increases the performance of the query and can retrieve the result very fast.
What is the Difference between OLTP and OLAP?
Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary
OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Demoralization is promoted
Low volume transactions
Transaction recovery is not necessary
What is ETL?
ETL is extraction, transformation and loading, ETL technology is used for extraction the
information from the source database and loading it to the target database with the necessary
transformations done in between.
What is Fact table?
A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain
the data from which dimensions are created.
What is a dimension table?
A dimensional table is a collection of hierarchies and categories along which the user can drill
down and drill up. It contains only the textual attributes.
What is a lookup table?
A lookup table is nothing but a 'lookup' it gives values to referenced table (it is a reference), it is
used at the run time, it saves joins and space in terms of transformations. Example, a lookup
table called states, provide actual state name ('Texas') in place of TX to the output.
What is real time data-warehousing?
3
In real-time data warehousing, your warehouse contains completely up-to-date data and is
synchronized with the source systems that provide the source data. In near-real-time data
warehousing, there is a minimal delay between source data being generated and being available
in the data warehouse. Therefore, if you want to achieve real-time or near-real-time updates to
your data warehouse, you’ll need to do three things:
1. Reduce or eliminate the time taken to get new and changed data out of your source
systems.
2. Eliminate, or reduce as much as possible, the time required to cleanse, transform and
load your data.
3. Reduce as much as possible the time required to update your aggregates.
Starting with version 9i, and continuing with the latest 10g release, Oracle has gradually
introduced features into the database to support real-time and near-real-time, data warehousing.
These features include:
+ Change Data Capture
+ External tables, table functions, pipelining, and the MERGE command, and
+ Fast refresh materialized views
What is data mining?
Data mining is a process of extracting hidden trends within a data warehouse. For example an
insurance data ware house can be used to mine data for the most high risk people to insure in a
certain geographical area.
What are Normalization, First Normal Form, Second Normal Form, And Third Normal
Form?
Normalization: The process of decomposing tables to eliminate data redundancy is called
Normalization.
1N.F:- The table should contain scalar or atomic values.
2 N.F:- Table should be in 1N.F + No partial functional dependencies
3 N.F:-Table should be in 2 N.F + No transitive dependencies
What type of Indexing mechanism do we need to use for a typical data warehouse?
Space requirements for indexes in a warehouse are often significantly larger than the space
needed to store the data, especially for the fact table and particularly if the indexes are B*trees.
Hence, you may want to keep indexing on the fact table to a minimum. Typically, you may have
one or two concatenated B*tree indexes on the fact table; however, most of your indexes should
be bitmap indexes. Bitmap indexes on the foreign key columns on the fact table are often useful
for star query transformation.
The maintenance window will also dictate whether you use partitioned indexes, which can be
faster and easier to maintain.
Bitmap indexes also take up much less space than B*tree indexes and so should be preferred.
On the other hand, dimension tables are much smaller compared with the fact table and could be
indexed much more extensively. Any column of the dimension table that is frequently used in
selections or is a level in a dimension object is a good candidate for indexing.
4
Which columns go to the fact table and which columns go the dimension table?
The Aggregation or calculated value columns will go to Fact Table and details information will go
to dimensional table.
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data warehouse. For
example: Based on design you can decide to put the sales data in each transaction. Now, level of
granularity would mean what detail you are willing to put for each transactional fact. Product sales
with respect to each minute or you want to aggregate it upto minute and put that data.
What are non-additive facts?
Fact table typically has two types of columns: those that contain numeric facts (often called
measurements), and those that are foreign keys to dimension tables.
A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that
contain aggregated facts are often called summary tables. A fact table usually contains facts with
the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can
be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive
facts cannot be added at all.
An example of this is averages. Semi-additive facts can be aggregated along some of the
dimensions and not along others. An example of this is inventory levels, where you cannot tell
what a level means simply by looking at it.
What are Semi-additive and factless facts and in which scenario will you use such kinds of
fact tables?
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table, but not the others. For example:
Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it
makes sense to add them up for all accounts (what's the total current balance for all accounts in
the bank?), but it does not make sense to add them up through time (adding up all current
balances for a given account for each day of the month does not give us any useful information
How do you load the time dimension?
Every Data warehouse maintains a time dimension. It would be at the most granular level at
which the business runs at (ex: week day, day of the month and so on). Depending on the data
loads, these time dimensions are updated. Weekly process gets updated every week and
monthly process, every month
Why are OLTP database designs not generally a good idea for a Data Warehouse?
OLTP cannot store historical information about the organization. It is used for storing the details
of daily transactions while a data warehouse is a huge storage of historical information obtained
from different data marts for making intelligent decisions about the organization.
5
Why should you put your data warehouse on a different system than your OLTP system?
Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the source from which
any BI tools fetch data for Analytical, reporting or data mining purposes. It generally contains the
data through the whole life cycle of the company/product. DWH contains historical, integrated,
demoralized, subject oriented data.
However, on the other hand the OLTP system contains data that is generally limited to last couple
of months or a year at most. The nature of data in OLTP is: current, volatile and highly
normalized. Since, both systems are different in nature and functionality we should always keep
them in different systems.
What are the difference between Snow flake and Star Schema. What are situations where
Snow flake Schema is better than Star Schema to use and when the opposite is true?
Star schema contains the dimension tables mapped around one or more fact tables.
It is a demoralized model.
No need to use complicated joins.
Queries results fast.
Snowflake schema
It is the normalized form of Star schema.
Contains in-depth joins ,bcas the tables r splitted in to many pieces. We can easily do
modification directly in the tables.
We have to use complicated joins ,since we have more tables .
There will be some delay in processing the Query .
What is the main difference between schema in RDBMS and schemas in Data Warehouse?
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modeled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
What is degenerate dimension table?
In simple terms, the column in a fact table that does not map to any dimensions, neither it s a
measure column.
6
For e.g Invoice_no, Invoice_line_no in fact table will be a degenerate dimension (columns),
provided if you dont have a dimension called invoice
What are the possible data marts in Retail sales?
Product information
Store
Time
What is meant by metadata in context of a Data warehouse?
Metadata or Meta Data Metadata is data about data. Examples of metadata include data element
descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions,
and process/method descriptions. The repository environment encompasses all corporate
metadata resources: database catalogs, data dictionaries, and navigation services. Metadata
includes things like the name, length, valid values, and description of a data element. Metadata is
stored in a data dictionary and repository. It insulates the data warehouse from changes in the
schema of operational systems. Metadata Synchronization The process of consolidating, relating
and synchronizing data elements with the same or similar meaning from different systems.
Metadata synchronization joins these differing elements together in the data warehouse to allow
for easier access
What is surrogate key? Where we use it? Explain
Surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the
table. The only requirement for a surrogate primary key is that it is unique for each row in the
table.
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the
dimension tables primary keys. They can use Info sequence generator, or Oracle sequence, or
SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can
change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the
primary keys (according to the business users) but ,not only can these change, indexing on a
numerical value is probably better and you could consider creating a surrogate key called, say,
AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may
display only the AIRPORT_NAME.
2. Adapted from response by Vincent on Thursday, March 13, 2003
Another benefit you can get from surrogate keys (SID) is:
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be
in your Employee Dimension). This employee has a turnover allocated to him on the Business
Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to
7
Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the
old one should Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your data warehouse
everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the Employee
'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the SID of the
Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' +
'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key: natural
key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural
key enlargement process, is that you might not have all part of your new key within your fact
table, so you might not be able to do the join on the new enlarge key -> so you need another id.
What is Data warehousing Hierarchy?
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy
might aggregate data from the month level to the quarter level to the year level. A hierarchy can
also be used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data values
at lower levels aggregate into the data values at higher levels. A dimension can be composed of
more than one hierarchy. For example, in the product dimension, there might be two hierarchies--
one for product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to
enable you to drill down into your data to view different levels of granularity. This is one of the key
benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures. For
example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a value at
the next higher level is its parent, and values at the next lower level are its children. These familial
relationships enable analysts to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might have a
hierarchy that represents data at the month, quarter, and year levels. Levels range from general
to specific, with the root level as the highest or most general level. The levels in a dimension are
organized into one or more hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general (the root) to most
specific information. They define the parent-child relationship between the levels in a hierarchy.
8
Hierarchies are also essential components in enabling more complex rewrites. For example, the
database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation
when the dimensional dependencies between quarter and year are known.
What is conformed fact?
The relationship between the facts and dimensions are with 3NF, and can work in any type of
joins are called as conformed schema, the members of that schema is called so
What is the need of surrogate key; why primary key not used as surrogate key?
Surrogate key is system generated unique number.
Identity in SQL, Sequence in Oracle, Sequence Generator in Informatica.
For Example: You are extracting Customer Information from OLTP Source and after ETL
process, loading customer information in a dimension table (DW). If you take SCD Type 1, Yes
you can use Primary Key of Source CustomerID as Primary Key in Dimension Table. But if you
would like to preserve history of customer in Dimension table i.e. Type 2. Then you need another
unique no apart from CustomerID. There you have to use Surrogate Key.
Another reason : If you have AlphaNumeric as a CustomerID. Then you have to use surrogate
key in Dimension Table. It is advisable to have system generated small integer number as a
surrogate key in the dimension table. so that indexing and retrieval is much faster.
What is meant by Aggregate Fact table?
Factable having aggregated calculations like sum, avg, sum(sal)+sum(comm),
these are Aggregated Factable
What is the purpose of "Factless Fact Table"? How it is involved in Many to many
relationship?
Factless fact contains only the key fields and any related dimensional bridge attributes. Doesn't
contain actual measure fields that are used for aggregates or summary. Factless facts can be
used as a bridge tables or to store measures that are used directly without aggregation
What are late arriving Facts and late arriving dim ? How does it impacts DW?
Late arriving Fact table:
       This is rarely happens in practice.
For example there was a credit card of HDFC transaction happened on 25th Mar 2005, but this
record we received on 14th Aug 2007. During this period there is a possibility of change in
dimensional data.
There are two table one is transaction fact table and customer dimensional table
Transactional fact table is having two time stamps 1. Transaction date 2. Load date
Customer dimensional table is having two time stamps 1. Enter date (when customer provided
the details) 2. Load date
In case of late arrive fact records we have get the proper key from the dimensional tables as
these tables are maintaining the history.
9
In above scenario we have to extract the right key by using transaction date <= enter date from
customer dimension table for the particular customer.
Late arriving Dimension table:
  This is very general practice in consumer finance projects
Some times we will get the transactional data without dimensional date. In this practice we will
follow two methods.
1. Create one dummy record in dimensional table and join with fact table. Once we receive this
data the dummy record in the Dim table will be updated using type SCD1
2. Place the transactional data in staging table till we receive the dim records. Re keying
processed will be used to get the key from dim table
Re keying process: Generally in ETL process first we will load dim records and then go for facts.
At the time of loading fact records keying will be done with dim tables. The unkeyed records will
be kept in staging table and remaining data will be stored in target table. Next day at the time of
keying process the unkeyed data will also used for keying is called as re-keying. Re-keying
process will be continued till we get the key from dimensional tables
Explain the situations where snowflake is better than star schema
A snowflake schema is a way to handle problems that do not fit within the star schema. It
consists of outrigger tables which relate to dimensions rather than to the fact table.
The amount of space taken up by dimensions is so small compared to the space required for a
fact table as to be insignificant. Therefore, tablespace or disk space is not a considered a
reason to create a snowflake schema.
The main reason for creating a snowflake is to make it simpler and faster for a report writer to
create drop down boxes. Rather than having to write a select distinct statement, they can simply
select * from the code table.
Junk dimensions and mini dimensions are another reason to create add outriggers. The junk
dimensions contain data from a normal dimension that you wish to separate out, such as fields
that change quickly. Updates are so slow that they can add hours to the load process. With a
junk dimension, it is possible to drop and add records rather than update.
Mini dimensions contain data that is so dissimilar between two or more source systems that
would cause a very sparse main dimension. The conformed data that can be obtained from all
source systems is contained in the parent dimension and the data from each source system that
does not match is contained in the child dimension.
Finally, if you are unlucky enough to have end users actually adding or updating data to the data
warehouse rather than just batch loads, it may be necessary to add these outriggers to maintain
referential integrity in the data being loaded.

No comments:

Post a Comment