Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, gaurang.

Asked: July 13, 2002 - 5:48 pm UTC

Last updated: January 04, 2013 - 1:34 pm UTC

Version: 8.1.7

Viewed 10K+ times! This question is

You Asked

Hi Tom,

Can you explian about STAR schema and data warehouse,dimension concept in few words? There are books available on OTN but before i start reading them explain some terminology i mention above.

and Tom said...

</code> http://docs.oracle.com/cd/B10501_01/server.920/a96520/logical.htm#97642 <code>

does that exactly... with a picture and everything ;)



Rating

  (123 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Advantage of Snowflaking

Sagar, July 10, 2003 - 2:30 am UTC

Hi Tom,
I was asked at a interview the advantage of snowflaking..Other than normalized structure and space saving, I could not come up with an answer. Please let me know if there are other advantages of snowflaking.
NOTE: Even Ralph Kimball and Oracle docs recommend NOT using snowflaking

Tom Kyte
July 10, 2003 - 8:12 am UTC

well, in the city when snowfalls -- before the cars start moving, it looks really nice.

I'll let others comment on this, star yes, snowflakes, ouch. not only hard to query by end users, but if you create a view to ease it, you'll wait years for the answers.

Dave, July 10, 2003 - 9:24 am UTC

Snowflaking is generally needed to support that type of aggregation table where you are aggregating to a higher level in a dimension.

eg., fact table is at a level of "postal_code", and there is an aggregation level at the "city" level. "City" might be snowflaked out from the "postal_code" table, although you generally benefit from having all the attributes of city and it's parent data elements in the postal_code table also. (Redundant data).

Tom Kyte
July 10, 2003 - 10:21 am UTC

and I would just roll that redudant data down into the dimension resulting in a star again.

Star, snowflake, 9i

Du&amp;#353;an, July 11, 2003 - 9:56 am UTC

Tom,

Q: Where and how do 9i bitmap join indexes fit in this story? Can they be used to improve performance of a snowflake, if someone decides to ignore all the warnings about such a design?

Q: Imagine a requirement to design a dimension which is for some queries fully historical, i.e. SCD Type 2 (as defined by Kimball) and for other queries it should just show the current state of a dimension record, just like SCD Type 1.
And then, someone might ask to see a mix of current and historical values of columns in another query.
What sort of hybrid approach would you use to tackle this requirement?
(Simple approach is to review and question such a requirement, surely.)

Many thanks

Now, the informal bit:
--------------------------
A simple test case:

select sum(pints_owed) from all_uk_readers;
*
ERROR at line 1:
ORA-01426: overflow

To me, it looks that this test case shows you should visit UK more often. What do you reckon?

Tom Kyte
July 11, 2003 - 11:06 am UTC

q) i haven't tested it enough myself (bitmaps+snowflake+star transformation). so much to do, so little time to do it all...

q) might two dimensions -- one with historical values and the other with current values repeated in place of history -- do it? or even one dimension where you have a "current_value" column paired with each "historical_value" column and your load routines maintain the "current_value" column as they add new entries?



Bitmap joins

Peter, July 11, 2003 - 11:32 am UTC

Dušan

You could use bitmap joins to connect fact data to dimension attributes.

In our DW we have Product number as a FK on the fact tables. Say Product has Product number as the PK and product type as an attribute. The bitmap join could answer product type queries against fact tables.

I can't see a convincing case to use bm joins on just a reference hierarchy though.

Tom

When are you coming to the UK?

Tom Kyte
July 11, 2003 - 12:32 pm UTC

i'm coming to Denmark in january, no current plans to be in the UK anytime soon

What about Brazil ?

Joao Lemos Vasconcellos, July 11, 2003 - 4:59 pm UTC

By any chance, are you coming to South America, perhaps Brazil ? We'd love to have you here ! There is a big Oracle community that will love to see you. Btw, your book Oracle 101 is a big hit here, in Portuguese, although there is a laughable lot of poorly translated terms that break the reading rythm and the fluency of your agreeable way of writing.

Joao Vasconcellos
Oracle Brasil - SP

Tom Kyte
July 12, 2003 - 9:03 am UTC

Gaja wrote the Oracle 101 book -- I did "Expert One on One Oracle" :)

sorry, no current plans to visit Brazil

Any plans for Australia (Sydney)?

A reader, July 12, 2003 - 12:55 am UTC

Tom,
AUSOUG is planning to have Oracle conference in August/September'2003. Last year, it was a big success. Why not you visit this time?, and also perfect time for your new book launch in Australia!
Another fan of yours!

Tom Kyte
July 12, 2003 - 9:28 am UTC

isn't that winter there? brrr.....

Open Invitation

reaz, July 12, 2003 - 5:14 am UTC

First of All I would like to Invite you to visit Bangladesh, where the number people interested to use Oracle is growing very fast.

Your visit will be a Great contribution to the Oracle users & lovers of this distant country.

How can I find more about the publication of Mr. Kimbal.

thanks.
Reaz.

Hmmm... winter in Sydney....

A reader, July 12, 2003 - 6:06 pm UTC

Tom,
Sydney is having one of the world's best weather. No extreme side of any season.
"For many, the best time to visit Sydney is in the Australian spring. This covers the period from September 1 to November 30"
So, I don't think this will be genuine excuse...

Level Attribute in Star Schema

A reader, August 06, 2003 - 4:13 pm UTC

Hi Tom,

We are in the process of designing DW. We asked a consultant to help us in design.
Just to make it easy, I give you a small example of what I’ve done so far.

I have one fact table, which has 3 dimensions. And each dimension has 4 levels. For example Time dimension. (Year-Month-Week-Day). I had Level column in all of the dimensions. Because I wanted to pre-calculate all sales eg. (the first week of the first month of the year 93) via ETL process during night and when a query is like (show me the total sales for the first week of first month of the year 93, just go to Time dimension and find the time_key and join it to fact table then find the total sales.

Our consultant, insist that we don’t need level column in any dimensions! But I’ve read many times that Star Schema needs to have level column. Could you please let me know your idea? Is that right that we need a level column in dimensions?

Another thing that the consultant mentioned is that, we are not going to pre-calculate, we will keep all the detail (Daily Sales) in fact table and then we use group by for those kind of queries. And in the case that was slow, we create another fact table(I would assume Summery Table) to keep the aggregate data.

Isn’t it a poor design?

Could you please, let me know your valuable idea?

Thanks millions of tons,
Arash






Tom Kyte
August 09, 2003 - 12:03 pm UTC

not sure I follow you on the "level" column. A typical dimension might be:

DAY -> QUARTER -> WEEK -> FY

for example?

You use materialized views in Oracle and data warehouses like you might use indexes in OLTP. You would store the details and in the event of tuning a frequently asked quesiton -- you would create materialized views (precomputed summaries) that the optimizer would use for you (like an index). You don't want to set up yet another star.

A reader, August 12, 2003 - 12:08 pm UTC

Thanks Tom.

If I understood well, column that shows the 'LEVEL' is logical. I mean, we don't need to have a column in our physical structure(dimension) that represent 'LEVEL'. For example like the following:

year number,
month number,
quarter number,
week number,
day number,
LEVEL number (We don't need this column!) Am I right?

Regarding the summery data, I got your point. But in the case that I don't need detail in DW, isn't it better that I just save summery data in Fact table ?

Thanks again for your valuable information.




Tom Kyte
August 12, 2003 - 12:49 pm UTC

i don't know what purpose level would provide here, so I cannot really comment (can say, LEVEL is a psuedo column in our SQL so that name would be a bad choice)


of course, you should make your facts as aggregated as you can, I just would not replicate start schemas at differing levels of detail for the same data -- MV's are for that.

Any books that you recomend ..

A reader, August 12, 2003 - 1:52 pm UTC

Hi Tom,
Do you recomend any books other than Oracle Documentation to learn about Datawarehousing Concepts and design .

Thanks in advance,
another fan of your's


Tom Kyte
August 12, 2003 - 2:19 pm UTC

the stuff by Ralph Kimball is fairly well respected.

Coming back to you visiting UK!!!

A reader, August 13, 2003 - 11:09 am UTC

Hi Tom,

While browsing through Oracle World bookmark, somewhere I read about Oracle World in Paris from Oct 20 - Oct 23. Guess this is a good reason for you to come to Paris and UK/Europe fans of yours will have a chance of meeting you ....

Regards,
Yet another person learning Oracle tips/trick from you!!!

create dimension

bob, August 13, 2003 - 2:43 pm UTC

shouldn't you be using Oracle dimensions, not your own version?
do your dw designs make use of Oracle defined dimensions Tom?
I don't think I have ever seen them mentioned here, although if I search I am sure i will.

from the dw guide.

You create a dimension using either the CREATE DIMENSION statement or the Dimension Wizard in Oracle Enterprise Manager. Within the CREATE DIMENSION statement, use the LEVEL clause to identify the names of the dimension levels.

Tom Kyte
August 13, 2003 - 2:55 pm UTC

i think we are confusing terms here -- they seemed to be using level as a column in a dimension table

you WOULD use a create dimension statement to help query rewrite and MV's

The usage of LEVEL in dimension

A reader, August 19, 2003 - 7:09 pm UTC

Hi Tom,

I am back again. I didn't have access to Internet (out of town).

You are right. I want to have a column that represents Level in Dimension. Let me ask my question in another way.

The End Users want to query:
1- Total sales in a day
2- Total sales in each week of a specific month in a year

Please correct me if I am wrong! For supporting these queries, I'll create one Time dimension and one Fact table as follows:

Create table Time_Dim
(
time_key number(sequence),
Year number,
month number,
week number,
day number,
level number
);

insert into time_dim(time_key,year,level)
values(1,2001,1);

insert into time_dim(time_key,year,month,level)
values (2,2001,1,2);
insert into time_dim(time_key,year,month,level)
values (3,2001,2,2);
insert into time_dim(time_key,year,month,level)
values (4,2001,3,2);

insert into time_dim(time_key,year,month,week,level)
values (5,2001,1,1,3);
insert into time_dim(time_key,year,month,week,level)
values (6,2001,1,2,3);
insert into time_dim(time_key,year,month,week,level)
values (7,2001,1,3,3);
insert into time_dim(time_key,year,month,week,level)
values (8,2001,1,4,3);

insert into time_dim(time_key,year,month,week,day,level)
values (9,2001,1,1,1,4);
insert into time_dim(time_key,year,month,week,day,level)
values (10,2001,1,1,2,4);


Now, the fact table:

Create table fact_table
( fact_key number(sequence),
time_key number,
total_sales number
)

Now, ETL can calculate and insert total sale for e.g. the second week of the first month of Jan 2001 and also for each day. For example:

insert into fact (fact_key,time_key,total_sale)
values (1, 6 , 100000);
insert into fact (fact_key,time_key,total_sale)
values (2, 9 , 2000);
insert into fact (fact_key,time_key,total_sale)
values (3, 10 , 5000);

Now, if the query would be like ‘give me the total sales of each week in January 2001’ , it would be like this:

Select time_dim.week , total_sales
From fact_table,time_dim
Where fact_table.time_key = time_dim.time_key
AND time_dim.year = 2001
AND time_dim.month = 1
AND time_dim.level = 3
/

As you see, the ‘LEVEL’ helped me to :
1- Be able to save aggregate data in fact table
2- Be able to retrieve it correctly and easily

I’ve read this in a white paper that without ‘LEVEL’ in dimension, it’s impossible to support these kinds of queries, unless you create a Snowflake schema!

Now my questions are:
1- Is that true that we need ‘Level’ in dimension?
2- Is it that the only and the best way to support these kind of queries? (I mean the design)
3- Is that ok to save aggregate data and daily data in one fact table?

Hope, I could explain it better this time. I really don’t know, how to thank you ! Just thanks a million for your time.
Arash



Tom Kyte
August 19, 2003 - 9:19 pm UTC

‘give me the total sales of each week in January
2001’

Create table Time_Dim
(
time_key number(sequence),
Year date,
month date,
week date,
day date
);

Select time_dim.week , total_sales
From fact_table,time_dim
Where fact_table.time_key = time_dim.time_key
AND time_dim.month = to_date( '01-jan-2001' )
/


seems easier to me?

A reader, August 20, 2003 - 12:37 pm UTC

Hi Tom,

Correct me if I am wrong again. The following query

Select time_dim.week , total_sales
From fact_table,time_dim
Where fact_table.time_key = time_dim.time_key
AND time_dim.month = to_date( '01-jan-2001' )
/

Retrieve all records in 'Jan 2000'. I mean daily and weekly data ! Because in Tim_Dim I inserted one row for each day of each month and also one row for each week of each month and one row for each month of each year.

If I define all attributes as DATE, I don’t know what to insert for year, month, week ? Probably
· The first day of each year represent ‘YEAR’ e.g (‘01-jan-2001’ represent Year 2001). I mean, if I want to insert the total sales for year 2001 in Fact table, I should

Select time_key
From time_dim
where time_dim.year = to_date( '01-jan-2001' )

and then insert into fact_table values (time_key,total_sales)

· The first day of each month of each year represent that month in that year. e.g value of (01-AUG-2001) in month column represent August of 2001
· Don’t know about week !
· Day is clear

I don’t know how to write the query with your design for example “give me the total sales for the second week of January 2001? Because as I told you I want to calculate the total sales for ‘the second week of January 2001’ via ETL process during night and just insert one row in the Fact table.

In general how the data looks like in time_dim table?

Sorry if I confused you.
Thanks in advance.




Tom Kyte
August 21, 2003 - 4:57 pm UTC

so you want details and aggregates?

if you are fetching the details -- might as well just group by rollup? you don't need to compute the aggregates ahead of time?

Where are you today??

DD, August 20, 2003 - 12:40 pm UTC

its 20-AUG-2003. dont see any updates on this page. hope to see u soon.



Tom Kyte
August 21, 2003 - 4:59 pm UTC

geez...

i had to get up at 4:30 that morning...
fly to north carolina...
speak all day long...
then fly back home....

i was really tired. really really tired.

asktom is just a hobby :)

A reader, August 22, 2003 - 1:30 am UTC

Thanks Tom.
Arash


star and snowflake

A reader, November 24, 2003 - 4:56 pm UTC

Hi

I have read several Datawarehouse Books, Wiley the datawarehouse toolkit 2nd edition by Kimball and Essential Oracle 8i Datawarehousing Gary Odge/Tim Gorman. I am worried because these two books contradict each other in the aspect of differntiate star and snowflake schema

According to Kimball, snowflake is star schema with dimensions normalized

According Gary Dodge/Tim Gorman snowflake is denormalized star schema

Confusing isnt it??? According to Oracle documentation snowflake is star schema with dimensions normalized

Who is correct?!?!?!?

Tom Kyte
November 24, 2003 - 6:22 pm UTC

did they both have the same pictures? the name isn't really relevant -- the implementation is. I bet they both describe the same thing.

I've had people call implementations normalized whilst others called it denormalized at the same time. At the end of the day -- a rose is a rose by any name.

star query and FKs

A reader, November 25, 2003 - 11:15 am UTC

Hi

Does Oracle need the relationships between Fact Tables and Dimension Tables in order to perform Star Query? Primary keys and Foreign keys etc...

I think so?

Tom Kyte
November 25, 2003 - 1:28 pm UTC

it should -- it certainly "makes sense". there would not be any reason why you would not have the primary/fkeys there.

see
</code> http://www.dbazine.com/jlewis6.html <code>

nice article on them

relationships in DWH

A reader, November 25, 2003 - 1:46 pm UTC

Hi

I always thought relations are redundant in DWH because the data coming from operational database *MUST* always be valid....

Am I wrong?

Tom Kyte
November 25, 2003 - 2:05 pm UTC

there would not be an entire cottage industry built around ETL and cleansing tools if what you said were true :)

the data should be valid.

it rarely is.

but in any case, the relationships are not redundant, they are meta data used by the optimizer to more efficiently optimize your queries -- in addition to validation.

question about modifying datawarehouse model

A reader, November 25, 2003 - 2:38 pm UTC

Hello

Recently our company merged with another two companies. We must now change our Datawarehouse data model to suit this merge.

Our datawarehouse design is very simple. One sales fact table, one customer table, one product table and one time table. After the merge one customer id can appear under three company (the customer share same code across three companies) so basically now we need to add another company table in datawarehouse.

The question is should this company table a parent table of customer table in datawarehouse? Or it can be a new dimension to the fact (sales) table?

The first approach making company table parent of customer table this would change ur model from Star to Snowflake. This controls the data integrity.

The second approach of making company table a dimension of sales table this could compromise the data integrity. There might be cases where in the sales table there are customers who are only customers of company_1 also appears as customers of company_2

I was thinking the approach 2 since we also assume the data should always be valid (it comes from SAP) but hmm after read your last reply that we shouldnt assume data are always valid, this put me in doubt.....

Tom Kyte
November 25, 2003 - 3:27 pm UTC

why would you have multiple company tables? you are one company now?

we are a group

A reader, November 25, 2003 - 3:33 pm UTC

Hello

We are under same group but different companies. Each company keep their brands.

That is why the new customer PK would be the combination of customer_id and company_id....

Cant argue with management about these issues (making one company instead of 3)


Tom Kyte
November 25, 2003 - 3:56 pm UTC

add a company column to the company table?


your customers won't "span" companies? (i would HOPE they would). if they do, you are heading for a big mess.

ah I think you confused

A reader, November 25, 2003 - 4:03 pm UTC

Hello

I think you missunderstood, we have a company table which has three rows, one for each company my question was about these two approaches

=========================
The first approach making company table parent of customer table this would change ur model from Star to Snowflake. This controls the data integrity.

The second approach of making company table a dimension of sales table this could compromise the data integrity. There might be cases where in the sales table there are customers who are only customers of company_1 also appears as
customers of company_2


Which approach would you use?

Tom Kyte
November 25, 2003 - 4:24 pm UTC

to me -- a customer is a customer is a customer

they would not be in there three times if they were a customer of all three 'companies'

I'm saying "don't duplicate the customer data -- truly bad idea, it'll be 100% a mess"

Oracle DBA Guide to Data Warehousing and Star Schemas

Ferenc Palyi, February 04, 2004 - 9:00 am UTC

I'have heard about this book, in the title, visited amazon.com and read reviews saying this is the best book on star schemas ever. So I decided to buy it.

I have serious concerns that buying the book wasn't the best investment in my life.

The book describes the four generation of star optimization.
Third is star transformation and the fourth is the introducion of bitmap join indexes.

The book comes up with an example:
You can create a bitmap index

create bitmap index pos_day_b1 on pos_day (period_id)

or you can create a bitmap join index

create bitmap index pos_day_bj1 on pos_day (per.period_id)
from pos_day pos, period per
where pos.period_id = per.period_id

per.period_id is the primary key of the period dimension, pos.period_id is the foreign key in the fact table.

Do you know eny reason to create a bitmap join index like that? It's structure will be completely the same as the sturcture of the simple bitmap index, the values in the foregin key are the same values in the primary key.

The author made some tests with this bitmap join index and got to a conclusion, that life is better without bitmap join indexes.

I think if he would have created the join index on one of the period dimension's higher hierarcy levels, than that would be a different story.

It's pretty funny. What do you think? Have you read the book?

Tom Kyte
February 04, 2004 - 9:23 am UTC

bitmap join indexes are a way to denormalize without denormalizing.

consider the emp/dept tables..

people ask "how many people work in RESEARCH?".  "Who in SALES makes more than 5000$?"

Note that they do not ask "how many people work in deptno=10", "who in deptno=20 makes more than 5000$"

so, do you move DNAME into EMP?  No, you can create a bitmap join index.  consider:


ops$tkyte@ORA10G> create table emp as select * from scott.emp;
Table created.
 
ops$tkyte@ORA10G> create table dept as select * from scott.dept;
Table created.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> alter table dept add constraint dept_pk primary key(deptno);
 
Table altered.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> set echo off
Enter to continue
 
=================================================================
 
so, where is how to create the bitmap index -- using a FROM
and WHERE clause
 
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> create bitmap index emp_bm_idx
  2  on emp( d.dname )
  3  from emp e, dept d
  4  where e.deptno = d.deptno
  5  /
 
Index created.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> set echo off
Enter to continue
 
=================================================================
 
of course, to use a bitmap index we must use the CBO, we would need
lots of data to get it to want to use the bitmap as well -- so we fake
it out.
 
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> exec dbms_stats.set_table_stats( user, 'EMP', numrows => 1000000, numblks => 300000 );
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G> exec dbms_stats.set_table_stats( user, 'DEPT', numrows => 100000, numblks => 30000 );
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> set echo off
Enter to continue
 
=================================================================
 
and now we just query...
 
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> set autotrace on
ops$tkyte@ORA10G> select count(*)
  2  from emp, dept
  3  where emp.deptno = dept.deptno
  4  and dept.dname = 'SALES'
  5  /
 
  COUNT(*)
----------
         6
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=13)
   1    0   SORT (AGGREGATE)
   2    1     BITMAP CONVERSION (COUNT) (Cost=1 Card=10000 Bytes=130000)
   3    2       BITMAP INDEX (SINGLE VALUE) OF 'EMP_BM_IDX' (INDEX (BITMAP))
 
<b>Note how neither of EMP or DEPT was actually accessed for this query...</b>
 
 
ops$tkyte@ORA10G> select emp.*
  2  from emp, dept
  3  where emp.deptno = dept.deptno
  4  and dept.dname = 'SALES'
  5  /
 
     EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ------
      7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300     30
      7521 WARD       SALESMAN        7698 22-FEB-81       1250        500     30
      7654 MARTIN     SALESMAN        7698 28-SEP-81       1250       1400     30
      7698 BLAKE      MANAGER         7839 01-MAY-81       2850                30
      7844 TURNER     SALESMAN        7698 08-SEP-81       1500          0     30
      7900 JAMES      CLERK           7698 03-DEC-81        950                30
 
6 rows selected.
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=8712 Card=10000 Bytes=870000)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (TABLE) (Cost=8712 Card=10000 Bytes=870000)
   2    1     BITMAP CONVERSION (TO ROWIDS)
   3    2       BITMAP INDEX (SINGLE VALUE) OF 'EMP_BM_IDX' (INDEX (BITMAP))

<b>and notice how DEPT is not accessed here either</b>


bitmap join indexes could be used in some cases INSTEAD of a star transformation -- it is not that they "help it", they are an alternative in a way.


I have not read the book, no. 

datawarehousing link on new OTN doc site

A reader, February 04, 2004 - 12:31 pm UTC

great discussion of bitmap join indexes!

Alberto Dell'Era, February 04, 2004 - 12:36 pm UTC

Liked the 'denormalize without denormalizing' hint...

Querying trends

A reader, August 04, 2004 - 10:04 pm UTC

I have a dimension table like

create table cust (cust_id number primary pey, cust_name varchar2(100));

create table day (day_id number primary key,
day date,week date, month date,year date);

And a sales fact table like

create table fact
(
cust_id number references cust,
day_id number references day,
sales number
);


How can I write a query that shows me the sales for the last 6 weeks (or months or years) by customer? Would I have to self-join the fact table to itself 6 times?

This is a fairly typical question to be asked in a data warehouse, right? Surely it cant be something very complicated!

Thanks

Tom Kyte
August 05, 2004 - 12:39 pm UTC

select cust_id, sales
from fact
where day_id in (select day_id
from day_table
where day >= sysdate-6*7 );

would be one approach (or join if you want the "date" in the output)

Actually -- i would say you would not factor day all of the way out and use day_id in this case, DAY as the date should be kept in the fact table and that would be the primary key into the day_table dimension...


Partitioning fact table

A reader, August 05, 2004 - 7:29 am UTC

Another question about these star thingies, in my fact table above

create table fact
(
cust_id number references cust,
day_id number references day,
sales number
);

This table can get really big, so I want to partition by date to get nice monthly partitions.

But my table above has the "day_id", not the actual date. The day_id is a surrogate key for my "time dimension" which has the actual dates.

So how can I do something like

partition by range(fact.day_id)
values less than 'Jan-2004' -- But the Jan 2004 is coming from joining the day_id with the DAY table

How is this typically done in data warehouses?

Thanks

Tom Kyte
August 05, 2004 - 12:58 pm UTC

perhaps you've "over normalized"? seems the DATE should be in there.

if you want to partition by the DATE, the DATE will need to be there to partition on.

A reader, August 05, 2004 - 1:56 pm UTC

select cust_id, sales
from fact
where day_id in (select day_id
from day_table
where day >= sysdate-6*7 );

No, I mean a result with 7 columns. The cust_id and the sales in each of the last 6 weeks

?

Tom Kyte
August 05, 2004 - 2:28 pm UTC

so you want to pivot, simple pivot:

select cust_id,
max(decode(day,trunc(sysdate-0)),sales)) day1,
max(decode(day,trunc(sysdate-1)),sales)) day2,
.....
max(decode(day,trunc(sysdate-6)),sales)) day7
from fact,dim
where <join>
and day >= sysdate minus whatever....
group by cust_id;


but typically the VISUALIZATION tool does the pivoting (discoverer for example or whatever reporting tool you use)

Regarding partitioning the fact table by day_id ...

Gabe, August 05, 2004 - 2:45 pm UTC

<quote>But my table above has the "day_id", not the actual date. The day_id is a surrogate key for my "time dimension" which has the actual dates.</quote>

day_id is the surrogate key from the "Day" dimension ... surrogate key value usually come from an Oracle sequence ... but it is not a law or anything!

The ?trick? I usually use is to generate the day_id primary key for a given day as to_char(<day>,'J') ... thus you have a predictable, deterministic (hence repeatable ... should you reload/add to the "Day" dimension) _relationship_ between a day and its surrogate PK. Use the Julian date number to define you range partitions on the fact table.



huh??

Gabe, August 05, 2004 - 3:33 pm UTC

To "A reader" who supplied the folowing:

<quote>
create table day (day_id number primary key,
day date,week date, month date,year date);
</quote>

So how exactly do you represent a week, month or year as a DATE? And how does whatever representations you use make it easy for users to query by them?


Tom Kyte
August 05, 2004 - 8:27 pm UTC

the simply truncate the day to their concept of the week, same with month, year, qtr, fiscal year and so on - very common approach.

Dave Aldridge, August 05, 2004 - 7:03 pm UTC

<quote>The ?trick? I usually use is to generate the day_id primary key for a given day
as to_char(<day>,'J') ... thus you have a predictable, deterministic (hence
repeatable ... should you reload/add to the "Day" dimension) _relationship_
between a day and its surrogate PK. Use the Julian date number to define you
range partitions on the fact table.<\quote>

I don't see how this is an improvement on just using the day in date format as the key to the date table, and including it in the fact table. It seems to be the use of a synthetic key just for the sake of using a synthetic key.

I think it is extremely unlikely that you would ever need the convenience of taking all the records that are identified as "01-Jan-2004" and changing them all to "02-Jan-2004", for example. Maybe there is a DW environment where such a capability is desirable, but I would personally just use the date itself as the key.

Tom Kyte
August 05, 2004 - 9:28 pm UTC

concur. seems to be a surrogate in search of a problem.

Dave Aldridge (again), August 05, 2004 - 7:06 pm UTC

... and in addition, if you did modify the date column in your proposed methodology then you would have broken the deterministic rrelationship between date and date_id -- not that such a relationship seems to be good practice, anyway.

To Gabe

A reader, August 05, 2004 - 8:08 pm UTC

"So how exactly do you represent a week, month or year as a DATE? And how does whatever representations you use make it easy for users to query by them?"

My mistake, week, month, year, etc are not DATE, they are simple VARCHAR2 fields.

Basically the point is to describe the various "attributes" of that day as "day in year", "week in year", "day in week", "day# in week", "quarter in year" etc.

This makes it easier for users, in a classic star schema model to query by these dimensional attributes. i.e. show me all sales for the 1st quarter or 10th week in the year, first monday of every month, etc, etc.


To Dave

A reader, August 05, 2004 - 8:12 pm UTC

"I don't see how this is an improvement on just using the day in date format as the key to the date table, and including it in the fact table. It seems to be the use of a synthetic key just for the sake of using a synthetic key."

I agree that using the actual date itself in the FACT table makes sense, but as I just posted, there are many other descriptive "attibutes" to the so-called "time dimension" that are extremely "queryable". Surely we cant put them all in the fact table! Then, it wouldnt be a fact-table. FACT table, as per classic (read Kimball) is supposed to contain only multipart PK (each one a FK to its own dimension table) and its own preferably numeric summable attributes and thats it. Query filters are applied to the dimension table (DAY for this discussion)

Tom Kyte
August 05, 2004 - 9:34 pm UTC

then you will not be partitioning in them since they are by definition *synthetic*.

you get what you get at the end of the day. you can partition by day_id -- but it won't be strictly "by month" or "by daY" since day_id doesn't *mean* either of those.

cake -- eating it -- comes to mind. here, i would not have gone to the illogical extreme of turing the DAY field into a surrogate. in other cases -- maybe (maybe maybe maybe - really -- just maybe)





Partitioning

A reader, August 06, 2004 - 7:05 am UTC

"then you will not be partitioning in them since they are by definition *synthetic*"

OK I see that having the concept of day_id as a surrogate key doesnt make sense.

But I can have the actual date field in my fact_table and that can be a FK to my DAY table where all the other attributes of that day are expanded out. Yes, some can be derived from the fact table's date itself, but I really dont want to be doing trunc's on the PK column since that would blow away the index.

Also, there are some attribs that cannot be derived simply from the date itself, like a 'is_holiday' flag which I can store in the DAY dimension table.

Now, since my fact table has the actual date, I can partition by month like

partition by range (key_date)
partition jan2004 values less than to_date('31-jan-2004','mm-dd-yyy')
partition feb2004 values less than to_date('29-feb-2004','mm-dd-yyy')
...

Sounds good?

Thanks


Tom Kyte
August 06, 2004 - 8:32 am UTC

yes.

probably too long to read ...

Gabe, August 06, 2004 - 12:49 pm UTC

<quote>concur. seems to be a surrogate in search of a problem.</quote>

The Julian date number is one of the candidate keys for defining an entity representing a “Day” … it is not a surrogate. One can use the Julian date number as the PK or something like the number 20040806 or a varchar2(8) like ‘20040806’ … implementing the PK of the “Day” entity on a DATE column is, in itself, not enough (that is, there has to be some _extra_ stuff in there to enforce the integrity/definition of the “Day” table):

create table day (d date primary key);
insert into day (d) values (sysdate);
insert into day select (d+1/86400) from day;
select to_char(d,’DD MON YYYY’) from day;

D
---------
06 AUG 2004
06 AUG 2004

The _extra_ could be:

create trigger trg_day before insert or update on day
for each row
begin
:new.d := trunc(:new.d);
end;
/

From an ER modeling perspective, an entity having a PK based solely on an Oracle DATE column can only represent one thing: a “Second”.

A DATE column is never a good choice in a DW environment as a dimension to query by … whether implicit or explicit, the conversion of strings to dates is just a way to ask for trouble (you better hope your ad-hoc query tool is sophisticated enough to generate SQL with the _correct_ formatting masks).

<quote>... and in addition, if you did modify the date column in your proposed methodology then you would have broken the deterministic rrelationship between date and date_id -- not that such a relationship seems to be good practice, anyway.</quote>

I won’t pretend to understand what this is suppose to mean … the transformation of a DATE value with either to_number(to_char(<DATE>,’J’)) or to_number(to_char(<DATE>,’YYYYMMDD’)) has been, is and will always be deterministic.

<quote>
partition by range (key_date)
partition jan2004 values less than to_date('31-jan-2004','mm-dd-yyy')
partition feb2004 values less than to_date('29-feb-2004','mm-dd-yyy')
...
Sounds good?

Yes
</quote>

Actually no (just look how easy is for the time part to mess up the days! … throw into the mix some extra software layers and … life is suddenly too complicated for comfort).

partition by range (key_date)
partition jan2004 values less than to_date('01-feb-2004 00:00:00','dd-mon-yyyy hh24:mi:ss')
partition feb2004 values less than to_date('01-mar-2004 00:00:00','dd-mon-yyyy hh24:mi:ss')

How about …

partition by range (day_id)
partition jan2004 values less than 2453037
partition feb2004 values less than 2453066
...

or

partition by range (day_id)
partition jan2004 values less than 20040201
partition feb2004 values less than 20040301
...

Sounds good?

For the scenario listed here (temporal range partitioning the fact table) … ultimately the question is if partition elimination happens for your typical DW query (that is, you never query by FK values ... DATEs or NUMBERs, doesn't matter ... always query by dimensions).


Tom Kyte
August 06, 2004 - 1:59 pm UTC

I'll totally 100% disagree.

DATE is what you must use. it'll not have a time component.

tell me optimizer, how many values between:

20011231 20010101

date = 1
number = 1,130

it is vital to store a date in a date, a number in a number, a string in a string. vital.

A reader, August 06, 2004 - 2:37 pm UTC

TOM,

CAN YOU EXPLAIN WHAT THE STAR SCHEMA IS. WHAT IS THE DIFFERENCE BETWEEN STAR SCHMA AND OLTP

THANK YOU IN ADVANCE



Tom Kyte
August 06, 2004 - 2:42 pm UTC

star schema is a data warehouse construct, as far from oltp (transaction processing) as you get.


</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96520/logical.htm#97636 <code>



A reader, August 06, 2004 - 2:47 pm UTC

Thanks tom. That is a good article.

no value whatsoever in what I'm saying?!

Gabe, August 06, 2004 - 4:38 pm UTC

<quote>
tell me optimizer, how many values between:
20011231 20010101

date = 1
number = 1,130
</quote>

In the fact table, just by looking at the range definition, the optimizer won't be able to specify/estimate the number of rows in the range ... that particular FK column (DATE or NUMBER) is not unique.

The "Day" dimension could/should have 2 DATE columns useful for DATE operations (when I am designing the thing, it always has them):

FROM_DT = DATE '08-AUG-2004 00:00:00'
THRU_DT = DATE '08-AUG-2004 23:59:59'

From the end-user perspective, the choice of what the PK in the "Day" dimension is ... well, is totally irrelevant ... from that angle it is indeed a surrogate/synthetic value with no logic attached to it ... the metadata layer (EUL for Discoverer) won't even expose the thing to be queried by. From the perspective of the ETL designer/developer what exactly is the problem with having the PK of the "Day" table defined as NUMBER? If none, what exactly is the problem with _fashioning_ my NUMBER values as opposed to using an Oracle sequence (as I said, I still have my FROM_DT, THRU_DT date columns in the dimension table)? Well, I didn't find any drawbacks ... arguably, I did find some advantages.

I don't quite see why you are "totally 100% disagree" with what I'm saying! Strangely enough, you didn’t have the same strong disagreement when presented with “The usage of LEVEL dimension August 19, 2003” … generic time dimension table holding all of these: days, weeks, months and years (how exactly do they decide if the week of [Aug 29, 2004 – Sep 4, 2004] is in the month of August or September is beyond me!) … generic fact table holding both the detail rows (at day level) and the aggregations at week, month, and year levels (think of it as a table and all its time-based mv aggregations rolled into one!). That is probably the most perverse Oracle implementation of a dimensional model … it is what I call the black-hole implementation and, scary enough, I did come across such implementations.

Than again, a world with everyone in total agreement would be rather boring.


Tom Kyte
August 07, 2004 - 9:15 am UTC

what does uniqueness have to do with it?  I'm talking "number of rows that meet the criteria"

ops$tkyte@ORA9IR2> create table t ( x date, y int );
 
Table created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> insert into t select sysdate+rownum, to_char( sysdate+rownum, 'yyyymmdd' ) from all_objects;
 
29665 rows created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> analyze table t compute statistics;
 
Table analyzed.
 

ops$tkyte@ORA9IR2> set autotrace on explain
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select * from t where x between to_date('31-dec-2004') and to_date('02-jan-2005' )-1/24/60/60;
 
X                  Y
--------- ----------
31-DEC-04   20041231
01-JAN-05   20050101
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=14 Card=4 Bytes=48)<b>
   1    0   TABLE ACCESS (FULL) OF 'T' (Cost=14 Card=4 Bytes=48)</b>
 
 
 
ops$tkyte@ORA9IR2> select * from t where y between 20041231 and 20050101;
 
X                  Y
--------- ----------
31-DEC-04   20041231
01-JAN-05   20050101
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=14 Card=327 Bytes=3924)<b>
   1    0   TABLE ACCESS (FULL) OF 'T' (Cost=14 Card=327 Bytes=3924)</b>
 
 
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> set autotrace off

<b>this can have a profound effect on the optimization of queries.  Strings do it even worse.</b>


I'm saying humans don't use julian dates, they are not meaningful to us.  We use dates - january 1st, 2005.  To move the date out of the table into the dimension only achieves:

a) less ease of use for the dba/developer doing a partitioning scheme
b) less meaning in the ddl itself -- hey, is 2,453,255 august or july...
c) more work for the database for a large percentage of queries (the end users use dates)


I'm against the julian date here.  that is what I'm 100% against.  

<b>(and yes, I'm aware my example did not use julian dates, i used the format 99.999% of the implementations I see out there using incorrectly time after time after time -- because people understand the number 20050101</b>
  

Day dimension

A reader, August 07, 2004 - 12:58 pm UTC

If the "grain" of the fact table is indeed a single day, having the actual DATE field as the PK of the DAY dimension (and of course the FK in the fact table) is the best compromise. This allows end users to query directly on the fact table when they know what day(s) they need. This allows DBAs to partition the fact table on the DATE field.

If users need to query by some extended attribute of the time dimension, like quarter, week, or is_holiday, those could be simple stored/indexed in the DAY dimension and that could do a star_transformation-type query.

Yes, I realize that most things can be done simply by TO_CHAR()s on the DATE field in the fact table itself, but then the index on the DATE field will not be used and it is easier to have all these fields pre-calculated in the DAY dimension table and available for simple filtering/querying.

Most DAY dimensiont tables I have seen are like
pk DATE
week varchar2
month varchar2
....
week_in_year, qtr_in_year, day_in_year, etc

i.e. every field can be derived from the pk but it never changes, so might as well store it in the table and index it. The index will be very selective if the table contains even 1 years worth of dates!

Good discussion, thanks

looking for value ... (too many words though)

Gabe, August 09, 2004 - 2:59 pm UTC

Tom,

[I do have some questions sprinkled in here … hope you’re not already fed up with the “Day” dimension subject.]

I’ve been reading you long enough to have seen the proof of how important (from a performance perspective) is to always pick the right data type (as if just using the right data type is not a good-enough reason in itself). I doubt someone will ever catch me using a number or string for a date.

In our case though, it is not about the data type of some column, it is about choosing the PK for a table (the “Day” dimension) … which has, or could/should have, all of these:

FROM_DT ==> DATE ‘DD-MON-YYYY 00:00:00’
THRU_DT ==> DATE ‘DD-MON-YYYY 23:59:59’
YEAR
YEAR_MONTH
YEAR_MONTH_DAY ==> ‘YYYYMMDD’
JULIAN_DAY_NO
Etc.

You seem to favor FROM_DT … I favor the JULIAN_DAY_NO (the DATE columns are still there in the table to be used for DATE-related operations). When it comes to what gets migrated as a FK into the fact table, I don’t see any difference between DATE or NUMBER … the FK column itself should only be used as a join key. I mean why don’t you like to see 20050101 as a FK number value in the fact table (and BTW, I don’t like that either)? … I guess, because the end-users will recognize that as representing a day and start _using_ it right from the fact table (and that can have performance implications) … I argue the same thing will happen if the FK in the fact table is DATE ‘2005-01-01’. Sooner or later the less SQL-savvy end-users (for whom we are in fact designing/building this dimensional model) will be overrun with options: “should I use the DATE FK column from the fact table?… or should I use the DATE PK column from the Day table? … or should I use the pre-cooked (DATE-derived) denormalized column form the Day table?”. My worry (justified or not … maybe you can comment on it) is that, by choosing to use the FK DATE from the fact table, the action of the end-users may preclude the usage of some optimization techniques (the usage of the index, the applicability of the star transformation with bitmap indexes, etc.). The whole point of having this denormalized “Day” table is to eliminate the need, for the less-SQL savvy end-users, to ever use the TO_DATE(<DATE>,<MASK>) constructs … we want them to just pick the value they are interested in … and be done with; we want to give them then the liberty of ad-hoc access but not to the extent of them writing or generating performance-crippling SQL. Leave them wild … and I’ll bet they will soon come back asking why their SQL report/extract _works_ for “September” but not for “August” …

Select <whatever> from <fact_table>
where to_char(<fact_table>.<fk_day_as_date>,’YYYY Month’) = ‘2004 August’;

One could of course explain what “fm” in a date format mask is/does … but wait … wasn’t the whole point of implementing a dimensional model in the first place to establish a self-serve, cost-effective, conceptually simple (to communicate) system for business end-users to obtain “business intelligence” without the constant/expensive assistance from the technical experts?

And, as I tried to explain, there is another reason I personally don’t like a DATE column as PK for the “Day” dimension … some people may find it “annoyingly pedantic or extreme” … I consider it in line with <quote>it is vital to store a date in a date, a number in a number, a string in a string. vital</quote>: the DATE datatype contains both date and time components … for a “Day” entity I only need the date fields. You know, it is as if, for a table with a PK having as values only integer numbers we would use a NUMBER(26,4) datatype! But, I won’t push this to far (consider it just an opinion) … I’m sure people will quickly jump saying that “Day” is fairly static and protected from changes … well, you either totally believe in declarative integrity, asserting and instrumenting your design/code or you don’t … or you do, but are prepared to cut corners.

Some questions:
1. Assuming one also needs a “Month” and/or “Quarter” and/or “Year” dimension tables, would you still use DATE columns as PK?
2. Would you not implement the “Day” table (and all the other time-based dimensions) table as IOT(s)?
3. Talking about the _rigth_ datatype … for the “YEAR” column of our denormalized “Day” table (values like 2004, 2005, etc.) what dataype do you use? ... NUMBER(4) or VARCHAR(4)?

What I find perplexing when discussing the implementation of a time dimension is that people tend to _forget_ the thought/decision process that brought them to the concept of a time dimension in the first place … <quote>This allows end users to query directly on the fact table when they know what day(s) they need</quote>. Well, the next worst thing from having no option is having too many options … why did they decide they needed a DW in the first place and who was it built for? … why did they decide a star model is a simple/effective method of communicating and abstracting the intricacies of the database model/schema? … and if there was a need for a “Day” component, why did they decide to have a “Day” table with all these pre-cooked day-related information? … so that, we can go back and tell the end-user they should learn about DATE format masks in Oracle? I don’t think so. A technology-savvy person can get the job done with just a DATE column (no “Day”, “Month”, “Quarter”, “Year” tables required … if one leave out of discussion, for a momemt, the information which is not directly derived from a DATE … like, “is holyday”, “is hurricane Alex day”, etc.) … it is not these people we should be worried about, the whole dimensional model concept did not come about because of them. The end-users and/or the BI tool setup should be taught/trained, ideally, just the following:
A. These are the dimensions (things you should query by) and what they represent
B. These are the hierarchies to use for drill-down, roll-up
C. These are the facts (things you measure) … and for each, whether they are additive, semi-additive or non-additive
D. These is how you join dimensions with the fact(s)
E. May have to somehow communicate in plain English about conformed dimensions and drilling-across
Start talking technology-lingo like snowflake, star transformation, CBO, bitmaps, surrogate keys, factless fact table, degenerate dimensions and you have lost your audience … they will be almost ready to go back to their monthly file extract and Excel.

So, this is my perspective/justification on this, apparently so trivial, task of choosing the PK of the “Day” dimension table … hope there is some value to it (99% disagreement from you would leave me with the hope of, at least, grasping the concepts) … surely I didn’t just toss a coin.

Thanks.

Tom Kyte
August 09, 2004 - 4:22 pm UTC

good discussion, you make very valid, reasoned points.



the main difference for me between using the julian date in a number vs a date in a date is the human one.

humans don't think "julian" dates, we think 01-jan-2004.

factoring out the date and using the julian date does not make it so they need not use to_date( 'string', 'mask' ) -- it just means they must join prior to doing to (to get the sales for the last 5 days -- they would be doing a to_date against a column in the dimension table). It could be very useful in many typical queries.

I don't see how querying directly against the attributes of the fact table would preclude anything -- only how it could be beneficial in many cases.

but take:

Select <whatever> from <fact_table>
where to_char(<fact_table>.<fk_day_as_date>,’YYYY Month’) = ‘2004 August’;

how is that different from:

Select <whatever> from <fact_table>, <dim_table>
where <fact_table>.fk_to_dates = <dim_table>.pk
and to_char(<dim_table>.<day>,’YYYY Month’) = ‘2004 August’;

they have the same ability to "make mistakes". And some joker would figure out with the julian_date that it is a julian date and just do the same thing at the end of the day.

Some questions:
1. Assuming one also needs a “Month” and/or “Quarter” and/or “Year” dimension
tables, would you still use DATE columns as PK?

yes, i would..

2. Would you not implement the “Day” table (and all the other time-based
dimensions) table as IOT(s)?

probably, most of the dimensions -- time based or not actually

3. Talking about the _rigth_ datatype … for the “YEAR” column of our
denormalized “Day” table (values like 2004, 2005, etc.) what dataype do you use?
... NUMBER(4) or VARCHAR(4)?

year seems appropriate as a number here -- it is monotomic. Of course, we are not talking "bc" or anything -- just "normal" years





A reader, August 09, 2004 - 4:48 pm UTC

Um, both you and Gabe seem to be violently in agreement that the "human" factor is what is the biggest factor to be considered in deciding whether to have "actual DATE vs julian date" as the PK/FK.

In both of your approaches, given that some joker/user would eventually figure out that the fact table really has julian dates and/or real dates and starts to write queries that make use of this fact (instead of joining the DAY dimension and using all the pre-cooked fields)...all the to_char()s would kill the index on this field.

Instead, doesnt it make sense to use a surrogate, totally meaningless sequence# as the PK for the DAY table? This way the users are forced to join to the DAY dimension and once they do this, their pretty GUI tool would show all the nicely pre-cooked fields available? So they wouldnt need to do any to_char()s?

Of course, the (possibly big) disadvantage of using a surrogate key in the fact table is that this would make (range) partitioning by this field impossible! :(

Thanks

Tom Kyte
August 09, 2004 - 8:17 pm UTC

I know, confusing when arguments for both sides of a coin are presented

this is a judgement call.

to me -- I don't think so, not for a date, it is used just too often -- the optimization if being able to forgo the dimension at times is just too appealing to pass up.

not pretenting to _know_ the future ... but ...

Gabe, August 09, 2004 - 11:12 pm UTC

<quote>Of course, the (possibly big) disadvantage of using a surrogate key in the fact table is that this would make (range) partitioning by this field impossible!</quote>

The thing about days is that they are 100% predictable … the “Day” table is fairly static and protected from changes … there is probably one script or program, in total control of the DW/ETL team which maintains the “Day” table … for the most part, one needs to run it once a year, in order to add the next year worth of records. Done … the PK values are set and immutable. As I said, there is no _law_ saying surrogate key have to draw their values from an Oracle sequence … in fact, for the “Day” table there isn’t even multi-user concurrent writes to worry about … your “Day” maintenance program can safely get the MAX from the table and increment by 1 for every new record being added. In any case, whether one sequentially builds the PK values or uses an Oracle sequence (which may leave gaps because of caching … but so what?) the point is that, once set, the PK values are immutable. Hence, I can partition my fact table because I know the values of 01-Jan-2005, 01-Feb-205, 01-Mar-2005, etc. As Tom said it <quote>less meaning in the ddl itself</quote> … sure … but impossible? … no … it is possible to implement temporal range portioning of the fact table even if the PK values of the “Day” table have no meaning … they only need to be assigned in chronological order.


To Gabe

A reader, August 10, 2004 - 9:09 am UTC

Hm, so you are saying something like

partition by range (date_col)
(partition jan2004 values less than 123456
partition feb2004 values less than 234567
...
)

As Tom said, reading this DDL makes no sense, because there is no apparent corelation between "Jan 2004" and "123456", but it will work just fine since, as you said, the PKs are immutable.

Thanks!

just to close ...

Gabe, August 10, 2004 - 11:05 am UTC

<quote>humans don't think "julian" dates, we think 01-jan-2004.</quote>
From the perspective of the end-users the PK from “Day” is just a join key to the fact table … there is no meaning what-so-ever from that angle … we would not publish a Julian number to them, unless their business was/is dependent on it (they explicitly asked for it during the analysis phase).

<quote>
but take:

[#1]
Select <whatever> from <fact_table>
where to_char(<fact_table>.<fk_day_as_date>,’YYYY Month’) = ‘2004 August’;

how is that different from:

[#2]
Select <whatever> from <fact_table>, <dim_table>
where <fact_table>.fk_to_dates = <dim_table>.pk
and to_char(<dim_table>.<day>,’YYYY Month’) = ‘2004 August’;
</quote>

As much as possible, we want them to do the following:

[#3]
Select <whatever> from <fact_table>, <dim_table>
where <fact_table>.fk = <dim_table>.pk
and <dim_table>.<year_month_pre-cooked_column> = ‘2004 August’;

We don’t want to give them too many options … ideally just one way in. Hence, we would have a view on top of the “Day” table which restricts what they see (the _technical_ columns of the table are not published) … they get to select through the view and nothing else.

Now, if there is a category of end-users who do require the flexibility of the FROM_DT, THRU_DT DATE columns then we do screen and train them … there is a leap of faith with everything, the only thing we can do is to be methodical about unleashing the full power. Kimball uses the publishing metaphor [the article is “An Engineer’s View”] to describe what it gets to be successful with a DW effort … the data model and its implementation is just the end of the beginning … the actual publishing of the information is just as important.

Performance wise, with #2, there isn’t much damage one can do when mucking about with the “Day” dimension … 10 years worth of days is still less than 4000 rows … FT scans, index access, whatever … Oracle will go pretty quickly through 4000 rows. Muck around with the FK DATE in the fact table (millions of rows) and there may be a sizeable performance impact … I won’t even 100% trust myself without at least an explain plan … and that kind of contradicts that ad-hoc access ideal.

I’ll try to see if I can materialize some technical (contra)arguments for <quote>I don't see how querying directly against the attributes of the fact table would preclude anything</quote>. By now, it should feel like a breath of fresh air to go back to practical, measurable, SQL test cases.

Someone much more coherent that I can manage (short article … lots of insight) …
</code> http://www.intelligententerprise.com/030422/607warehouse1_1.jhtml?_requestid=320733 <code>

Thanks


Implementing history

A reader, September 08, 2004 - 11:29 pm UTC

The requirement is to capture month-end snapshots of data so as to be able to report on the business state as of any available month in the past.

What are your thoughts on implementing this type of history in the context of a dimensional model? "SCD Type 2" as Kimball calls it. Personally, I would just add a "as_of_date" to the relevant tables and add that to the existing PK of the table and be done with it? (Would consider partitioning by this as_of_date to ensure that any historical query is no worser than any current query).

[My problem with the textbook solution for SCD Type 2 is that it doesnt have a effective date, relying instead on joining with the time dimension (via the fact table) to derive the "as_of_date". This makes it impossible to write dimension-only queries as of a certain date]

Thanks



Tom Kyte
September 09, 2004 - 8:01 am UTC

if space is not of a concern, having the same dimension value repeated month after month certainly makes things easier for some things -- harder for others.

The as_of_date would be in the dimensions, every join out would have to specify the "as_of_date" everytime -- but that is easier then formulating the query that finds the record that was current as of a point in time if you just log "changes to the dimension"



Implementing history

A reader, September 09, 2004 - 9:34 am UTC

My thoughts exactly, glad that we agree.

"if space is not of a concern, having the same dimension value repeated month after month certainly makes things easier for some things -- harder for others"

Other than space, what things would this make "harder" for? What did you have in mind?

To address the space issue, I plan to (range) partition by the "as_of_date" and create local indexes. Basically, the table would just contain N independent monthly partitions that can be managed easily. 99% of the queries would access only (a particular month's set of) partitions so partition pruning would kick in big time!

What are your thoughts on creating a separate database for this stuff? In the end-state, 95% of the database would contain historical information so creating a new database just for 5% of the data doesnt make sense to me. Management is worried that the DW-type queries (if not properly written or if partition pruning doesnt kick in or whatever) will bog down the server and hurt performance of the current data users. I prefer to have 1 database and manage resources using Resource Manager and the like. Thoughts?

Thanks

Tom Kyte
September 09, 2004 - 11:36 am UTC

"show me the history of this dimension -- when things changed over time for it"

easy with the from/to dated dimensions.

trickier with a constant stream of information (but lag/lead make it pretty easy actually)...


I prefer as few databases as possible in all cases. Less to manage, less to backup, less to recover, less to worry about, easier to tune (especially if you do not get a new server for this other instance -- the max number of instances per host should be "1")

A reader, September 09, 2004 - 3:03 pm UTC

Snowflake schema is yet another relational model
with fact tables and dimension tables.

So you could have

customer_Fact -> Customer_Dimension -> Category-> etc



Best tool for ETL - PL/SQL Vs Informatica/data sage

A reader, September 23, 2004 - 2:17 pm UTC

With the awesome features in Oracle 9i, from a performance perspective if code is written in a decent way in PL/SQL, then the performance is much better compared to any third party DW tools like informatica, data stage etc. because all these tools need to talk with SQL/ PL/SQL in one way or other. Is my understanding correct. If there are strong PL/SQL Developers do we really any of these DW third party tools for Oracle 9i?.

Thanks


Tom Kyte
September 24, 2004 - 9:25 am UTC

It is in my experience.

but remember, code is not discriminating, you can write bad code in any language. People who say "language X is slow" probably tried to program it using their knowledge of language Y.


However -- In an large scale environment, the benefit of these tools is in their meta data repositories (like Oracle Warehouse builder -- that generates sql and plsql to do stuff). The benefit is in the management of the meta data -- even more so then the "code generation".

Code -- I can write lots of code.

but...
Maintain it -- who me?
Share it with 50 other people -- are you kidding?

That is where the tools come into play more than anything.

A reader, September 24, 2004 - 10:33 am UTC

Thats a great response, I agree 100%

A reader, September 24, 2004 - 10:37 am UTC

"metadata management by ETL tools"

Nevertheless, the core question remains. From a feature set perspective, most ETL tools have these GUI thingies that let you "map columns", do trivial transformations, lookups, validatations, etc. PL/SQL is more than suitable for all of this stuff. And there are good PL/SQL IDE' out there also.

Having said all this, would you say that with 9iR2 and up, given sufficient in-house SQL/PL/SQL expertise, the third-party ETL vendors are out of business?

Thanks



Tom Kyte
September 24, 2004 - 11:52 am UTC

Let me take your sentence and finish it:

From a feature set perspective, most
ETL tools have these GUI thingies that let you "map columns", do trivial
transformations, lookups, validatations and most importantly store these trivial -- yet vital transformations, lookups and validations in a self documenting fashion so that in the future, when we've moved onto something more exiciting and the next wave of developers come onto the scence -- the footprints we've left behind are bold and clear. They won't have to read our code (which we fully document each and every step of the way right :)) and sort of figure out what we did -- it'll be obvious


You see, an ETL tool probably in many cases makes the entire design process a little slower then if you just sat down and banged out code. but in the implementation, test, production, maintainance cycles they really save your butt.

Conversation overheard in a large shop not using ETL tools and just using "really good programmers":

End User: "why is this number 5"
Programmer: "because the first and third bits are set"
End User: "huh? anyway -- why is this 5 showing up on this report, where does it come from"
Programmer: "Well, originally it was on disk - and we ran a query..."
End User: "no, i mean, what source systems gave us this number, what processes touched it, how was it made -- from which inputs and what did you do to it to make it 5"
Programmer: "oh, well if you look here, on line 201 of this unit, we ...."


Same conversation in a shop using an ETL tool:

<this space intentionally left blank, end user use the OWB function from reports to right click on the value in the report and view "geneology" -- the report of systems that contribute to the value, the functions applied to it, what it means>






Think of an ETL tool as a method of imposing rigor on your development process. Sure, if you are lone programmer jockey, and you are the only guy doing things and you plan on doing it for the rest of your life -- you don't need one.

The power of the ETL tools are no so much in the "trivial transformation of this column to uppercase", but in the documetation that there is such a transformation, it takes place on data coming from source A but not source B and it was implement because <fill in the space>.

I agree there

A reader, September 24, 2004 - 12:23 pm UTC

We are just finishing a redevelopment of a 4 TB dwh and have used OWB for almost all of the ETL stuff. In all we have created over 400 mappings (from the most trivial 1 to 1 map to the most complex multi-table validation mapping).
Benefits to us:
1) The code that comes out is 'good' code - it runs well in a live environment
2) With 400 maps, a lot of them do similar things, we can give an 'ideal' map to junior developers and ask them to produce n similar maps for n similar table loads.
3) We can look at the data flows on screen! The business users can see that a value in table y comes from souce x.
4) We get time for the senior guys to cut the 'special' code by hand - there will always be some feature or that a ETL tool could not handle, in our case it was partition management




A reader, October 18, 2004 - 2:57 pm UTC

Tom see your comment
Followup:
‘give me the total sales of each week in January
2001’

Create table Time_Dim
(
time_key number(sequence),
Year date,
month date,
week date,
day date
);

Select time_dim.week , total_sales
From fact_table,time_dim
Where fact_table.time_key = time_dim.time_key
AND time_dim.month = to_date( '01-jan-2001' )
/
seems easier to me?

If the data is stored for a day withing a month, how will the above query yield correct result, if I want to query for a quarter?

Do you mean that we have to do the calculation at the report side?



Tom Kyte
October 19, 2004 - 7:45 am UTC

if you had it stored as dates, sure -- but the original person with the review did not have that, so it could not be done that way.


If the data is stored for a day within a month -- the TIME_DIM is there to turn the facts (the data stored by day) into years, months, qtr's whatever -- that is its sole purpose in life, to rollup data.



A reader, October 20, 2004 - 12:56 pm UTC

Ok. This is clear now.

Thanks!

Star optimization

Reader, October 22, 2004 - 11:56 am UTC

Can you pls explain this term and how exactly it works ? How does it differ from Star Transformation ?
The document says that one is good for dense fact tables whereas the other for sparse fact tables. Looking for more clarity and better understanding on this.

Thanks

Tom Kyte
October 22, 2004 - 6:09 pm UTC

see
</code> http://www.dbazine.com/jlewis6.shtml <code>
for what one does...

a star optimization is a technique whereby the dimensions are cartesianed producted together and then joined to from the big fact table (in short). works OK with small dimensions, but breaks down after a bit, star transformation (described by JPL in the above link) sort of turns the problem inside out, letting it scale to larger problems.

Snowflake

Peter, December 20, 2004 - 10:02 am UTC

A long, long way up (second post in fact!) there was some discussion about the merits of SNOWFLAKE. The general consensus was don't go there, I mainly agree !
The only time I had to use a snowflake design was on a DWH with nested MVs i.e. MV defined from other MVs and not base fact (for refresh performance) In this case we had to use snowflake to get query re-write to work all the way up the MV tree. This is completly understandable as the original query has now been re-written to form that no longer has a one to one join to the base level of the dimension table.

Life in 10g is much better though as you could use rewrite equivalent to 'flatten' the nested structure, this will also remove the hidden gotcha where a mv in a higher tier can not be reached because it's parent is not selected in the inital round of query re-write

What is better to use DATE or DATE_KEY

Jack, February 04, 2005 - 1:51 pm UTC

Hi Tom,
Let's say we have a SHIP_DATE (truncated)in a sales table, is it better to store it in DW as DATE or DATE_KEY (number)?
1. Store it as date is fine, but if it is NULL, we have to put a fake date on it (e.g. '31-DEC-9999'), otherwise, if we join with a date dimension table, it will be skipped; but if we do put a value there, calculating turnaround time will be wrong (ship_date minute order_date), unless you code something special.
2. Store it as date_key is fine, too. But then you always have to join with the dim table even if you are looking for transactions on a specific date.
3. Store both date and date_key can resolve this, but they might take gitabytes just for this (if we have several such dates in a table and if the table is large, or if we apply this to many big tables).

I know 1-2 Gigabytes is not a big deal today, but I am wondering if there is better ways so we don't have to keep both. Also, do you see any problems of designing a date_dim table with date_key as primary key, but allow fact tables to build FK to DATE field in the dimension (not PK)?

Thanks,


Tom Kyte
February 04, 2005 - 2:24 pm UTC

DATE only DATE, date date date

why?

how many numbers between

yyyymmdd yyyymmdd
20001231 20010101

how many DAYS between

yyyymmdd yyyymmdd
20001231 20010101


the optimizer knows the answer to both. however, if you stuff a date into a string or a date into a number -- guess what? it doesn't KNOW "oh ho, that is really a date in disguise" -- so, it gets the cardinality really wrong - and that leads to "wrong plans"

dates -- put them in DATES
numbers -- put them in NUMBERS
strings -- you can put them in strings




Levent

Levent. Atay, February 04, 2005 - 3:14 pm UTC

Let me tell you about snowflakes in dimensional modeling. Do you know what you have if you have a snowflake? A snow storm! Not recommened.

Date as Date

A reader, February 04, 2005 - 3:47 pm UTC

I prefer Date as Date too.
I just need to resolve the NULL value issue: If it is null, and you join with the TIME_DIM (DATE AS PK, MONTH, YEAR, ...). Here're my problems:
1) If it is null, the row will be skipped (may not be desirable);

2) If outerjoin (specified in BusinessObjects Universe), will query performance be impacted? (Most of the time, we don't need to return the unshipped records).

3) If put as '31-Dec-9999' when null(not shipped), sometimes it may be mis-interpreted by careless people(e.g. calculating average turnaround time, avg(ship_date -order date)

4)We already have a DATE dim table (PK is Date_Key(number)) with many tables using DATE_KEY as FK. If we start using DATE (in DATE_DIM) as referenced FK column (not PK, but unique), do you see any problems?

Let's say we define it as DATE, how should we make it work better?

Thanks in advance.





Tom Kyte
February 05, 2005 - 5:07 am UTC

1) if it is null, what is it then? when you are going to this time dimension, you are going there to turn a date into a week, a qtr, a fiscal year, etc

so, if you have NULL, what week, qtr, etc is that???



so -- you tell me, if there is a null in there, then what? what does it mean to go to this time dimension with a null?

another opinion ...

Gabe, February 04, 2005 - 5:50 pm UTC

<quote>Let's say we have a SHIP_DATE (truncated)in a sales table, is it better to store it in DW as DATE or DATE_KEY (number)?</quote>

Your SHIP_DATE value (for a given Order) from the SALES/ORDERS OLTP table will be mapped, during the ETL, to a record in your DAY dimension (you mentioned there already is a DAY dimension table). The PK implemented in your DAY table will be migrated to the fact as a FK … you do not chose the datatype for this SHIP_DATE in the DW … it has been done already!! … if the PK in the DAY dimension is a DATE then that is the datatype migrated to the fact … if it is a NUMBER then that is the datatype migrated to the fact … and so on.

Tools like BusinessObjects, Discoverer, etc. do not necessarily require a DAY dimension table … they are happy to be pointed to a DATE field and will _build_ the DAY dimension/hierarchies/etc. virtually in their own layer. If this is the case … and if no end-user will ever do analysis right in the database … and there are no other DAY attributes which cannot be derived right from a DATE (like “hurricane flag”) then just including a DATE column in the fact table to represent the Shipping Date for an Order is quite acceptable … it is [a bit like] a degenerate dimension (there are no attributes of interest left to justify adding a dimension table).

Now, about NULL SHIP_DATEs … DW are for historical analysis … depending on your requirements, you may want to have your star model track only shipped orders … and leave analysis of the un-shipped orders to an ODS … or even to reports in your OLTP systems (they likely do that already … it is a tactical thing). Even from the perspective of the Shipping Department the _knowledge_ comes from analyzing months/years of thousand of shipped orders and not from the last weeks of, comparatively, small number of not-yet-shipped orders.

If you still need to have un-shipped orders in your star then … you probably have (or need) an ORDER_STATUS dimension _describing_ the various phases an order goes through your organization. You map the NULL SHIP_DATEs to a _special_ record in the DAY table. You configure the tool meta-layer (and train end-users) such that:
1. For historical analysis you always make sure that the predicate ORDER_STATUS = ‘SHIPPED’ is always included in the generated SQL … hence, comparing DATEs is OK (the _special_ record in the DAY tables is excluded by virtue of predicating on SHIPPED orders only)
2. For analysis of un-shipped orders you make sure not to compare the other DATEs (order_entry, invoice_date, etc.) with the _special_ DAY … in fact, what one does with un-shipped orders is to compare DATEs of pre-shipping phases with SYSDATE (well, not SYSDATE since DW usually have a delay of sorts … compare with the day the DW was last refreshed).

<rant>In passing … depending of how you model your star (Day dimension and other helper tables) one may never have to compare DATEs when doing analysis (as in "apply Oracle DATE functions") … the DATEs comparisons/operations are only be needed during ETL</rant>

Again, I’m not sure tracking an order all the way to the shipping phase belongs in a DW … and the decision of how the SHIP_DATE is represented in the fact table (as in … what datatype) has been made when the PK of the DAY dimension was chosen. One can argue DATE is better than NUMBER but, by now, it doesn’t have anything to do with ORDER_DATE or SHIP_DATE or any other DATE attribute from the OLTP.


Another Opinion

A reader, February 05, 2005 - 12:07 am UTC

Thanks Gabe.


A reader, February 05, 2005 - 7:47 am UTC


fact, dimention, snowflaking

Raghav, February 05, 2005 - 7:53 am UTC

Hi Tom,

What is a fact table? What is the dimention table? Are these tables have any special characteristics or they are normal and only the treatment (joining tables with speficied columns arriving at required summaries which we do normally) is different?

Are these dimention tables are mainly used for date / time analysis or any other areas (some examples please).

At last what is snowflaking? Is it the overhead (Like opening a number of sessions, occupying temporary tablespace or memory) on the server while running the queries?

Thanks in advance
Raghav

Tom Kyte
February 05, 2005 - 8:56 am UTC

read the data warehousing guide:

</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96520/toc.htm <code>

these are schema designs, types of tables....

definitions, diagrams, examples -- all in that link.

David Aldridge, February 05, 2005 - 10:34 pm UTC

>> Tools like BusinessObjects, Discoverer, etc. do not necessarily require a DAY 
dimension table … they are happy to be pointed to a DATE field and will _build_ 
the DAY dimension/hierarchies/etc. virtually in their own layer. If this is the 
case … and if no end-user will ever do analysis right in the database … and 
there are no other DAY attributes which cannot be derived right from a DATE 
(like “hurricane flag”) then just including a DATE column in the fact table to 
represent the Shipping Date for an Order is quite acceptable … it is [a bit 
like] a degenerate dimension (there are no attributes of interest left to 
justify adding a dimension table). <<

In response to this I'd add that one of the benefits of "materializing" the attributes of the date in a dimension table is that the optimizer can sometimes get much better statistics on the query.

For example if you store day-level dates (ie. no time) in the fact table and rely on the front-end converting these on the fly with expressions such as Trunc(fct.my_day,'MM') then a query that predicates the fact table with ...

Trunc(fct.my_day,'MM') = Trunc(sysdate,'MM')

... is not going to get a good estimate on the cardinality of the result set. It seems that the optimizer just is not capable of inferring cardinality in this situation very well.

However if a dimension table is created in which the expression Trunc(dmn.my_day.'MM') is "materialized as a column of values then the optimizer can read the cardinality from the dmn table's statistics. There are situations where skewed data in the fct table is going to disrupt this, but given an even distribution of values the optimizer will infer the cardinality in the fct table correctly.

An example ...

SQL*Plus: Release 10.1.0.2.0 - Production on Sat Feb 5 20:12:04 2005

Copyright (c) 1982, 2004, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> create table fct (my_date not null)
  2  pctfree 0 as select trunc(sysdate)-rownum from dba_objects
  3* where rownum < 180
SQL> /

Table created.

SQL> create table dmn (my_date not null,my_month not null,my_year not null)
  2  pctfree 0 as select my_date,trunc(my_date,'MM'),trunc(my_date,'YYYY')
  3  from fct;

Table created.

SQL> insert into fct select * from fct;

179 rows created.

SQL> /

358 rows created.

SQL> /

716 rows created.

SQL> /

1432 rows created.

SQL> /

2864 rows created.

SQL> /

5728 rows created.

SQL> /

11456 rows created.

SQL> /

22912 rows created.

SQL> /

45824 rows created.

SQL> /

91648 rows created.

SQL> /

183296 rows created.

SQL> /

366592 rows created.

SQL> commit;

Commit complete.

SQL> analyze table fct compute statistics for table for all columns;

Table analyzed.

SQL> analyze table dmn compute statistics for table for all columns;

Table analyzed.

SQL> alter table dmn add constraint pk_dmn primary key (my_date);

Table altered.

SQL> alter table fct add constraint fk_dmn foreign key (my_date) references dmn


Table altered.

SQL> set autotrace on


-- note in the query below the even distribution of 
-- my_date values gives the optimizer an exactly correct 
-- estimate of cardinality, "4096"

SQL> select count(*) from fct where my_date = trunc(sysdate-10);

  COUNT(*)
----------
      4096


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=352 Card=1 Bytes=7
          )

   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'FCT' (TABLE) (Cost=352 Card=4096
           Bytes=28672)


Statistics
----------------------------------------------------------
        177  recursive calls
          0  db block gets
       1220  consistent gets
        598  physical reads
          0  redo size
        394  bytes sent via SQL*Net to client
        508  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
          1  rows processed

-- in the next example the cardinality estimate is way off
-- 126,976 rows are found with an estimate of 7,332
SQL> select count(*) from fct where trunc(my_date,'MM') = trunc(sysdate-10,'MM')
;

  COUNT(*)
----------
    126976


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=440 Card=1 Bytes=7
          )

   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'FCT' (TABLE) (Cost=440 Card=7332
           Bytes=51324)





Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1193  consistent gets
          0  physical reads
          0  redo size
        395  bytes sent via SQL*Net to client
        508  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

-- Exactly the correct inference of cardinality here
-- For some reason I find this startling.

SQL> select count(*) from fct,dmn
  2  where fct.my_date = dmn.my_date and
  3* my_month = trunc(sysdate-10,'MM')
SQL> /

  COUNT(*)
----------
    126976


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=277 Card=1 Bytes=2
          1)

   1    0   SORT (AGGREGATE)
   2    1     HASH JOIN (Cost=277 Card=126976 Bytes=2666496)
   3    2       TABLE ACCESS (FULL) OF 'DMN' (TABLE) (Cost=3 Card=31 B
          ytes=434)

   4    2       TABLE ACCESS (FULL) OF 'FCT' (TABLE) (Cost=262 Card=73
          3184 Bytes=5132288)





Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1196  consistent gets
          1  physical reads
          0  redo size
        395  bytes sent via SQL*Net to client
        508  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

-- Here is a demonstration that a function-based index doesn't help the fact table

SQL> create index xie_mth on fct(trunc(my_date,'MM'));

Index created.

SQL> analyze index xie_mth compute statistics;

Index analyzed.

SQL> select count(*) from fct where trunc(my_date,'MM') = trunc(sysdate-10,'MM'
;

  COUNT(*)
----------
    126976


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=244 Card=1 Bytes=7
          )

   1    0   SORT (AGGREGATE)
   2    1     INDEX (RANGE SCAN) OF 'XIE_MTH' (INDEX) (Cost=279 Card=7
          332 Bytes=51324)





Statistics
----------------------------------------------------------
         23  recursive calls
          0  db block gets
        341  consistent gets
          1  physical reads
          0  redo size
        395  bytes sent via SQL*Net to client
        508  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>

So anyway, for this reason among others my own preference is to use a seperate dimension table for dates. 

Date Dimension

Jack, February 07, 2005 - 1:05 pm UTC

Hi Tom,

We have been using a DATE dimension table with a numeric field used as PK. As result, we have DATE_KEYs in our fact tables (e.g. ORDER_DATE_KEY in our ORDER table). Now we try to use DATE (truncated) directly (have ORDER_DATE in our fact table). But instead of building a new DATE dimension table (with PK being the truncated dates), I am thinking of let all FKs referencing the DATE field is in the existing date dimension.

Do you see any problems of doing that? I know in a normalized model, we usually build FK referencing the PK column in the parent table.

The DATE dimension table looks like this:
DATE_KEY NUMBER(12) PK
DATE_DATE DATE
YEAR_NO NUMBER(4)
MONTH_NAME VARCHAR2(10)
HOLIDAY_IND
...


Thanks,

Tom Kyte
February 07, 2005 - 1:21 pm UTC

i've no problem putting a real date into a fact table.

i've no problem with having a foreign key to any UNIQUE constraint (make it NOT NULL too -- the date_date field, in fact all fields that are not null should be...)

A reader, February 07, 2005 - 4:13 pm UTC


Date Dimension

Jack, February 07, 2005 - 10:13 pm UTC

Thanks!

the magic 1%?

Gabe, February 08, 2005 - 12:20 am UTC

David:

Regarding that “degenerate Day dimension is quite acceptable” comment of mine … I meant it as “from a dimensional modeling perspective” … I didn’t mean to attach any sort of performance implications to it. If you page up a bit you’ll find that my approach is to always rely on a Day dimension table with a non-DATE PK. I am of the firm belief that:
1. the implementation of the star should be as such that no end-user ever has to hit (directly or through the ad-hoc generated SQL) a DATE conversion/function (especially on the fact table)
2. the migrated FKs from the dimensions should only be used as join keys … that is, they should never be directly predicated against.

In fact, my comment to Tom was:
<quote>My worry (justified or not … maybe you can comment on it) is that, by choosing to use the FK DATE from the fact table, the action of the end-users may preclude the usage of some optimization techniques (the usage of the index, the applicability of the star transformation with bitmap indexes, etc.)</quote>
… your test case seems to show that mucking with a DATE in the fact table may indeed slighly _confuse_ the optimizer … and hence, unless someone points out some weakness in the argument, it actually strengthens my case.

Now, I did mimic your test case and findings and here are my observations:

You loaded 733184 rows in FCT … 179 days * 4096 each. The DMN table had 179 rows. Both tables were analyzed, including histograms.

SQL#1

Indeed the optimizer uses the histograms you had collected to come up with the precise cardinality: Card=4096. If you were to test without histograms on FCT, the cardinality would’ve come out as Card=7332.

SQL#2

The cardinality estimate is way off: Card=7332 vs. 126976 actual rows returned. If you were to test without histograms on FCT, the cardinality would’ve again come out as Card=7332.

SQL#3

Here the optimizer comes with the precise cardinality on DMN, Card=31 and also on the hash join Card=126976. If you were to test without histograms you’ll probably find:
1. No histograms on DMN and FCT ==> Card=2 on DMN, Card=7332 on the hash join
2. Histograms on DMN, not on FCT ==> Card=31 on DMN, Card=126976 on the hash join
3. Histograms on FCT, not on DMN ==> Card=2 on DMN, Card=7332 on the hash join.

The same kind of thing with the SQL#4 with the FBI.

In summary, whenever a function is applied against the FCT.MY_DATE column, the optimizer does what it would do if the histograms weren’t there at all: it assumes a 1% cardinality … you get this Card=7332 which is total-rows-in-FCT=733184/100 and Card=2 which is total-rows-in-DMN=179/100.

Let us say we slightly change your SQL#1 from:

Ver1: select count(*) from fct where my_date = trunc(sysdate-10);
to
Ver2: select count(*) from fct where my_date + 1 = trunc(sysdate-9);

Can the optimizer do anything about this?

SQL> select count(*) from fct where my_date + 1 = trunc(sysdate-9);

  COUNT(*)
----------
      4096


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=371 Card=1 Bytes=7
          )

   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'FCT' (TABLE) (Cost=371 Card=7332
           Bytes=51324)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1193  consistent gets
       1188  physical reads
          0  redo size
        380  bytes sent via SQL*Net to client
        500  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So, it doesn’t rewrite, id doesn’t seem to use the histograms and goes back to estimating the cardinality for the full table access on FCT to 1% … Card=7332 … not far from the truth, but still.

Cheers.
 

David Aldridge, February 09, 2005 - 1:33 am UTC

Gabe,

yes, we are indeed singing from the same song sheet then.

This is a great illustration, I think, that histograms are not just for skewed data in this world of data warehousing. I'm in the habit of collecting histograms on every indexed column (which generally means every fk column) in the fact table. The one fly-in-the-ointment for me is that Oracle (and maybe no other DB vendor either) does not have an implementation of multi-column histograms that would enable the optimizer to infer the distribution of a particular column's values based on different values of another column.

For example, the optimizer may infer that 75% of sale_type_cd's are "Retail" and 25% are "Wholesale", and that sales peak on Mondays and Saturdays, but cannot infer that the Saturday peak is 95% Retail, while the Monday peak is only 30% Retail. Now of course with such an extremely small number of values for sale_type_cd we could partition on it easily, and rely on partition level statistics to help us, but with higher numbers of distinct values this wouldn't be the case.

Tom/Gabe, can you see any theoretical (never mind practical for now!) problems with such a feature? Oh OK, practical ones also then.

Dave A.

Star Schema, histograms, et all

Stewart W. Bryson, February 11, 2005 - 1:27 pm UTC

First off, let me ask, how does one make sure histograms are collected using DBMS_STATS? I know how to do it with analyze, but I can't seem to find this in the newer documentation.

As far as the DATE_KEY, DATE controversy, I am in a similar situation. Our dimension tables use surrogate, numeric DATE_KEYS constructed like "20050211" to join to the fact table. However, our fact tables also have DATE keys in them, and this is how the fact tables, and even some of the larger dimension tables, are partitioned.

The problem with this model is that joins across the numeric DATE_KEY column do not allow partition pruning, as Oracle has no understanding of the relation between the numeric value '20050211' in the DATE_KEY column, and the FEB2005 partition for the DATE column. In order to get partition pruning, and possibly even partition-wise pruning, is to join on the DATE column anyway. But the primary key goes across the numeric field, so I'm forced to do two joins for every date lookup, unless I want to liter the tables with numerous indexes, all of which take time to build.

The other option is to partition on the numeric DATE_KEY column, but the purist in me is against partitioning on a number key when what I really want to partition by is date. However, I inherited this model, and this is exactly what I am contemplating, as I don't really see a "Best Case" in any of this.

I'm a big fan of surrogate keys and believe completely in them, except when it comes to time or date-based dimensions. It's using surrogates for surrogates' sake, and that's really just logical promiscuity.

Thanks to Tom and every one for this great thread... a lot of smart people have posted here.

Tom Kyte
February 12, 2005 - 8:22 am UTC

method_opt, method_opt is the parameter you use to tell it what to do to the columns.  (dbms_stats talks through this on the gather table/schema stats section)

not only is your partition pruning defeated, but the optimizer is lied to.

ops$tkyte@ORA9IR2> create table t ( str_date, date_date, number_date, data )
  2  as
  3  select to_char( dt+rownum,'yyyymmdd' ),
  4         dt+rownum,
  5         to_number( to_char( dt+rownum,'yyyymmdd' ) ),
  6             rpad('*',45,'*')
  7    from (select to_date('01-jan-1995','dd-mon-yyyy') dt
  8            from all_objects)
  9  /
                                                                                                                                                         
Table created.
                                                                                                                                                         
ops$tkyte@ORA9IR2> create index t_str_date_idx on t(str_date);
                                                                                                                                                         
Index created.
                                                                                                                                                         
ops$tkyte@ORA9IR2> create index t_date_date_idx on t(date_date);
                                                                                                                                                         
Index created.
                                                                                                                                                         
ops$tkyte@ORA9IR2> create index t_number_date_idx on t(number_date);
                                                                                                                                                         
Index created.
                                                                                                                                                         
ops$tkyte@ORA9IR2> begin
  2          dbms_stats.gather_table_stats
  3          ( user, 'T',
  4            method_opt=> 'for all indexed columns',
  5            cascade=> true );
  6  end;
  7  /
                                                                                                                                                         
PL/SQL procedure successfully completed.
                                                                                                                                                         
ops$tkyte@ORA9IR2> set autotrace on explain
ops$tkyte@ORA9IR2> select *
  2    from t
  3   where str_date between '20001231' and '20010101';
                                                                                                                                                         
STR_DATE DATE_DATE NUMBER_DATE DATA
-------- --------- ----------- ---------------------------------------------
20001231 31-DEC-00    20001231 *********************************************
20010101 01-JAN-01    20010101 *********************************************
                                                                                                                                                         
                                                                                                                                                         
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=7 Card=370 Bytes=17390)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=7 Card=370 Bytes=17390)
   2    1     INDEX (RANGE SCAN) OF 'T_STR_DATE_IDX' (NON-UNIQUE) (Cost=3 Card=370)
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
ops$tkyte@ORA9IR2> select *
  2    from t
  3   where number_date between 20001231 and 20010101;
                                                                                                                                                         
STR_DATE DATE_DATE NUMBER_DATE DATA
-------- --------- ----------- ---------------------------------------------
20001231 31-DEC-00    20001231 *********************************************
20010101 01-JAN-01    20010101 *********************************************
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=330 Bytes=15510)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=6 Card=330 Bytes=15510)
   2    1     INDEX (RANGE SCAN) OF 'T_NUMBER_DATE_IDX' (NON-UNIQUE) (Cost=2 Card=330)
 
 
 
ops$tkyte@ORA9IR2> select *
  2    from t
  3   where date_date
  4      between to_date('20001231','yyyymmdd')
  5              and to_date('20010101','yyyymmdd');
 
STR_DATE DATE_DATE NUMBER_DATE DATA
-------- --------- ----------- ---------------------------------------------
20001231 31-DEC-00    20001231 *********************************************
20010101 01-JAN-01    20010101 *********************************************
 
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=1 Bytes=47)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=3 Card=1 Bytes=47)
   2    1     INDEX (RANGE SCAN) OF 'T_DATE_DATE_IDX' (NON-UNIQUE) (Cost=2 Card=1)


<b>Hmm, seems to the optimizer that between the STRINGS '20001231' and '20010101' there are LOTS of other strings and between the numbers 20001231 and 20010101.

but there are very few other unique DATES between them..

(can you tell that I am totally not a fan of using a string for a date, a number for a date, a string for a number, blobs for clobs, varchar2(4000) to store any string "just in case" it gets bigger and so on?)


</b>

I'm a big fan of surrogate keys when a NATURAL KEY does not exist!  (and then you use a sequence).  But here, all you seem to be doing it copy information and putting it into the wrong type? 

just for completness ...

Gabe, February 12, 2005 - 10:01 am UTC

The optimizer won't get "lied to" if we were to use Julian numbers: it will show exactly the same cardinality as with Dates ... in the example above Card=1.

Tom Kyte
February 12, 2005 - 12:47 pm UTC

humans don't use julian numbers -- so it becomes a moot point - same problems arise (I've having some deja vu, I've had this conversation somewhere)....


dates belong in dates.



7 seconds response time for 50M result set

A reader, August 22, 2005 - 4:14 pm UTC

Hi Tom,

We are developing a query tool for star schema type of data warehouse. Our sales manager wants our application to get 7 seconds response time for ad hoc queries for summurizing 50 million rows - WITHOUT using aggregates (materialized views).

Which means that he wants the query to finish in 7 seconds for any result set of 50 million rows (maybe out of 200 million total fact rows).

I think he is crazy but I want to ask you first if this is realistic to do.

I know with MV it's possible for such query to finish in less than 1 second.

Also my questions:
1. Possible to finish in 7 seconds to summarize 10 million rows out of 200 million rows?

If not possible for these kind of performance, how long you would expect for that kind of queries?

Tom Kyte
August 24, 2005 - 3:25 am UTC

WITHOUT using MVs

why does a sales manager give a whit what technology you use. What if they said "without using tables, i don't like tables"




Dilbert at work

Bob B, August 24, 2005 - 10:17 am UTC

I think the best way to convince you're manager that you need to use MV's is to give him a hardware estimate for his goals:

First, run some of the likely candidates for ad-hoc queries. Focus on ones that will likely run longer than 7 seconds. Take the maximum time of all runs as the query time to be reduced. Then based on your current customer load and projected customer load, estimate the max concurrent users running ad-hoc queries. Multiply these two numbers together (max concurrent users * max query time) and divide by 7. This is a crude (and wrong) estimate of how many copies of your current hardware you'll need.

While the estimate is a complete crock, it might put some sense to the manager that using MVs will save time and money. The manager has essentially told you to put a nail into a piece of wood, but don't use a hammer or a nail gun. There's a great acronym for this kind of problem:

WOMBAT - Waste of Money, Brains, and Time.

What is the quickest way generate oracle tables?

P Winfield, January 26, 2006 - 2:50 pm UTC

Hi Tom

What is the best tool (in your opinion) of generating a bunch of tables? I've got a paper document with 25 tables and doing it with a script is the only way I know.
What would you do? Any tools or 3rd party products that can make this task quicker?

We are just after standard datatypes (varchar2, number etc) with primary and foreign key constraints, nothing too exotic.

Regards

Phil

Tom Kyte
January 27, 2006 - 8:16 am UTC

sounds like it would be as fast to type it in from "paper" as it would be to point click and press a button (faster probably).

sounds like a "script" to me.

Inversed one (fact table) to many (dimension table)

Wayne, February 03, 2006 - 4:20 pm UTC

The mainstream DW design is to have a fact table pointing to many dimensions (star). For example, a transaction has a customer dimension.
If we don't maintain change history of the customer, we will always report transactions with customer's latest information, name, address, etc. (join with CUSTOMER_ID)
If we do maintain its change history, we will put a customer surrogate key in the transaction table, so we will always report relevant customer information at the time of transaction.
If we want to report all the transactions with the most current customer address, we can put the original customer_id in the transaction table along with the surrogate key, so that we can use the customer_id to join with the customer dimension table and use the most recent indicator (e.g. end_date is null).


Interestingly, some of our developers come up with another design (with changing customer history):
The primary key of the customer dimension table is CUSTOMER_ID + START_DATE.
They put only the CUSTOMER_ID in the transaction table.
If they want to report customer information at the time of transaction, they join fact table (one) to customer table (many). -- 1-to-many the opposite way
If they want to report customer information at the time of transaction, they can join them and specify fact.TRANS_DATE between dim.START_DT and dim.END_DATE;
If they want to report latest information along with transactions, they can join them and specify dim.END_DATE IS NULL (or 12/31/9999)-- only one such record will be there.

Do you think this is a viable solution? Is there any performance issues? This is not much of a DW, but more of operational reporting environment.


RE: DATE_DIM

Bakunian, June 21, 2006 - 8:22 pm UTC

Tom,

I 100% agree with using DATE data type in a date dimensions PK. However how would you store dates in fact table that are not applicable say there are no date available for particular type of date say there is no REFUND_DT value available for record or date that Would you store non-existent date like 01/01/0001?
There is also notion that indexes are smaller on NUMBERS than on DATE columns. Is that true?

Tom Kyte
June 22, 2006 - 11:58 am UTC

what is wrong with NULL?

dates are fixed width 7 byte fields.
numbers are varying length fields between 0 and 22 bytes.



Hmm, looks like an index on a number holding a date (which is just so so so so very very very very very bad) is actually LARGER... bummer eh

Notions - no time for "NOTIONS".  Everything can be "shown"...


ops$tkyte@ORA10GR2> create table t
  2  as
  3  select trunc(sysdate)+rownum dt_dt, to_char(trunc(sysdate)+rownum,'yyyymmdd') num_dt
  4    from all_objects;

Table created.

ops$tkyte@ORA10GR2> insert into t select * from t;

49786 rows created.

ops$tkyte@ORA10GR2> insert into t select * from t;

99572 rows created.

ops$tkyte@ORA10GR2> insert into t select * from t;

199144 rows created.

ops$tkyte@ORA10GR2>
ops$tkyte@ORA10GR2> create index dt_idx on t(dt_dt) tablespace manual;

Index created.

ops$tkyte@ORA10GR2> create index num_idx on t(num_dt) tablespace manual;

Index created.

ops$tkyte@ORA10GR2>
ops$tkyte@ORA10GR2> select index_name, leaf_blocks from user_indexes where index_name in ('DT_IDX','NUM_IDX');

INDEX_NAME                     LEAF_BLOCKS
------------------------------ -----------
DT_IDX                                1057
NUM_IDX                               1110

 

RE: Null in Date

Bakunian, June 22, 2006 - 12:17 pm UTC

"what is wrong with NULL?"
Assuming "Sale Date" is nullable and let's say you have no sale on Tuesday. So if you asking following question how would you show that Tuesday had 0 sales.

Select count(*) from T1
Where "sale date" between Mon and Fri
Group By "Sale Date"


Tom Kyte
June 22, 2006 - 2:45 pm UTC

you missed the point - if we DID NOT use null - how would this change anything in the above??????


to show all of the sales for a given range of dates, you typically outer join to a dimension that has them all.


Your question involves "missing rows"

ops$tkyte@ORA10GR2> create table t ( dt date, sales number );

Table created.

ops$tkyte@ORA10GR2>
ops$tkyte@ORA10GR2> insert into t values ( trunc(sysdate), 100 );

1 row created.

ops$tkyte@ORA10GR2> insert into t values ( trunc(sysdate)+2, 200 );

1 row created.

ops$tkyte@ORA10GR2> insert into t values ( trunc(sysdate)+4, 500 );

1 row created.

ops$tkyte@ORA10GR2>
ops$tkyte@ORA10GR2> with dates
  2  as
  3  (select trunc(sysdate)+level-1 dt from dual connect by level <= 7)
  4  select dates.dt, t.sales
  5    from dates, t
  6   where dates.dt = t.dt(+)
  7   order by dates.dt;

DT             SALES
--------- ----------
22-JUN-06        100
23-JUN-06
24-JUN-06        200
25-JUN-06
26-JUN-06        500
27-JUN-06
28-JUN-06

7 rows selected.
 

To Bakunian

Michel Cadot, June 22, 2006 - 12:43 pm UTC

Your query is not asking for what you expected.
Your query means "in which day have I sold something and how many?".
What you want is "how many have I sold in each day?".
You start with a calendar and watch the sales within it.
So in your query you have to start with a calendar (a table or one you dynamically generate) and then outer join with your sales.

Regards
Michel


Referential integrity

Sanjay, November 15, 2006 - 2:21 am UTC

Hi,

In an interview i was asked, if we remove the referential integrity from a Star Transformation(with the bitmap index intact), will we see a decrease in performance of the data fetch query?

My answer was YES, but was not able to give a perfect explaination for the same, could you please explain.

Many Thanks in advance.

Tom Kyte
November 15, 2006 - 7:08 am UTC

the only sensible answer would be:

it depends.

they were probably looking for: you should have bitmaps on all the dimension keys and of course foreign keys should be asserted so that the optimizer understands that every row in the fact table will find a mate in the dimensions.

If the query plan changes because the integrity constraints are not there - then the query might run

a) faster
b) slower
c) the same

all we can say for sure is that the plans are different - if they are different.



Thanks

Sanjay, November 15, 2006 - 8:18 am UTC

Hi,

Thanks for a prompt response.

But, will oracle interpret a star transformation methodology (or snowflake transformation) once the FK’s are removed?

Regards,
Sanjay


Tom Kyte
November 16, 2006 - 2:47 am UTC

you should have the fk's there, it only makes sense. facts out to dimensions implies foreign keys. the bitmaps are on the foreign key columns.

Optimizer and foreign keys...

A reader, November 15, 2006 - 6:31 pm UTC

Tom,

Do you have an example and/or link for how "...foreign keys should be asserted...", in addition to available indexes, in developing an execution plan? Ie here is an example where tables have a FK and here without a FK defined with the difference in execution plans/costs.

I know intuitively that the FK and supporting indexes should be there, but have been looking and trying to create an example for quite some time to refute the "...constraints and definitions are too costly to maintain at the database level..." type of discussion for improving whse/mart performance. Such that a FK should be defined in addition to creating the supporting index, as an index alone just does not provide enough information about the data "...so that the optimizer understands that every row in the fact table will find a mate in the dimensions..."

Something along the lines of what you did in the "Using dimensions for query rewrite" discussion. Which went a long way in educating me in how to make better use of materialized views to answer yet other questions.

Thanks,

Tom Kyte
November 16, 2006 - 2:59 pm UTC

</code> http://asktom.oracle.com/Misc/stuck-in-rut.html <code>

not using constraints is the most foolish decision one can make, especially in a data warehouse.

the MOST FOOLISH thing one can do

not a very foolish thing, but one of the most

not null - so important
primary key - so important
foreign key - so important
all check constraints - yes those too are used by the optimizer

it is short sighted of them to not use the constraints, very much so.

Still confused

Srini, March 29, 2007 - 4:32 pm UTC

I am posting it here as one of question in this chain was how to parition when we do not have date column in the fact table.
We have a surrogate key in the fact table for a calender dimension. My goal is to partition the data by month and this way I can enable archiving at the database level by moving partitions. I have 4 such fact tables. It seems to me that I have a few options
1. Add the date column in the fact table. This makes it a denormalized fact. Partition by date. This would involve converting all the existing data (4+ yrs) and change our existing processes to load the date as well
2. Use the trick suggested above by creating a deterministic number key by converting it to Julian day. Then I can partition by range on the surrogate key.
3. Somewhat similar to 2 above but not using the julian date logic. Since the calender surrogate is sequence I can take advantage of the fact that we have a nonoverlapping sequential range of numbers for a given month. So I can create a range partition on the surrogate key.
Tom Kyte
March 30, 2007 - 1:06 pm UTC

currently, in 10gr2 and before, you can only partition based on actual columns in the table.

seems you have created a surrogate key for your date dimension - another valid approach is to use.... a natural key (the date itself).

Avinash, April 20, 2007 - 6:06 am UTC

I have three dimensions and fact. Dimension1 and dimension2 have n*n relationship and the relation between dimension 1 and dimension 3 is 1*n. The relation between fact and diemnsion1 is 1*n. The existing model(except these tables) is Star.

So which design you suggest for this situation. What i think is Dimension1 and dimension2 can be snowflaked and the same(snowflake) can be done for dimension1 and dimension3.

Please suggest.
Tom Kyte
April 20, 2007 - 7:14 am UTC

dimensions do not have relationships to each other.

OK

Kumar, April 20, 2007 - 7:54 am UTC

Hi Tom,
Could you please provide the link for LATEST
Oracle Datawarehousing guide? Hope there would be guide for
Oracle database 10g Release 2.
Tom Kyte
April 20, 2007 - 10:00 am UTC

teaching to fish:

goto otn.oracle.com
click on documentation
click on database version of choice

find all documentation for that version.

LATEST -- FTM then RTM ?

Duke Ganote, the perplexed, April 20, 2007 - 8:49 am UTC

Avinash, April 24, 2007 - 3:00 am UTC

Hi Tom,

i will give the business here so that you will be able to decide better.

Suppose i sell a product. Because of some reason, i need to replace that product. We call this as Replacement(Diemsnion1). Now for each Replacement, i can have a root cause analysis(Dimension2) as to why the product was replaced. Also the logistics data to do the replacement is stored in dimension3. So now the relation between dimension 1 and dimension 3 is 1*n and Dimension1 and dimension2 have n*n relationship. And we have a fact table which contains all the installed product. so the relation between fact and diemnsion1 is 1*n

In short,
I have three dimensions and fact. Dimension1 and dimension2 have n*n relationship and the relation between dimension 1 and dimension 3 is 1*n. The relation between fact and diemnsion1 is 1*n. The existing model(except these tables) is Star.

So which design you suggest for this situation. What i think is Dimension1 and dimension2 can be snowflaked and the same(snowflake) can be done for dimension1 and dimension3.

Please suggest.


Tom Kyte
April 24, 2007 - 10:59 am UTC

that sounds like a transactional system implementation/model.

when you say "dimension" I think warehouse - and the rules of OLTP do not always apply.

doesn't seem right that a root cause would be Many to Many.

I'm not going to be able to do a model for you in this forum, the details needed would be larger than I want you to type in here.

dimensions are not really "many to many" - there is a missing dimension in there so you have a 1:M and M:1 relation - but now you have fact to dim to dim to dim - getting a bit "fancy" with the snowflake at at that point.

Surrogate keys for dates

Jon Roberts, June 05, 2007 - 2:41 pm UTC

Tom, the strict Kimball readers on here disagree with you that you should put and use the actual date as a natural key on the fact table. They argue it should be a surrogate key and the actual date in a date dimension.

The purpose for doing this is to allow a tool like Discoverer to join fact tables through common dimensions. Time being a very important dimension you would want to work with in this way.

Example:
Suppose I have a SALES_FACT and a BUDGET_FACT with common dimensions like MARKET_DIM and DATE_DIM. Now, a report may want to have actual sales, the budget, and the market. Add in a calculation too to see if we did better or worse than the budget.

create table sales_fact
(market_id number not null,
 date_id number not null,
 order_id number not null,
 quantity number not null,
 dollars_sold number not null);

create table budget_fact
(market_id number not null,
 date_id number not null,
 quantity number not null,
 dollars_sold number not null);

create table time_dim
(date_id number primary key not null,
 day_stamp date not null,
 month_stamp date not null,
 year_stamp date not null);
 
create table market_dim
(market_id number primary key not null,
 market_name varchar2(100) not null,
 sub_market varchar2(100) not null);
 
alter table budget_fact add foreign key (market_id) references market_dim (market_id);
alter table sales_fact add foreign key (market_id) references market_dim (market_id);

alter table budget_fact add foreign key (date_id) references time_dim (date_id);
alter table sales_fact add foreign key (date_id) references time_dim (date_id);



Notice how the budget doesn't have an ORDER_ID. It is at a higher grain than the SALES_FACT.

Now a query that does this would be a "fan trap" and yield incorrect results.

select market.market_name,
       t.month_stamp,
       sum(sales.quantity) total_sales,
       sum(budget.quantity) total_budget,
       sum(sales.dollars_sold) total_sales_dollars,
       sum(budget.dollars_sold) total_budget_dollars,
       sum(sales.quantity)/sum(budget.quantity) ratio_sales_budget
  from market_dim market,
       sales_fact sales,
       budget_fact budget,
       time_dim t
 where budget.market_id = market.market_id
   and sales.market_id = market.market_id
   and t.date_id = sales.date_id
   and t.date_id = budget.date_id
 group by market.market_name,
       t.month_stamp

To write this query properly in one pass, you would have to write it with in-line views and this is how Discoverer does it:
select b.market_name,
       b.month_stamp,
       s.total_sales,
       s.total_sales_dollars,
       b.total_budget,
       b.total_budget_dollars,
       s.total_sales/b.total_budget ratio_sales_budget
  from (select sales_market.market_name,
               t.month_stamp,
               sum(sales.quantity) total_sales,
               sum(sales.dollars_sold) total_sales_dollars
          from market_dim sales_market,
               sales_fact sales,
               time_dim t
         where sales.market_id = sales_market.market_id
           and t.date_id = sales.date_id
         group by sales_market.market_name,
                  t.month_stamp) s,
       (select budget_market.market_name,
               t.month_stamp,
               sum(budget.quantity) total_budget,
               sum(budget.dollars_sold) total_budget_dollars
          from market_dim budget_market,
               budget_fact budget,
               time_dim t
         where budget.market_id = budget_market.market_id
           and t.date_id = budget.date_id
         group by budget_market.market_name,
               t.month_stamp) b
  where b.market_name = s.market_name
    and b.month_stamp = s.month_stamp


This is a valid reason why you should only have surrogate keys and measures on a fact table.

Now the question is, how best to partition this? Should it be done by range on the date_id since people are frequently putting filters on a date range?

Can Oracle star transform AND partition eliminate in the same query?
Tom Kyte
June 06, 2007 - 1:29 pm UTC

eh???

The purpose for doing this is to allow a tool like Discoverer to join fact tables through common dimensions. Time being a very important dimension you would want to work with in this way.

explain that please.



tell me please, what is the difference between using

20070601123323

as the "key" or

42

both are

a) unique
b) immutable
c) can be used to join


but the first one can be used directly without HAVING to join when you want to.

Adding additional dimension between fact and dimension table

A reader, July 12, 2007 - 3:51 pm UTC

Hi Tom,

Consider the following scenario:

Fact table Transaction_fact is as follows:

trans_id
trans_date
trans_state_id
trans_amount

The Dimension table location_dim as as

location_state_id
location_state_name
location_state_desc

Initially we used to receive records at state level.But now the design is changed and we are going to receive records at city level.

I thought about 2 approaches:

Approach 1:

Create one city dimension table and add a trans_city_id column in transaction_fact table.

Approach 2:

Close the existing records in the location_dim table.
Add a column location_city_desc in the location_dim table.
Add new records for state-city combination records in location_dim table.

Both approach will result in same output but i see approach 2 slghtly better as compared to approach 1.

The reporting layer is going to be modified for this change.

Compared to approach 2 , in approach 1

1)Less Impact as i just have to modify the dimension table and the derivation logic
2)The earlier records(records before change) point to different records in dimension as compared to the new records(pointing to new records in dimension table).)

One problem i face with approach 2 is that i will have to convert location_dim from Type1 to Type2.

Awaiting your valuable inputs/thoughts..

Regards

FKs in Fact Table

A reader, September 30, 2008 - 6:13 am UTC

Hi Tom,

In star schema it is said that keys of all dimensions must be present in associated fact table as foreign keys and a bitmap index should be created on each of foreign key.
Also, we should create pk of fact table as combination of all FKs in fact table.

My questions is:

The foreign keys in Fact table should be defined using following?

ALTER TABLE fact_table
ADD CONSTRAINT fk_name FOREIGN KEY (pk_of_dimension) REFERENCES TO dim_table(pk_of_dimension)

If yes then will not be performance issue when we load data into FACT table?

I will appreciate your help in clarifying my doubts.

Thanks

Tom Kyte
September 30, 2008 - 9:16 am UTC

... Also, we should create pk of fact table as combination of all FKs in fact
table.
...

huh? what do you mean by that??



... If yes then will not be performance issue when we load data into FACT table? ...

this always makes me laugh. So worried about something you do once - when you should be worried about something you do over and over (query).

You need constraints to be in place in a warehouse for performance. For optimal performance. The OPTIMIZER uses constraints in order to optimize queries. The fewer constraints you have asserted - the less options available to the optimizer.


read this:
http://asktom.oracle.com/Misc/stuck-in-rut.html

and put as many constraints as you can in place.

A reader, October 01, 2008 - 3:22 am UTC

Thanks a lot Tom.

"... Also, we should create pk of fact table as combination of all FKs in fact
table. "

I mean by above statement that in DWH it is said that once should created PK of Fact table by combining FKs from dimention tables ..ie PKs is composite key...

Is it a good practice?

Tom Kyte
October 01, 2008 - 11:57 am UTC

that does not make any sense whatsoever.

The combination of all foreign keys would not in general be unique. I don't get what you are saying.



nulls in fact

bakunian, October 01, 2008 - 5:00 pm UTC

Tom,

Interesting topic so you saying that it's OK to have nulls in Fact table meaning if there is no dimension value available it must be null. Correct?

Kimball recommends to have surrogate key for "nulls" something like 999. I can see his point otherwise every query will require extra checking for nulls
(...or state_key is null) type of thing which unnecessary complicates query.
And I see your point if value is unknown logically it must be "null" and it helps Oracle by providing extra information. Therefore there is trade off of query complexity vs logic.
What do you think?
Tom Kyte
October 01, 2008 - 5:42 pm UTC

define what you mean by "each query will require extra checking for nulls"

I don't know why the queries would be harder to code - give example using EMP and DEPT - EMP is fact table, DEPT is dimension. DEPTNO in EMP may be null - why is it harder to join and query?

nulls in fact

bakunian, October 01, 2008 - 7:33 pm UTC

I just realized that for facts to be OK with nulls all is required is an outer-join on the side of dimension. Correct?
So often we accept concepts as gospel.

Thanks for help refuting that theory.

The funny thing is that Oracle Warehouse Builder 10g generates surrogate of type "number" for time_dim key.






Tom Kyte
October 01, 2008 - 10:43 pm UTC

... I just realized that for facts to be OK with nulls all is required is an
outer-join on the side of dimension. Correct?
...

yes


there are lots of funny things out there. I agree.

Snowflake question

A reader, October 15, 2008 - 12:33 pm UTC

Tom,

Since fact table represents many-to-many relationship among dimension tables why there is ever need for snowflake.

For example if worker has many skills and same skill can belong to many workers then fact table can represent that by having employee id record repeated with different skill ids in the same row. So then what is the point of having Snowflake?

Thank you for you response




Tom Kyte
October 15, 2008 - 6:02 pm UTC

a fact table does not really represent a many to many relationship between dimension tables.


not sure what you mean.


you would not repeat the entire employee record in the fact table, you would have the fact table, a one to many relationship to employee skills and that would have a one to one relationship out to the skills table


create table emp (empno primary key, other data);
create table emp_skills( empno references emp, skillno references skills, primary key(empno,skillno) );
create table skills( skillno primary key, other data);



Snowflake question

A reader, October 15, 2008 - 9:37 pm UTC

Now you lost me.
" you would have the fact table, a one to many relationship to employee skills and that would have a one to one relationship out to the skills table "

Did you mean to say that employee skill would have many to one relationship out to the skills table?
Because if you did not then I have hard time understanding how one to one between employee skills and skills tables would represent many-to-many relationship between employees and skills.

Thank you
Tom Kyte
October 17, 2008 - 8:33 pm UTC

I am talking of the relations in one direction.

emp would have many emp_skills would have one skill

emp_skills is the materialization of a many to many relationship since going the other way

skill would have many emp_skills would have one emp


emp_skills is commonly referred to as an association table.


You wrote:

... Since fact table represents many-to-many relationship among dimension tables why there is ever need
for snowflake. ....

to which I say "no, it doesn't". I don't know what you mean.

Date date key or artificial key

Jay, October 17, 2008 - 2:09 am UTC

Hi Tom,
If we are using sequences to load the DIM_DATE dimension and the day_id (integer) is ordered on the date sequence, the performance issue you mentioned won't be there, isn't it? Two consecutive dates have 2 consecutive integers as the key.
Used the SQL you used to create table t. Added one more column to have the artificail key start from 1.
SQL> select * from t where y between 20081018
  2  and 20081019;

X                  Y          Z                                                 
--------- ---------- ----------                                                 
18-OCT-08   20081018          1                                                 
19-OCT-08   20081019          2                                                 


Execution Plan
----------------------------------------------------------                      
Plan hash value: 1601196873                                                     
                                                                                
--------------------------------------------------------------------------      
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |      
--------------------------------------------------------------------------      
|   0 | SELECT STATEMENT  |      |     1 |    16 |    43   (5)| 00:00:01 |      
|*  1 |  TABLE ACCESS FULL| T    |     1 |    16 |    43   (5)| 00:00:01 |      
--------------------------------------------------------------------------      
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - filter("Y"<=20081019 AND "Y">=20081018)                                  

SQL> 
SQL> select * from t where z between 1
  2  and 2;

X                  Y          Z                                                 
--------- ---------- ----------                                                 
18-OCT-08   20081018          1                                                 
19-OCT-08   20081019          2                                                 


Execution Plan
----------------------------------------------------------                      
Plan hash value: 1601196873                                                     
                                                                                
--------------------------------------------------------------------------      
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |      
--------------------------------------------------------------------------      
|   0 | SELECT STATEMENT  |      |     2 |    32 |    43   (5)| 00:00:01 |      
|*  1 |  TABLE ACCESS FULL| T    |     2 |    32 |    43   (5)| 00:00:01 |      
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - filter("Z"<=2 AND "Z">=1)                                                

SQL> 
SQL> 
SQL> select * from t where x between to_date('18-Oct-2008','DD-Mon-YYYY')
  2  and to_date('20-Oct-2008','DD-Mon-YYYY') -1/24/60/60;

X                  Y          Z                                                 
--------- ---------- ----------                                                 
18-OCT-08   20081018          1                                                 
19-OCT-08   20081019          2                                                 


Execution Plan
----------------------------------------------------------                      
Plan hash value: 1601196873                                                     

                                                                                
--------------------------------------------------------------------------      
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |      
--------------------------------------------------------------------------      
|   0 | SELECT STATEMENT  |      |     3 |    48 |    43   (5)| 00:00:01 |      
|*  1 |  TABLE ACCESS FULL| T    |     3 |    48 |    43   (5)| 00:00:01 |      
--------------------------------------------------------------------------      
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - filter("X"<=TO_DATE('2008-10-19 23:59:59', 'yyyy-mm-dd                   
              hh24:mi:ss') AND "X">=TO_DATE('2008-10-18 00:00:00', 'yyyy-mm-dd  
              hh24:mi:ss'))                                                     

Tom Kyte
October 17, 2008 - 9:27 pm UTC

and when would an end user every be able to use 1 and 2 as input????


1 and 2 are just random numbers - what does the number 42 mean? as a date?

No, it would not work in real life, we'd never really use 42 and 55 in a where clause, we'd be joining the fact to the time_dim and using a where on the functionally dependent DATE data in the where clause

select ...
from fact
where fact.dt_as_a_number between
(select id from time_dim where time_dim.dt = to_date('31-dec-2006','dd-mon-yyyy'))
and
(select id from time_dim where time_dim.dt = to_date('01-jan-2007','dd-mon-yyyy'))


Humans will input dates, dates need to be converted into surrogates in this case.

Artificial date keys

Mike, October 20, 2008 - 10:18 am UTC

Actually, you are placing faith in the idea that your dates are ordered in the same way as your artificial keys.

This query:
select ...
from fact
where fact.dt_as_a_number between
(select id from time_dim where time_dim.dt = to_date('31-dec-2006','dd-mon-yyyy'))
and
(select id from time_dim where time_dim.dt = to_date('01-jan-2007','dd-mon-yyyy'))

would be more reliable as
select ...
from fact
where fact.dt_as_a_number in
(select id from time_dim where time_dim.dt between to_date('31-dec-2006','dd-mon-yyyy') and to_date('01-jan-2007','dd-mon-yyyy'))

You should discourage the assumption that your artificial keys have any meaning, including ordering, continuity, etc.

Artificial keys are not always as easy as they sound.
Tom Kyte
October 21, 2008 - 12:02 pm UTC

yes, another good point, sequence numbers are not gap free, they are not necessarily "orderable" either.

You would have to be *very* careful to ensure your synthetic keys are generated in a 100% serial, predicable fashion.

Fact table update

reader, October 21, 2008 - 3:05 am UTC

Tom,

Can fact table be updatable in this scenario all dimensions type 1.

create factless_fact (customer_key number, product_key number, service_key number);

create customer_dim (customer_key number, first_name varchar2(10) );
create product_dim (product_key number, product_name varchar2(10) );
create service_dim (service_key number, service_name varchar2(10) );

insert into customer_dim ( 1, 'Joe' );
insert into product_dim ( 1, 'Books' );
insert into service_dim ( 1, 'Books online' );

insert into factless_fact (fact_key, customer_key, product_key) values (1, 1, 1);

The next day Joe subscribed to service so should I update fact table or insert new row for customer Joe?

update factless_fact set service_key = 1
where fact_key = 1 ?

or

insert into factless_fact (fact_key, customer_key, service_key) values (2, 1, 1 );

Thank you for your reply
Tom Kyte
October 21, 2008 - 3:47 pm UTC

I believe you have to answer this.


Do you

a) need current status of joe
b) joe's history


or can joe have multiple services simultaneously (if so, you might have modeled this wrong)

in short, insufficient data - I don't know your data requirements here.

fact table update

reader from LA, October 21, 2008 - 8:32 pm UTC

Hi Tom,

I thought that the only place history kept is in dimension table therefore data in type 1 dimensions just being replace for particular row and no new row inserted into fact table.

To answer your questions.
Do you

a) need current status of joe
Yes
b) joe's history
No, because its type 1 dimension table

Yes, user joe can have more than one service so what did I do wrong?

Thanks again
Tom Kyte
October 22, 2008 - 8:33 am UTC

Look - you are the only one that knows what your data needs are. You can call them type 1 slowly change dimensions or type 42 rapidly expanding universes - but at the end of the day - fact tables can absolutely contain history, many times do

and you do not really do history in the dimension, the history would typically be associated with the fact. type 1 slowly changing dimensions do not do history.

If user joe has more than one service, then you have three tables - more of a snowflake than a star.

service is the dimension.
person is the fact

a person has many services
a service has many persons ---- needs association object.

Date surrogate keys - again

Jay, October 24, 2008 - 2:58 am UTC

Hi Tom,

In response to my explain plans, you asked
"and when would an end user every be able to use 1 and 2 as input????"

Never. He/she is not expected to either. The user is presented a prompt or drop-down for choosing dates. That is how decent tools like Business Objects, Cognos etc work. The tool uses that to filter data in the dimension table, then joins the records with those in the fact table using day_id as the key.


"1 and 2 are just random numbers - what does the number 42 mean? as a date?"
Nothing - and it does not matter. User sees only dates.

"No, it would not work in real life, we'd never really use 42 and 55 in a where clause, we'd be joining the fact to the time_dim and using a where on the functionally dependent DATE data in the where clause

select ...
from fact
where fact.dt_as_a_number between
(select id from time_dim where time_dim.dt = to_date('31-dec-2006','dd-mon-yyyy'))
and
(select id from time_dim where time_dim.dt = to_date('01-jan-2007','dd-mon-yyyy'))
"

Actually, that works in real life (using surrogate keys). Users do not use surrogate keys. They use dates. The tool/query builder will ensure after filtering on the dates, the join is done on the surrogate key. I have been working with data warehouses for JPMorgan Chase, GE etc which have huge data warehouses running like this.
I am not saying that is the only/best way- it is certainly debatable, that is why we have this post :). But it certainly works in real-life. And the query is not formed the way you wrote it. It will be more like

SELECT
FACT_MEMBERSHIP_ACTIVITY.LOYALTY_PROGRAM_TIER_ID,
FACT_MEMBERSHIP_ACTIVITY.BASE_POINTS_USED,
FACT_MEMBERSHIP_ACTIVITY.POINT_TYPE_2_USED,
DIM_DAY.ACTUAL_DATE
FROM
FACT_MEMBERSHIP_ACTIVITY,
DIM_DAY
WHERE
( DIM_DAY.DAY_ID=FACT_MEMBERSHIP_ACTIVITY.DAY_ID )
AND (
DIM_DAY.ACTUAL_DATE BETWEEN {d '2007-11-24'} AND {d '2009-01-11'}
)

Constrain/filter on dimension tables, then go to the fact table is how I understand these queries are expected to work.

Keys not being in sequence - date/time dimension would be loaded once (or a few times at most). It is certaonly possible to ensure that they are in sequence.




Tom Kyte
October 24, 2008 - 1:14 pm UTC

this is still a horrible idea. why have a TWO STEP to build a simple query. This is not a good idea. Think about it - why build in complexity *on purpose*. Why obscure the data?? Why generate what is now NOT a surrogate key (if you have to load and assign in sequence, it is NOT a surrogate)

and upon retrieval, you have to join to the dimension any way - because if the guy says "get me stuff between A and B" odds are high (near 100%) they want to see this data (the dates, not the number 42 and 55).

Nope, not buying into this scheme.

Surrogate keys for dates

Mike, October 24, 2008 - 1:54 pm UTC

Does your BI tool truly let you select ranges on a dimension attribute, and generate SQL that specifies a range of surrogate keys? I think it is far more likely that you get SQL that looks like 'select from fact where dimension_key in (select dimension_key from dimension where [range-predicate on some other column])'. I think that aligns with Tom's point that your queries will be joins - not just predicates on the fact table.

Dimensions are sets of discrete things. It is generally not appropriate to say 'where dimension between A and B', like 'where color between red and white' or 'where model between Cobalt and Impala'. It is tempting to do that with dates, but that just shows that they are not a perfect fit for the 'dimension' concept.

Although it is useful to have dates treated as dimensions in many respects, they are not quite a round peg in a round hole.

Date dimension - to surrogate or not to..?

Jay, October 28, 2008 - 12:46 am UTC

Hi Mike,
The query I pasted is generated by Business Objects against our customer loyalty data mart. BI tools are pretty good and user-friendly - otherwise they wouldn't be minting money or having such a loyal customer base.
The rest of the argument - what is an attribute and what is not, is another long debate. Let me not get into that.

Tom,
I fully agree with your observations elsewhere in the page that using surrogate key for time dimension does add to storage - extra column, extra index and so on. Well, so far I have seen data warehouses/ data marts that use surrogates for dates, never thought about it, and have been designing dimensions that way (surrogate keys for dimensions - period). Let me try your approach. Thanks a lot for your input.
Tom Kyte
October 28, 2008 - 7:02 am UTC

@jay

Never. He/she is not expected to either. The user is presented a prompt or drop-down for choosing
dates. That is how decent tools like Business Objects, Cognos etc work. The tool uses that to
filter data in the dimension table, then joins the records with those in the fact table using
day_id as the key.


the tools are not building the query you say - they are building something entirely different. they would NOT build

where x between 42 and 55

they have no knowledge that they can even do that - that there is a direct relationship between the surrogate key and the data values (that 42 represents a date less than 55 does, and that all dates between the one that 42 represents and 55 represents are in the range between 42 and 55).

The action you say is taking place, does not take place. I too have used these tools. They do not make that leap of faith - and that is exactly what using 42 and 55 in the predicate against dates would be - a leap of faith.


Date dimension - surrogate

Jay, October 28, 2008 - 9:01 am UTC

Hi Tom,
The query I pasted - it does not filter on 45 or anything like that.
" DIM_DAY.ACTUAL_DATE BETWEEN {d '2007-11-24'} AND {d '2009-01-11'}" - this the tool formed using values user chose from a list of values
" DIM_DAY.DAY_ID=FACT_MEMBERSHIP_ACTIVITY.DAY_ID " - this the tool figured out (constructed or built or whatever) because the tables were linked using that column in Business Objects Universe. DAY_ID is the surrogate key used for time dimension.
The surrogate key is rarely, if ever, used for filtering/constraining. Surrogate keys are used for joining with the fact table. Filters will always be applied on data that users can understand. I think somewhere in the post, I was understood to convey that tools will convert actual data to surrogate values. No, that was not something I tried to convey.I think that holds for Mike's question - "Does your BI tool truly let you select ranges on a dimension attribute, and generate SQL that
specifies a range of surrogate keys?" - no it does not go into creating a range of surrogate keys or anything like that.


I used surrogate key values in a query to test if the plan is worse compared to using filters on a date column (You demonstrated so in an example in the beginning of the post - "Followup August 7, 2004 - 9am US/Eastern:"). But in my case, the query with filters applied against surrogate key column was doing better. I never meant to say that users are going to use surrogate keys to filter data or that tools rewrite queries. Users will use meaningful data, the tool will use surrogate keys to link to fact tables. May be we had a long discussion about nothing.

Fact table design

A reader, March 19, 2009 - 6:27 am UTC

Hi Tom,

We have a fact table at month level (calendar month - YYYYMM) with daily load, thus balances are as of last successful run. Now there is a requirement of YTD (financial - year to date). YTD being at different level of granularity, do you think
we should put YTD in the same table Or create new table
for YTD? The table will be queried by ad-hoc users through front end BI tool.

Regards,
Tom Kyte
March 19, 2009 - 10:57 am UTC

if you wanted to have this precomputed - you would use a materialized view

otherwise, the fiscal year to date data ALREADY EXISTS in this table, it is called "aggregation", you just add them up (that is what the materialized view would do - it would do it once and then reuse the precomputed answer over and over transparently like an index works when users query the details)

Fact table design

A reader, March 19, 2009 - 2:41 pm UTC

Hi Tom,

Thanks.

Data is already available in DW (in old style excel format in the table - monthyear, balance, yeartodate average etc.). For DM (star schema) we are normalising the fact records. The dilemma is, use the existing one in DW in the same table or separate table?

Your advise will be really helpful.

Regards,
Tom Kyte
March 23, 2009 - 10:15 am UTC

just my opinion:

data warehouse implies "some level of thought, design"
excel format table implies "not a thought was given to much"

it seems funny to see them together.



I do not understand your dilemma however, not sure what you are asking?

Star schema design question

yoav ben moha, July 01, 2009 - 2:32 am UTC

Hi Tom,
Our 10g datawarehouse has a staging database on one machine and a user report database on a second database.
Currently we dont using star schema at all.
1.I would like to know if i should consider
implement star scehma also on STAGING database ?
2.Is it bad practice to design a datawarehouse instance
(e.g: not staging instance ) that is not using star
schema ? It might be possible that i am getting
performance less good that i could achive by using star
schema ?
Thanks


Tom Kyte
July 06, 2009 - 7:02 pm UTC

this question is unanswerable.

why?

because the answer to:

should every data warehouse be a star schema?

is absolutely, definitely:

NO


It depends. Would you benefit from a star schema (no clue, I know nothing about you at all). Would a star schema be able to support your queries (no clue).


All I can suggest is you read, learn, test and benchmark - or have someone that has come in an evaluate your current solution and help you understand why or why not "a star schema for you"

Conclusion?

pasko, August 04, 2009 - 6:26 am UTC

Hi Tom,

I have read the entire Thread and i still have a Question regarding the Design of Partitions for Data Warehouses.

Qn 1.
For the Time Dimension, if we use Surrogate Keys which will be as FKs to the Fact Table, then how do we partition the Fact Table?

Possible Answer(from this thread):
Do not use surrogate Keys for the Time Dimension and use only a real Date field.I note that even the Oracle Sample schemas (SH) uses this model.

Qn 2.
If we use a Real Date natural key in the Time Dimension and make it as foreign key to the fact table, then do we ever need to reference the Time Dimension at all? I mean we could get almost everything from the Date field: Quarters, Months,Years,Days..etc

Thanks.

Regards,

Pasko



Tom Kyte
August 04, 2009 - 3:15 pm UTC

q1) correct on the possible answer

q2) correct, you might not need to store all of the 'derivable' columns in the dimension. sometimes things like "fiscal" year or non-standard mappings would be more easily implemented as a dimension rather than as a function

And as a dimension, you have precomputed the function - there is something to be said for that in some cases.

and as a formal dimension - you would be able to use the CREATE DIMENSION statement and describe hierarchies to us for use with query rewrite and materialized views

so there are pros and cons to consider.

Star Schema

Bob, September 30, 2009 - 9:44 pm UTC

Dear Tom,

I'm having some troubles identifying the dimensions and fact(s) table(s) in a quite simple relational model. I did understand the theory behind it, but when it's time to put in on practice, I realized I'm kinda noob. Could you please help me with this, as the examples I found out on the web turned out to be quite simple? The rm is the following:
http://picasaweb.google.pt/r.gorrao/RelationalModels#5387451558529274434
I think the dimensions are: Country, Studio, Director, Movie, Epoch, Award, Actor, Roles and Rime.
I also guess I should put the Ratings in the fact table, but what about the Casts table?! Is my dimensions/facts tables even right?
Sorry for bothering you on this, best regards,

BG

Different approach DW X OLTP

Marcos, October 14, 2009 - 3:59 pm UTC

Hi Tom,

Help me convince my DBA that DW/BI needs another instance Oracle, with other parameters, and have a lot of differences. He insist in use the same approach in both cases: OLTP and DW (Star Schema).

Thanks in advance.

Index for start schema

A reader, November 01, 2010 - 2:49 am UTC

Hello Sir,

In my data warehousing system, we have used Start schema data model. In where there are 4 fact tables and 11 dimension. One of the fact table contains 20M records and in that fact table only 4 dimension are used. 3 dimension contains low cardinality, (1 dimension has 9 records, second has 112 records, third has 1028 records) and first 2 dimensions are never going to change, 3rd and 4th dimension can have new records but it cardinality always going to be <1 of total records in fact table. In this situation, do I need to create indexes on all the dimension columns? when should we create index on fact table? This fact table is updates through nighly job only. If I create indexes then do I need to drop them before inserting records into fact table and then recreate them again? I tried to search articles related to these questions but I do not find it.

Thanks in advance.
Ethan

Tom Kyte
November 01, 2010 - 3:28 pm UTC

star schema.... not start ;)

... In this situation, do I need to create indexes on
all the dimension columns? ...

proabably, for we NEED primary keys and foreign keys to understand the relationships where. without the primary keys - we'll never know that the fact table is key preserved in a join to the dimensions.

... when should we create index on fact table? ...

whenever by doing so, some query you execute would greatly benefit (and on the primary key of course - that'll be a must).

... If I create indexes then do I need to
drop them before inserting records into fact table and then recreate them
again? ...

probably not, these tables sound pretty small. 20million records isn't large - and I'll assume you just add a couple thousand records at a time - a very small amount of data indeed. But you don't really tell us - so I'm guessing.


I use tables of 1-10 million records on my LAPTOP for demos (the records by the way are 125 bytes in average length - now you have something I don't - a meaningful number, you can actually compute how big my data is - 20,000,000 records is *MEANINGLESS* when trying to convey "size" to anyone)

question

maneet, January 15, 2011 - 9:33 am UTC

i want to create the star schema of the weblog files data...
can u plz help me how to make dimensions and facts for weblog data??i want to use it for mining
plz help me......

weblogs

maneet, January 15, 2011 - 10:28 am UTC

i want to create the star schema of weblogs data ....can u plz tell me about how to create the dimension and fact table for weblogs data???i want this for mining

What if time_key has to deal with seconds

Saad Ahmad, October 31, 2011 - 5:15 pm UTC

You mentioned
Create table Time_Dim
(
time_key number(sequence),
Year number,
month number,
week number,
day number,
level number
).
What if my fact table has a date column that has seconds. If I take this concept then equivalent of this table will need to have 86400 rows per day?

If I keep this table as one row per day and try a join with between; that does not do partition pruning, i.e.
select x.*
from fact x join mydate
on x.trndte between mydate.date and mydate.date||235959
...
The explain does not show partition pruning!


Tom Kyte
November 01, 2011 - 2:53 am UTC

the table only needs to have as many distinct rows as you have distinct date/times.

If you have 86,400 unique times for a given day in your fact table - then yes, you need 86,400 rows in your dimension.

If you have 42 unique times for a given day in your fact table - then no, you only need 42 rows in your dimension.


And sure, it CAN show partition pruning, you'd have to give a more complete example for us to explain why partitioning pruning was either

a) a bad idea for your query
b) not possible for your query


ops$tkyte%ORA11GR2> CREATE TABLE fact
  2  (
  3    dt  date,
  4    x   int,
  5    y   varchar2(30)
  6  )
  7  PARTITION BY RANGE (dt)
  8  (
  9    PARTITION part1 VALUES LESS THAN (to_date('13-mar-2003','dd-mon-yyyy')) ,
 10    PARTITION part2 VALUES LESS THAN (to_date('14-mar-2003','dd-mon-yyyy')) ,
 11    PARTITION junk VALUES LESS THAN (MAXVALUE)
 12  )
 13  /

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> exec dbms_stats.set_table_stats( user, 'FACT', numrows => 1000000000, numblks => 100000 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> create table dim ( dt date, data char(20) );

Table created.

ops$tkyte%ORA11GR2> exec dbms_stats.set_table_stats( user, 'DIM', numrows => 10000, numblks => 1000 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> set autotrace traceonly explain
ops$tkyte%ORA11GR2> select fact.*
  2    from fact, dim
  3   where fact.dt >= dim.dt and fact.dt < dim.dt+1
  4     and dim.data = 'x';

Execution Plan
----------------------------------------------------------
Plan hash value: 950912430

--------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |   250M|    16G|  3503K (23)| 11:40:42 |       |       |
|   1 |  NESTED LOOPS             |      |   250M|    16G|  3503K (23)| 11:40:42 |       |       |
|*  2 |   TABLE ACCESS FULL       | DIM  |   100 |  3100 |   272   (0)| 00:00:04 |       |       |
|   3 |   PARTITION RANGE ITERATOR|      |  2500K|    92M| 35032  (23)| 00:07:01 |   KEY |   KEY |
|*  4 |    TABLE ACCESS FULL      | FACT |  2500K|    92M| 35032  (23)| 00:07:01 |   KEY |   KEY |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DIM"."DATA"='x')
   4 - filter("FACT"."DT">="DIM"."DT" AND "FACT"."DT"<INTERNAL_FUNCTION("DIM"."DT")+1)

ops$tkyte%ORA11GR2> set autotrace off



key key - partition pruning at runtime would happen.

Partition Granules

Amir Riaz, November 04, 2011 - 1:37 pm UTC

Hi tom,

Possible query paths are:

parallel query
1, block Granules
2, Partition Granules

when Oracle considers partition Granules.

from Oracle documentation

Full Partition-Wise Joins

A full partition-wise join divides a large join into smaller joins between a pair of partitions from the two joined tables. To use this feature, you must equipartition both tables on their join keys. For example, consider a large join between a sales table and a customer table on the column customerid. The query "find the records of all customers who bought more than 100 articles in Quarter 3 of 1999" is a typical example of a SQL statement performing such a join. The following is an example of this:

SELECT c.cust_last_name, COUNT(*)
FROM sales s, customers c
WHERE s.cust_id = c.cust_id AND
s.time_id BETWEEN TO_DATE('01-JUL-1999', 'DD-MON-YYYY') AND
(TO_DATE('01-OCT-1999', 'DD-MON-YYYY'))
GROUP BY c.cust_last_name HAVING COUNT(*) > 100;


This large join is typical in data warehousing environments. The entire customer table is joined with one quarter of the sales data. In large data warehouse applications, this might mean joining millions of rows. The join method to use in that case is obviously a hash join. You can reduce the processing time for this hash join even more if both tables are equipartitioned on the customerid column. This enables a full partition-wise join.

When you execute a full partition-wise join in parallel, the granule of parallelism, as described under "Granules of Parallelism", is a partition. As a result, the degree of parallelism is limited to the number of partitions. For example, you require at least 16 partitions to set the degree of parallelism of the query to 16.

You can use various partitioning methods to equipartition both tables on the column customerid with 16 partitions. These methods are described in these subsections.

why a partition wise join uses partition granules. Please elaborate with an example


Tom Kyte
November 07, 2011 - 10:55 am UTC

think about it.

In the partition wise join - two things are true:

both tables are partitioned on the same key attributes
the join condition uses those key attributes.

So, we know that data in table1, partition1 is joined only to data in table2, partition1. And so on.


So, we know that to join T1 with T2 - we can just join t1.partition1 with t2.partition1, t1.partition2 with t2.partition2 and so on.

If we used dynamically generated block granules we wouldn't have that fact anymore - the blocks in granule1 of table 1 would have to be joined to all of the granules of table2 (well, ok, technically just the granules that 'touch' partition1 of table2). So, instead of a 1:1 relationship in the join, it is now a 1:1many - so, each parallel execution server would have to read a granule from table1 and the entire partition from table2. You'd have many things reading partition2 and attempting to buffer it potentially at the same time. It doesn't make sense.

very impressive .

Amir Riaz, November 08, 2011 - 2:18 am UTC

That's an impressive explanation.

does the decision between block Granules and partition Granules is based on cost or some rules.

some rules i can find out from the documentation is, the table must the structure to support partition Granules, documents says we need equip-partition. What is equip-partitions, do you take hash partition as equip-partition .
Tom Kyte
November 08, 2011 - 7:35 am UTC

does the decision between block Granules and partition Granules is based on
cost or some rules.


it is based on logic, yes. You cannot do an arbitrary "granule-wise" join since you'd have to join a given random granule to every other possible granule - there is no "1:1"ness there - it would be a cartesian type of join.


give me a pointer to the word "equip-partition" in the documentation please, give some context.

or did you mean "equi-partition" (equal, same partitioning scheme). They need to have their respective join keys be the partition key and employ the same partitioning scheme. That way we know that a given partition is joined to one partition in the other table (the data for a given partition in table 1 has to be contained in a single partition in table 2)


https://blogs.oracle.com/datawarehousing/entry/partition_wise_joins

Thanks

Amir Riaz, November 08, 2011 - 1:00 pm UTC

Please help me understand the following part of documentation.

http://download.oracle.com/docs/cd/B10501_01/server.920/a96520/parpart.htm#98285

Figure 5-2 illustrates how the hash partitions are formed in the sales table. Each cell represents a subpartition. Each row corresponds to one range partition, for a total of 8 range partitions. Each range partition has 16 subpartitions. Each column corresponds to one hash partition for a total of 16 hash partitions; each hash partition has 8 subpartitions. Note that hash partitions can be defined only if all partitions have the same number of subpartitions, in this case, 16.

Tom, totally lost here.

case: 1.
on range partition suppose one range partition is selected. Since each range partition has 16 hash partitions and in second table has 8 hash partitions. Did hash sub-partitions of first table will join to hash partition of second table. What will be the degree of parallelism 8 or 16 or 1

case:2.
The second table will join horizontally or vertically if i have to join with all those 8 range partitions with second table which has 8 hash partitions.

Thanks
Tom Kyte
November 08, 2011 - 1:14 pm UTC

Tom, totally lost here.


is it because of the use of the word "row"? In the picture, when they refer to a row, they are referring to a row of partitions in the picture itself, not a row in the table.

Same with the word column.


they both have 16.

... and then subpartition each partition by hash on s_customerid using 16 subpartitions for each partition, for a total of 128 subpartitions. The customers table can still use hash partitioning with 16 partitions....


the customers table has 16 hash partitions, as does each range partition of the sales table.

star transformation and bind variables

A reader, November 09, 2011 - 1:53 am UTC

How will star schema behave with literal and without literal in Oracle 11g. Keep in mind that Oracle 11g uses adaptive cursor sharing. do we still have to provide literals or we can go with bind variables.

in PL/SQL code do we have to convert bind variables to literals.

The main question is, can star transformation now work with bind variable or it still need literals if it need literals why?

A reader, April 04, 2012 - 9:53 am UTC

Sir,

I am a big big big fan of yours. I am using Oracle for last 9 months (working on SAP BO) and have now started reading your books, articles, blogs etc etc etc.. and within a span of 9 months I have learned a lot just because of you. One day I will be surely working with you (hopefully). :)

As I am posting for the first time on your forumn please ignore any mistake of mine..

I have designed a Star schema for one of my datamart and my client is after me suggesting that over that I should create a MV to provide a consolidated view. I am trying to convience my client not to do so with the points as below:

1. As we have created a Star Schema in the database we should take advantages of the same and should avoid creating another layer of reporting which in future will increase the complexity of the queries while expanding the functionality of the mart.

2. We have to create a complete refresh MV and during refresh data will not be available for reporting to users and the duration will increase over the period of time once the data increases

3. As MV are a table on a disk using a MV in this case will consume the tablespace which will increase over the period of time.

Please can you suggest of any more points or additions. We are using SAP BO as a reporting tool in our organization wherein a Universe can be created easily for reporting.

Cheers,
Shaz



Tom Kyte
April 04, 2012 - 12:19 pm UTC

1) well, it won't increase the complexity of queries since materialized views work like indexes - you do not code to the materialized view, you query the base tables and the optimizer rewrites the query to go against the materialized view when appropriate.

That said - I cannot evaluate whether a materialized view would make sense or not - it will negatively impact data loading (that is obvious) and may or may not make queries "go faster" - that'll be a function of the types of queries you execute.

My suggestion: set up a simulation and benchmark it. get real numbers and then you can either agree with them or defend numerically your decision to not create one

2) not necessarily, there are two ways to refresh materialized views

atomic_refresh => true, the current default in 10g and above. The materialized view is refreshed using delete+insert. The data never disappears

atomic_refresh => false, the materialized view is refreshed using truncate+direct path load.

3) true, but so what. Same can be said about an index.

About Sales History(SH), Oracle Demo Schema

alip, June 12, 2012 - 6:44 am UTC

Hi Tom,

SH is Oracle demo schema for data warehouse, in star schema...

can we get the source of transactional DB(schema) that use to build this schema?

thank.
Tom Kyte
June 12, 2012 - 11:31 am UTC

No, but only because it doesn't exist.

this is a demo schema, it was not built from any production schema or anything, it is just useful for - demos.

Bitmap Index

Asheesh, December 27, 2012 - 8:01 am UTC

Hi Tom,
I am still confused about the STRUCTURE OF BITMAP INDEX.. Can you please help..


Tom Kyte
January 04, 2013 - 11:01 am UTC

STAR_TRANSFORMATION_ENABLED and star query

Arvind, December 31, 2012 - 10:32 pm UTC

Hi Tom,

Wish you a very happy 2013.

We need to create a data mart. We will be using OBIEE for ad hoc reporting.

a) Is it advisable to set star_tranforman_enabled = true?

b) Do we need create bitmap index on each columns of dimension table too?

c) Is star transformation is always good and star query will always run faster if we compare it to the same query when star_transformation_enabled = false.

Regards,

Arvind

Tom Kyte
January 04, 2013 - 1:34 pm UTC

a) do you want or need that type of transformation? are you using a star schema? will you have the prerequisites in place for doing star transformations?

b) http://docs.oracle.com/cd/B28359_01/server.111/b28313/schemas.htm#CIHGHEFB
just the fact table

c) nothing is ever 100% - if a star transformation was always the best way to go - we wouldn't have invented other join techniques. The optimizer will decide when and if to use them.

History

A reader, April 24, 2013 - 12:11 pm UTC

Say I have a star-lite schema with just 2 dimension tables and 1 fact table. For discussion purposes, let's take the standard student/test/score tables. Student and Test are the dimensions and Score is the fact. In addition to capturing the test score, the requirement is to capture a snapshot of student and test (dimension) attributes when the test was taken. This would mean duplicating the Student and Test attributes in the Score fact table, right? Are there any other ways to model this requirement?

fsdasdfsadfsd

fasdfasdfas, February 20, 2014 - 10:54 am UTC

fsdfdfasdfasdfas