Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Rupali.

Asked: November 08, 2000 - 5:02 am UTC

Last updated: July 19, 2011 - 1:23 pm UTC

Version: 8.1.5

Viewed 10K+ times! This question is

You Asked

Dear Mr. Tom,

I am in big dilemma now a days.

We say that we should normalize data to the most possible level.
But when web enabled applications come into picture, i have seen a general tendency to denormalize the data just to increase the speed on net, and as a result increase in data redundancy.

If only speed will be the case then will make only one table only of all fields........

I have got following probs..

1. If fully normalize database is there then what will be the
speed of web enabled applications
2. Any adverse effects of denormalized database on the web
enabled applications.
3. Which is the best way for web enabled application....
completely normalized database
denormalized database with a lot of redundancy

Regards,

Rupali


and Tom said...

I normalize all data for OLTP -- I never store redundant data. In an OLTP type system -- denormalizing is slower then normalizing. The cost associated with an UPDATE (of which you do as many as you do reads typically) more then offsets any percieved gain by denormalization.

In a DSS type situation. I would consider denormalization as the data never or very infrequently changes. I would use it only when needed.

I'm not aware of any big changes in thoughts or perceptions in this area due to the web -- the web changed nothing in regards to this.

Rating

  (49 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

I have a slightly different twist to the original question

Mohsin, October 25, 2002 - 4:57 pm UTC

Given a database/schema how can we determine the level of normalization the schema underwent.

Tom Kyte
October 26, 2002 - 12:31 pm UTC

you would reverse engineer it into some tool and study the results I suppose. I've (very frankly) never even considered doing anything like that, ever.

Mike, October 26, 2002 - 11:25 pm UTC

Not vary understood your statement:
"
The cost associated with an
UPDATE (of which you do as many as you do reads typically) more then offsets any
percieved gain by denormalization.
"

Could you elaborate that by a test example.

Thank you vary much

Tom Kyte
October 27, 2002 - 11:18 am UTC

A normalized structure might have:

table customer ( cust_id, cust_name, cust_address );
table orders ( order_id, cust_id );


someone might denormalize that to:


table customer ( cust_id, cust_name, cust_address );
table orders ( order_id, cust_id, cust_name, cust_address );

in order to avoid the "join" from customer to orders during a select. However, they have massively impacted

o inserts into orders (how many times do you call up an order really? does what
you did for SELECTS pay off the penalty you imposed on inserts?)

o updates to customer which much cascade all over the place


database were born to join, my comment was people try to avoid the join -- not realizing they just moved the "cost" elsewhere. did they do the right thing? many times *not*


How about small lookup tables?

Doug C, October 27, 2002 - 3:33 pm UTC

Tom - if databases are born to join, how do you feel about small lookup tables that show up in a bazillion queries and add an addtional join level? What I mean is the kind of table where there is nothing more than a description for a type of order or something like that. I'm guessing you are going to say that if it is read quite frequently then the blocks will be in the buffer cache and the overhead is minimal. Even so, it seems like the kind of think that can be easily re-inforced in the app as well (i.e, - forcing values 1-5 or something rather than joining with a lookup talbe containing 1-5).. What do you prefer to do?

Tom Kyte
October 27, 2002 - 3:41 pm UTC

are 1-5 constants for ever and ever and ever (they never are -- you'll add 6 later, remove 1)

If 1-5 are well know, take the approach we did in the data dictionary for many things -- views with DECODES. Using FBI's (function based indexes) you could even "where on them" if you like.

Works well for very small lookup tables.

What difference does it make if ...

Pasko, October 28, 2002 - 3:29 am UTC

Hi Tom,

In your example above ...

table customer ( cust_id, cust_name, cust_address );
table orders ( order_id, cust_id );

What difference does it make when i specify or do not specify the Foreign Key from Orders to Customer.

Some people would still call that Denormalized if i omit the Foreign Key .

What's your take on that Tom.

Best Regards.




Tom Kyte
October 28, 2002 - 7:29 am UTC

A rose by any other name is still a rose.

If it walks like a duck, talks like a duck -- hey, it must be a duck.


I would not call that denormalized, no. There is no redundant data -- I left out primary and fkeys for the example.

What is your opinion?

Thaha Hussain, May 06, 2003 - 3:18 am UTC

Dear TOM,

Please see a few tables from my comuterised Library:

LIB_MASTER
=========
MASTER_CODE
TITLE
AUTHOR_ID
......
......
......

LIB_CHILD
==========
ITEM_CODE
MASTER_CODE
AVAILABLE DEFAULT 'Y' <<---- Please Note this field


BOOK_ISSUE
==========
ITEM_CODE
MEMBER_ID
ISSUE_DATE
RETURN_DATE
RETURNED_DATE
.........
.........

When one book is issued, the transaction is recorded in the BOOK_ISSUE table and the field AVAILABLE is made 'N' for the curresponding ITEM_CODE in LIB_CHILD table.

RETURNED_DATE is filled while the book is returned. (Suppose no renewal!)

My doubt is:-

We will know whether a particular book is available or not by querying the

BOOK_ISSUE table. So there is no need of the filed AVAILABLE in the LIB_CHILD

table. If it is there, is it against Normalization? What is your opinion?

Tom Kyte
May 06, 2003 - 7:56 am UTC

Normalization is overrated.
Denormalization can be taken to frightening extremes.

The answer usually lies some where in between the two extremes.


Depends on the volume of queries against this system and their nature. If you query up "a" row from lib_child to see if it is available and do not ask questions like "how many books are available" "how many are not available", then a view like:


create view lib_child
as
select ......, nvl( (select 'N'
from book_issue
where item_code = lib_child_table.item_code
and returned_date is null ), 'Y' ) available
from lib_child_table;

might be easier to maintain.


If I asked those questions frequently however -- I might want Y/N to physically appear in that table and would pay the price on insert/update/delete on book_issue in order to have it be so.

Since there could be mutating table issues if you try to maintain this field via a trigger (assuming referential integrity is in place here), you might lean toward the view -- an index on book_issue(item_code,returned_date) would make it fairly efficient.

Thank You!!!

Thaha Hussain, May 06, 2003 - 8:53 am UTC

Dear Tom,

Thank You!!

-Thaha Hussain

Whick is the best practice?

TH, July 14, 2003 - 2:37 am UTC

Dear Tom,

We are extending one of our software to a new edition.

Right now we have a table like this
=======================================
TABLE CLASSES:-
class# | column2 | column3
-----------------------------------
Level1 | -- | --
Level2 | -- | --
Level3 | -- | --
Level4 | -- | --
Level5 | -- | --
Level6 | -- | --
Level7 | -- | --

Column2 & Column3 are a few columns which are out of our discussion.

We introduced a new table to group these levels for taking reports based on that.

PROGRAM_MASTER:-

Program_id # | Description
-----------------------------
ID1 | First program
ID2 | Second program
ID3 | Third program
ID4 | Forth program

===================================================================

Question:-
Whick is the best practice among the following?

CASE I
Adding a new column (Program_id) to the table CLASSES

TABLE CLASSES:-
class# | column2 | column3 | Program_id
---------------------------------------------------------
Level1 | -- | -- | ID1
Level2 | -- | -- | ID1
Level3 | -- | -- | ID1
Level4 | -- | -- | ID2
Level5 | -- | -- | ID2
Level6 | -- | -- | ID3
Level7 | -- | -- | ID3


CASE - II
Creating a new table as follows?

PROGRAMS_AND_LEVELS:-

Program_id | Level
------------------------------
ID1 | Level1
ID1 | Level2
ID1 | Level3
ID2 | Level4
ID2 | Level5
ID3 | Level6
ID3 | Level7



Thanks a lot!


Tom Kyte
July 14, 2003 - 9:19 am UTC

since it appears to be a classic "1 to M" relationship -- not a "many to many", the first approach, using a foreign key, would be the one I would lean towards

In your new book...

Kashif, July 14, 2003 - 3:56 pm UTC

Hi Tom,

Do you cover normalization/denormalization in your new book? I did read somewhere that you discuss how to design efficient schemas, and I assumed one of the sections under that would be normalization. Thanks.

Kashif

Tom Kyte
July 15, 2003 - 1:04 am UTC

efficient schemas talks about the physics of data.

Me, i've never been too overly concerned about getting my data into the 30th normal form close to nirvana.

I'm more worried about it getting it *fast*

In a way i guess I do talk about normal/denormal -- but I never call it that. I call it "implementing to answer your most frequently asked questions". I'll organize the data in the way that

a) is easy enough to access
b) gives the best performance for the most frequent types of access

eg: I'll store the data in a format perfect for querying but terrible for writing if 90% of my access is reading the data (i'll ping the writes for the benefit of the read).

I'll do the opposite if the opposite is true.

Clarification?

Mark Wooldridge, December 03, 2003 - 9:23 pm UTC

Can you calrify your remark...

If 1-5 are well know, take the approach we did in the data dictionary for many
things -- views with DECODES. Using FBI's (function based indexes) you could
even "where on them" if you like.

Works well for very small lookup tables.

I kinda udertand the decode concept but the FBI concept is still confusing.

Tom Kyte
December 04, 2003 - 7:43 am UTC

ops$tkyte@ORA920PC> create table t ( cd int, data varchar2(80) );
 
Table created.
 
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> create or replace view v
  2  as
  3  select cd, decode( cd, 1, 'code 1',
  4                         2, 'code 2',
  5                         3, 'code 3',
  6                            'unknown' ) cd_str, data
  7    from t
  8  /
 
View created.
 
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> create index t_idx on t( decode( cd, 1, 'code 1',
  2                                          2, 'code 2',
  3                                          3, 'code 3',
  4                                             'unknown' ) )
  5  /
 
Index created.
 
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> exec dbms_stats.set_table_stats( user, 'T', numrows => 1000000, numblks => 100000 );
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> set autotrace traceonly explain
ops$tkyte@ORA920PC> select * from v
  2  where cd_str = 'unknown'
  3  /
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=10000 Bytes=550000)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=2 Card=10000 Bytes=550000)
   2    1     INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE) (Cost=1 Card=4000)
 
 
 
ops$tkyte@ORA920PC> set autotrace off


what I meant was -- you can index the "decode" and then "where" on the decod and still use the index efficiently. 

Do you consider this as denormalization

A reader, March 24, 2005 - 1:40 pm UTC

Hi Tom,

We build a school management software, which manages hundreds of schools within district. Most of the data are relevant to one school only.

For we have the following tables:
SCHOOLS (SCHOOLID PK)
STUDENTS (STUDENTID PK, SCHOOLID)
STUDENT_GRADES(STUGRADEID PK, STUDENTID)

A studentid always has fixed schoolid, if he tranfers, we create a new studentid/schoolid for him. Currently table like STUDENT_GRADES contains 50-million rows. I am thinking to add SCHOOLID to it and partition the table as well. Do you think if this is a good idea?

And do you think this is denormalization?

Tom Kyte
March 24, 2005 - 4:02 pm UTC

don't see the point of student_grades?

RE: Do you consider this as denormalization

Bill Schwartz, March 24, 2005 - 4:22 pm UTC

I have a good question for you, reader:

As I work in education, how could you possibly track your students over time if you are issuing new student ids each time a student changes schools? How can your elementary school educators get feedback on how well they teach based on their former students middle school grades?

Seems to me you need to re-think your design there.

Tom Kyte
March 24, 2005 - 4:27 pm UTC

yeah, it seemed a little strange to me too.

oltp + dss

riyaz, March 25, 2005 - 12:47 am UTC

I am fully agreeing with your answer on normalizaton of database.

But in our case, we have a system fully OLTP, but updating so many tables (for report purpose) through trigger online. So our system is both OLTP and DSS.

e.g, sales transaction has 2 tables header and detail and the same thing is updated in 3 rd table which has summary of data (sales), will be used for MIS report purpose.

Here the 3rd table used for MIS purpose is denormalized one but updated online.(THE ABOVE IS THE CASE FOR OTHER MODULES ALSO)

Here, What is your opinion, our design is ok or bad.

Tom Kyte
March 25, 2005 - 7:47 am UTC

not sure what you mean here.

explanation

riyaz, March 25, 2005 - 9:07 am UTC

our system is OLTP + DSS
ie. so many reports are being generated.
we have table design like this
1. sales_head (one row per invoice - invoice no, date, party etc..)
2. sales_det (product-wise details (invoice no, date, prod, qty etc
3. dss_sales (date, prod code, party code, qty, amount)
this table will have summary of data, which will be used for reporting.
And this table is being updated through trigger on sales_det. (Data required for reporting online)

Here, sales_head and sales_det is in line with Normalization, but dss_sales is denormalized one. (repeated columns of sales_head/det)

As per your view, OLTP should have Normalization and DSS can be of De-normalization.

But in our case for OLTP, we are foced to have de-normalized table. Is it ok. pl reply.




Tom Kyte
March 25, 2005 - 6:19 pm UTC

you are penalizing your OLTP work to the benefit of the reports.

If you are OK with that, so be it.

One might use materialized views refreshed from time to time to denormalize for reporting and be less impacting on the OLTP data.

Re: Do you consider this as denormalization

A reader, April 01, 2005 - 8:24 pm UTC

Do you consider this as denormalization March 24, 2005
Reviewer: A reader

I am the poster of this post - sorry so busy with my project i even had no time to check my favorite internet website asktom.com nowadays.

I'd better rephrase my question:

We have tables like:
SCHOOLS (SCHOOLID PK, SCHOOLNAME etc...)
+->STUDENTS (STUDENTID PK, SCHOOLID FK, STUDENTNAME etc...)
+->STUDENT_GRADES(STUGRADEID PK, STUDENTID FK, GRADEID FK, ...)
+->STUDENT_GRADE_ITEMS(STUGRADEITEMID PK, STUGRADEID FK,...)
+->CLASSES
+->CLASS_ATTENDANCES
+->SUBJECTS
+->SUBJECT_GRADES
+->PERIODS
+->PERIOD_ATTENDANCES


As you can see this is a multiple-level hierarchy of tables and the schoolid is only propagated to the second level tables.

There are two things guruanteed:
1. A row in a second level table will never change its schoolid. (we have other mechanism to handle student transfter etc.)
2. There are tables in the system that we need to visit 4 or 5 steps up the hierarchy to get the schoolid.

Given these conditions, do you think it a good idea to propagate schoolid to these "grandchild tables" for better performance?

Thanks.










Tom Kyte
April 02, 2005 - 9:10 am UTC

I still don't get student_grades. Seems it's primary key should begin with studentid and what is graded a fk to?

I cannot answer if it would be a good idea or not, not knowing the questions you ask of the data, whether you need to make it so that inserts perform as well as they could be or what.

Re: Do you consider this as denormalization

A reader, April 02, 2005 - 10:42 am UTC

All our tables use surrogate single-column meaningless autogenerated PK. We are mainly concerned about performance of queries, some of them involves joins over 20 tables and takes over dozens of minutes.

As you can tell without this schoolid in the "grand child" tables, the joins are very expensive. But if I add the schoolid to the "grandchild" tables and joining all participating tables together with schoolid too then the query will be very efficient because the rows will be filtered by schoolid before joining together. I improved one single query from 10minutes to 1 second by doing this. I am just wondering if this is a common technique.

Interestingly this database is actually a M$ one, but I couldn't find a good forum like yours. But I guess the tuning methodology of all databases are similar.

Tom Kyte
April 02, 2005 - 11:33 am UTC

and the addition of these meaningless keys does what for you? HOW can they help?


The addition of these meaningless keys is a bad idea. Ok, one way to look at this, according to your model:

We have tables like:
SCHOOLS (SCHOOLID PK, SCHOOLNAME etc...)
+->STUDENTS (STUDENTID PK, SCHOOLID FK, STUDENTNAME etc...)
+->STUDENT_GRADES(STUGRADEID PK, STUDENTID FK, GRADEID FK, ...)
+->STUDENT_GRADE_ITEMS(STUGRADEITEMID PK, STUGRADEID FK,...)


A school has students. Students are unique within a school (no reason they have to be "uniquely unique" here, in your model, a student is a detail record OF A SCHOOL and you've already said if we move schools, new student)

so, why isn't it:

schools( schoolid pk, .... )
students( (schoolid fk,studentid) pk, ..... )
student_grades( (key from students+whatever) pk, .. )

In your model (which others believe to be a bit "wrong", I agree, a student is a student and you should not be giving them a new id when they move schools), the natural key of a student is their schoolid and studentid within that school.

the tuning methodology of databases is frankly what makes them different. Therein lies our LARGEST differences!!! an implementation that works in Oracle with non-blocking reads, multi-versioning will not work the *same*, nor as good in SQL Server.

The reverse is frequently true as well.


Re: Do you consider this as denormalization

A reader, April 02, 2005 - 11:02 am UTC

More info: our system contains data of more than 200 schools and most reports are dealing with one single school only. Some fact tables contains 10s of millions of rows.

GRADEID is a FK to a GRADES lookup table, which references SCHOOLS too (each school has 12 grades). so for 200 schools there will be 2400 rows in GRADES.


I know it's hard for you to say good or bad. But would you pls tell if you think this kind of tuning common or not?



Tom Kyte
April 02, 2005 - 11:35 am UTC

I'd use compound natural keys, they are OK, you are allowed to.

To create a student, you need to know the school. The student's primary key includes the schoolid and the unique student id within that school.

To give a student a grade, you need the students primary key (schoolid,studentid)

and so on.

Re: Do you consider this as denormalization

A reader, April 02, 2005 - 11:43 am UTC

Thanks Sir.

With compound PK approach, I am concerned about the number of foreign key columns. With 5 or 6 levels down the hierarchy, you might have 6 columns for a single foreign key which would be a concern (I don't know why but I think it doesn't smell good).

Anyway thanks for shedding light on this.

Tom Kyte
April 02, 2005 - 12:00 pm UTC

in Oracle it would not be a problem.
in other databases, who knows.

In fact, since foreign keys almost always have to be indexed in Oracle, it would be a good thing.


t1( a pk )
t2( a fk, b, (a,b) pk )
t3( a, b, c, (a,b) fk, (a,b,c) pk )

we would have a t1_pk, t2_pk, t3_pk and since the foreign keys are on the leading edge of the index, all is well *IN ORACLE*.
</code> http://asktom.oracle.com/~tkyte/unindex/index.html <code>

Now, you do this

t1( a pk )
t2( X pk, a fk )
t3( Y pk, a, b, X fk )

so you have t1_pk, t2_pk, t3_pk as indexes AND t2_a, t3_X

as indexes....

AND you have the extra columns. So since every index adds work, you've just made it slow to insert and we still have to do the fkey check (1 column, 5 columns, just bytes in a key to us)

So, you have fatter tables and more indexes.

I have skinnier tables with less indexes.


I'm having two issues with this conversation (don't take this the wrong way, I'm not rebuking you or anything, I think this is actually instructive)

a) you are asking Oracle advice for SQLServer implementations. A logical model, sure we can help there but when you get to physical schemas, if you are concerned about performance, you best get a SQServer guy on board and fast.

b) You are solving hypothetical issues. I like to work on real problems myself. Before solving a hypothetical issue, I like to be relatively sure it is a real issue.

can this table be normalized

A reader, October 25, 2005 - 9:14 am UTC

Hi

We have a user request application running on Oracle 9.2.0.6.

The workflow is as follows:

1. User sends a new request to support level 1
2. Support level 1 tries to resolve the request or issue
3. If Support level 1 cannot resolve it scalates to level 2
4. If level 2 cannot resolve it sends to level 3

The main table where all requests goes has over 80 columns, there are three columns indicating the level where the request is currently and some columns indicating when the request is sent from level to level.

I wonder if the level of support can be normalizaed, i.e have another level table with three rows indicating level 1, 2 and 3 so if in future we have a new level we dont have to add columns.

Is this a technique to tackle this sort of problems?

Tom Kyte
October 26, 2005 - 7:20 am UTC

I don't know why you would have 3 columns for the 3 levels in the first place? I didn't get that part - seems you would have a single column "level" (don't call it that, there is a psuedo-column LEVEL already) in the first place.

can this table be normalized

A reader, October 26, 2005 - 10:13 am UTC

Hi

We have three columns because each row in the table is a request, if we insert a new row everytime a request moves from level 1 to 2 or 2 to 3 our natural key, request_id would be useless. If we use surrogate keys then we cannot perform joins!

Tom Kyte
October 27, 2005 - 2:50 am UTC

why would you insert a new row every time a request moves from level 1 to 2 or 2 to 3 - why would you not just

update t set level = 2 where key = value;

to move it from 1 to 2.

can this table be normalized

A reader, November 04, 2005 - 9:22 am UTC

Hi

Sorry for the late reply, being travelling.

If we dont insert a new row we cannot know from what level moved to what level. If I simply update the level I dont know what is the previous level.

Tom Kyte
November 04, 2005 - 5:08 pm UTC

well, would it not be "the previous level"??????

(level minus one)

can this table be normalized

A reader, November 05, 2005 - 6:54 am UTC

Hi

It can move from 1 to 2 then 2 to 3 but also moved from 3 to 1 or 3 to 2 :(

Tom Kyte
November 05, 2005 - 5:55 pm UTC

well, first question I would ask is "is is relevant to remember" (could be a silly question, then again, you might not have asked it yourself)

and then I might consider "current_status", "last_status" as attributes if I needed last and current status to be available.

Normalization

Sujit Mondal, May 04, 2006 - 4:33 am UTC

Discussions is really helpful.
I have a business scenario like below:
We have to store and retrive Customer Information , It has 3 parts
1) Customer General Information (General Details attributes,Address Attributes General)
2) Customer Distribution Information (Distribution Details attributes,Address Attributes Distribution)
3) Customer Contract Information (Contract Details attributes,Address Attributes Contract).

The 3 parts with all its attributes will be inserted , updated or retrived separately from application. in any situations business does not want to see all three sets of informations at the same time. The relationship between 3 entities are one to one.

To model this business requirement we have come up with following two ways , can you please guide us to choose the best way , or if none of those two are good then what will be the best way?

Model 1:
---------------------------------------------------
3 separate tables for general , distribution , contract with all its respective related fields ( including address attributes which are common but will have different values in 3 tables) with the Primary Key column customer_id with proper FK constraint.

Model 2:
---------------------------------------------------
3 separate tables for general , distribution , contract with all its respective related fields excuding address.
PK and FK are as Model 1.
There will be a separet table for Address with Customer id and address type as its Primary Key.
So for each Customer there will be 3 records in the address table.



Tom Kyte
May 04, 2006 - 7:09 am UTC

what about model 3:

one table.

Normalization

Sujit Mondal, May 04, 2006 - 7:58 am UTC

Thanks for your reply.
One table is obviously a good solution , but because of some constraints we won't go for it. As 3 parts of information may be endered at separate time with the general as first. So in that case the elements for others we may need to keep null or defaullt values. Again the elements of distribution and contract has some foreign key constraint so it will be difficult and ambiguous to assign some default values to those columns with Fk. Though we can keep not assigned as default value. Also application like to retrive and store it in a modularize way. Which I think , may not be a good idea from database perspective.
Based on those we had to make 3 different tables. Can you please tell me among Model 1 and Model 2 which one shall I choose?

Tom Kyte
May 04, 2006 - 12:23 pm UTC

tell me the constraints.

Give me a real example where it would be "difficult or ambigous" to have a single table.

I am still voting for model 3, I do not like either of Model 1 or 2.

Model 2 is really not good. Model 1 is simply not good.

Can denormalization be worse than joins for queries

biswa, May 10, 2006 - 12:08 pm UTC

Say we have two tables T1 with 5,000 rows and T2 with 100,000 rows in an OLTP database.
The relationship of T2 to T1 is Many to One.
Both tables are relatively static.

Now suppose we denormalize T2 and store T1 data there for faster access so we dont need to join.
Since the T2 rowsize increases, The database will probably be able to store less rows of the table blocks in memory.
If so then although we would not be requiring a join, we might require more physical IO.

My question is
Is it possible that at some point, denormalization can reduce performance because more physical I/O is required because less rows fit in memory - or is denormalization always faster for querying tables than joins?


Tom Kyte
May 11, 2006 - 7:52 am UTC

have you shown that joins are "slow" for you - databases WERE BORN TO JOIN.

normalization

Lakshmi, September 15, 2006 - 7:36 am UTC

Hi Tom

Our requirement is to store the following data in database:

ScanId
Start time
End time
Granularity
BSC1
BSC1_counterX
BSC1_counterY
BSc2
BSC2_counterZ
BSC2_counterX
BSC2_counterY
.....

[The number of Objects (BSC1,2 and the counter) is not fixed and can be configured by the user]

We initially thought of normalizing the table as

Scanner table:
SCANID STRTTIME ENDTIME GRANULARITY


ScannerBsc table:
SCANID BSCID COUNTER_NAME

Hence the data:
SCANID STRTTIME ENDTIME GRANULARITY ObjectName COUNTERNAME ObjectName COUNTERNAME COUNTERNAME
1 10 12 1 BSC1 xx BSC3
xx yy

will be put as:

SCANID STRTTIME ENDTIME GRANULARITY
------- ---------- ---------- -----------
1 10 12 1
2 3 6 1


SCANID ObjectName COUNTERLIST
---------- ---------- ----------
1 BSC1 xx
1 BSC3 xx
1 BSC3 yy


The problem now is that the user should not be allowed to create multiple scanners with the same definitions
The list of object name and counters should not be exactly for 2 scanIds

It would be nice if you can suggest how to design tables for this particular requirement

Thanks
Lakshmi

Tom Kyte
September 15, 2006 - 8:51 am UTC

so, (object_name,counter) needs a unique constraint? isn't that all?

Not exactly

Lakshmi, September 18, 2006 - 2:38 am UTC

Actually the "list" of object_name+counter should be unique
Example:
If
Scan1 Obj1 counter1 Obj2 counter1 counter2

already exists

then

DupScan Obj1 counter1 Obj2 counter1 counter2 - NOT allowed
and
Scan2 Obj1 counter1 Obj2 counter1 - allowed (counter2 is not present)
Scan3 Obj5 counter1 Obj2 counter1 counter2
- allowed

In brief there should not be two different scanners with the same scanner definition


Tom Kyte
September 18, 2006 - 2:52 am UTC

sorry, did not follow this at all.

or do you just mean that the LAST 5 columns need a unique constraint.

if so, then you perhaps did pick the wrong structure (you didn't really "normalize" anything - you meant to have 6 columns in a table)

Thanks for the quick response!!

Lakshmi, September 18, 2006 - 7:01 am UTC

Sorry may be I was not so clear !

I have the following requirements:
1. The following info is to be stored in DB
ScanId/Name
Start time
End time
Granularity
Object
Counters for that Object

a.The number of Objects can be from 1 to n and
b.Counters for each such object can vary from 1 to n
c.The number of objects / couters per each object is not fixed

2. "Granularity + Object[](object list) + Counters [](counter list)" should be unique

How to design the tables for this requirement?
I guess I cannot create a single table because Object list and counter list is not fixed !

Tom Kyte
September 18, 2006 - 1:29 pm UTC

sorry, I don't see any sort of effective design for this particular requirement.

This is interesting ... waiting for response from Tom also

Khalid, September 18, 2006 - 1:33 pm UTC

IMHO, my solution would be 'triggah' FWIW. So I am curious as to what Tom has to say.


Tom Kyte
September 19, 2006 - 2:22 am UTC

let us see your "triggah" - and it best include "lock table" which is why I said "I see no effective (efficient) design for this"

Query Performance

Raghav, September 19, 2006 - 3:04 am UTC

Hi Tom,

For example,

select stu_id, stu_name, class, school_name, sport, ....
from stu_dat, stu_mst, sch_mst
where stu_dat.stu_id = stu_mst.stu_id
and stu_dat.schoolid = sch_mst.schoolid
and sport = 'volley ball'

The tables are fully normalised. On the above tables, indexes are built on stu_id, class, school etc. But, there is no index in particular to the sport.

The query is taking much time than expected. We know this is a bad performer for instance. There are somany indexes on those tables and we dont want to create one more index for the purpose of this query. Why because, in place of sport, there may be some other varient(s) against which the information is needed.

As per the stats we know this particular query is using the existing indexes but still the performance is bad. How can we check the query performance by changing the existing index to other without physically chaning the database (existing indexes)?

Is there any other mechanism to increase the performance of this query without creating a new index.

Thanks in Advance
Raghav

Tom Kyte
September 19, 2006 - 2:32 pm UTC

I would hope that it would use no indexes in your case!!

the only thing that could benefit from an index would be one on sport and only if volley ball is not a popular sport.

if this is using indexes - that is probably the problem.

but you give almost nothing to work with here, like "how long is long", 'how big is big'

query performance

Raghav, September 21, 2006 - 9:00 am UTC

Hi Tom,

Thank you for your reply. In the example, I am trying to put some more conditions, to put stress on the query.

select stu_id, stu_name, class, school_name, sport, section, gender, mother_tongue
from stu_dat, stu_mst, sch_mst
where stu_dat.stu_id = stu_mst.stu_id
and stu_dat.schoolid = sch_mst.schoolid
and stu_name like 'A%'
and sport = 'volley ball'
and section in ('A','C')
and gender = 'M'
and mother_tongue in ('Hindi','Telugu','Bengali')

Request you not to go by the exact above example and request you to take a similar example as you are the best and right person can give good examples to idenfity the following.

supposing all the tables are fully normalised, and there is no index in particular to the column which we are trying to query which is causing bad performance as soon as the condition is incorporated.

How can we check the query performance by changing the existing index to other without physically chaning the database (existing indexes)?

Is there any other / best mechanism to increase the performance of this query without creating a new index.

Thanks and Regards
Raghav

Tom Kyte
September 22, 2006 - 2:00 am UTC

"more stress"?!?!?! not sure what that means, queries do not get stressed, they remain calm cool and collected through the worst of times.

you cannot "test the performance of some new index structure" without creating the new index structure which would physically change the database.

query performance

Raghav, October 09, 2006 - 3:36 am UTC

Hi Tom,

Thank you for your reply.

I understand that queries do not get "stressed" by giving more conditions. They only cause the performance issues like the one in out example.

My question is : suppose the database is fully normalised, and there is no index on a particular column on which we are trying to filter the data, which is causing bad performance as soon as the condition is incorporated.

Is there any other / best mechanism to increase the performance of this query without creating a new index on that column.

Hope this time, you can understand my query better.

Thanks and Regards
Raghav


Tom Kyte
October 09, 2006 - 8:35 am UTC

you can only "guess", we can help you with that guess using enterprise manager and the index advisor, but it'll be "I guess this the work I'll do if you add this index because this is my guess as to the cardinality"

the only way to move beyond a guess (which frankly, if you know your data - should might be able to just "make" in your head) is to add it and test.

Which you can safely do in your TEST environment of course...

virtual index

Raghav, October 10, 2006 - 8:29 am UTC

Hi Tom,

Thank you for your reply. I agree with your coments. Yes, there is a provision in enterprise manager sql analyser to get recommendations on the indexes which may be again an assumptions made to change the performance of the sql which we can try in test environment too.

There is one more concept "virtual index" being allowed to create through sql analyser which is creating an index in another schema called "AURORA$JIS$UTILITY$" without affecting the current schema (not creating any new object (index) in current schema. Request you to throw some light on this concept how it works.

Thanks and Regards
Raghav

Tom Kyte
October 10, 2006 - 12:02 pm UTC

that is what EM (enterprise manager) uses, the "nosegement" index. It is not documented, if you want to use it, you would use EM.

virtual index

Raghav, October 14, 2006 - 3:50 am UTC

Hi Tom,

Thank you for your reply.

Can we create these "virtual" / "nosegement" indexes in any schema (for ex:test) instead of "AURORA$JIS$UTILITY$" incase if we dont want to use EM and do it manually. Will the query in main schema (for ex: live) uses this index while executing the query. If not, how can we enforce the query to use the virtual index created in other(test) schema. It may be a silly question. Please provide your valuable inputs.

Thanks and Regards
Raghav


Tom Kyte
October 14, 2006 - 8:14 am UTC

they are not documented, caveat emptor.



A reader, October 19, 2006 - 2:15 pm UTC

We have a persons table that has approximately 40 columns. The first few columns are the persons details such as name, gender etc.. and then follow other details such as his arrival flight number, departure airport code, hotel person of contact name etc .. The flight, hotel details will be null blank for many records becuse these people are from the same town and the table ends up having null fields in many records. Should these columns with null values be separated out into a new table (normalized?). Also, this table stores various kinds of data about a person. Can all of this remain in a single table?

Tom Kyte
October 19, 2006 - 2:21 pm UTC

no, you don't want to separate out a one to one relationship like that.

(well, there are two sides to this coin, some would say "never use null" - I'm not of that camp, more of a "pragmatist" when it comes to performance)


One table.

normlaization

sam, November 18, 2006 - 8:39 am UTC

TOm:

For the above case, some DB design books say to create a "Subset table" for the null columns if they only apply to a few people. You seem to like the idea of having one large table for 1 to 1 data relationship. Is there an advantage to that versus creating a subset table?

normalization

sam, November 18, 2006 - 8:48 am UTC

Tom:
1. For your above example, let us say the customer address changes frequently. Do not you have to store the customer address associated with each order and not normalize it.


table customer ( cust_id, cust_name, cust_address );
table orders ( order_id, cust_id, cust_name, cust_address );


2. For OLTP you normalize to eliminate redundant data.
For DSS (Data warehouses) you denormalize to speed up queries.

I am confused. Would normalizing an OLTP increase performance for (UPDATE, DELETE, INSERT) and decrease performance for (SELECT)?

because you only do (SELECT) on DSS and denormalizing is supposed to speed it up.

3. Can you give us example how normalization works in increasing performance. If I have one unnormalized table (1 million records) and the SELECT query is taking 20 minutes to run can it be said that because it is not normalized or more related to the way the query is built and doing full scan versus index scan?

Is it normal for some queries to take that long or there must be something wrong?

thanks,




normalization

sam, November 21, 2006 - 12:42 am UTC

Tom:

With your expertise in DB design, do not you have any comment on the normalization questions above?

Tom Kyte
November 22, 2006 - 3:16 pm UTC

#1 - too vague to address.

If the customer address changes, why would store it with the orders table? You have to STATE REQUIREMENTS here.

If the cust_address in orders is the current address of the customer, well, then you would not put it there.

if the cust_address in orders is really "the address to ship this here order to", then it is not named very well.

answer therefore: depends on the requirements doesn't it.

#2 - you normalize to avoid redundant data, period. regardless of system type.

you might, MIGHT, maybe, SOMETIMES (usually in a materialized view) denormalize in a warehouse. otherwise no, you wouldn't.

#3 - if you have a 20 minute select, you are not in a transactional system are you. And you might, just maybe, just MIGHT, sometimes, MAYBE use a materialized view in a warehouse for that 20 minute query.

normalization of database

Zpatel, March 15, 2007 - 8:11 am UTC

Hi Tom,


See I Confused here.
Which one is preferable??

Actually What I have to do?
I am designing database for hospitality mgmt....

Cuisine Name
Indian

Chinese
Mexican

Cuisine Type
Veg
Non Veg


Item Type
Soup
Main Course
Desserts
Snacks

Item Name
Nachos
Hakka Noodles
Panir Bhurji


So I have to maintain hierarchy this way,,



CHINESE
|
Soup
|
Vegetarian
|
Sweet- corn
Hot & sour
|
Non- Vegetarian
|
Chicken Hot & Sour
|
Snacks
|
Vegetarian
|
Hakka Noodles
Man-chow
|
Non-Vegetarian
|
Chicken Chopswey
MEXICAN
|
Soup
|
Vegetarian
|
Mexican Tomato Soup
|
Main Course
|
Vegetarian
|
Nachos

|
Non-Vegetarian
|
AAAA
BBBB
INDIAN
|
South Indian
|
Snacks
|
Vegetarian
|
Idli
Dhosa


|
Punjabi
|
Main Course
|
Vegetarian
|
Paneer Bhurji
Veg jaipuri
|
Non-Vegetarian
|
Shahi Kabab

Category in red are only belongs to Indian Cuisine. This is only for India.
thats why i make two tables
.



Ref_CuisineName
------------------------
1 Chinese
2 Indian
3 Mexican


Ref_CuisineType
---------------------
1 Vegetarian
2 Non Vegetarain

Ref_CuisineTo (Only Belong to Indian Cuisine)
---------------------

1 South Indian
2 Punjabi
3 Gujarat



Ref_Item
-------------
1 Noodle
2 Paneer Bhurji
3 Veg Jaipuri
4 Shahi kabab
5 Idli
6 Dhosa
7 Mexican Tomato Soup
8 Sweet- corn
9 Hot & sour
10 Chicken Hot & Sour
11 Hakka Noodles
12 Man-chow
13 Chicken Chopswey
14 AAAA
15 BBBB

Which is the best practice among the following?


CASE I

Sec_ItmTypeNm
---------------

cuisineTypeId cuisineTypeId CuisineNameId
1 Chinese Veg
2 Chinese Non-Veg
3 Indian Veg
4 Indian Non-Veg
5 Mexican Veg
6 Mexican Non-Veg


Ref_CuisineTypeItem
CuisineTypeItemId cuisineTypeId ItemtypeId
1 Chinese - veg Soup
2 Chinese ¿Non Veg Soup
3 Indian- Veg Soup
4 Indian- Non Veg Soup
5 Mexican - Veg Soup
6 Mexican ¿ Non Veg Soup
7 Indian- Veg Main Course
8 Indian- Non Veg Main Course
9 Chinese- Veg Snacks
10 Chinese ¿ Non veg Snacks
11 Indian ¿ veg Snacks


Ref_ItemTke

ItemTke ItemId CuisineTypeItemId
1 Mexican Tomato Soup 5
2 Sweet- corn 1
3 Hot & sour 1
4 Chicken Hot & Sour 2
5 Hakka Noodles 9
6 Man-chow 9
7 Chicken Chopswey 10


Ref_ItemTke1

ItemTke1 ItemId CuisineTypeItemId
2 Paneer Bhurji 7 2
3 Veg Jaipuri 7 2
4 Shahi kabab 8 2
5 Idli 11 1
6 Dhosa 11 1


CASE II
Ref_ItemDetail


Item Id Item Nm Cuisine Type Cuisine Name Item Type
1 Hakka Noodles Vegetarian Chinese Snacks



Hope , You Understand....

What i want to ask you,,,



Which is the best practice among the following?



Zpatel
Tom Kyte
March 15, 2007 - 12:24 pm UTC

seems like "cuisines" have "foods" and "foods" have various attributes like

- vegetarian versus carnivore
- course (snack, main, dessert, whatever)

now, foods may or may not belong to multiple cuisines - I think they do.

and there isn't a "hierarchy" in as much as a simple many to many relationship.

create table cuisines ( cname primary key );

create table food ( foodname primary key, fkey_to_veg_or_carnivore, fkey_to_course_name );

create table cuisines_to_foods (cname, foodname);



and veg_or_carnivore/ course_name are simple lookup tables with valid values for those attributes.


Normalization

Zpatel, March 16, 2007 - 5:42 am UTC

Thanx Tom,,,


But Now
Only Indian Cuisine have sub Catgory like "South Indian", "Punjab","Gujarati","Rajsthani" etc....
And Food resides Under all that category if Food is INDIAN.

your suggestion is good for other cuisines, but what about above situation.

Now i have to take care of all cuisines...

So Pls Append your View as per My need..

So What i have to do???

create table cuisines ( cname primary key );

create table food ( foodname primary key, fkey_to_veg_or_carnivore, fkey_to_course_name );

create table cuisines_to_foods (cname, foodname);


I think i have to add one colums to "cuisines_to_food" fkey_to_Subcategoty_lookup

Food_subcategory(Foodsubname Primarykey);

If food belongs to indian cuisine's subcategory i have to pass that "Foodsubname" to cuisines_to_foods.
Otherwisw Put that field Null



Tom Kyte
March 17, 2007 - 2:37 pm UTC

In general the sub categories are permitted everywhere.

Are you saying you want to repeat things like 'breakfast', 'lunch' and dinner???

Normalization

Zpatel, March 19, 2007 - 4:43 am UTC

Hi Tom,


In general the sub categories are permitted everywhere.
Yes, you are right.

Are you saying you want to repeat things like 'breakfast', 'lunch' and dinner???
NO no ....
What i mean ,

Cuisines like Mexican,Chinese Have subcategory Like Soup,Main Course etc..

But Indian cuisine have subcategory like SouthIndian,Punjabi,Gujarati and this subcategories categorised into Sub-subcategory like Soup, Main Course.....

So, How i have to hadle it?

Thanx

Tom Kyte
March 19, 2007 - 10:13 am UTC

the model i suggested above "handles" it already?

Normalization

Zpatel, March 20, 2007 - 7:36 am UTC

Hi Tom,


Got It!
Thanx

aother point

Bunditj, October 28, 2008 - 7:02 am UTC

Tom Kyte
October 28, 2008 - 8:18 am UTC

well, let us see - they state "immutable", but the examples are all data that changes.

They recognize that you'll have to max out the attributes - if you want someone to be able to have 100 affiliations, you'll need 100 columns. So you know what happens then? Some developer says - eh, we'll just store a comma delimited list. Oh boy, then the problems really begin.


I might look at a cluster to keep all data with user_id as a primary or foreign - together, in a single block.


Every time I see someone doing something in-record (attr1, attr2, attr3, ... ) instead of cross record as a relation - bad stuff happens. They get to a point where they need attr(N+1) and then things go downhill fast.


Now, that said - re-read the original answer. OLTP versus DSS/Reporting. There is something to be said that a profile page might be considered a 'report' and we do use things like materialized views (redundant, denormalized) in normal systems to maintain this sort of stuff.

No reason why the transactional data would not be persisted transactionally and things like a profile page assembled and subsequently stored (to be reassembled if something changes).

I do similar things here - I denormalize into my TEXT index (a page is a Question, an Answer, a set of reviews and a set of followups - when you search this site you are searching over 4 columns in 3 tables - but I don't join to do the search, I did the join way ahead of time and indexed the result of the join)


Data Warehouse and Normalization

Dheeraj, July 28, 2009 - 2:57 am UTC

Hi Tom,

I am working in Data Warehousing and was under the impression that it is De-normalizaed,Yesterday I came across a question that in which normal form is Data Warehouse Model (Star Schema and Snow flake schema).

and the answer given was that it is in Third Normal form.

Can you please throw some light on this,with examples.

Thanks,
Dheeraj


Tom Kyte
July 28, 2009 - 8:20 pm UTC

Well, I'm not at all sure what you are asking here.

Data warehouses do not have a single model that all ascribe to. You can normalize, you can denormalize. It all depends.

But, I'm not at all sure what question you have???

normalization

A reader, July 28, 2009 - 11:16 pm UTC

Tom:

Do you always use Materialized views for read only page that uses 2-3 tables or you only do it when the page rendered is slow?

Also, how often you refresh the MV? is it real time, every hour, nightly ,etc.

I like you idea of using normlaized tables for transactions and indexed MVs for reports or read only pages.
Tom Kyte
July 29, 2009 - 12:35 am UTC

Never say never,
Never say always,
I always say.....


there are no "always".

materialized views are like indexes, you would use them like indexes.


how often to refresh? the only answer to that is "it depends", else we would not give you an option. In a warehouse it might be weekly - after the load. In a transactional system it might be on commit. It is somewhere in between "right away and never"


Normalization in DWH

DK, July 29, 2009 - 6:05 am UTC

Tom,

Refering to the Normilization question in DWH.

What I want to ask is that in the typical Star Schema studded with DIM and Fact tables,is it any way Normalized form (which for of normalization it follows)
Or
Does the Snow Flake Schema wherein the DIM is splitted into two or more tables,is that in any Normalized.

Hope you understand what I am asking

Thanks,
Dheeraj
Tom Kyte
July 29, 2009 - 7:24 am UTC

star schemas are normalized schemas (data does not repeat) as are snowflakes (just more relations).

star schemas can be denormalized as well, a star schema could be a collapsed snowflake for all we know. A snowflake could be denormalized as well.

"it depends"

give us a documented schema and then we can tell you - these names 'star, snowflake' they do not enforce anything - they are just names.


db

A reader, July 29, 2009 - 4:02 pm UTC

<<In a transactional system it might be on commit. It is somewhere in between "right away and never" >>

With this you are really inserting several records: one small record for each transaction table and one big record with several joins to the matetrialized view. correct?

I guess there would be some performance penalty for each commit - but reporting give be real time updates and faster. right.

Codification

Muru B, October 16, 2009 - 12:50 pm UTC

Dear Tom,

Thanks a lot for improving our Oracle skills! Your expert advice is invaluable.

I request your opinion on, whether column values in large tables with more than million rows should be codified for better performance. For example a customer table could have following columns, gender and card type. In our team many prefer to spell out the entire word for every row (as Male, Female, Credit Card, Debit Card etc), but my recommendation is to codify (M, F, C, D, etc) with a lookup table when needed. Their argument is, hardware is cheap including storage, helps readability and less joins. My argument is, production hardware is not cheap (we can buy 1TB for home under $100, but production hardware costs thousands), less overall blocks used improves performance, negligible problem with joining lookup tables. Please give your opinion.

Gender
---------
M vs Male
F vs Female

Card type
-------------
C vs Credit Card
D vs Debit Card,
P vs Prepaid Card


help on this

venkata, July 19, 2011 - 11:18 am UTC

Tom,

I was going through something about schema modeling techniques

http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/schemas.htm

i was not able to understand what it means by snowflake schemas and star schemas. Can you please tell us what it means by snowflakes
Tom Kyte
July 19, 2011 - 11:56 am UTC

If you look at a snowflake:

http://www.its.caltech.edu/~atomic/snowcrystals/photos/w031230a113.jpg

it is sort of like there is a center (the fact table) and it is sort of surrounded by a layer (the dimensions) and that layer is fancy too - and has it's own extensions (dimensions)

A star schema is a fact table surrounded by dimensions - a single layer.

A snowflake is when your dimensions have dimensions.

http://docs.oracle.com/docs/cd/B10500_01/server.920/a96520/schemas.htm#11631

they have a picture there.

one example

venkata, July 19, 2011 - 12:40 pm UTC

Tom,

i need one example for each different type of schema, please, for the scenario you mentioned.
Tom Kyte
July 19, 2011 - 1:23 pm UTC

they are right there in the documentation link you sent me????? did you read it??

click on the link I gave you - it has a schema *right there* in a nice graphical format.