Skip to Main Content
  • Questions
  • Modeling graphs inside a relational database

Breadcrumb

May 4th

Question and Answer

Tom Kyte

Thanks for the question, Aru.

Asked: September 26, 2003 - 8:03 pm UTC

Last updated: April 09, 2004 - 1:01 pm UTC

Version: 9i2

Viewed 1000+ times

You Asked

Tom,

We have a debate on how to model graphs in a relational database. Our customers unwilling to use the object-relational features of Oracle.

APPLICATION: Participants submit data indicating how much energy they can sell/buy and for what price. This data is submitted in the form of piecewise-linear curve with 20 segments: First 10 MW at $10, next 30 MW at $11, and so on up to 20 such quantity, price points. The data is submmitted and utilized as a whole curve. Users do not query or modify individual linear segments of the curve.

We now define columns X1,Y1,...X20,Y20 and enter the curve in a single row of the table. This is convenient for the applications which need to read this data and use it in their linear-program optimization calculations.

The DBAs on our client's side consider this a violation of normal forms and are asking us to create a child table called PriceCurves that has three new columns: SegmentNo, X, Y and then create the curve as twenty rows.

Since individual segments of the curve are never accessed or modified, our model is not a violation of any of the normal forms. Besides there is some performance advantage to not having a child table and having data in a single table in a single row.

I have suggested we switch to using ObjectRelational features and embed the entire curve as a single object in the parent table. But they are not ready for O-R as yet.

What do you think?

and Tom said...


I think normalization is overrated.

I think performance is key.
I think ease of use if paramount.
I think common sense must reign supreme.

I think the model IS normalized. normalization does not mean "have as many rows in the database as possible"

Tell me -- what rule of normalization are you in violation of? They are thinking this should be modelled as a parent child relationship. I see it as an entity with 40 attributes. X1, Y1 is not any different then "ename", "job" to me in this context.

What sort of dramatic performance decrease would you see if you did what they THINK they want?

What sort of storage increase would you see?

I would not use an object for persistently stored data personally. I use the OR stuff in my PL/SQL code all of the time in order to make it a more productive programming language, but when it comes to storing bits and bytes on disk -- i stick with well designed, performant tables.

Do that which

o makes the most sense
o is fast
o is efficient
o is easy for the applications that consume the data to manipulate



Rating

  (23 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Limitations of RDBMS

Vikas Khanna, September 28, 2003 - 11:54 pm UTC

Hi Tom,

Recently in a conference a questions was posed to the participants :

What are the 3 limitations of RDBMS in context to O-RDBMS. In what situations we should use RDBMS and in what situations O-RDBMS is helpful.

Can you please give us a detailed explanation of the above.

Regards,


Tom Kyte
September 29, 2003 - 7:34 am UTC

I know of no limitations -- since O-RDBMS's are effectively built on top of RDBMS's....


I've never felt limited by the rdbms.

Ranting 'bout normalization...

Marc Blum, September 29, 2003 - 8:24 am UTC

Hi,

first a quote from one of Tom's presentations:

<TOMQUOTE>
Statements that should raise your eyebrows:
It is my opinion...
I claim...
I think...
I feel...
</TOMQUOTE>

;-)

My 2 cts:

(1)
Every developer should have a solid understanding of what normalization *means*. Any occurence of de-normalization should raise his eyebrows: "Gosh, what happend here to the datamodel? I'll find out if there's some strong reason for it." If there's a reason, fine.

(2)
Follow the "Single best place"-Dogma for your datamodel

(3)
Avoid redundancies (in the first order). They are errorprone because developers tend to overlook them; or the staff changes and noone tells the newbies how the datamodel is to be understood (sad, but look around you, that's the way it is). So, when intruducing redundancies, make their maintenance bullet-proof and undependend from changing application code.

(4)
The above example is typical for a "technical" type of application. Lots of nearly equal attributes. I agree with Tom here. The only point is that every new price segment means a structural change to the database.

Greetings from old europe
Marc


I have 2 cents as well ...

Gabriel, September 29, 2003 - 1:38 pm UTC

Aru,

Do you think there is any remote chance they will move from 20-segments curves? Do you think they will ever consider offering buying/selling on multiple types of segments curves at the same time? (if "Yes" on the second, and if implemented now with 20 pairs of columns ... I would be willing to bet, they will invent another SEGMENT_TYPE and, if required, add some more columns to this table)

From your description, I gather you represent the contractor who builds a solution for a client (and their DBA has raised the issue). Have you advised them what would be involved in moving from 20-segments to 21-segments curves? … you better get them to sign on it (what am I talking here, maybe you did that already).

<quote>The data is submitted and utilized as a whole curve. Users do not query or modify individual linear segments of the curve.
</quote>
Do you ever see this changing? I can certainly see the benefit of asking a question like “Get the best price plan available for my average consumption of 300MW per month” … you may have to come back to Tom and ask for the best SQL one could have (the man is a miracle, he can take ANY data model and squeeze any ounce of performance out of it). After he gives his answer contemplate a change from 20 to 10-segments curves … would his query “scale” (as in, be resilient to the change)?

<quote>
We now define columns X1,Y1,...X20,Y20 and enter the curve in a single row of the table. This is convenient for the applications which need to read this data and use it in their linear-program optimization calculations.
</quote>
So, the whole curve is inserted in one gulp and served as one … nothing more, nothing less. Check it out Tom, they are using the database as a dumping ground (+backup, +whatever). The question that I mentioned above is probably being served by that “linear-program optimization calculation” engine … maybe that application did not hardcode the 20 limit! Do you see your client ever using that data for anything else but the calculation engine? Which modeling approach to you think is more resilient to change?

Bottom line: for the situation you described and based on the hints you gave regarding the architecture of the system and its known intended usage … yes, store the whole curve (segments and all) as one record. If you are mandated or have an Oracle database around (hopefully employed for a better use) then use it … otherwise you can get your functionality (merge curves in a file and serve them whole) for, virtually, $0. And, yes, you’re bound to cause a stir by asking “how to model graphs inside a relational database” when all you need is a file.

Tom,

<quote>X1, Y1 is not any different then "ename", "job" to me in this context.</quote>
Why don’t you consider X1=100MW, X2=20MW, X3=30MW and “ename”, “job”? …
The only thing ename and job have in common is “they are attributes of the same Employee” … X1, X2, X3 are attributes of the same curve but there is more: the 20MW represents the segment between 100MW and 120MW. Graph the points (Xi,Yi) with i=1,2,3 in any order and you’ll get the same 3 dots … start connecting the dots and the order in which you do that will give you different shapes (as in “meaning”).
But you are absolutely correct; “in this context” … the app does not need, right now (I may add), to model the segments of the curve. It is questionable whether they need a relational database at all (except for its non-relational features).

<quote>
Tell me -- what rule of normalization are you in violation of?
</quote>
How about repeating groups.
I remember working for a company building a commercial ETL tool and having to implement migrations from Mainframe/Cobol to Unix/Oracle … as in, from “closed” to “open” systems (steady business … Oracle sure rode that wave) … guess Oracle was considered “open” because it was/is offering facilities for implementing relational data models. I remember having to deal with Cobol metadata that said entity E has attributes A, B, C and data coming in as <record header info>,A,(B1,B2, … Bn), C. People were moving to open systems partly because querying (as in “making use”) of data stored as A,B1,B2, … ,Bn,C was inefficient and prone to error. Extracting the whole record in one gulp was fast and easy … anything else wasn’t.

Bottom line: as in many other cases, your practicality gets you scot-free; but “normalization is overrated” is in the same league as “my query is not performing”.

Personally I approach “normalization” very much like “surrogate keys”:
Not “should I use a surrogate key?” but “why shouldn’t I use surrogate keys?” … hence, not “should I normalize?” but “why shouldn’t I normalize?”. In both cases the advantages, long term, are overwhelming. My conclusion may be questionable ... but, hey, I don't really have a reputation to protect.

On a different note … this site is like a magnet … certainly the best marketing effort Oracle has undertaken to wining mind-share. And we know who the Caesar is, don’t we? :)

Gabriel



Tom Kyte
September 29, 2003 - 2:45 pm UTC

ok, think about first_name, last_name, middle_name. 3 dots -- that when put together give you a name.

think about street_addr, city, state, zip. 4 dots.....

and so on. they are to me as meaningful as ename, job....


If you view it like that, no issues with a single, efficient, effective record. Since the application itself that visualizes this is "20", no worries there. to add or take away would require fixes there as well.


normalization IS overrated. I see it taken to illogical extremes -- literally killing performance. common sense -- that is what is not overrated.


And what is wrong with "my query is not performing"? as long as you have the supporting information to go with it?

I don't approach surrogates your way. I do it the other way -- should I use one? Not, why shouldn't i use one.





RE: Caesar

Mark A. Williams, September 29, 2003 - 2:49 pm UTC

> And we know who the Caesar is, don’t we? :)

The salad?

- Mark



Regarding comments from Gabriel and Marc

Aru Arunasalam, September 29, 2003 - 3:14 pm UTC

It is possible that the customer may choose to have more than 20 segements in the future. But the change won't come about without long deliberation. As it is 20 is much more than what most market participants want.

I don't know if the folks who created the normals forms (Codd?) allowed for such anomalies as embedding price-curves in relational databases.

The basic theme to normalization seems to be: make sure you can manipulate stored data without losing data integrity. Clearly, if one were to change one of the segments independantly, the entire price curve will be messed. But the same possible even if the data was organized in twenty rows instead of in one row.



"column exploded" design is ridiculos

Mikito Harakiri, September 29, 2003 - 4:34 pm UTC

Tom Kyte
September 29, 2003 - 4:50 pm UTC

gee, great argument, thanks for the pointer to a big create table. now what?

did you look at

a) what they need to do?
b) what the application needs to do?
c) the stated constraints

hmmm?

I thought similarity is obvious.

Mikito Harakiri, September 30, 2003 - 12:21 pm UTC

If you look more closely, one of the comments was "I would enjoy quering this table". It applies to both cases,isn't it?

The problem stated "How do I model graphs", which is not quite what it really means. It should read "How do I model a function?" Now, (one argument) function is a set of pairs.

Now, consider

create table Funct1 (
valueAt0 integer,
valueAt1 integer,
valueAt2 integer,
...
)

Admitedly, it's slightly less ridiculos than the design with varchar2 in the link, but close.

Tom Kyte
September 30, 2003 - 12:43 pm UTC

we'll have to agree to disagree.

since the application needs 20 "points", getting them in a SINGLE row is great, that is exactly what the application needs.

I would even say that on pointed to table -- yes, I would enjoy querying that table in many cases. Lots better then a parent child! I can see use cases where that model is the only common sense approach. I can see use cases where that model stinks. Nothing but nothing is black or white.

long ranting

Gabriel, September 30, 2003 - 4:14 pm UTC

Aru,

Here is my final “2 cents” for your original concern:
<quote>We have a debate on how to model graphs in a relational database. Our customers unwilling to use the object-relational features of Oracle.
</quote>
I would avoid the one table design, primarily because of that 20 limit. If necessary, you can put together a little process to denormalize the data and store and sync that image into another table (if it is convenient to serve the calculation engine this way) … so you can get the best of both worlds … all the features you need are there in Oracle. As for O-R, you can invent your own or you can use Oracle Spatial (all about points and lines and polygons). The “meat” of Spatial though revolves around polygons, which you don’t seem to have … so, just look/read at what they do in Spatial for lines, if you are curious about an O-R approach.

Tom,

Entity with PK=(first name, middle_name, last_name) … I can assemble the full name in any order I or anyone pleases and I’m not losing any meaning … it still represents the same physical person (modeled by the Entity uniquely identified by the said PK … I had to mention this so I don’t get the “but we cannot identify all people in the world just by their name” … for anyone jumpy on that one, think my Entity models “My Kids”).

Curve with PK=(dot1,dot2,dot3) … I cannot connect the dots except in a precise order … (dot1,dot2,dot3) is different than (dot2,dot1,dot3) … if I’m not concerned with “connecting the dots” then I’m not modeling a “curve” but a triplet of dots.

Just assume for one minute we normalize a bit here:

Entity 1: Curve_Type (PK=curve_type_id, AK=name, max_no_of_segments, etc.)
Entity 2: Curve (PK=curve_id, FK=curve_type_id, create_date, etc.)
Entity 3: Curve_Segment (PK=?, MW_Amt, Dollar_Amt)

What do you say the natural key for Curve_Segment would be? (let us ignore surrogate keys in this case) … well, it cannot be PK=(FK=curve_id, MW_Amt, Dollar_Amt) since segment 1 and 3 may both be (100MW,$10) … but is PK=(FK=curve_id, segment_order) … where the segment_order can only be an integer between 1 and the related max_no_of_segments. Have I captured more semantics in or very close to the model? (notice I did not say “all semantics”) In the one entity design the ability to deal with all curves doesn’t exist … also the semantics of the segment_order attribute are somewhere buried in the Oracle metadata (not a good place! … talk about giving water to the mill of those who argue one should get the business logic out of the database and into some upper layer in order to be database independent). Is this little bit of normalization taken to an “illogical extreme”? … YES, I say, if and only if, the only requirement is and will ever be “model 20 pairs of (MW_Amt,Dollar_Amt)”. Do you think I’m taking abstraction to “illogical extremes”?

From the little Aru describes, the bit (s)he’s building doesn’t have any “understanding” of curves (I guess, “pricing plans”) … might as well store pairs of (miles, gallons) … data in, data out … hence which is the best design for this sole requirement? … overwhelmingly the one entity.

The problem may be (invariably, is), the curve-meaning would still be out in the user community … your business guy/gal may soon say “I have those price plans in the database to feed my pricing engine … let me get some more juice out of it” … which model do you think looks appropriate now? The business user won’t see any obligation to pay for any re-factoring of the model. The developers will rush to fill in the gap between what the model actually is and what their business owners thinks it is … based on their own understanding of how things should be … is anybody’s guess where those extra semantics will now end up: in the model, very close to the data in a PL/SQL trigger or package … or scattered around in various client applications … or some upper layers. If the original architect had any suspicion that data may be used for something else, then (s)he may have missed the opportunity to keep those business rules in the model or very close to it (which is what you recommend, isn’t it? … but after the fact, in many instances, it is too late!).

Now, why am I so against a statement like “normalization is overrated”? In principle I am not, because I can “guess” your reasons and more than willing to give you plenty of leeway on it (you are absolutely correct with the vast majority of your answers after all) … I would agree with you that plenty of projects do not require a big modeling approach … in many cases I just put together some DDL in a script and start from there … but, the fact that I’m writing the DDL directly with all the denormalization I deem appropriate already in, does not mean that I’m discrediting the value of normalization. Check out your very first 5 words of the original response … you already set the stage for what’s to follow … whether your intention or not, you have already associated this big negative connotation to “normalization”. Your, maybe unqualified, “follower” out there may take that to heart to mean “don’t bother to normalize at all” … the critic may have just been given ammunition to challenge you.

Let us talk a bit about normalization and abstraction:

When we embark in this normalization process we can go between 2 extremes: fully normalized and fully denormalized (everything lumped together into one entity). Anything in between can be considered good or bad, depending on circumstances (which also change over time).
One cannot say “normalization is overrated” (and expect not to be challenged) since a valuation of what is appropriate (name it “common sense” if you like) for a subject being modeled cannot be made unless the subject and its context is actually described, in sufficient details, at that particular time. In other words, one cannot just say “little bit more or little bit less normalization is appropriate” … you also have to describe the subject being modeled and its context. This is akin to saying “my query is not performing” … it just doesn’t make sense unless you specify the actual query (“the subject”) and its context … the same query executing in 30 seconds may be appropriate for a reporting application but may not be for another OLTP application. I remember reading one of your papers and you making essentially the same point: client calling you in and saying, “my system is not performing” … and you saying “Aha … and?” … followed by inquiries about the context: type of application, suspect SQL statements, history of recent changes, what is considered an appropriate response time, etc. What did you just do? You used context to establish a valuation system … after that, talk about common sense, this is worse, this is better, this and that. Someone throwing at you just “my system is not performing” coveys as much value as someone throwing at me just “normalization is overrated”. Of course, you are totally entitled to express your opinions since people come at this site exactly for this reason, to get them (and keep coming back because they are overwhelmingly valuable) … but do not expect not to be challenged if your opinions are generic or suspect. On a case by case basis I have no doubt you can give plenty of examples where too much normalization literally killed performance … but “case by case” is what I am getting at … the “case” should come before the assessment if you are to be fair and objective.

Since you got me going … here is another thread where, I try to prove, you were incorrect …

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:1235228851558,

Anyone traumatized by the commoner daring to question his/her idol … check out this thread where I made and recognized a blunder … you could twist the knife even more if that helps with your trauma …

http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:2200739488307, <code>

One can say “abstraction is overrated” (and you certainly question the real value of so called “universal models” … and I agree with you). Why can one express this opinion though? … surely there are degrees of abstraction as there are degrees of normalization. Because one of the extremes when abstracting is always known and it just doesn’t make sense in any real life scenario … one could abstract everything in this world until all is left is an entity named “Thing” … it models and represents … well, anything. This is akin to OOM … within the same namespace, the root of any class hierarchy is a class called Object (Object Id) … but that is OK since the class is ABSTRACT … it doesn’t get implemented in real-life. Hence, when one says “abstraction is overrated” without giving much of a context I CANNOT DISAGREE because there is at least one case where that is true: abstracting all the way to Thing just doesn’t make sense … normalizing to 3NF may or may not make sense.

So normalization and abstraction are kind of orthogonal … I can fully normalize a model and then continue to abstract it until I see blue in my face … or I can strike a balance between the level of normalization and performance (enjoy that for a moment) and then abstract till blue again. Conclusion: I don’t necessarily “see blue” because I take a data model to a certain level of denormalization … the evil is not in normalization.

All this talk about context and valuation systems (being able to compare things) … let us have a real example and use what you mentioned: street addr, city, state, zip … better yet, unit#, street#, street_name, street direction (we in North America have Eglinton East and Eglinton West … talk about grids :)), municipality, state, zip, country.

What is appropriate for the Mom&Pop B&B web-reservation system in “old Europe”? They use all this stuff to mail a welcome package to whoever makes a reservation … so, if all they care is to use the address information to print on envelopes, what level of normalization is appropriate? Well, make your choice and I’ll keep agreeing with you. Are they going to start offering on their website some sort of map of the flight path between your address and theirs? Are they going to feed this data to their own reporting system or even warehouse to get some exotic intelligence out of their data? Remote possibilities … but to be sure I’ll ask them. So, yes, having all those attributes denormalized in addr1, addr2, addr3 is not going to raise any of my eyebrows. Of course, if they decide not to build a custom app but rather rent a service, the company offering that service may decide to offer those features … what is now appropriate? Well, something more than addr1, addr2, addr3 may be better. A more extreme example against the Mom&Pop requirements would be that of a organization conducting elections … their business IS addresses and electors … for them an address has at least two roles: mailing address and residential address. One would be well advised to get giggly with normalization for those residential addresses. The electoral address itself is “contained” within some parcel of land … the distribution/density of people determines the boundaries of electoral polls, the density of poll locations, the staffing required to run those polls, etc. Comparing 2 addresses (which is critical in this case in order to avoid duplications) is pretty complex … having those attributes denormalized is definitely better than addr1, addr2, addr3. More to it, street, municipality, state now have extra properties … they have associated polygons to it … is it better to always have to parse out the municipality from the addr fields in order to identify the associated polygon?
So we can compare and decide what is appropriate because we know what we are talking about. No absolute full normalization is good, full normalization is bad.

<quote>I don't approach surrogates your way. I do it the other way -- should I use one? Not, why shouldn't i use one.</quote>
I said that is my approach … I did not presume to imply I approach surrogates the same way you do … since I cannot read your mind. One could draw any conclusion from this (in another thread)…

<quote>
I personally use surrogate keys all of the time. Look up in the URL above, there is a big "display_id" number up there (a combination of a sequence, time, and random number. The reason in this case is because the base table didn't have a natural key (your question doesn't have a key really).

Even in tables that appear to have a natural key, I use surrogates. This is to protect myself when people decide the primary key value needs to be changed (and they always do). The only imutable column in the table is really the surrogate key -- the rest are subject to modification. Since I don't want to have to cascade that update to dozens of tables -- i use the surrogate.
</quote>

You use surrogates for entities without a natural key … you even use them for entities that have no natural key. Guess I’m getting at the same result … just thinking about it differently.

Sorry this has gotten so long … I prefer to justify my points and sometimes it takes more than 2 words (check out Marc bellow) to be rigorous when there is no exact science about it. I’m not out to prove you wrong for the sake of it … just following your advice of challenging “so called” experts. In my books, you are not “so called” … you are THE expert … but not THE ABSOLUTE one. I’ll take humans over any deity any time. And I’m learning more from you than you’ll ever learn (if anything) from me.

To sarcastic Mark from Indianapolis, IN USA

I don’t know what to make of your 2 words insight? But take another guess … is not the salad and is not Tom either.

Gabriel


Tom Kyte
September 30, 2003 - 4:23 pm UTC

You are passionate, I'll say that much.

given what I read in the question, i think a single table makes sense. there is no need to increase the LIO on this system some 20 fold potentially when what they got "works".



Gabriel

Mark A. Williams, September 30, 2003 - 4:29 pm UTC

> To sarcastic Mark from Indianapolis, IN USA

I was neither guessing nor was I attempting to be sarcastic.

- Mark

Chuck Jolley, September 30, 2003 - 4:35 pm UTC

Tell them they can have their child table...
IF they will maintain a lookup table of valid curve segment values.
Otherwise there is no "relationship" involved in their idea.
Many times informing the user of the true nature of what they are asking for will demonstrate why it is less efficient.
Assuming that it realy is less efficient of course ;)



Marc

Gabriel, September 30, 2003 - 6:02 pm UTC

Marc,

It certainly felt sarcastic to me … since it hinted to what I’m arguing against … ambiguity. My only excuse … it was deliberate.

It just goes to show, that given the slightest level of ambiguity, whether deliberate (conscious denomalization in ER modeling) or not, will invariably generate different semantics to different observers.

Given a chance (new code, new requirements, new assumptions, new anything), the actual semantics of a data model will deviate from its origins. This deviation manifests itself through: performance degradation, logical bugs or through legacy-type symptoms (difficult to change, missing or incomplete documentation, etc.). Most of what we do in our business is try to keep this continuous deviation in check. Tom’s approach is pragmatic: deal with the known problems as they come … it is a deterministic path … he can always make it better … since performance is a good indicator of how much deviation is there, he also concludes one should design for performance (which I agree with ... since he's really arguing for solid programming). I, on the other hand, cannot always make a bad system perform … hence, whenever I get a chance, I try to control those run-away semantics close to the model and as soon as possible (worried more about being able to stand up longer rather than running faster) … I find normalization a good tool for that … the downside? … having to convince people (who sometime don’t give a monkey about it) about potential problems … non-deterministic path … exposed to being labeled dogmatic, alarmist, obsessive, trouble maker, whatever. Everyone remembers the fix for the big problem … not having the problem at all? … a non-event. But I’m learning … Tom’s book is finally being delivered … hey, we are slower here in Canada.

Again … sorry man for labeling you.

Gabriel


Gabriel

Mark A. Williams, September 30, 2003 - 6:49 pm UTC

Don't worry, I don't feel "labeled" :)

My intent was to have a tiny bit of fun (though unoriginal I realize), and to give the "simple/obvious/easy" answer. Certainly it was not intended to offend, and I hope that it did not. Though there *are* things you can read into the fact that it was a "simple/obvious/easy" response if you are so inclined (just not too much, ok?) - as you observe - "different semantics to different observers".

I understand and respect your viewpoint, and agree with Tom that you have, if nothing else, a full supply of passion.

"Tom’s book is finally being delivered" - I hope that you enjoy the book as much as I have and find it a worthwhile addition to your library.

"we are slower here in Canada" - you need to make the ice colder then :) and I am sure that is not the case.

Thanks,

Mark

Great Debate!

A reader, October 01, 2003 - 4:05 am UTC

Great Debate Friends,
Good to know some of the expert opinions here. I have another similar situation here:
A database to be designed to keep the values against the bookmarks of a word document/ template.
Two approaches:
1. Create a Normalized Table
Template_id, field/bookmark_id, field/bookmark_value

2. Create a De-normalized Table
Template_id, field/bookmark_id_1, field/bookmark_value_1, field/bookmark_id_2, field/bookmark_value_2 …

The second approach certainly provides performance, at the related cost probably when a field/bookmark is added/ deleted. My thoughts are the first one though elegant would be limited in its scalability due to very high LIO. The modifications required when a field/ bookmark is added/ deleted would roughly be the same in both the approaches.

Can I request your opinion here?
Thanks


Tom Kyte
October 01, 2003 - 8:39 am UTC



here, two tables would be called for -- you do not have a documented upper limit on the number of bookmarks and it does not seem to me that there would be one.

You can reduce the LIO's via the smart use of an IOT (index organized table). It'll clump the book marks for a given document together on a block or two -- instead of scattered all over the place (same technique could be used in the above question as well obviously)

Just for the Fun of it …

Gabriel, October 02, 2003 - 3:34 pm UTC

Hi Tom,

With your permission, I’ll play a bit more with our topic here (and promise not to use any of the “Entity”, “Attribute” words).

First of all … I think we are all in agreement that the one table implementation is appropriate when the only stated need is to stage data prior to feeding some consumer. Who cares what is the content when the only thing I want is a small container that moves fast?

But then, we’ve all been duped by the talk of price plans and graphs (the crazy cousins of trees?) and O-R and … ! OK … all with, maybe, one exception.

Let us assume one would model this, in Oracle, when interested in MWs and Dollars. We’ll also keep it simple and not think of things like comparing price plans submitted by sellers and buyers. So, here it is … “give me the cheapest energy plan for my average consumption of 42MW” … or the reverse “give me the best energy plan for my $400 budget”.

The one table implementation (4-segment plans for simplicity):

create table epp_in1
( id integer not null
,name varchar2(10) not null
,segm_offset_1 integer not null
,unit_price_1 integer not null
,segm_offset_2 integer not null
,unit_price_2 integer not null
,segm_offset_3 integer not null
,unit_price_3 integer not null
,segm_offset_4 integer not null
,unit_price_4 integer not null
,constraint epp_in1_pk primary key (id)
,constraint epp_in1_uk unique (name)
);

insert into epp_in1 values (1,'Plan A',10,10,30,15,20,20,140,24);
insert into epp_in1 values (2,'Plan B',20, 8,25,15,25,22,130,24);
commit;

First 10MW @ $10, next 30MW @ $15, next 20MW @ $20, the rest till 200 @ @24 … and so on.

One would have to always scan the entire table to evaluate the best match. One would also have the code lots of PL/SQL (or wrappers to external) procedures to do all these CPU intensive searches … here is such a skeleton (doesn’t do anything worthwhile) …

create or replace package math_pkg as

function fn_dummy
( an_x in integer
,an_segm_offset_1 in epp_in1.segm_offset_1%type
,an_unit_price_1 in epp_in1.unit_price_1 %type
,an_segm_offset_2 in epp_in1.segm_offset_2%type
,an_unit_price_2 in epp_in1.unit_price_2 %type
,an_segm_offset_3 in epp_in1.segm_offset_3%type
,an_unit_price_3 in epp_in1.unit_price_3 %type
,an_segm_offset_4 in epp_in1.segm_offset_4%type
,an_unit_price_4 in epp_in1.unit_price_4 %type
) return integer;
pragma restrict_references(fn_dummy, WNDS, WNPS, RNDS, RNPS);
end;
/
show errors

create or replace package body math_pkg as

function fn_dummy
( an_x in integer
,an_segm_offset_1 in epp_in1.segm_offset_1%type
,an_unit_price_1 in epp_in1.unit_price_1 %type
,an_segm_offset_2 in epp_in1.segm_offset_2%type
,an_unit_price_2 in epp_in1.unit_price_2 %type
,an_segm_offset_3 in epp_in1.segm_offset_3%type
,an_unit_price_3 in epp_in1.unit_price_3 %type
,an_segm_offset_4 in epp_in1.segm_offset_4%type
,an_unit_price_4 in epp_in1.unit_price_4 %type
) return integer is
begin
-- complex stuff here
return an_segm_offset_1 * an_unit_price_1 +
an_segm_offset_2 * an_unit_price_2 +
an_segm_offset_3 * an_unit_price_3 +
an_segm_offset_4 * an_unit_price_4 ;

end;
end;
/
show errors

After this a query like …

select math_pkg.fn_dummy(42,segm_offset_1,unit_price_1
,segm_offset_2,unit_price_2
,segm_offset_3,unit_price_3
,segm_offset_4,unit_price_4
) dummy_result
from epp_in1
;

and then DENSE_RANK the results.

Now, this may be appropriate and even faster than the 2 table design … but there are some disadvantages … I have to write all these routines, I may need to load the parameters into PL/SQL arrays before being able to iterate through them (should I need to, of course) … in fact, the only use of Oracle’s analytics is that DENSE_RANK call outside.

The big ?advantage? of the 2-table implementation is that it already has the data organized vertically … is ready for heavy querying as is.

Consider …

create table epp
( id integer not null
,name varchar2(10) not null
,constraint epp_pk primary key (id)
,constraint epp_uk unique (name)
);

create table epp_segm
( epp_id integer not null
,segm_seq integer not null
,segm_offset integer not null
,unit_price integer not null
,constraint epp_segm_pk primary key (epp_id, segm_seq)
,constraint epp_segm_fk foreign key (epp_id) references epp (id)
,constraint epp_segm_chk check (segm_seq between 1 and 4)
);

insert into epp values (1,'Plan A');
insert into epp_segm values (1,1, 10,10);
insert into epp_segm values (1,2, 30,15);
insert into epp_segm values (1,3, 20,20);
insert into epp_segm values (1,4,140,24);

insert into epp values (2,'Plan B');
insert into epp_segm values (2,1, 20, 8);
insert into epp_segm values (2,2, 25,15);
insert into epp_segm values (2,3, 25,22);
insert into epp_segm values (2,4,130,24);

commit;

If I look at the data with this …

select epp_id
,segm_seq
,segm_offset
,unit_price
,sum(segm_offset) over (partition by epp_id order by segm_seq) as x
,sum(segm_offset*unit_price) over (partition by epp_id order by segm_seq) as y
from epp_segm
;

EPP_ID SEGM_SEQ SEGM_OFFSET UNIT_PRICE X Y
---------- ---------- ----------- ---------- ---------- ----------
1 1 10 10 10 100
1 2 30 15 40 550
1 3 20 20 60 950
1 4 140 24 200 4310
2 1 20 8 20 160
2 2 25 15 45 535
2 3 25 22 70 1085
2 4 130 24 200 4205

I can now almost see my graphs …
With MW on the x-axis and $price on the y-axis I can plot these:

Plan 1: (0,0) , (10,100) , (40,550) , (60, 950) , (200,4310)
Plan 2: (0,0) , (20,160) , (45,535) , (70,1085) , (200,4205)

The 2 queries I listed above are now equivalent with:
A. finding the intersection points, with the graphs, for the line given by x=42MW … the one with the lowest y is the cheapest price.
B. finding the intersection points, with the graphs, for the line given by y=400$ … the one with the highest x is the bigger MW volume for the budget.

One handy with trigonometry can even generalize the linear function for price plans (needs to get into COS, SIN, TAN, etc.) and use that for the PL/SQL function mentioned in the 1-table implementation.

With SQL I can do (maybe not the most elegant!) …

select epp_id
,to_char(max(decode(sign(x-42),-1,0,y-(x-42)*unit_price)),'999.99') price
,dense_rank() over (order by max(decode(sign(x-42),-1,0,y-(x-42)*unit_price))) as rank_by_price
from
(
select epp_id
,segm_seq
,segm_offset
,unit_price
,sum(segm_offset) over (partition by epp_id order by segm_seq) as x
,sum(segm_offset*unit_price) over (partition by epp_id order by segm_seq) as y
from epp_segm
) t
group by epp_id
order by epp_id
;

EPP_ID PRICE RANK_BY_PRICE
---------- ------- -------------
1 590.00 2
2 490.00 1

So, I have all the Oracle analytics to R&R.

I don’t know which approach is faster (your opinion would be welcomed) … the tradeoff seems to be between build-in vs. build-out functionality rather than normalization. In fact, the “normalization” is not even an issue since this is now very much a calculation, derivation work-horse (it would be like arguing for 3NF for a DW) … the queries will get faster the more I pre-calculate and store. The insert/update OLTP side definitely pales in comparison with the rest.

Now … what is the mentioned “linear-program optimization calculation” engine doing then? Well, probably this and more (kind of curios who’s salesman got there first). Hence the whole thing may have just been about staging some data after all.

Hence, I feel I owe you an apology … if nothing else, about my long ranting being “long”.

Given the same conjecture I would still come strong against “normalization is overrated”. Why? Because “normalization is rated” also leaves room for the illogical extremes but has this positive connotation to it (excluding any other considerations, the same model normalized vs. denormalized looks to me like a C function with vs. without ASSERTs … hence, at least something is always better). I see it a bit like “positive reinforcement” … I want my kids to learn music … but, while they are young and vulnerable, I rather their teacher don’t use Pink Floyd’s “ … I don’t need no education …” as class material. Well, this is it.

I rather start concentrating to my current real assignment (… no performance problems, but huge data quality ones … and they come fast … surprise, surprise the model is over denormalized and code is all over the place) … the client wants me to “Analyze This” … better do it or no “Analyze That” sequel will follow :)

Thank you.

Gabriel out …


Thanks

A reader, October 03, 2003 - 5:37 am UTC

Gabriel,
A worthwhile presentation. The crux of the matter is “database goes beyond the requirements of current application being build”. A database is being considered an organizational asset, the what and how of the data to be stored, decided by what it represents – a recognition as organizations migrate to organization wide data model thus decided. Typically such a data model would also have taken into account the possible future use of this data – not to be forgotten here is the fact that applications come and go while the data always remains.

The two possible future scenarios that you’ve pointed out (cheapest energy plan for an average consumption, and, best energy plan for an budget) are just the kind of requirements that could not be best catered for with de-normalized data-model.

Tom’s advise on the other hand has been very practical, avoid taking de-normalization to extreme, and an extraordinary way of physically keeping the normalized data items together by using IOTs to keep the LIOs minimum.

And, if I can request Tom to confirm if it would be okay to make a third table to keep the “attributes for a table in word template” in the word template example above, it would be great. It is expected that there would be 15 to 55 bookmarks/fields in a word template, about 80 templates and a few templates having a grid (word table).

Thanks and regards,


Tom Kyte
October 03, 2003 - 8:37 am UTC

i don't have a clear enough picture to answer the last part. i don't know what the 3rd table is about?

No question.

Gabriel, October 03, 2003 - 2:54 pm UTC

To “A Reader”,

Thank you for your support ... if nothing else, for at least acknowledging that I may have a point. In my mind, it wasn’t really about all future uses of the data … that would be way more speculation that even I could handle, given the little actual facts that we were presented with.

--------------

The fundamental difference between the 1 vs. 2-table design is where the semantics of that “Segment Sequence” attribute are defined: in the data (2-table) or in the metadata (1-table).

As any pragmatic would’ve been able to easily point out, the answers to the simple questions I listed could be easily obtained from the 1-table design without PL/SQL functions or trigonometric SQL functions (you may need them when searching for a mathematical generalization of those graphs … and that is abstraction that is not really required) … one CAN get the same answers with a CASE statement (or derivatives) and a bit of elementary math ( + - * / )… and, if one would want to also use the built-in Oracle analytics, getting the data vertically in a temp table is also trivial.

The real issue is the fact that the CASE statement would’ve hard-coded EXACTLY 20 branches … going to 19 would require taking one out, going to 21 would require adding one in, hence changes everywhere in the code … or to be precise, any impact analysis effort would’ve had a wider area to investigate, hence higher chances to miss something … with the 2-table design, the change/search is localized: the little code that handles insert/update/delete in them. It is about ENCAPSULATION … and I don’t even program C++ or Java for my living … I’m SOOOOO “obsolete” :-( ! Quality software like Oracle and cheap CPUs/Memory/Disks can get my data vertical from horizontal or vice-versa faster than I could notice the difference. LONG TERM, PERFORMANCE IS NOT MY PRIMARY CONCERN. The container (software and hardware) for the model gets better all the time … the model doesn’t without re-factoring!

Is it likely the enterprise has hard-coded that “20” somewhere else in their processes? Absolutely. I cannot really stop global deforestation single-handed … but I can protect the trees at least on my side of the woods.

So, it all boils down to how important that “20” really is. Well, to use Tom’s own metaphor he uses in the Introduction to his latest book (I just got it … and is fantastic) … it is about how holistic the “Preventative medicine doctor” should get. Or differently … how important is year 2000 for the programmer of the ‘80s? … it is about only 2 digits after all :-) ! With the 1-table design, the genie is out of the bottle … and the milestone is coming nonetheless ... the only escape is if the company goes bust before it … or just spend the money and/or pray for luck.

The morale of the story: when modeling graphs in a relational database, first think Y2K.

As Tom explained, “Nothing but nothing is black or white”… it is about using the gray (matter) when it really matters (sic!), in your particular context.

BEYOND the fact that the little bit that is pragmatic in the original question has already been pragmatically answered … For the rest, “modeling graphs in a relational database” (and using them), how obvious does the obvious has to get? Draw your own conclusions everyone.

To Mark A Williams: sorry man about the “volume of words” …I know ”I Did It Again”, but … simple things do evolve some time in controversy. As you may know (just trying to make the point) … Fermat’s Last Theorem took over 350 years to solve … and could be expressed in less than 20 “words” … the search for the FLT proof has nonetheless spurred the development of most modern algebra (where is this Relational Theory coming from?). Some things, IMHO, are worth defending … even to the point of having to explain the totally obvious.

Gabriel … totally out … (of words) …


My question

A reader, October 06, 2003 - 5:34 am UTC

Dear Gabriel,

If the requirements that you list are explicitly stated, it alone would derive a two table (or normalized) design. It is many times, the requirements remain hidden or evolve over time, that do not get adequately handled by the data model. It is the experience of the business analysts and the data modelers to see through these things. There are definite benefits of normalization, and, as you say the container (software and hardware) for the model gets better all the time, it can be coarsely said that, when in doubt normalize.

However, I see no clash in your’s and Tom’s viewpoints. You state the benefits of normalization (rather brilliantly on the above problem) and academics would endorse this as well.

While Tom, breaks our shackles, as in this case by saying a de-normalized table might just be okay. And, he does it on a person to person (read case to case) basis. If Tom sees there is a prejudice to normalize, taking it to excessive limits, he says normalization is overrated, and , if he sees a poorly de-normalized tables he advises for normalization. Now, reconciling the two or extracting excerpts for critical examination appears a little gratuitous, without the contexts in which these were said. The context is what Tom derives, by attaching himself to the person, situation and problem, and, advises accordingly - the best that the person, situation or the problem deserves. That’s why he makes us understand like nobody else. That’s his USP, that’s why so many of us visits him.

So, what this thread tells us is de-normalization is not a vice, and, in certain cases (like performance) may be a demand too. Now for my scenario:

Dear Tom,

A template would look like:

<Reviewer_Name>
<Address_Field_1>
<Address_Field_2>

Dear <Reviewer_name>,

Please find enclosed the details of the documents still outstanding for review/ forward.

<A Word Table>

Sl. No. Document ID. Document Description Status
<Sl_No> <Document_Id> <Document_Description> <Status>
...

</A Word Table>

Yours truly,
<Consultant>

The word table can have variable number of records. Like this there are many templates.

Now table one would hold:
<template_id> <field_name> <field_type>
t1, reviewer_name <field>
t1, address_field_1 <field>
t1, grid_1 <grid>
...
table two to hold
<template_id> <grid_id> <field_name>
t1 grid_1 sl_no
t1 grid_1 document_id
...

Table two can be a Nested Table, but what would be best?

Thanks for your time.


Tom Kyte
October 06, 2003 - 8:12 am UTC

i only use nested tables as a programming construct in plsql -- i never use them to persistently store data on disk. physically -- they are just a parent/child table implemented for you. You can do it more efficiently yourself and if you like the O-R paradigm, you can certainly use an O-R view to make it appear "as if" it were a nested table.

Spatial Objects. Where are they?

Alex, March 08, 2004 - 10:26 am UTC

Hello.
You are talking a lot about different types of geometries.
Could you please tell me if it possible to visualize them?
(I just want to find a program which could draw Oracle Spatial Objects. It would be wonderful if it could work with Oracle 8i in Windows 2000)
And one more question: is it possible to assign a label and color to a Spatial Object?
Thank you in advance.

Tom Kyte
March 08, 2004 - 1:59 pm UTC

there is a mapviewer included with iAS (app server). it visualizes spatial data on a map.



Does violate 1NF and Normalization is mispracticed (not overrated)

Shouvik, March 09, 2004 - 9:28 am UTC

Please refer to the following link, to get a definition of 1NF.

</code> http://databases.about.com/library/weekly/aa080501a.htm <code>

It says:
Recall that 1NF sets the very basic rules for an organized database:
· Eliminate duplicative columns from the same table.
· Create separate tables for each group of related data and identify each row with a unique column (the primary key).

and then probably goes on to describe the problem described here.

<Aru>
use it in their linear-program optimization calculations.
</Aru>
Tell me if they wanted to use several of Oracle's functions, Regression, best fit, etc, won't they need the points in row format.


In college we used to learn Navier Stoke's equation in fluid mechanics. Every other flu-mech equation is a special case of it and all of its terms are never needed. Does that mean it is overrated.

I would advice please normalize and denormalize if you think necessary. Never leave it unnormalized or says "I donot care what the heck normalization is"

Normalization is a theory. Use it and donot use it you cannot use it (with existing computing power). But donot despise it.

Tom please take back the word "overrated".

Boolean Algebra was introduced long before Computers came into being. I could not be practised fully then. Any theory which could not be practised at any given time should not be termed as "underrated".

Tom Kyte
March 09, 2004 - 12:28 pm UTC

nope, won't take it back.

if you use it as a mantra, it is overrated.

Look, I was just making a point -- anything taken to an extreme is just that -- something taken to an extreme.

do you always normalize?
no.


just take it for what it was, a comment.

this is the link I meant

Shouvik, March 09, 2004 - 9:34 am UTC

Re: 'I thought similarity is obvious. '

Andy Hardy, March 09, 2004 - 10:07 am UTC

The similarity is obvious to you, because you are inferring something about the data structure based on the names of the columns or you have some underlying knowledge of the data.

We don't.

In both cases, someone has stated that their requirement is that they store x pieces of data against a single entity. Nothing more, nothing less.

database design book

S, April 09, 2004 - 1:01 pm UTC

Hi Tom,
I like your second book - Effective Oracle by Design - very much. Could you recommend a good database design book? From logical design point of view, not physical implementation.

Thank you very much.