Home>Question Details



Alvaro -- Thanks for the question regarding "Query on design", version 8.1.7

Submitted on 25-Jun-2003 19:03 Central time zone
Last updated 26-May-2009 10:09

You Asked

Hi tom,

we have a desing problem ...

we have a table of products like this

PRODUCT
------------------------------
PROD_ID     VARCHAR2(15)
PROD_DESC   VARCHAR2(30)
PROD_TYPE   VARCHAR2(15)
PROD_COST   NUMBER(14,5)
PROD_LEVEL  NUMBER(3)
PROD_STATUS VARCHAR2(2)
PROD_CLASIF VARCHER2(2)

And there is another table with the attributes of each product like this

PRODUCT_ATTRIBUTES
------------------------------
PROD_ID         VARCHAR2(15)
PROD_ATTR       VARCHAR2(15)
PROD_ATTR_VALUE VARCHAR2(30)


And exist another table with all the valid values for the attributes

ATTRIBUTES_VALUES
------------------------------
ATTR_ID        VARCHAR2(15)
ATTR_VALUE     VARCHAR2(30)


SO ... We have products definitions like this

PRODUCT

PROD_ID   PROD_DES    PROD_COST  PROD_LEVEL  PROD_STATUS  PROD_CLASIF
----------------------------------------------------------------------
1000      SINGLE BED  100000     1           ACTIVE       A
1001      DOUBLE BED  200000     1           ACTIVE       A
1002      BABY CHAIR   50000     1           ACTIVE       B
...

PRODUCT_ATTRIBUTES

PROD_ID    PROD_ATTR      PROD_ATTR_VALUE
---------------------------------------------------------------------
1000       CATEGORY       BED ROOM
1000       COLOR          RED
1000       LENGTH         2.00 
1000       WIDTH          1.00
1000       MATERIAL       WOOD
1000       STYLE          CLASIC
1000       GROUP          FIRST CLASS
...
1001       CATEGORY       BED ROOM
1001       COLOR          RED
1001       LENGTH         2.00 
1001       WIDTH          1.80
1001       MATERIAL       IRON
1001       STYLE          CLASIC
1001       GROUP          FIRST CLASS
...
1002       CATEGORY       DINING ROOM
1002       COLOR          BLUE
1002       LENGTH         0.50 
1002       WIDTH          0.50
1002       MATERIAL       WOOD
1002       STYLE          YOUNG
1002       GROUP          FIRST CLASS


then, we want to solve users questions using a query :

*  All the products with  Category "BED ROOM" and Color "RED"  ???
*  All the products with  Category "BED ROOM" and Color "RED" and Material "IRON"  ???
*  All the product  with  Color "RED" and style "CLASIC"  ???

Every condition of the questions is a row in the struture, so we don't Know how to build 
a 
"single" query to solve such questions. We use that structure because the attributes are 
dynamic, 
we don't know  how many features a product will have, product A could have 
color,size,material 
(3 attributes) ... and product b could have color,style,group, and length (4 or more ...) 
and so on. 

Is this a wrong design ??? or Just we don´t know how to build a query ???

please forgive me for my english ...
 
thanks for your help ...

Acueto
 

and we said...

Here is a excerpt from my forthcoming book where I talk about this (and show you how 
ugly, hard and inefficient queries against your very flexible model will be)



(2)Do not use Generic Data Models

Frequently I see applications built on a generic data model for "maximum flexibility" or 
applications built in ways that prohibit performance. Many times - these are one in the 
same thing! For example, it is well known you can represent any object in a database 
using just four tables:

Create table objects ( oid int primary key, name varchar2(255) );

Create table attributes 
( attrId int primary key, attrName varchar2(255), 
datatype varchar2(25) );

Create table object_Attributes 
( oid int, attrId int, value varchar2(4000), 
primary key(oid,attrId) );

Create table Links ( oid1 int, oid2 int, 
primary key (oid1, oid2) );  


That's it - no more CREATE TABLE for me! I can fill the attributes table up with rows 
like this:

insert into attributes values ( 1, 'DATE_OF_BIRTH', 'DATE' );
insert into attributes values ( 2, 'FIRST_NAME',    'STRING' );
insert into attributes values ( 3, 'LAST_NAME',    'STRING' );
commit; 


And now I'm ready to create a PERSON record:

insert into objects values ( 1, 'PERSON' );
insert into object_Attributes values( 1, 1, '15-mar-1965' );
insert into object_Attributes values( 1, 2, 'Thomas' );
insert into object_Attributes values( 1, 3, 'Kyte' );
commit;

insert into objects values ( 2, 'PERSON' );
insert into object_Attributes values( 2, 1, '21-oct-1968' );
insert into object_Attributes values( 2, 2, 'John' );
insert into object_Attributes values( 2, 3, 'Smith' );
commit; 

And since I'm good at SQL, I can even query this record up to get the FIRST_NAME and 
LAST_NAME of all PERSON records:

ops$tkyte@ORA920> select 
     max( decode(attrName, 'FIRST_NAME', value, null )) first_name,
  2  max( decode( attrName, 'LAST_NAME',  value, null ) ) last_name
  3    from objects, object_attributes, attributes
  4   where attributes.attrName in ( 'FIRST_NAME', 'LAST_NAME' )
  5     and object_attributes.attrId = attributes.attrId
  6     and object_attributes.oid = objects.oid
  7     and objects.name = 'PERSON'
  8   group by objects.oid
  9  /

FIRST_NAME           LAST_NAME
-------------------- --------------------
Thomas               Kyte
John                 Smith


Looks great, right? I mean, the developers don't have to create tables anymore, we can 
add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The 
developers can do whatever they want and the DBA can't stop them. This is ultimate 
"flexibility". I've seen people try to build entire systems on this model.

But, how does it perform? Miserably, terribly, horribly. A simple "select first_name, 
last_name from person" query is transformed into a 3-table join with aggregates and all. 
Further, if the attributes are "NULLABLE" - that is, there might not be a row in 
OBJECT_ATTRIBUTES for some attributes, you may have to outer join instead of just joining 
which in some cases can remove more optimal query plans from consideration. 

Writing queries might look pretty straightforward, but it's impossible to do in a 
performant fashion. For example, if we wanted to get everyone that was born in MARCH or 
has a LAST_NAME = 'SMITH', we could simply take the query from above and just wrap an 
inline view around that:


ops$tkyte@ORA920> select *
  2    from (
  3  select 
     max(decode(attrName, 'FIRST_NAME', value, null)) first_name,
  4  max(decode(attrName, 'LAST_NAME',  value, null)) last_name,
  5  max(decode(attrName, 'DATE_OF_BIRTH',  value, null)) 
                                                      date_of_birth
  6    from objects, object_attributes, attributes
  7   where attributes.attrName in ( 'FIRST_NAME', 
                                     'LAST_NAME', 'DATE_OF_BIRTH' )
  8     and object_attributes.attrId = attributes.attrId
  9     and object_attributes.oid = objects.oid
 10     and objects.name = 'PERSON'
 11   group by objects.oid
 12         )
 13   where last_name = 'Smith'
 14      or date_of_birth like '%-mar-%'
 15  /

FIRST_NAME           LAST_NAME            DATE_OF_BIRTH
-------------------- -------------------- --------------------
Thomas               Kyte                 15-mar-1965
John                 Smith                21-oct-1968

So, it looks "easy" to query, but think about the performance! If you had a couple 
thousand OBJECT records, and a couple tens of thousands of OBJECT_ATTRIBUTES - Oracle 
would have to process the entire inner group by query first and then apply the WHERE 
clause. 

This is not a made up data model, one that I crafted just to make a point. This is an 
actual data model that I've seen people try to use. Their goal is ultimate flexibility. 
They don't know what OBJECTS they need, they don't know what ATTRIBUTES they will have. 
Well - that is what the database was written for in the first place: Oracle implemented 
this thing called SQL to define OBJECTS and ATTRIBUTES and lets you use SQL to query 
them. You are trying to put a generic layer on top of a generic layer - and it fails each 
and every time except for the most trivial of applications.
 

Reviews    
2 stars The crux of the question wasn't answered!   June 26, 2003 - 10am Central time zone
Reviewer: A reader 
Tom, the example you used is okay but I am sure such designs are used often in the real world and 
doesn't portray real business problems posed by Alvaro question such as 

"...We use that structure because the attributes are dynamic, we don't know  how many features a 
product will have, product A could have color,size,material (3 attributes) ... and product b could 
have color,style,group, and length (4 or more ...) and so on. "


How would you tackle a design issue such as the one described above? 


Followup   June 26, 2003 - 11am Central time zone:

I answered the crux of the question:

quote:
Is this a wrong design ??? or Just we don´t know how to build a query ???


answer:
wrong design


I thought it was very straightforward.  Not having all of the requirements -- I cannot comment 
further.  But I would betcha there is a finite set of attributes such a thing could have -- at the 
very least, there is a very LARGE set of common attributes.  

They have gone the route of "total flexibility" -- which I many times find is "we didn't gather the 
requirements really".  That could be part of the issue.

I would go with

create table t (
<set columns -- things we know everything has>,
<columns most things have, color, size, cost, shape, etc>,
<couple of flex fields for oddities>
);


a single table, many attributes.  Ability to index.  Ability to query performantly.

I would also bet that most attributes are in fact "well known" 

2 stars Correction!   June 26, 2003 - 10am Central time zone
Reviewer: A reader 
I should have said Tom's example (4 table data structure) is "Not" used in the real world very 
often... 


Followup   June 26, 2003 - 11am Central time zone:

you would be surprised how many systems do this, "flexibility" is their mantra.  excessively poor 
performance is their life.

see
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:6692296628899
for the classic example of this sort of thinking taken to an illogical extreme. 

4 stars Alternatives for that bad design   June 26, 2003 - 10am Central time zone
Reviewer: Alvaro from Barranquilla, Colombia
Ok, Tom we understood your approach ...

If we use that design in our application the performace will be poor ...

So , We are thinking to store all attributes in the same table no matter what king of attribute ..

product table would be

prod_id
prod_atr1
prod_atr2
prod_atr3
...
prod_atr20

With this model we have all the information in one table, but the problem would be the user must 
know that atr1 is color and atr is size and atr3 is last_name ... and so on.
The flexibility is not 100% but the performance would be better ...

what do you think about it ...

Thanks 
Acueto

 


Followup   June 26, 2003 - 11am Central time zone:

I think you'll find MOST attributes are in fact well known.

FEW attributes are not known.

So, I think you have a table with lots of fixed attributes -- color, size, last_name and so on.

You have a couple of columns at the end that vary by "product"

and the end users would have to "know"


But think about how excellent they would have to "know" SQL in order to query the other model at 
all -- look at the "or" example I have -- ugly. 

5 stars Proper data analysis required   June 26, 2003 - 10am Central time zone
Reviewer: A reader from UK
Very good points Tom! Proper analysis of the business procedures and data storage requirements 
should yield the right balance between flexibility and performance. 


5 stars Column names   June 26, 2003 - 10am Central time zone
Reviewer: A reader from UK
Alvaro

From the information you have provided it seems that most products have the same basic information 
about them, eg. colour, category, material. What prevents you from using these as the column names? 
Without a full analysis of the business, though, it would be impossible to determine the correct 
table/column set to use for you data storage. 


4 stars attributes   June 26, 2003 - 11am Central time zone
Reviewer: Alvaro from Colombia
Thanks UK ...

As you said we would use color, material, category and  ...
for column names, but actually, the attributes could be more and in some cases unknow for new 
products.
I could say you our product attributes are this and no more:
(color, material, heigh, width, category , ... ) only 15 attributes but it's not the real world. 
If a new product is added to our company it could have some attribute that does not exits, for this 
reason we are thinking in table fields like 

prod_id
prod_atr1
prod_atr2
...
prod_atr20
  


5 stars Tom's Date of Birth is 15th March 1965   June 26, 2003 - 11am Central time zone
Reviewer: A reader 
Is that true ? 


5 stars Analyse   June 26, 2003 - 11am Central time zone
Reviewer: Paul from UK
Unknown will be just fine for new products, simply leave it NULL.

Ok, now I'm getting to know more about your business, but more information is still required. For 
example what is the scope of the products covered by the business? Could you use the scope then to 
come up with good column names that will cover 99% of all products you would need to store? Might 
it be necessary to sub-divide products into specific groups, perhaps requiring more than one table? 
Maybe two groups of products have totally different attributes. Also, we are probably only looking 
at one tiny part of your whole system, perhaps other parts will affect how this part is designed.

Designing, as with any planning type activity will be more accurate and complete when the 
information provided is accurate and complete. 


5 stars Think of it from the users point of view.   June 26, 2003 - 12pm Central time zone
Reviewer: Chuck Jolley from OKC, OK USA
I try to think of these things from the users point of view.
Numeric attributes like size, cost, number per package etc that virtualy all of your products will 
have
or properties that could use list of values from lookup tables like color, material, group, style, 
etc get columns.
If the LOV would be too long to use for data entry or a drop-down on an inquery screen, why give it 
it's own column?
Just put that stuff in a wide descriptive column. You can still index and search on it.

select * 
  from products
 where desc like '%64 SQUARES%'
   and desc like '%TELESCOPING LEGS%'
   and color = 'RED'
   and width = 17
   and units = 'INCHES'
   and etc;

Gets you all the red 17" chess and checkers tables with telescoping legs.

Without creating a table with 1000 columns or 1000 lines of sql.

If something REALY new comes in that won't fit you can always add another column.
chuck
 


Followup   June 26, 2003 - 12pm Central time zone:

(tagged xml might even be a consideration.... oh, did I really say that out loud)

we can index that (XML) and do section searches amongst other things. 

5 stars Flexible database design   June 26, 2003 - 1pm Central time zone
Reviewer: Yong from CA, USA
The example Tom put out here 
was used/have been useing in a real world.

I involved two projects for two big companies at Bay Area. The java developers int my team try to 
convince me how flexible the data model will be. I gave the exact answer as Tom did: your flexible 
will kill the database. Unfortuantely, I didn't win these wars and development started. I knew what 
was going to happen. 

In the test environment, we have 20 rows on each tables, java developers could show how easy and 
flexible for a java develper to add any new attribute and feature. Then they were vary happy.

When we started the production, and table rows grew to 10,000 records in the tables, we encountered 
performance problem. After a user provide user name and password, our J2EE app loaded search the 
person's profile, it took 40 seconds to show up the next page!

Happy now? 

Finally, I have to re-design the data model 6 month later.

I read many Tom's idea and couldn't agree more. 

 


4 stars Supporting this model   June 26, 2003 - 1pm Central time zone
Reviewer: will from TX
I'm currently cursed with maintaining this type of model.  It supports a web app that generates a 
search on the various attributes.  The queries that are generated are something (evil) to behold.  
No way it will scale.

The question is how to redesign the structures to support the user's ability to add products and 
attributes and efficiently maintain and query the resulting large and varied set of searchable 
attributes.

Above you recommend a product (with flex field) approach.

We could probably get around the "user created model" requirement by giving them some facility to 
assign the flexfields for a given use.  However, the number of flexfields would be very large.  

To support the large and diverse set of attributes,  would you recommend a separate table for each 
product type?  How fine should the distinction be?  For example, should you have a single 
"computer" table or a "PC" and "Server" table?  Do you think a company like Amazon would store the 
data in a table per product type model?

I realize that these are decisions that would have to be made for our specific requirements, but 
any discussion on the alternatives would be greatly appreciated.  Will you discuss alternatives in 
your book?  I know I'll be first in line when it comes out.

Thanks for the great site (wow, cool spellcheck).


 


Followup   June 26, 2003 - 1pm Central time zone:

You might consider XML, you can store attributes in a tagged "document",  Oracle text can index 
that.  You can do section searching on it.

benchmarking such an implementation would be crucial obviously.  

5 stars Also, the generic design prevents enforcement.   June 26, 2003 - 1pm Central time zone
Reviewer: Chuck Jolley from OKC, OK USA
I forgot to add, that doing the design genericaly like that prevents the use of two of the most 
classic types of constraints: foreign keys and lookup tables.
What is to prevent someone from entering 'INCHES' for the value of the attribute 'COLOR'?
Let alone the "attribute" 'RANDOM_STRING'?
A 10,000 line check contraint?
10,000 lines of application code?
In our shop we call this the "Big Chief Tablet" school of design.
The database is just used like a gigantic piece of notebook paper.
In my view all the attribute columns except "desc" should be enforced  by check constraints or 
forgeign keys on lookup tables.
This alone will save the programmers a huge amount of work.
chuck
 


Followup   June 26, 2003 - 1pm Central time zone:

hear hear.

(when I talk of xml above -- it would be for ONLY those things that are definitely NOT KNOWABLE 
until months after the system is fielded -- the exceptions, things to add to version 2) 

5 stars Re: Tom's Date of Birth is 15th March 1965   June 26, 2003 - 2pm Central time zone
Reviewer: A reader 
I've got to say that if it's true, he looks like heck for being only 38 (just kidding, Tom!). 


5 stars Similar problem   June 26, 2003 - 4pm Central time zone
Reviewer: av 
I am facing same problem in our system. We do have these 4 tables to store the template data. 
Mostly the data is from the forms/applications templates.

For example the template is an application when you need to apply for a credit card or auto loan or 
insurance etc., The system doesn’t know the template and its attributes well in advance. Developers 
can create new template and the engine automatically recognizes it as a new template and registers 
its fields and values in the database. The 4-

table structure is:

1. Template_type Table
Template_Type_id number,
Template Type Name varchar2

2. Template Table
Template_id number,
Template Name varchar2,
Template_Type_id number

3. Template_Field Table
Field_id number,
Field_name varchar2,
Template_type_id number

4. Field_value Table
Field_Value_id number,
Number_val number,
String_val varchar2,
Date_Val Date
Template_id number,
Field_id number

With this flexible table structure developers don’t care about the number of templates and the 
number/type of fields in it. It is easy for them to allow users to fill templates and store them in 
database. It is also easy to show all templates and its fields.

But the real problem is with the reporting (both ad-hoc and standard). The field’s sizes in each 
template vary between 100 and 500. Right now we have 40 templates, 8000 fields and 400,000 
field_values.

We designed the datawherehouse table as a flat table with 1000 columns (number, varchar2 and date) 
and transforming the data into columns in ETL process, and we are reusing columns. Because we need 
to give sort by, filter, group by, aggregate functions etc features on each and every fields in 
ad-hoc reporting.  So we thought that de-normalizing and storing the data in columns give 
flexibility at report level.

Now the queries for a simple select become 

Select decode(Template_Type, 1, number1_column, 2, Number15_column, 3, number34_col, null) column1, 
decode(Template_type, 1,  varchar22_col, 5, varchar35_col, null) column2
From dw_table

The reason for this that we want to give a report of all the field values for that user templates 
(if the field exists in multiple templates then we want get all of them like cross-template) is 
with the column reuse let say field “Total_cost” can exists in any column. So using the metadata we 
are preparing the above decode to find out all values.

These queries are taking a very long time. I couldn’t fully understand the XML approach that you 
talked about. Is it a viable solution for my problem? 

Can you please let me know the best approach to solve the above said reporting problem? If you need 
to implement this framework how do you handle this design?
 


Followup   June 26, 2003 - 4pm Central time zone:

It'll help on searches for a document instance (XML will) cause you can 

select ... from t where contains( 'something within attribute' )

and get the instances back.

As for being a data warehouse solution, not so good.  there you need facts, dimensions.

Like I keep saying -- this flexibility = great, good for the developers.  this flexibility = 
stinks, terrible for any ad-hoc queryies, aggreations, "ors", you know  -- what people want to do 
with data.

I think  your data warehouse needs 40 fact tables (40 is NOT a big number, you don't have that many 
templates).

I think your data warehouse needs columns that convey meaning and are appropriately indexed with 
proper datatypes.

I think your "etl" process would convert their "glob of nothing" into something that people can use 
in the warehouse.

Their structure is excellent for working on a document instance ("hey, get me the data for 
document=555, that user is here to input the information")

Their structure is horrible for actually using the data any other way.


I would have changes to a template notify someone so the warehouse can be modified to accomidate it 
properly. 

5 stars Thanks   June 26, 2003 - 7pm Central time zone
Reviewer: A reader 
Thanks a bunch for your feedback. I totally accept your point. Still I have a question.

1.    The reason for merging them to provide cross-template queries, like get the cost field from 
Template 1 and Template 5 or aggregate quantity from Template 10 and Template 20. I have to merge 
the columns in the different table to get this kind of reports.

Another different question, which is not related to above argument is:

Some templates have around 500 columns, so they will become 500 columns in a table (I have this 
problem with current table too).  Is it going to scale well? Please let me know if you can think 
any better approach?
Thanks in advance.
 


Followup   June 27, 2003 - 8am Central time zone:

facts and dimensions are useful to break out "500" columns into more meaningful sets of tables.

It sounds like you need 1 fact table with common attributes, then 40 other "subfact like tables" 
for the non-common attributes, with dimensions off of each.  Practically speaking in a DW 
environment, I would probably have 41 facts -- one for the common with all rows from all templates 
-- and the ability if need be to drill down to the non-common details.

Then I would have 40 more fact tables -- that replicate the common but also have the non-common 
attributes.

In that fashion, I don't have to join common to non-common to query "template 1" , I just query 
template 1. 

4 stars The original question   June 27, 2003 - 9am Central time zone
Reviewer: A support manager from UK
I agree with Tom whole heartily - just don't go down the route of 'common object' tables. It kills 
performance, and confuses support staff (or at least gives them a long learning curve on the 
meaning of data).

I inherited a DB put together by a bunch of dogmatic OO programmers: All data "items" are in a 
single narrow table with a "class descriptor" to tell you if looking at (say) a stock item or an 
outlet. Values associated with item are in an attribute table. Certain items have properties that 
are items in their own right, these are in a link table. Certain items have parents - eg a "steak 
dinner" has a parent of "meat dish" which has a parent of "food". Each of these relationships is 
held in a single two column table of item, parent. Most simple queries run as 3 or more joins to 
the item table.

Performance is a dog. And to make the batch (output) interfaces fly the developers had to create 
temporary tables that contained a conventional (multi-column) form of the data

Performance is designed in - it follows analysis and design, it is not a generic, commodity item. 


5 stars Don't you have to take DB theory to get a programming degree any more?   June 27, 2003 - 11am Central time zone
Reviewer: Chuck Jolley from OKC, OK USA
What's the difference between that and using flat files?
I mean, they could have done the same thing with three flat files and a couple of app maintained 
"index" files.
They would have saved the price of Oracle and support.
It might have even run faster. 


4 stars Generic Database Models, don't try this at home   June 27, 2003 - 11am Central time zone
Reviewer: Jeff Klee, M.O.S. (Man-of-Science) from Research Triangle Park, North Carolina
Tom,

I really enjoy your column. This thread peaked my interest because I'm an application DBA and I 
support Oracle's "Oracle Clinical" application for a large pharmaceutical company. Oracle Clinical 
is a large "system" which is generally used to capture clinical trials data using a flexible and 
generic data structure. I would describe the datamodel in a nutshell by saying that it is built 
around two entities, a question (i.e. "what is your birthday", "what is the barcode on the blood 
sample") and a response (i.e. "16-JUN-1988" and "AA400055H" respectively). In effect, the RESPONSES 
table contains almost all of the dynamic data and the remaining schema objects (~10,000 although I 
didn't count them), are the metadata behind the RESPONSES table and the metadata that is used to 
construct the presentation layer on the fly. As you suggest, building a system like this is not 
like picking fruit and you need an army of folks to build it (read “a large pile of money”) and 
another highly trained group to "fly it". My role in this is to trouble-shoot problems where the 
typical problem statement is something like  “why is the database slow" (usually when someone 
builds a slow custom-query) and “can you make it go" (usually a non-database issue - network 
routers, defective firewalls etc), so, I’ve become familiar with the underlying table structures 
(~400 tables). After working with this product for a year, I have concluded that the generic 
database model probably makes sense in this case because you can build a single product, sell it to 
several end users and there is a large enough market to justify the expense (read “payback the cost 
of building it”). The main benefit of this architecture is that it allows a company to build their 
own presentation layer and capture any kind of data they want, and do this quickly. Although it's 
marketed as a  validated clinical-trials application, you could choose to capture other types of 
data. My intent is not to challenge your premise, "Do not use Generic Data Models" but to emphasize 
it. I would go on to qualify that statement by saying that one should have a sound reason to do it, 
like a business plan that justifies the expense of building such a complex system by weighing the 
benefits of doing so. In summary, Oracle Clinical provides my company with a high-speed mechanism 
for deploying custom applications (on the web). Although Oracle provides a mechanism for addressing 
performance problems and building generic "extract views" that make it easy to query the database, 
designing a system this complicated is probably a lot of "hooey" if you're only deploying one 
instance of it (and then I would call that a career development system). Also note that, since the 
RESPONSES table contains actual entities and their attributes, we extract some of them to a 
relation model where we can build high-performance (and more readable and easier to understand) 
queries. So my conclusion is that this generic model makes sense in this case and it does what it’s 
intended to do, capture data. Don’t try this at home though.
 


5 stars Bad design rant continued   June 27, 2003 - 11am Central time zone
Reviewer: A Support Manager from UK
If I had my way I would take their degrees from them, Chuck!

As I said above, performance is not an accident, it is good design.

In the OO system above, this generic approach to design has made the system far too complicated, 
and too divorced from the meaning of the business data, I would love to know that a widget costs a 
dollar and not that Item 10000324 has an attribute of type 345 with a value of 1.
By the way, this madness also pervades into the application code itself - too many layers of 
abstraction and objects that are way to big. No wonder the consultancy that put this in for our 
customer went bust.

 


4 stars Synthetic primary key and Real Primary Key   August 23, 2003 - 8am Central time zone
Reviewer: A reader 
Tom,

can you tell me what is Synthetic primary key and Real Primary Key.

Thanks for your time.
 


Followup   August 23, 2003 - 10am Central time zone:

synthetic primary keys are made of rubber.

real primary keys are made of numbers, dates, strings :)

(that was a joke... never mind)


We normally call them a surrogate key.

it is a key you use in place of the real or natural key of a table.  Consider a table of documents:

create table documents
( path   varchar2(255),
  name   varchar2(255),
  .... other data ....
)

arguably, the primary key -- the natural key of this table is PATH,NAME.  However:

a) we move documents, path might change
b) we rename documents, name might change
c) path and name are "big"
d) path and name are 2 columns, more typing in queries

So, path and name, while they are the natural keys are not necessarily the best fields to use as 
the primary key.  Especially for A) and B) -- cascading updates to foreign keys is *nasty*.  Just 
because you renamed a document doesn't mean it is a different document.

So, in real life, we would probably:


create table documents
( id     number PRIMARY KEY,
  path   varchar2(255),
  name   varchar2(255),
  .... other data ....,
  constraint name_unq unique( path, name )
)


we would introduce ID, a surrogate (synthetic) key.  It is imutable (never changing) solving the A) 
and B) problem.  It is small solving the C) data storage problem and it is a single simple column 
solving the D) issue.


 

5 stars Excellent   August 23, 2003 - 11am Central time zone
Reviewer: A reader 
***** for the joke
***** for the answer

whenever i read your responses i will be very very serious. Today you made me laugh with the joke 
.. Great..Thanks :) 


3 stars informatica   September 30, 2003 - 1pm Central time zone
Reviewer: kit from england
Re generic designs. I understand there is always a tradeoff 
but reading your comments, your do not seem to like them

Have you seen the underlying informatica design. how do they allow users to create their own source 
and target columns and apply business logic 


Followup   September 30, 2003 - 1pm Central time zone:

the only think i know about informatica is it really likes to do things as deadly slow as it can -- 
a row by slow row process.

beyond that, i haven't looked at their tables/implementation -- I just try to get people to use the 
database to its fullest.  sometimes a simple insert/update/create table as select will do in 
seconds/minutes what otherwise would take hours/days. 

2 stars sql   October 3, 2003 - 2am Central time zone
Reviewer: redmeat from sg
hi, 

it may work if you can build the sql string below.

SELECT * FROM PRODUCT WHERE PROD_ID IN
(
SELECT PROD_ID FROM PRODUCT_ATTRIBUTES A
WHERE A.PROD_ATTR IN (SELECT B.PROD_ATTR
        FROM PRODUCT_ATTRIBUTES B
        WHERE B.PROD_ID = A.PROD_ID
        AND B.PROD_ATTR='GROUP'
        AND B.PROD_ATTR_VALUE = 'FIRST CLASS')
INTERSECT
SELECT PROD_ID FROM PRODUCT_ATTRIBUTES A
WHERE A.PROD_ATTR IN (SELECT B.PROD_ATTR
        FROM PRODUCT_ATTRIBUTES B
        WHERE B.PROD_ID = A.PROD_ID
        AND B.PROD_ATTR='CATEGORY'
        AND B.PROD_ATTR_VALUE = 'BED ROOM')
)
/

PROD_ID         PROD_DESC                       PROD_COST PROD_LEVEL PR PR
--------------- ------------------------------ ---------- ---------- -- --
1000            SINGLE BED                         100000          1 A  A
1001            DOUBLE BED                         200000          1 A  A


 


5 stars question on design   October 24, 2003 - 2pm Central time zone
Reviewer: A reader 
Hi Tom
We have a situation where we have huge files 
stored in the database (Eventually this will run
in 9i/10G)

We need to parse it. Parsing in pl/sql has appeal as
you have something in the DB and you do the parsing,
store the parsed results in tables for the UI to display.

However, pl/sql is not meant to be used (I think )
as a parsing mechanism. perl *perhaps* is better bcos
it is designed upfront for this. (10G regular expressions
in PL/sql mitigate this problem a bit may be for 10G)
But doing it in perl introduces the "extproc" complexity and with it perhaps other complications. 
If you want to do it in the middle tier then you have to ship this file contents over - not a good 
idea.

What are your thoughts?
1. pl.sql
2. perl with extproc
3. middle tier

My choices are as above in the order of preference.
Please let me know what your thoughts are.

Thanx for an amazing site!!

 


Followup   October 24, 2003 - 3pm Central time zone:

what is this document and what does your parsing involve (i've never personally seen the appeal of 
perl, ugh) 

5 stars Thanx Tom!   October 24, 2003 - 3pm Central time zone
Reviewer: A reader 
The document is a text document (if that is what you meant)
It contains information that needs to be parsed and
put in tables for the UI. The information is in a format
that would need reasonable amount of parsing to be 
carried out.

Thanx! 


Followup   October 24, 2003 - 4pm Central time zone:

do you have a "for example" (and why wouldn't the client that stuffs it in do the parsing?) 

5 stars thanx!   October 24, 2003 - 4pm Central time zone
Reviewer: A reader 
Unfortunately, I can not give you an example right now
for confidentiality reasons.

Anyways, I dont think the parsing involved is going to be
*very* heavy duty and can be done in pl/sql with the array
of string manipulations. My concern is that if you are using too much pl/sql processing (due to the 
size of the file) it may bog down the database. The process itself
can be done in background using jobs. - it is not the
"ui" ineractiveness that bothers me but the resource
consumption at the database.

for example (fake)
a=b,c=d,

put it in a table where col_a  = b and col_c = d


Thanx!! 


3 stars Database Design   April 21, 2004 - 6am Central time zone
Reviewer: Bárbara Díaz Vicente from Madrid,Spain
Hi:
I have to design a database with several schemas, such as when we want delete some funcionallity 
only we have delete its schema.
The idea is that there is a schema with generic information for all our schemas and then each 
necessary schema.
For example:

        SCHEMA 1: CSR_GES_PRMT
                                col1
                                col2
                                col3

        SCHEMA 2: PCRO_GES_PRMT
                                col1
                                col4
                                col5

Cardinality between both tables 1:1.-
       ......
        SCHEMA n: ................................

OBJECTIVE: we need get a table - view synonym ???? -
TB_GES_PRMT with the following attributes
col1,col2,col3,col4,col5.

At the begining I had known use view and synonyms but there are something restrictions with DML 
sentences.

How can we do this design ???
Thanks in advance.

P.D: Sorry for my english !!! 


Followup   April 21, 2004 - 8pm Central time zone:

I don't fully understand the concept but it sounds so far like "a bad idea".  not sure what you are 
trying to do exactly. 

3 stars Database Design   April 22, 2004 - 5am Central time zone
Reviewer: Bárbara Díaz Vicente from Madrid,Spain
I'll try explain better the cuestion.
We have two Databases created.Now we want integrate both in only one database and think how to do 
it for future developments where we need reuse some information and create some new data.
We think create a schema with share information and so much schemas that we need for creating our 
funcionallity.
For a concrete example:
 A table for customers ----- Generic Information:CSR_CLIENTS
                              MSISDN
                              CO_EST_CLTS_N
                              CO_EST_SIM_N
                             Specific Information:PCRO_CLIENTS
                               MSISDN
                               FC_TMSEL_V
We have created source for a table PCRO_CLIENTS with the following data: 
MSISDN,CO_EST_CLTS_N,CO_EST_SIM_N,FC_TMSEL_V and we don't want reprograming that source but we need 
separate generic-especific information , Perhaps solution will be a wiew with join both tables but 
we have limitations with DML sentences.

Or perhaps better solution will be another .......
Thanks.

                             


 


Followup   April 22, 2004 - 7am Central time zone:

a single table seems better, unless I'm missing something here.  one table will perform better, be 
easier to insert/update/delete from and so on.   

3 stars OK   April 23, 2004 - 1am Central time zone
Reviewer: Ram from Bangalore,India
Hi Tom,
Any other way to put the following Query?

SQL> select deptno,dname,cursor(select empno,ename,sal from emp e where e.deptno = d.deptno)
     from dept d order by deptno;

Please do reply.
    



 


Followup   April 23, 2004 - 11am Central time zone:

there are an infinite number of ways to "put" the above query.

but if what you want is an order set of dept rows that each have a cursor that is the list of emps 
that work in it, you got what you need. 

3 stars OK   April 24, 2004 - 6am Central time zone
Reviewer: Ram from Bangalore,India
Dear Tom,
Thanks for your reply.But could you please some other ways
to put the above query *regardless of order by of deptno*?
Please do reply.
Bye!
 


Followup   April 26, 2004 - 5am Central time zone:

just outer join.  gives you an 'equivalent' answer, but not with a 'cursor' 

4 stars what to do with the generic data model   November 1, 2004 - 1am Central time zone
Reviewer: philip from Australia
Tom,

Back to your comment about generic data model.  I am a DBA and   i am supporting a database 
designed by our vendor who has a similar generic data model.  

Currently, the Object table has 97000 rows and the object attribute table has 9.1 million rows and 
attribute value table has 9.1 million rows.

Users are always complaining about how slow the system is.  From the Oracle performance manager - 
Top segments performance view, it shows there are high number of physical reads(in 1.4 millions) on 
Objects, Object attribute and attribute values tables.  Logical read is around 20 millions. 

What can i do to improve the performance of system ? 


Thanks

Philip 


Followup   November 1, 2004 - 3am Central time zone:

get another vendor or ask the vendor to solve the problem.

see the queries in my example above?  you cannot ask even the most BASIC SIMPLE QUESTIONS without 
performing a contorted query that even a yoga master would be impressed by.

(this is just a side note.  when we buy technology -- like a database -- we do things like "compare 
benchmarks", "benchmark the system", "make sure the selected database can handle our load with our 
required response times"

I know we do that -- cause part of my job is showing that time after time.

So, why is it when we buy an application the only thing we look at is:

o does the checklist satisfy our wish list more than the other guys
o are the screens really pretty and flashy

we don't seem to include most of the times:

o can it meet our performance goals
o does it use system resources in a reasonable fashion (eg: it isn't a sqlserver application that 
opens 20 databases connections for each "end user" even though it is running on Oracle and you know 
you don't do that on Oracle)
o does it take advantage of the underlying technlogy in a reasonable fashion

An application buy almost never involves the technology team -- yet the technology team is the one 
responsible for making it "go fast")

Just an observation -- I'll bet the technology team responsible for implementing and maintaining 
this system had very little to say about which vendor to use. 

4 stars what to do with generic data model?   November 2, 2004 - 1am Central time zone
Reviewer: Philip from Australia
Tom,

You are right, our tech team have no say about which vendor to use or to review vendor's design.


On the other extreme we have one vendor that promised to deliver a OLTP system running on our 
oracle database.  The first time i
received their SQL install script, it was horrible. It turns out that they developed the 
application on their prefer database 
sever MS SQL.  The install script they gave me was written in MS SQL T-SQL and they expect that to 
work on oracle without any changes. 
Then i told them that oracle sql syntax and ms-sql syntax are quite different, 
I also gave them a copy of Oracle 9i R2 installation CDs and asked them to test their software on 
oracle
before coming back to us for UAT.

Six months later, on the first week of UAT, the performance of the application was terrible.  There 
was one batch update function(update 15000 accounts) which took 5 hours to complete on our oracle 
database server. 
Our oracle database server isn't very powerful, it is SUN Fire V220R with 1 Ghz  CPU , 4 Gb of RAM. 
 Allocated db buffer 1Gb, shared pool 260M. 
The acceptable completion time for the batch update function is 45 mins.   When i traced the 
application(for a duration of 5 mins) while they were running the 5 hrs job i found that the VB 
application was sending
23000 unique sql statements to the db server.  I could see library cache latch contention, 
blocking session (they didn't index any of their foreign keys and there are 267 tables in their 
application), large amount of data is sent to the client (what i can tell is each account update, 
the VB app
will select all the information related to the account (involve 7 select stmts),process it on the 
front end and update the database.

Later on I implemented the cursor_sharing=force parameter and hope that will reduce library latch 
contention because there is no way they are going to change their code due to the
tight timeframe.  But it didn't improve anything.

My management was upset and dissatisfied with Oracle database performance.  I was able to convince 
them that it has nothing to do with Oracle performance but it is the
inability of our vendor to code their application to work with oracle.

The vendor's response to the performance problem was 
our database server is too low spec.  They said they tested their application on a 3Ghz dual 
processes Intel box running MS-SQL server, a high end client PC and it took them around 
37mins to complete the same function.  They also tested their application on a 3Ghz dual processes 
Intel box running Oracle 9i R2 and they said it achieved 1.30 hrs.  
I tried to put the oracle database on one of the high end(dual 3ghz and 1 Gb RAM) Windows 2003 box, 
but it still took 5 hrs to complete the batch update.  That makes me wonder whether the vendor was 
telling the truth about 
running their application on a oracle 3ghz intel box .
I told the vendor about hard parsing in oracle and the requirements to index foreign keys that 
involved joins.  But their response was MS-SQL server doesn't have hard parsing issue or 
performance issue with
non-index foreign keys.    I also read the ms sql server performance tuning documentation, it 
didn't really mention that database performance will be bad if application
didn't use bind variables.

At the end of the story is: management decided to ditch oracle and use ms-sql server because the 
vendor cannot write program to work with oracle db.   Now i am learning ms-sql server...

I have the following questions:
1. Do you think the hard parsing and the un-index foreign keys are the major cause of the 
performance issue? 
2. Do you think it is bad design for application to send 23000 statements to the database and they 
should be replaced with stored procedure?  The vendor is
reluctant to use stored procedure because they are going on about the database independent BS 
again.
3. Also the vendor use a 3 tier approach but really a 2 tier implementation.  The application 
server was doing nothing while the client PC and database was doing 
most of the work.  The vendor also believe that there is no need to handle transaction locking in 
their application because that's the job of the 
application server(they are using MS 2003 server as application server) to handle locking and 
serialization.   I don't think the vendor is correct.  What is the job the application server in 
relates to transaction lockings and other functions?
4.  We now have MS SQL server in place instead of oracle, the batch update function is taking 
37mins to complete. In SQL server, there is no need to configure
database buffer cache, shared pool size etc as they are configured automatically by the engine. 
There is not alot of monitoring i can do on the SQL Server to see whether it has 
bind variable issues.  Is there a major difference between SQL Server and Oracle on how they handle 
parsing/generating execution plan and reuse of sql statements?   
5. Do you think the big performance difference between Oracle 1Ghz UNIX box and
 3 ghz dual processes intel MS SQL box was because parsing/generating execution plan uses alot of 
CPU and if we have a faster machine definitely it will be faster?






Philip
 


Followup   November 2, 2004 - 7am Central time zone:

Your managment is the problem here.  How could they possibly negotiate a contract whereby the 
vendor didn't even develop on the database they targeted?  This is primarily their (mgmt) fault in 
my opinion.  They do not seem to know how to procure software.

there is zero percent chance that an application designed for database X will perform adequately on 
database Y.

That is, if you wanted sql server and they knew only oracle -- it would fail just as miserably.


the locking models between the two databases -- TOTALLY different, wildly different.

the way you program sql server is different from the way you program oracle is different from the 
way you program sql server.

1)
I don't like to speculate -- i hate to guess.  could hard parsing and unindexed fkeys be an issue?  
probably - but was this a multi-user situation?  In a single user mode, easily 2/3rds of the 
runtime of an INSERT of a single row could be parse time (if it takes 1 second to insert -- .66 
seconds would be parse time, .33 would be actual useful code).  In a single user mode, the 
unindexed fkeys would not be relevant.

2) 
to me the BEST and ONLY way to be DB independent is to put 100% of your code into stored procedurs.

why -- cause we lock different, cause we have different features (if you want a bill of 
materialials explosion in Oracle -- use connect by, in sql server -- use temporary tables and a 
recursive function), case we are different.

that and you'll never find an individual with the depth of knowledge on oracle, sqlserver, db2, 
informix, sybase, whatever to make a truly "database independent, works the same in all databases 
and works well in all databases".  You will however find many people proficient in sqlserver OR in 
oracle OR in db2 -- and they can code the perfect transactional api in one of the databases.  Of 
course that means you actually need to have an up front design, some specifications, a clue..... 
(the implementors do).


and if you contracted this vendor to build you a CUSTOM SYSTEM to spec -- why do you care about 
their "database independent needs", they quite simply are not your concern.  Your mgmt should have 
actually litigated perhaps to get their money back (but they probably never negotiated any sort of 
service aggreement either....)

3) transactions and locking are the job of the database

4) huge difference - they do not have a shared pool at all.  they are not built for many people 
working simultaneously in the same database.  

doesn't 37 minutes seem like an absurdly long time?

5) no guessing.

 

4 stars Role of Application Server   November 12, 2004 - 4am Central time zone
Reviewer: Philip from Australia
Thanks for your reply.  


You mentioned about "the locking models between the two databases -- TOTALLY different, wildly 
different.

the way you program sql server is different from the way you program oracle is 
different from the way you program sql server.  "

I had a discussion with the developers who have more knowledge of MS SQL Server than Oracle.  I 
disagreed with one comment made by the developers about the need to understand database
locking model when Application Server is used.  

They are saying if you program in 2 tier model then whether using MS SQL or Oracle, the programmer 
must understand how database 
handles locking.  But if you program in a 3 tier model, the programmer don't have to understand 
about how database 
handles locking because it is the job of the Application server which handles these tasks for you.

Are they correct?


Does SQL server 2000 really supports row level locking, is it as good as Oracle?

What kind of features does the Oracle Application server provides in terms of Database services?  
And what advantages does it have over some open source AS like JBOSS?







  


Followup   November 12, 2004 - 7am Central time zone:

laughed so hard I fell right out of chair

<quote>
ut if you program in a 3 tier model, the programmer don't 
have to understand about how database 
handles locking because it is the job of the Application server which handles 
these tasks for you.
</quote>

oh ho ho ho ho, that is really funny.

that developer has a thing or two or three left to learn. 

4 stars MS SQL Server does recommend bind variables   November 12, 2004 - 10am Central time zone
Reviewer: jim from Portland, OR USA
If one actually reads the MS SQLServer docs and with an eye to improving performance then MS 
recommends using bind variables. (same advice Oracle gives)  So the vendor not using bind variables 
should be shot. (metaphorically, I am not advocating actual violence)  It is amazing how many 
vendors who are "database independent" are "database ignorant".  I am working with a vendor now and 
their documentation is so full of mythology about Oracle best practices that it would have been 
better if they said nothing.  It would have been better if they pointed people to the Oracle 
documentation! (rebuild indexes frequently, put things in one extent, cost based optimizer builds 
indexes in one direction, and rule based builds them in another...)  This is a major CRM vendor 
that is traded on a major stock exchange! 


4 stars great joke that caused Tom to fall off his chair   November 13, 2004 - 1am Central time zone
Reviewer: Philip from Australia
Tom,

I was shocked and in total disbelief when i first heard the statement 'no need to understand 
database locking in 3 tier model' from this 'Senior' developer in my company.  This 'Senior' 
developer also had made the following bizzare statements about database programming:

1. Bind variables in Oracle programming - never heard of it and MS SQL server don't have this 
problem because it is so advance.   
2. integrity constraints - no need to enforce on database level because the application handles 
them already.  
3. Never have concurrency issues when you use all the default feature of the MS SQL database 
statements.  No need to worry about any extra feature(in regards to transaction locking etc) 
because the database  should handle them for you. There is no need to worry about how to specially 
code update/inserts and selects to avoid blocking in a multiuser environment.



When i confronted the 'Senior' developer regarding point 2.  The response was it is a nuisance to 
handle the database exception if integrity constraints were already enforced on the application 
level.  The application is coded so well that it checks for integrity violation before data is put 
into database.   

This arrogant 'Senior' developer doesn't think there could be bugs in the code.  How can any 
developer would think it is a nuisance to handle exception?

Another joke is this 'Senior' developer is developing(alone) a big J2EE application that is going 
to be used by ........ 2 persons.  This project has been going on for 1 year. 


Hope you like these jokes.




 


Followup   November 13, 2004 - 9am Central time zone:

1) sqlserver does not only have them -- they encourage them.

If the performance issue isn't enough to convince, please have their mgr do a search on google for

"sql injection"

(don't have them do it, let their boss do it -- get their boss very scared of sql injection -- as 
they should be -- and have them then ask this Sr developer why he isn't concerned as well)

2) that is the silliest thing I've heard.  It is impossible to enforce cross object (foreign key) 
or cross row integritry constraints without locking large sections of data or entire tables.  And 
what you have to lock differs by database (as we have fundementally different locking models)

3) cause they don't do real transactions by default -- as stated in many places.  Locks are a scary 
thing in sqlserver -- the database quickly grinds to a halt.  

He is right that you in general shouldn't have to worry about how to code update/inserts and 
selects to avoid blocking in a multi-user environment -- but you cannot be using sqlserver at that 
point in time....


"it is a nuisance to handle database exceptions if the integrity constraints were already enforced 
by the application" -- hmmmmmm and what exceptions would be raised by the database if the data were 
clean?  There wouldn't be any!

 

5 stars   November 28, 2004 - 9pm Central time zone
Reviewer: A reader 
Tom,

You often advocate the use of a single query instead of pl/sql if possible. Can the same apply for 
reports? A new index is being added to a table with slow reports. However, I am just looking for a 
quick pointer as to what kind of database features (pl/sql) should be used in reporting. Also, 
should any specific table/schema designs be considered? Thanks in advance as always. 


Followup   November 29, 2004 - 8am Central time zone:

my mantra is simple (and fresh in my mind as I just revised this chapter this weekend :)

*You should do it in a single SQL statement if at all possible.

*If you cannot do it in a single SQL Statement, then do it in PL/SQL (as little PL/SQL as 
possible!).

*If you cannot do it in PL/SQL, try a Java Stored Procedure.  The number of times this is necessary 
is extremely rare today with Oracle9i and above.

*If you cannot do it in Java, do it in a C external procedure.  This is most frequently the 
approach when raw speed, or the use of a 3rd party API written in C is needed.

*If you cannot do it in a C external routine, you might want to seriously think about why it is you 
need to do it...
 

4 stars Generic lookup model or not...   December 1, 2004 - 3pm Central time zone
Reviewer: A reader 
Hi Tom,

Today, we have a document table that has FK reference to 16 lookup tables (with ID and decription 
columns), meaning that to get the decription from the lookup tables I do a join with 16 tables. 
This joins works, but is not too fast though... 

Now the new requirents are to have dynamic lookup definitions. After talking with them they would 
actually accept 30-40 lookup tables (with 50.000 - 100.000 rows in each), the dynamic definition is 
not a must.

The problem is that I think that this join with 30-40 tables would be very slow, on the other hand 
the more generic model has the mentioned disadvantages... Could you give some suggestions on the 
design here? 

Thank you. 


Followup   December 1, 2004 - 7pm Central time zone:

what is a "dynamic lookup definition"?

do you get all of the attributes all of the time?  

is this lookup gone "mad"?  or are these reasonable lookups -- really needed ones (eg: you are not 
changing say M/F for male/female into a look right?) 

4 stars   December 2, 2004 - 4am Central time zone
Reviewer: A reader 
Thanks for the fast reply.

By "dynamic lookup definition" I meant that the user can define new lookup types, and today that 
would mean a new table that must be connected to the document table (new FK column).

<
do you get all of the attributes all of the time?  
>
Not all attributes are registered on all documents, but we have inserted a "default/empty" row in 
each lookup table not to outer join.

<
is this lookup gone "mad"?  or are these reasonable lookups -- really needed 
ones (eg: you are not changing say M/F for male/female into a look right?) 
>

Well, most of the data in these lookups come from ERP systems and are quite static (like accounts, 
currency type, departments etc), and you have lookup like "projects" which gets a new entry when a 
new project starts (this table is still ID and descr because the details about projects are held in 
a different system).

The typical use of the system is: a document of type invoice comes, you register the amount, etc as 
free text and then you have to register which account to use, which department and/or project will 
pay the invoice. The user should not be able to register the invoice on a non-existent project - 
that's the reason for the lookup. 


Followup   December 2, 2004 - 7am Central time zone:

what i meant by "do you get all of the attributes all of the time"

was -- ok, so you have a 16 table join, do you retrieve ALL of the attributes in your application 
MOST of the time -- or do you just get a couple here and there.  If you do not dereference 10 out 
of 16 of the lookups, you did 10 joins that were not "useful".  We can remove them.  but if you 
always "select * from this_nasty_join", you cannot.


 

4 stars   December 2, 2004 - 8am Central time zone
Reviewer: A reader 
I hope I understand you correctly... 

We use this functionality 3 places. In one module (an overview screen) we get data from 3 lookups 
and there the sql has only these 3 lookups.

In the othe 2 (document editor and archive modules) we get all of the attributes with one query. 
The "useless" lookups return "" for the value of "ID" and "Descr" but they are still in the query. 
So in 2 of 3 cases we all the atributes... 


Followup   December 2, 2004 - 11am Central time zone:

if you mostly get all of the attributes, you are in "join heck" -- not too much we can do about it. 

4 stars   December 2, 2004 - 1pm Central time zone
Reviewer: A reader 
So it wouldn't be better with one general and big lookup?

A solution I thought about was instead of this huge 40 tables join, select the references from the 
doc. table and build a batch of 40 sqls (send it as one job from the application or do it in a 
procedure on the DB) that would get the correct row from each table. These single selects would be 
very fast, but do you think it makes sense?

Thanks for your time. 


Followup   December 2, 2004 - 7pm Central time zone:

you still have a 40 table join -- 39 of the tables have the same name is all.

you would have the hottest table on the planet and it would crush you.  


And these lookups still need to go to 39 real tables do they not.  with all of the department 
information, project information and so on. 

5 stars Very useful!   December 2, 2004 - 4pm Central time zone
Reviewer: A reader 
This is a extremely useful discussion on design issues. Top-notch! Thanks, Tom. 


5 stars   December 3, 2004 - 5am Central time zone
Reviewer: A reader 
Case 1.
<
you still have a 40 table join -- 39 of the tables have the same name is all.
>
Sorry, I didn't explain what I thought here. I agree, if the resultset would have the same 
structure as today than I agree, you still need the 39 "tables" except that each of these tables 
have 40 time more data in it. So it's a brainded sollution :). But, to avoid the big join I would 
define a new couplings table with basically 2 FK columns: docID and lookupID (the PK column in he 
big lookup) and one row for each defined lookup on a document (eg 40 row/doc). I would change the 
code to handle that now I get 40 rows and not on row/doc with 40 columns... I would expect that 
this would be fast on "get all data on document X" but would suck on "get all documents with 
department Y or project Z" and reports etc. Correct? 
So I guess this is not an aproach for "real life" apps...

Case 2.
<
And these lookups still need to go to 39 real tables do they not.  with all of the department 
information, project information and so on.
>
Yes, there would be 39 real tables, I would just release the db doing the big bada join but I would 
have to merge 2 result sets in code. I know that you generally mean that databases are built for 
joins but is that still correct for this join? I remember that a while ago we had problems with 
Sybase because they had a 16 table limit on joins...  

So it would be:

Sql1.
Select lookupID1, lookupID2, ... from doc where docId=X

I get lookupID1, lookupID2 in code and build an sql like this:

select lookupDesc1, lookupDesc2, ... from lookup1, lookup2, ... Where lookupID1=X and lookupID2=Y 
and etc. (lookupIDs are PK of course)

The question is if this approach would be faster than the join. It's easy to answer that the join 
is way faster if you have 3 row is the doc table and 10 in the lookups, but is this still correct 
for 10+ million rows? 


Followup   December 3, 2004 - 8am Central time zone:

now you are talking "generic funky data model" -- i wrote about that in "Effective Oracle by 
Design".  ugh. 

5 stars significant CPUOTHER utilized query on statspack report   December 22, 2004 - 5am Central time zone
Reviewer: SREENIVAS from INDIA
hi tom,

can you write this query in effective way on 9.2.0.1.0 datbase.

SQL Statistics
~~~~~~~~~~~~~~
-> CPU and Elapsed Time are in seconds (s) for Statement Total and in
   milliseconds (ms) for Per Execute
                                                       % Snap
                     Statement Total      Per Execute   Total
                     ---------------  ---------------  ------
        Buffer Gets:     170,480,244        557,125.0   15.81
         Disk Reads:         145,721            476.2    3.28
     Rows processed:          22,983             75.1
     CPU Time(s/ms):             985          3,219.1
 Elapsed Time(s/ms):             971          3,172.1
              Sorts:             918              3.0
        Parse Calls:             306              1.0
      Invalidations:               0
      Version count:               1
    Sharable Mem(K):              72
         Executions:             306

SQL Text
~~~~~~~~
SELECT    distinct A.COMP_APPL_ID AS "App No",   A.APPDATE AS "A
pp Date",   B.CUSTOMERNAME AS "Customer Name",   D.DESCRIPTION A
S "Product",   D.SUBTYPEDESCRIPTION AS "Sub-Product",   C.DESCRI
PTION AS   "Location",   E.BPNAME AS "Sourced By" FROM    LOT_CU
STOMER_T B,   LOT_GLOBALLOAN_T A,   COMPANY_GENERIC C,   PRODUCT
 D,   BP E,   LOT_WORKFLOWSTAGE_DTL F WHERE    A.COMP_APPL_ID =
B.COMP_APPL_ID(+) AND   A.SRC_BPID=E.BPID AND   C.GENERICID = A.
CG_LOCATIONCODE  AND   B.GUAR_COAP_FLAG(+)='P'  AND   D.PRODUCTC
ODE = A.PRODUCTCODE  AND   NVL(A.PRODUCTSUBTYPE,'NULL') = NVL(D.
PRODSUBTYPECODE,'NULL') AND   A.COMPANYID=1000 AND   A.COMP_APPL
_ID = F.comp_appl_id AND    F.EMPLOYEECODE='10030'  and F.STAGES
TATUS='P' AND F.STAGE='DDUPDEDS' order by 1

All Optimizer Plan(s) for this Hash Value
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shows all known Optimizer Plans for this Hash value, and the Snap Id's they
were first found in the shared pool
-> ordered by Snap Id

    Plan
 Hash Value  Snap Id        Cost Optimizer
------------ -------- ---------- --------------------
  3821711819      120       533  CHOOSE
  3821711819      126       540  CHOOSE
  3821711819      129       546  CHOOSE

-> Rows indicates Cardinality, PHV is Plan Hash Value
-> ordered by Plan Hash Value

--------------------------------------------------------------------------------
| Operation                      | PHV/Object Name     |  Rows | Bytes|   Cost |
--------------------------------------------------------------------------------
|SELECT STATEMENT                |----- 3821711819 ----|       |      |    533 |
|SORT UNIQUE                     |                     |     1 |  261 |    527 |
| NESTED LOOPS                   |                     |     1 |  261 |    520 |
|  NESTED LOOPS OUTER            |                     |     1 |  221 |    519 |
|   NESTED LOOPS                 |                     |     1 |  179 |    517 |
|    NESTED LOOPS                |                     |     1 |  139 |    516 |
|     MERGE JOIN                 |                     |     9 |    1K|    489 |
|      SORT JOIN                 |                     |   409 |   18K|     12 |
|       TABLE ACCESS FULL        |PRODUCT              |   409 |   18K|      2 |
|      SORT JOIN                 |                     |     3K|  259K|    478 |
|       TABLE ACCESS FULL        |LOT_GLOBALLOAN_T     |     3K|  259K|    431 |
|     TABLE ACCESS BY INDEX ROWID|LOT_WORKFLOWSTAGE_DT |     1 |   20 |      3 |
|      INDEX RANGE SCAN          |LOT_WORKFLOWSTAGE_IN |    48 |      |      2 |
|    TABLE ACCESS BY INDEX ROWID |COMPANY_GENERIC      |     1 |   40 |      1 |
|     INDEX UNIQUE SCAN          |COMPANY_GENERIC_PK   |     1 |      |        |
|   TABLE ACCESS BY INDEX ROWID  |LOT_CUSTOMER_T       |     1 |   42 |      2 |
|    INDEX RANGE SCAN            |LOT_CUSTOMER_T_INDX1 |   100 |      |      1 |
|  TABLE ACCESS BY INDEX ROWID   |BP                   |     1 |   40 |      1 |
|   INDEX UNIQUE SCAN            |BP_PK                |     1 |      |        |
--------------------------------------------------------------------------------
could you specify whether indexes are useful or not for these query. 
moreover parameter values 
optimizer_index_caching=0
optimizer_index_cost_adj=100

regards
SREENIVAS
 


Followup   December 22, 2004 - 9am Central time zone:

thats alot of working getting done in 3 seconds isn't it.

how many rows does this query return.

you need to look at the tkprof as much as the explain plan. 

5 stars significant CPUOTHER utilized query in responsetime=serivcetime+waittime   December 23, 2004 - 1am Central time zone
Reviewer: SREENIVAS from INDIA
Thanks for your quick responce.
pls check the tkprof of this query with different sessions.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.24       0.21          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        2      0.48       0.48          0      10534          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        6      0.72       0.69          0      10534          0           0

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30  (SCB)

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  SORT UNIQUE
      0   NESTED LOOPS
      0    NESTED LOOPS OUTER
      0     NESTED LOOPS
      0      NESTED LOOPS
    201       MERGE JOIN
    201        SORT JOIN
    201         TABLE ACCESS FULL LOT_GLOBALLOAN_T
    201        SORT JOIN
     32         TABLE ACCESS FULL PRODUCT
      0       TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
    191        INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
      0      TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
      0       INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
      0     TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
      0      INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
      0    TABLE ACCESS BY INDEX ROWID BP
      0     INDEX UNIQUE SCAN BP_PK (object id 23944)
===============================================================================================
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.16       0.15          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        8      7.15       7.00        787    1119033          0          74
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       12      7.31       7.16        787    1119033          0          74

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30  (SCB)

Rows     Row Source Operation
-------  ---------------------------------------------------
     20  SORT UNIQUE
     54   NESTED LOOPS
     54    NESTED LOOPS OUTER
     54     NESTED LOOPS
     54      NESTED LOOPS
 187862       MERGE JOIN
     32        SORT JOIN
     32         TABLE ACCESS FULL PRODUCT
 187862        SORT JOIN
 187862         TABLE ACCESS FULL LOT_GLOBALLOAN_T
     54       TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
 178625        INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
     54      TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
     54       INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
     54     TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
     54      INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
     54    TABLE ACCESS BY INDEX ROWID BP
     54     INDEX UNIQUE SCAN BP_PK (object id 23944)
=================================================================================================
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.24       0.23          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        8      7.06       6.95        157    1119054          0          76
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       12      7.30       7.18        157    1119054          0          76

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30  (SCB)

Rows     Row Source Operation
-------  ---------------------------------------------------
     20  SORT UNIQUE
     56   NESTED LOOPS
     56    NESTED LOOPS OUTER
     56     NESTED LOOPS
     56      NESTED LOOPS
 187862       MERGE JOIN
     32        SORT JOIN
     32         TABLE ACCESS FULL PRODUCT
 187862        SORT JOIN
 187862         TABLE ACCESS FULL LOT_GLOBALLOAN_T
     56       TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
 178625        INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
     56      TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
     56       INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
     56     TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
     56      INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
     56    TABLE ACCESS BY INDEX ROWID BP

pls advice to tune this query further..

from 
SREENIVAS
  


Followup   December 23, 2004 - 11am Central time zone:

investigate a better indexing strategy.  you see your index is

     56       TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
 178625        INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)

you get 178,625 rows from the index, went to the table and found that only 56 of them applied.  
perhaps your index needs to be "fattened up" to include more relevant columns. 

5 stars   December 24, 2004 - 6am Central time zone
Reviewer: sreenivas from india
thanks a lot. 


5 stars Funky model or not?   January 19, 2005 - 8am Central time zone
Reviewer: Mikael from Sweden
Tom,

I am modelling a database that will be used to format certain reports - and every report's look is 
dependent on each contract from each customer.



Anyway the problem is that this is what I got (it is hard to describe, but here is a go):

500,000 unique products that each have 60-70 attributes out of 1500 possible attributes (varying 
types, and overall no attribute is used more than others).

Each attribute have some describing data (of course): unit of measure,  definition, etc.

In addition each product belong to a classification according to 6 of the attributes - and 
depending on this classification:
- the attribute description (in different languages) may change.
- some of the attributes also have defined values that it can take - with a corresponding refined 
description (in different languages). 



This is what I need to do:

I select a couple of attributes that a product must have and a classification it must conform to. I 
receive one product (in this example).

So I have a product which happen to have a classification and 61 attributes. At least one of these 
attributes has a classification matching the product classification - hence I need to select that 
description in the requested language for this product. It also has a defined value for the 
classification it has - so I also need to select the description for that defined value in the 
requested language. Of course I also need to select the attribute describing data.

Depending on the customer contract I shall return some describing data and some or all of the 
attributes depending on the product selected.



I am thinking a view based on two tables with attributes (since number of colums is more than 1024) 
for the product table and an attribute table with attribute describing data and possibly one or two 
subtables with descriptions/classification and values/classsification respectively. (Finally 
possibly a customer contract table for determining what data should be selected (i.e. what 
attributes, language and describing data that should be delivered for a certain product) - if this 
should be done in the DB and not in the final stages of our report construction widget).

Monsterous!

Any non-funky suggestions?

Thanks,
/Mikael 


Followup   January 19, 2005 - 10am Central time zone:

i can only re-iterate the caveat above about generic data models and their lack of maintainability 
(one of their supposed BENEFITS!?!), their lack of performance, their lack of flexibility when 
querying (again, one of their supposed benefits?!)

...
500,000 unique products that each have 60-70 attributes out of 1500 possible 
attributes (varying types, and overall no attribute is used more than others).
.......


you really have 500,000 subtypes?   truly?

...
In addition each product belong to a classification according to 6 of the 
attributes - and depending on this classification:
......

seems to counter that -- sounds like a structure is starting to form there....   

5 stars What about Oracle Objects?   January 19, 2005 - 11am Central time zone
Reviewer: Kerry from Austin, TX USA
Would you recommend using Oracle's object-relational capabilities for the kinds of data discussed 
here (i.e. data requiring extensibility or variant subtypes)?

I am currently using XML in a product catalog model, and was happy to see your endorsement, albeit 
guarded, of the XML approach in some of the responses here.  We chose CLOB storage for the XML, and 
have VERY few function-based indexes to search within the XML.  As you suggest, we have "real" 
database columns for some frequently searched attributes. Our particular needs for the attributes 
stored in the XML are primarily BROWSE rather than SEARCH, making the problem a little more tame. 


Followup   January 20, 2005 - 9am Central time zone:

http://www.oracle.com/technology/oramag/oracle/03-jul/o43xml.html

3 stars Nice   January 21, 2005 - 1pm Central time zone
Reviewer: Catherine from OH,USA
Hi Tom,
Sorry for the disturbance.But I need a query which lists the number of employees working under an 
employee. I mean the query output to be in the format of

Mgr_ename        Mgr_empno   count(*)
---------        ----------  -------
....               ... ..       ...


This is what I have tried.

SQL> select ename,empno from emp
  2  where empno in( select mgr from emp)
  3  /

ENAME           EMPNO
---------- ----------
FORD             7902
BLAKE            7698
KING             7839
JONES            7566
SCOTT            7788
CLARK            7782

6 rows selected.

I need that Count column.Could you please provide the query?

If Possible I would like you to replace count column with
the list of enames.Please help
Bye
 


Followup   January 21, 2005 - 7pm Central time zone:

select mgr, count(*) from emp group by mgr;

gives you the count by mgr and


select emp.ename, mgr.*
  from emp, (select mgr, count(*) from emp group by mgr) mgr
 where emp.empno = mgr.mgr
/

gives you the name of the mgr as well. 

4 stars   February 11, 2005 - 10am Central time zone
Reviewer: A reader 

> you really have 500,000 subtypes?   truly?

Actually it seems as if some of the products conform to the classification but far from all  - so 
at best I'd still have 250,000 subtypes... and now I hear that some new products will be adding to 
the attributes or when they buy companies into their concern with products not conforming to their 
attribute catalogue will be adding 100s of attributes to the data structure ...

I guess what I was fishing for here is: How can you handle companies with these vast classification 
models.
Btw are they used to match 2 (3) tables to get the products - one table with values and one table 
with attribute names and unit of measure (and a third describing the attribute itself and depending 
on the product classification (can hold wild cards) a description).

My thought was that I'd move away from that design to get better performance, but now it feels as 
if I am wondering in circles...

Any ideas on how to handle these kinds of records with 1000's of possible attributes without 
picking up the non maintainable "object" model? (with Oracle). Say that I'd like to search among 
the 500,000 products "WHERE (ATTR143 > 4 OR ATTR567 < 100 OR ATTR1789 < 100) AND ATTR50 = 'AAAA'

Brgds,
/Mikael
 


Followup   February 12, 2005 - 7am Central time zone:

what possible application could use this?  I cannot fathom one.  So, the stuff must be so generic, 
so un-applicable to programming with -- that this information is either used ONLY when looking at 
the object itself (eg: it need not be structured, no one uses it).

so, maybe you can take the N attributes that are consistent -- you will find there are dozens of 
them (really) and glob the rest into an xmltype and forget they exist.  If someone truly wants 
them, they can use extract to get them.


If you have over a 1/4 million unique sets of attributes -- I cannot imagine anyone using that 
query you propose.  Like 2 rows would even have those attributes (eg: they already know precisely 
the product they are going after) 

3 stars Limit on number of columns per table?   February 18, 2005 - 5am Central time zone
Reviewer: Matt from Australia
I was recently asked whether there was any physical limit on the number of columns allowed per 
table.  Having checked this in the reference doco I see that there is a documented logical limit of 
1000 columns per table. 

What is it that restricts the number of columns to this figure? 

Obviously having many columns per table may impact the efficiency of DML. I would say that an 
entity should have as many columns as makes sense within the definition of the business problem. 
However, my friend was wanting to create a table with 10,000 columns. Fundamentally, I see all 
kinds of issues with this surrounding the definition of the business problem he is trying to solve 
and the maintainability of any system created like this (and many others). 

However, ny friend says that this is a small part of a "temporary" stop gap solution required make 
a prototype "work".

I'm interested to see what response you have to this, it seems to me that there is a very 
shortsighted attitude from the business decision makers on this one.

Regards, 


Followup   February 18, 2005 - 8am Central time zone:

that is a physical limit, the logical limit is alot smaller :)

and it used to be 255, 1,000 is relatively new.


It is just a reasonable upper bound.

10,000 columns!!?!?!  

prototypes are supposed to proove the viability of something, this would be an anti-prototype.  a 
prototype that does something totally wrong but appears to succeed will only make them think "this 
might be a good thing"

they can use varrays (really bad idea too if you ask me), but I'd rather have someone who does data 
modelling sit with them for an afternoon and straight this out -- get the real reqirements and 
explain the right way. 

3 stars 9 days too late   March 24, 2005 - 5am Central time zone
Reviewer: A reader 
happy birthday anyway and best wishes
for the next decade 


5 stars and what's wrong with that? i'd like to know...tutudududu   April 10, 2005 - 5am Central time zone
Reviewer: k. 
<quote source="your original reply">
  <part1>
    You are trying to put a generic 
    layer on top of a generic layer
  </part1>
  <part2> 
    and it fails each and every time except for 
    the most trivial of applications.
  </part2>
</quote>

part1:
------
  but you see that happening all the time, don't you?
assembler is a generic layer on top of computer hardware. 
what do we need c++ for?

  and heeey - what do we need pl/sql for? it cannot compete with c++ (as far as performance and 
flexibility are concerened)!!
 
  well maybe pl/sql cannot do all those things that c++ is capable of, but certain people (and they 
are not few) are willing to give up all that flexibility and performance to gain something else. 

  i am a developer - my task is to do my best to fulfil the wishes of my clients. they say "we need 
this and this"
and i have to find (what i think is) the best way to achieve that. oracle is here to serve me - i'm 
not here to serve oracle. and indeed oracle serves me well - does not even complain when i only use 
a subset of its [his/her? :-)]  capabilities - only those that i do need.



part2:
------
http://asktom.oracle.com/pls/ask/f?p=4950:8:1676653721052884756::NO::F4950_P8_DISPLAYID,F4950_P8_CRI
TERIA:38264759390157
<quote source="another thread, a few days ago">
You'll live to regret that generic model, every time I've used them every 
time (and I have, twice, for real applications, 15 and 13 years ago, never 
again)
</quote>

well, it seems to me that the problem lies here:
"15 and 13 years ago"

using java 20 years ago (had it existed) to do things it does today would have come to a similar 
end.


computers are getting faster and faster. there are many ways to get advantage of that - we can just 
let it imporove performance of our applications, but we can say
"hey - our application runs pretty well, what if we use hardware performance gain to add new 
features - something to satisfy our clients - because it's them who we do this all for - we make 
their lives easier , so why not to give them some funky functionality, now that there's enough 
resources?")

typically we try to get the best of both approaches - improve the performance + add functionality, 
you would have not imagined possible in real world a few years ago. 


Followup   April 10, 2005 - 10am Central time zone:

You see, I'm all about the data. 
Applications come, and they go.  
The data stays forever.


Relational databases haven't changed radically in 15 years.  The problem with the model of 15 years 
ago exists today.

We will have to agree to disagree, I doubt in this case either of us will convince the other ;) 

5 stars You are so right.   April 10, 2005 - 7pm Central time zone
Reviewer: Vinayak from NJ, USA
Tom, you are so right about the design. 

In one of the 3rd-party applications I've worked on, this was a performance killer. In their 
application, what you do is that you create your product catalog defining the products and then for 
each product, you define its attributes. To some extent, I understand the need for them to have a 
flexible data model to support this. So they had a table that contained each row for each product 
attribute.

But when we started writing queries/reports, a simple query was involving a 10-way self-join and 
the performance was not acceptable. If you were searching for just one order, it was taking about a 
minute or so which was not acceptable at all.

So what I did was that I created another table that has all the products' attributes as columns in 
this table, initialized it with existing data and then write a trigger to populate it automatically 
through the application. The performance impact of this trigger on the INSERT was neglible, not 
even noticeable, but the performance gain while quering data was HUGE. The query that used to take 
60 seconds was running now in a fraction of a second.

 


5 stars Surrogate Key   December 7, 2005 - 11am Central time zone
Reviewer: A reader 
At what point would you consider using a surrogate key?  You mention if the data is large, but 
would you consider a VARCHAR2(20) field too large (it will be used in tens of millions of rows)?

So, we'd have a table LOOKUP( id NUMBER, key VARCHAR2(20) ) and do lookups when we needed the 
"natural" key.  Now, though, you've got me thinking this might be the wrong approach.

Any thoughts? 


Followup   December 8, 2005 - 1am Central time zone:

varchar2(20) is tiny.

I would, assuming it is not subject to update, use the natural key.  No good reason not to 
(assuming it is not subject to update) 

5 stars Where to draw the line   March 16, 2006 - 2am Central time zone
Reviewer: Loz from Melbourne, Australia
Hi Tom,
I've got a situation that is crying out for the "generic database".  The rest of the database is 
properly modelled and this is just one small aspect of it.  I am reluctant to go down this path but 
can't think of how to model it any other way.

Consider a system where there are some products and people.  Each person on the system has some 
basic, common information about them as well as some additional information that needs to be 
recorded against them when they buy a given product.  For example if they buy product X it requires 
to know their blood group.  Product Y requires the number of bikes they own.  Each product is 
essentially the same thing just with extra information recorded against the purchasers.  This extra 
information is simply used for reporting groups and possibly OLTP searches, although it is very 
likely we will already know the PK of the person in almost every OLTP situation.
My initial thought is that I should have these tables

product
extra_data_types
extra_data_types_allowed_for_a_product
extra_data_for_person_for_product
person
person_to_product

Another problem is if we needed to sort or compare them we would have to have a pretty messy bit of 
SQL (or mangled copy of the data) to determine a suitable value to compare if they are not intended 
to be interpreted as strings (e.g.  2<100 but '2'>'100')

A variation would be to have several reserved columns for various types but I can't see how to do 
this without dynamic SQL which I want to avoid.

One of the goals is that we can add an extra product and additional attributes simply by adding 
data and not changing the data model.  The meta data would allow us to build a screen & reports 
based upon the required extra data.  I know this sounds trivial but it really would be questionable 
as to whether it would be worth our while if we had to re-code anything to accommodate each new 
product (or rather a variation in the extra data).

I know you don't agree with this approach, but how do you think the best way of dealing with this 
would be given there will be only a few items of extra data per product?

I have tested this with some level of success with about 300K people but need to go to about 1M.  
Seems ok so far but it's a very basic test

Thanks.
 


Followup   March 16, 2006 - 11am Central time zone:

how many of these additional attributes do you have?   

5 stars Answer to above   March 16, 2006 - 5pm Central time zone
Reviewer: Loz from Melbourne, Australia
I envisage there will be only a handful of additional attributes per product (say max 5) and not 
all will be mandatory.  So say max 5 million rows for 1M people.
 


Followup   March 17, 2006 - 4pm Central time zone:

so, just add them as columns - you can add 5 columns easily. 

4 stars Query on Design   April 18, 2006 - 8am Central time zone
Reviewer: thirumaran from INDIA
Hi Tom, 

I am designing a model which should be very flexible to support dynamic fields. The user should be 
allowed to create any no of attributes for the template. 

My recommendations are we will have one master table and have 2 child tables to support dynamic 
fields 

Note: VC- varchar datatype
master table : Company_table
COMP_ID VC(32) not null - automatic DATABASE generated
COMP_PICT BFILE
COMMENT CLOB

DYNAMIC_FILED_TABLE1 (child for mater table COMPANY_TABLE)
DYNA_FIELD_ID VC(32) not null - automatic DATABASE generated
FK_COMP_ID 
COLUMN_NAME VC(255) NOT NULL,
COLUMN_DATATYPE VC(30) NOT NULL

DYNAMIC_VALUE_TABLE2 (child for table DYNAMIC_FILED_TABLE1)
DYNA_VALUE_ID VC(32) not null - automatic DATABASE generated
FK_DYNA_FIELD_ID 
COLUMN_VALUE VC(255) NOT NULL,

i am having a separate dynamic_filed & dynamic_value table for every master table which needs to 
support dynamic attributes creations. 

1 )few colleagues are debating these dynamic tables should be a common one for all the tables.(i.e) 
one child multiple parents 

2)few colleagues are debating these 2 dynamic tables should be combined into one .for every master 
have one dynamic table instead of 2 tables (1 for column (name/datatype) and 2-> for values)

Tom which is the best approach any pro's and con's

A) have one common table for all master table (1 child multiple parents)
B) have one master and 2 child tables (one for column name + dataype & one to store values)
C) have one master one child to store the dynamic attributes with data 

Thanks in adv
Thirumaran 


Followup   April 18, 2006 - 10am Central time zone:

... I am designing a model which should be very flexible to support dynamic fields. 
The user should be allowed to create any no of attributes for the template.  ...

ugh. I guess you didn't read my response above? 

4 stars query on design   April 19, 2006 - 7am Central time zone
Reviewer: thirumaran from INDIA
Tom, 

I got it the walked through this entire page and i understood the Advantages & Disadvantages on 
this design.

Thanks for clearing my doubts.

i have just one question 
is it recommended to have multiple parents for one child , what are the disadvantages. 
Can we maintain all FK's for these parent tables , does RDBMS support this.

Thanks in adv
Thirumaran

 


Followup   April 19, 2006 - 9am Central time zone:

the database enforces referential integrity - be that to a single parent (eg: emp.deptno -> 
dept.deptno) or to multiple (add the necessary foreign key from emp.mgr -> emp.empno).  

 

4 stars Multiple Parent for a child Design Consideration   May 9, 2006 - 7am Central time zone
Reviewer: Sujit Mondal from India
Hi Tom,
   Can you please give me your Input for following Data Model.
Parent Tables : Week (week id , start dt , end dt ...)
                Month (month id, start dt , end dt ...)

Child Table : Event
              Columns
              event id ,
              week id (FK to Week)
              Month id (FK to Month)
              ....
              ....
In any situation the rows in Event table will either be the child of Week or Month , so for a 
record either week id or Month id will be Null.

Is the above an acceptable design ? or shall we go for week_event and month_event separate table? 
All other attributes of Event table are same. 


Followup   May 9, 2006 - 8am Central time zone:

it is acceptable to have a single table have "more than one parent" table it is related to.  

Here, parent/child is sort of a misnomer.  It is really more like a lookup or a dimension - not 
like ORDER and ORDER_LINE_ITEMS in the "classic" sense.



but if week and month look like you have them looking - not sure why they are "different" 

4 stars Why optional FKs?   May 9, 2006 - 10am Central time zone
Reviewer: Gabe 
What I don’t understand is why an Event belongs to either a Week or a Month. True, the Week and 
Month time intervals are likely not aligned (a Week may overlap two Months), but they both provide 
full coverage of time. Hence, for any Event one should be able to specify the Week and the Month; 
those FKs should be mandatory … unless?? 


Followup   May 9, 2006 - 11am Central time zone:

(i didn't understand the need for two dates tables myself - seems they are "periods of time") 

5 stars Tom Kyte: Query on Design   August 21, 2006 - 6am Central time zone
Reviewer: Arjan Hölscher from Europe
great review on database development 


5 stars How do you suppose Amazon does it?   September 13, 2006 - 3pm Central time zone
Reviewer: Aaargh! from Netherlands
Great discussion on DB design. I have been creating large databases in MS SQLserver for about 8 
years now. Twice have I used the discussed 'generic data model' structure and both times it killed 
the performance of the application. 

I am now a guru on creating the required, insanely complex SQL statements to gather usefull 
information from the 'generic data model'. You DON'T want to select a set of records filtered on 
multiple 'dynamic columns', sorted by multiple 'dynamic columns' en include paging. 

When you are designing, for example, an application which has to manage ID-cards (all have 
different information on them) and the card-owner information associated with each card, it looks 
like a nice solution at first. 
The customer has a nice interface to create new cards and cardfields and there is one unified way 
the software can handle all card information. 
BUT you end up with an database that doesn't scale at all(!) and drives you nuts when trying to do 
anything 'non trivial' with it.

I do wonder how a company like Amazon solves this problem of variable object properties. If you 
take the products they are offering. Most of them share properties like 'price' but how do they 
deal with the product-category specific properties?

Something like this?:

Product
  ProductPK
  Price
  ProductCategoryFK
  ..

ProductCategoryPK
  ProductCategoryPK
  Name
  ..

BookProductInformation
  ProductFK
  ISBN
  Title
  ..

CDProductInformation
  ProductFK
  Title
  NumberOfDiscs
  ..

Tom, what's your idea about their possible solution?  


3 stars database design multiple parents   February 4, 2008 - 9am Central time zone
Reviewer: Kez from France


I have a question concerning generic vs non-generic design. I understand the reasons you have given for not making a design too generic, but I'm not sure if that's what I'm trying to do. I think that this is relevant to the original question, regarding database design and avoiding generic tables, but if you think it's a new question - please forgive me

I have tables for employee, company, contact, etc. (which do have different columns and cannot be combined - well, they could but there would be a lot of redundant columns in the 'generic' table) and they can all have one or more addresses, so I am facing the decision between tables for employee_address, company_address, contact_address, etc. or one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' column which would define whether the parent table was that of employee, company, etc. Hence, I would have a 'generic' address table, but is this better or worse than 4 or more 'address' tables with the same columns but a foreign key pointing to only one table?


I am trying to find the advantages and disadvantages of each

Followup   February 4, 2008 - 4pm Central time zone:

that is not a generic address table (in fact, nothing you said above would be 'generic', none of it)

You have an entity. It is an "address"

many other entities - employees, contact, company - have a relationship to an address.

there only needs be a single address entity, it is a "master table" - a parent table - it is not "generic" at all. It is very special purpose in fact.

It only holds addresses - it is as specific as you get.



4 stars   February 5, 2008 - 3am Central time zone
Reviewer: Kez from France
Thanks! I was just worried about the multiple keys with the 'type' column. Now I just have to translate your answer into French for the rest of the team........

Followup   February 5, 2008 - 8am Central time zone:

I just re-read that.

...
I have tables for employee, company, contact, etc. (which do have different columns and cannot be combined - well, they could but there would be a lot of redundant columns in the 'generic' table) and they can all have one or more addresses, so I am facing the decision between tables for employee_address, company_address, contact_address, etc. or one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' column which would define whether the parent table was that of employee, company, etc.
...

that is wrong, you would have:
create table address (
id number primary key,
....
);

create table emp (
empno number primary key, 
...
address_id references address );

create table contacts (
....,
address_id references address );



You have the foreign key in the wrong table, it belongs in the emp, contacts and so on.

You just have ADDRESSES - period. Then other entities point to them.
5 stars Address table   February 5, 2008 - 5am Central time zone
Reviewer: Mike from Cleveland, OH USA
From these remarks:

> one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' 
column which would define whether the parent table was that of employee, company, etc. 

>I was just worried about the multiple keys with the 'type' column. 

it sounds like you are planning to have a multi-purpose column which may be an employee ID, or a 
company ID, etc. depending on the context. I would suggest you avoid that. Instead, have separate 
columns: an  Employee ID column PLUS a Company ID column PLUS (whatever else)... 

You can then have proper foreign-key constraints for these columns. You should not attempt to use 
these foreign-key columns in your primary key (as they will sometimes be null).


Followup   February 5, 2008 - 8am Central time zone:

Actually, I think we had the parent table backwards - the parent table here is address - it does not have any foreign keys

(and using separate columns would be a bad idea - sure, today you have emp, contacts, company - tomorrow you have those three AND something else)
4 stars   February 5, 2008 - 9am Central time zone
Reviewer: Kez from France
In fact, if each employee, company, etc. could only have one address, then I would agree, but we can't put the address_id on the employee table because there could be more than one (eg home address, working away from home address, etc.)

In any case, I explained to the rest of the team and they are going with the separate tables anyway *sigh* I tried and I learnt something, so I'll move on.

Thanks

Followup   February 5, 2008 - 10am Central time zone:

then you have a missing entity, a many to many relationship between emps and addresses


create table address( id number primary key, .... );
create table emp( id number primary key, ... );

create table emp_addresses( emp_id references emp, addr_id references address, primary key(emp_id,addr_id) );



4 stars   February 5, 2008 - 9am Central time zone
Reviewer: Kez from France
Oh, and I wouldn't have been using null columns in the primary key as each record would have it's own id, this would be a foreign key constraint

4 stars multiple parents   March 31, 2008 - 6pm Central time zone
Reviewer: Bob from harrisburg, pa usa
In the last post you said: 

then you have a missing entity, a many to many relationship between emps and addresses 

create table address( id number primary key, .... ); 
create table emp( id number primary key, ... ); 

create table emp_addresses( emp_id references emp, addr_id references address, primary 
key(emp_id,addr_id) ); 

----
Just for clarification, are you suggesting that a cross linked table be created for each entity 
that uses an address? (ie. contact_address, client_address, emp_address, etc...)


Followup   April 1, 2008 - 7am Central time zone:

depends on the model. But if Contact has addresses (plural) then yes, there would be a contact to address association object (so that foreign keys and all can be safely in place)
4 stars Percentile Caluclation   May 8, 2008 - 6am Central time zone
Reviewer: Vicky 
Hi Tom,

After reading the "question of the day" stuff, hats off to your patience.

I have a requirement is which I need to calulate percentiles day wise and month wise.

Currently we are planning for a job which would run in midnight and calculate percentile for 
previous day using percentile_cont function.
Everything looks to be fine because we are developing and testing on 1000 records.
I assume that percetile_cont function sorts internally for percentile calculation.Well in 
production we are expecting millions of records.

1) Dont you think this will affect my performance.I am going to benchmark it once my development is 
complete but would be great if you can provide some early thoughts.

Secondly we are informed that we can receive some latent records i.e. back dated records. In such 
cases we would have to recompute percetile for entire day and month i.e. look again for the million 
of records even if only one latent record turns up.

2) Is there any other strategy you know for computing percentiles?  

I am going to benchmark it once development is complete but i dont want to change my entire design 
if some performance issue turns up and hence requesitng your inputs.

Your inputs will highly be appreciated.

Regards


Followup   May 12, 2008 - 8am Central time zone:

... Well in production we are expecting millions of records. ...

We deal with millions of records all of the time. Give us sufficient resources (memory, cpu, IO) and - well - it is what the database does.


1) of course it will affect your performance, but the main thing here - you can and will get your answer - you need to have reasonable expectations of how long it will take to get it, but you will get it.

and if your data arrives lazily over time like that, yes, you'd have to recompute. Sometimes - that is just a fact.

2) since the insertion of a single record will potentially affect the percentile assigned to EVERY OTHER record, you'd need to recompute.


I'm not sure you want percentile_cont - that is an aggregate function:

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm#i1000909


PERCENTILE_CONT is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
4 stars Percentile Calculation   May 17, 2008 - 7pm Central time zone
Reviewer: vicky 
Hi,

Data is received from 3rd party application and we are told since they implement queue mechanism , 
we can receive latent records maximum of 5 days back. I have to design it accrodingly. Also I have 
to report my data next day morning.

I am planning to use bucketized approach.
I have SLA(Sevice Level Agrrement) defined for the incoming data i.e.
if transaction response time is < 1 sec, it has performed well and meeting the SLA.
if transaction response time is >= 1 sec and < 3 sec, it has failed and penalty needs to be 
computed.
if transactions response time is >= 3 sec, it has completely failed and maximum penalty needs to be 
imposed.

I will divide the range 1 to 3 in say 10 buckets and the gap between each bucket would be (3-1)/10 
= 0.2 seconds

1st bucket : count of messages with response time < 1 second and total count
2nd bucket : count of messages with response time < 1.2 seconds and total count
3rd bucket : count of messages with response time < 1.4 seconds and total count
4th bucket : count of messages with response time < 1.6 seconds and total count
.
.
last bucket : count of messages with response time > 3.0 seconds and total count

once that is done i can find out my 90% easily by applying count/total count. Wherever 90% is 
matched that is my 90%ile value.

Diadvantage of this approach : 
1) Will provide an approximate value and not exact value.
2) every record needs to be exploded based on the count of bucket.

Advantage :
1) Fast and can cater to latent records easily.

Your views on this approach will highly be appreciated.



Regards


Followup   May 19, 2008 - 3pm Central time zone:

does it meet your reporting needs - if so, it is something that can be maintained as data arrives, which is good.
5 stars Generic Design - Oracle Clinical   June 25, 2008 - 10am Central time zone
Reviewer: Manish from India
Hi Tom, 

Have been reading through this thread and find it a very informative set of exchanges between you 
and others on this page. 

I am working in the healthcare field and the kind of data that needs to be captured from one 
specialty to another varies. There are a small set of values like the chief complaints, family 
history, social history, review of systems, etc that remains the same for each patient. But once we 
get to the clinical specialty domain, one wants to be able to get the details of the 
- patient's glucose level, 
- see the list of medications that a patient is on, 
- list out the SpO2 values for the patient

And so on, try to correlate the details between multiple other specialties.

To be able to record the case record, there is no set format noticed in my journey through various 
hospital implmentations. And the best way forward seems to be a generic model approach. 

Since Oracle Clinical also perhaps utilises the generic design for the clinical trials, what should 
be the design considerations one must follow to be able to build a robust "generic data model 
design" (since at times it is a business need). 

Please advice. 


Followup   June 25, 2008 - 11am Central time zone:

Oracle Clinical is a generic solution for everyone, you buy it. You customize it. You implement it. They truly don't know what you are going to store.

My argument is "if you are building something for you, you know what you are going to store, so store it"


5 stars Confirmation That Generic Model = Horrible   June 25, 2008 - 1pm Central time zone
Reviewer: A reader 
Tom,
I don't normally post here anonymously but this time I must. The company I work for has - against my strong objections - purchased a product named NetCracker. It uses a variation of the generic model you showed in your first response.

I can confirm for those readers who wonder about a genreic model that it is totally useless. The flexibility offered as a selling point for it is non-existent. Performance, particularly for getting data out of the database, is hideous.

To any objective observer it has been a total failure. The only good news, and this is a stretch, is that it's use has been limited to a pretty small segment of our business. My advice to anyone thinking of implementing a generic model for any non-triviel application: DON'T DO IT!

5 stars Generic Design   July 29, 2008 - 10am Central time zone
Reviewer: archana from Dublin
Hi Tom,

I agree with your comments posted on the site and found it very useful.Infact, I can see the same 
issue being repeated in our current projects by Architects.
Statement by Architect " data field types and sizes can also be categorized into a relatively 
smaller number of generic classes which improves maintainability throughout the application stack"
We are also urged to take this up as a Design policy.

You have mentioned defining field type/length inappropriately or in a generic manner will cause 
*** Data Integrity issues
*** Storage Issues
*** Performance Issues 
Also Varchar2 was being used as an example for lengths.

On a related note, I had a query on NUMBER data type, particularly with respect to Primary keys.

My Database designer has 'generically' applied Number(11,0) for all primary key fields in the 
database 
I am not sure i understand his approach is correct
For example 
Gender table primary key is also Number(11,0) although there will is only 2 records in this table 
Male/Female)

Similarly Country master stores maximum of 50 odd countries but Primary key is defined as 
Number(11,0)

Can you please advise me on this approach whether it is correct or not.Will there be any 
impact(performance, storage) if we generically design all primary keys in the same manner?

I also looked at the default HR schema and here is what i find, unable to draw any particular 
conclusions.

But in 
Countries table, country_id char(2)
Region table, Region_id number
Departments table, Department_id number(4)

Looking forward to your help and advise,
Archana

p.s I have raised this as a related query to the topic.
Please accept my Apologies if this is considered a new question.





Followup   July 29, 2008 - 4pm Central time zone:

using number(11,0) will limit your sequence values to 99,999,999,999 which seems like a willy nilly arbitrary limit.

it will not impact storage (the number 42 whether stored in a number, number(11,0), number(38), number(2) will consume the same amount of storage.

it will possibly impact performance (number(11,0) has a much larger range than number(2) does, high and low values could be much different)

it all comes down to "WHAT THE HECK ARE THE DATA REQUIREMENTS"


what are the requirements of the data, that is what drives the appropriate datatype.
5 stars Generic Design -Number Data Type   July 29, 2008 - 7pm Central time zone
Reviewer: archana from Dublin
Hi Tom,
Many thanks for your kind response.

Here is the requirement

Gender Table description

Gender_Id PK NUMBER(11,0)
Gender_Desc VARCHAR2(10)

I pointed out that Gender_Id will only have 2 rows, so no need to allow for Number(11)

But the database designer/architect came back with the generic model maintainability trade off .


I would like to know what would you define in this case?

Also not clear how performance might be impacted, please clarify.

And, Is it vsize() that I can use to calculate the storage impact?


Thanks and Regards,
Archana


Followup   August 1, 2008 - 11am Central time zone:

what is this trade off? I see no points to discuss?


however, if you use number(11) and you only ever put in there the numbers 1, 2 - it won't matter storage wise (the number "1" stored in ANY number definition is stored the SAME)

a datatype is a data integrity constraint, you use the right type to enforce a rule.
5 stars design   July 29, 2008 - 11pm Central time zone
Reviewer: A reader 
excellent thread


3 stars Good design principle, but is it excessive?   February 13, 2009 - 7am Central time zone
Reviewer: A reader 
If you have lots of "simple" lookup attributes for an entity that is purely used for lookup 
purposes eg. A product can have the attributes "colour", "size", "made_in",etc, and we had just 1 
lookup table;

create table product_lookup (id number primary key, attr varchar2(30), descr varchar2(30));

insert into product_lookup values (1,'COLOUR','Blue');
insert into product_lookup values (2,'COLOUR','Brown');
insert into product_lookup values (3,'SIZE','Big');
insert into product_lookup values (4,'SIZE','Small');
insert into product_lookup values (5,'MADE_IN','China');
insert into product_lookup values (6,'MADE_IN','Taiwan');

Would you classify that as too generic? Would you recommend creating, in our case 50+ tables to 
store each different attribute instead of just 1?

e.g colour_lookup_tbl(id number, descr varchar2(30)),
    size_lookup_tbl(id number, descr varchar2(30)),
    madein_lookup_tbl(id number, descr varchar2(30)), etc


Followup   February 16, 2009 - 10am Central time zone:

... Would you classify that as too generic? ...


yes, I'm sort of into this thing called 'data integrity'. How would you have your foreign keys if you do this? Would you put into your other tables that reference this two columns??? What would prevent you from using a color when you should have been using a size??


absolutely - a table for a lookup.


3 stars   February 13, 2009 - 10am Central time zone
Reviewer: A reader 
...If you have lots of "simple" lookup attributes for an entity that is purely used for lookup
purposes eg....

Actually, I've just read your other thread on this subject;

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1389366900346164087


You've raised the issues there which I think is relevant but I don't think causes us any problems;

data integrity - only administrators have access to screens for updating the lookup table
referential integrity - not a problem with the FK based on the lookup_id
performance - the joins are always of the type; select *
from products_tbl prd, product_lookup lok
where lok.lookup_id = prd.colour_id, I can't see how having 50+ tables with avg less than 6 rows would make much difference compared to 1 table with 300 rows.


Followup   February 16, 2009 - 11am Central time zone:

... data integrity - only administrators have access to screens for updating the lookup table ...

ummm so??? that is not data integrity, that is access control (security) - not part of the problem.



... referential integrity - not a problem with the FK based on the lookup_id ...

umm no, you missed the point. If everything is stored in table T, everything is "T", it is not a color, a size, a whatever. You have no referential integrity - you can easily use a color where a size should be. Data integrity - you get zero points for that.


... performance ....

so????? SO?????


Anyone is free to be as lazy as they want, I cannot change that. I know of no reason to use a single table other than that reason.
2 stars   February 17, 2009 - 4am Central time zone
Reviewer: A reader 
Tom -

Data Integrity/Ref Constraint - I get your point. Yes there is nothing to stop you "from using a 
color when you should have been using a size??". But the input screens where these attributes are 
entered is controlled by us. Yes we can make a mistake is choosing the wrong group (but unlikely as 
we're using a constants package for that), but it's not a high risk and would be corrected fairly 
easily. Effectively we control the data, we control what groups get added, we control what is 
entered, I think hence your assessment of these risk differs from mine.

Performance - looking at the code we've got, many of our screens caches these lookups by groups. 
Hence we have a common function which issue a select statement of this type;

select * from product_lookup where attr = :BIND_VAR;

Now what you're suggesting is to do away with the bind variable with this and change our common 
function to this;

case p_attr_val
  when '1' then select * from size_table;
  when '2' then select * from colour_table;
  when '3' then select * from made_in_table;
  when ...
  when '50' then select * from another_table;
end case;

Now to me this would actually hinder scalability.

I don't agree with it being lazy, it improves developer's productivity and I think there is 
arguably (with benchmarks) that performance is better.

Whilst on performance, maybe you can assist here - if I have a select * from products_tbls and join 
to every one of those lookup tables as you suggested to gets their description, if the CBO chooses 
to perform a HASH join, would it have to maintain 50+ different hash list? How would that compare 
if the lookup was just one table, does the CBO just have to maintain 1 list, which would perform 
better?


Followup   February 17, 2009 - 8am Central time zone:

... But the
input screens where these attributes are entered is controlled by us. ...


that is so very short sighted. Today they are - tomorrow? Are you saying "this data belongs to me, only me, only me and my application and my application is the only thing that will ever interface with this" - if so, I already know you are wrong. That data will be used and reused long after your application is erased. Applications come, applications go - data lives forever.


.... Effectively we control the data, we control what groups get
added, we control what is entered, I think hence your assessment of these risk
differs from mine. ....


Your assessment is way off base - fundamentally flawed. But one that happens time and time again. You believe the data is a side effect of your application. The application is a side effect of the data and the data will be there tomorrow - after your application withers and dies. There will be some "brand new, the last way to build applications, the best way to build applications, the best thing EVER" coming (it always does). If you make it so that the data can only be accessed safely by your application - well - you will have killed that data.


Everything you describe yourself doing - been there, done that. How many ISPF green screens interacting with CICS transactions have I seen that do the same thing - making the invention of the screen scraper so important when this web thing came out (you could not touch the data, you had to pretend to be a 3278 dumb terminal so the application could touch the data safely). You are doing the same thing - you are building the green screen legacy application of 2010.

...
Now what you're suggesting is to do away with the bind variable with this and
change our common function to this;

case p_attr_val
when '1' then select * from size_table;
when '2' then select * from colour_table;
when '3' then select * from made_in_table;
when ...
when '50' then select * from another_table;
end case;


wow, whoa, where did you get that idea. It would be more like

at the appropriate place we issue "select * from size_table", later, at the appropriate place we "select * from made_in_table" and so on. You have a pick list for a field - I see no use for the construct you just gave.

Scalability - you would not have that code, I don't know where that idea even comes from.


and look at your last bit - about the hashing. Umm, you would:

select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;


You would full scan lookup 3 times
You would hash lookup (which is very large - is has EVERYTHING) into memory 3 times
You would hash join to it three time.


Versus:

select * from t, size, color, made_in
where t.color = color.pk and t.made_in = made_in.pk and t.size = size.pk;


I'll full scan three smaller tables and have three smaller hash tables - true, you could put a predicate on yours:

select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and l1.descript = 'COLOR'
and t.made_in = l2.pk and l2.descript = 'MADE_IN'
and t.size = l3.pk and l3.descript = 'SIZE';

you'll hash the smaller bits of data, but full scan the larger....


If you have 50 columns, you'll have 50 joins - regardless of a single table or 50 tables.



It does not improve ANYONES productivity in any way shape or form. It is as bad as using global variables so we don't have to pass parameters - or not commenting code because we can improve our productivity.



Put up the benchmarks - you say you have them, let's see them?
3 stars   February 18, 2009 - 2am Central time zone
Reviewer: A reader 
Tom,

Yes, data does live forever, but I don't agree that our design will somehow "shorten" its useful 
life. Yes other applications could interface with this - but subject to the sets we have defined. 
It's simple, if they're using PL/SQL, they can use the same constants package we use to prevent 
data integrity problems. If they're using Java, they can copy our constants package and turn it 
into a constants class and so on. 

Other applications can safely access our data if they follow some basic rules and good practice 
where hardcoding is necessary. Yes, they can make mistakes "easier" with this hardcoding but that's 
what testing is for! It's hardly difficult to spot on the front screen that COLOR has been set to 
"Hong Kong"!

I made a mistake and should put "do with benchmarks" and have misled you into thinking I'd already 
done the benchmark.

"select * from product_lookup where attr = :BIND_VAR;" is embedded in a COMMON function that the 
front-end calls to cache all the values of a particular set. Hence I presented you with the "CASE" 
statement scenario as that is how they would adapt to the new 50+ table structure. I know you don't 
agree with this and it should be changed at the appropriate place, but the common function is what 
they use and that won't change.

However, whether we have "CASE" statement or your suggestion, I would still have to hard parse an 
extra 50+ statements where I don't need to. I could benchmark this to prove how using a bind 
variable can be beneficial, but because fundamentally you don't agree with this design whereas I'm 
finding that as we're discussing this and I'm appreciating where you're coming from, I'm more 
convinced that we haven't actually made a bad decision, it's kind of pointless even if I produced 
stats showing ten-fold increase in performance and reduced latching.

"select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;" - ok if the CBO is doing 3 full 
scans,then it's not as clever as I thought. Surely, it should recognise that it's the same table 
and since we're joining to it by the PK, then it's doesn't need to create 3 identical hash list? 
Whereas understandly, your statement "select * from t, size, color, made_in
where t.color = color.pk and t.made_in = made_in.pk and t.size = size.pk;" would obviously create 3 
hash list. Programmatically, if you had to write your own hash list, would you create 3 (or 50+) 
identical hash list?

Productivity is increased - no developers here wants to remember 50+ lookup table names. From your 
book (which I have and rate highly), I quote, you "just say no" to CHAR(1) instead of VARCHAR2(1) 
to avoid confusion, well here we say just "lookup".

I think we agree to disagree on this design. I do value your opinion on your input with regards how 
the CBO would handle the query and it looks like it doesn't do what I thought it should.


Followup   February 18, 2009 - 2pm Central time zone:

... It's simple,....

No, it isn't. You have to be educated on how YOU write your code in YOUR install before you can even consider touching this data. Everything is done in the application, every application has to follow a set of rules. It is not automatic and it will be far too easy for someone to say "eh, yeah, I know that is supposed to store colors, but you know what, we have this new thing that doesn't have a color, it has gidgets, we'll just use that field for gidget when the type of record is "X", it'll be great"

Now, you have a mess.


.. if they follow some basic rules
and good practice where hardcoding is necessary....

and that makes it complex, it didn't need to be that way.


... but the common function is
what they use and that won't change.
....

but only because you made this "design" decision, if you had not - it would be different, so please don't impose on me what your design led you to do.....

.... I would still
have to hard parse an extra 50+ statements where I don't need to. ...

You wouldn't, we would - and SO WHAT. I'll give you 50 statements - big deal, in the grand scheme of things - so what? At least I'd know what data you were going after, what each function used, I'd have *more* information, better information and data integrity to boot.

.... "select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;" - ok if the
CBO is doing 3 full scans,then it's not as clever as I thought. ....

think about it, it needs to join to three rows.... three *different* rows.


... Surely, it
should recognise that it's the same table and since we're joining to it by the
PK, then it's doesn't need to create 3 identical hash list? ...

surely you could easily run a tkprof and see, rather then assume what it does?


And still, even if it did it ONCE, it would be full scanning the entire set of 50 tables, I scan 3 tables, 3 short tables. I still win. Regardless.

... no developers here wants to remember 50+ lookup
table names. ...

They have to remember 50 bind variable values... HMMMM.... six one way, half dozen the other. 50 is 50 is 50.


... I quote, you "just
say no" to CHAR(1) instead of VARCHAR2(1) to avoid confusion, well here we say
just "lookup".
...

but - you still have *50* things to remember.

And frankly, if your developers honestly could not figure out how to put the data back together - well, hmmm. I wonder how they remember the API's they use to program?

And one would assume you use a naming convention of sorts, the developers must know that a given field in a table must be "looked up" (right, they have to, they code the join to the lookup table don't they - so they must know what columns are foreign keys to lookups...). One could make the assumption that you would have a naming convention that says "the column COLOR_FK will be used to represent color, that will be joined to the COLOR_LKUP table". They do, well, have access to the column names - they have to type them in

(this productivity concept, very weak)



... I think we agree to disagree on this design. ....

well, not really, you haven't put forth anything that says "this is better", you are conceding

o applications must do the integrity, all of them, every time, hope they do it or we'll have a mess.

o integrity can easily be compromised. And it almost surely will be (gidgets will happen - programmers will hack it - they always do)

o you still have 50 things to remember, you have saved nothing (so your entire "they are more productive" thing is gone, we'll need another example)

o the database will do more overall work, not less, with a single table - regarding IOs.


... how the CBO would handle the query and it looks like it
doesn't do what I thought it should. ...


And this is why I'm steadfast on this, there is the way we envision we would make something work if we wrote it...

And then there is the way it really works.

Of the two, only the latter is relevant, the former would be relevant only if we were to build our own database someday.






3 stars   February 19, 2009 - 12am Central time zone
Reviewer: A reader 
Doesn't Oracle Apps use lookups?

Doesn't Peoplesoft Apps use lookups?

And SAP?

How long have they been around (yes one of them has been taken over!)? Have their been reports of 
this disaster design causing data integrity and performance issue due to a single lookup?


Followup   February 19, 2009 - 7am Central time zone:

so what? do you agree with everything they do? I don't. Do you? Are they perfect examples of precisely how to build an application?

Or are they things built by people who are coders - that don't necessarily know much about databases?

even oracle apps.


especially SAP.


and do they all, ALL, have many reported instances of data integrity issues? yes, well, yes they do.
3 stars   February 19, 2009 - 12am Central time zone
Reviewer: A reader 
o applications must do the integrity, all of them, every time, hope they do it or we'll have a 
mess. -- yes do everything in the database, can't disagree. But the "mess" risk you're describing 
is exaggerated.

o integrity can easily be compromised. And it almost surely will be (gidgets will happen - 
programmers will hack it - they always do) 

-- no it can't, follow SIMPLE rules. Give programmers some credit.

o you still have 50 things to remember, you have saved nothing (so your entire "they are more 
productive" thing is gone, we'll need another example) 

-- select * from product p,lookup l1,lookup l2,lookup l3 where l1.id=p.color and l2.id=p.made_in 
and l3.id = p.size

o the database will do more overall work, not less, with a single table - regarding IOs. -- 300 
rows in our lookups, as oppose to 6 rows on avg. in your 50+ tables, and only then it's more I/O if 
the CBO performs Hash Join and decides that it really needs to create 50 identical hash list. Yes, 
I admit I made an incorrect assumption - it (CBO) really isn't as clever as it ought to be - if I 
find the benchmark is so poor, I will report back.


Followup   February 19, 2009 - 7am Central time zone:

The mess I'm describing is one I see over and over and over. How can you say it is exaggerated?


follow rules? give programmers credit? I give credit where credit is due. You have to train people to follow your rules to access your data. You are assuming everyone is "as good as you". Guess what, in order to have an 'average' coder, 50% of them are below average. Some of them are way below average. They are the ones that will make color become a gidget. It not only MIGHT happen, it FREQUENTLY happens.

After 16 years of begging, documenting, showing the negative effects of not using, saying it day after day - I still cannot get coders to realize that "using bind variables is something we have to understand" - and you think they won't abuse an attribute in a table? You think they won't take the short cut? Hah.


You still haven't provided another example - you give me a query that references color, made_in and size - they had to KNOW that. You still have 50 things to KNOW (or look up in your documentation or get from a describe, whatever)



3 stars   February 19, 2009 - 3am Central time zone
Reviewer: A reader 
"and that makes it complex, it didn't need to be that way." 

-- no it's not complex, you're making it out to be more complex than what it actually is. And yes, 
you've say the same back to me.

"but only because you made this "design" decision, if you had not - it would be different, so 
please don't impose on me what your design led you to do....." 

-- you making incorrect assumption in the same manner that I did, the COMMON function approach was 
already in place before we introduced the lookup. Our design doesn't and did not influence how they 
were performing lookups in the front-end. They may have not taken the approach you wish, but it 
wasn't driven by the alleged "bad" database design. Regardless, our design have facilitated the use 
of bind variables.  Compared to your 50+ lookups - I win.


"think about it, it needs to join to three rows.... three *different* rows."

-- so what? It's the same table. We're talking about if it decides to do a hash join here - that's 
just not smart if it's building X identical hash list.

"And still, even if it did it ONCE, it would be full scanning the entire set of 50 tables, I scan 3 
tables, 3 short tables. I still win. Regardless." 

-- if it did it once, I'd like to see what the effect of creating say 50 different hash list 
compared to 1 hash list, and compare that to the amount of IO that you'll save which makes yours 
the "outright winning" solution. We're talking about 300 rows here in 1 in-memory hash-list, as 
oppose to 50 hash list.

"this productivity concept, very weak)" - really? You like maintaining 50 table 
creation/maintenance scripts as well? Less code,keep it simple the better it is isn't it? Let say 
we want to add a new column to indicate a "short description" for all the lookup data, how many 
tables do you have to amend? How long would it take you to do 50+ compared to 1? Let's say I want 
to move all the reference data to a different tablespace, how many lines of code do you have to 
write? Productivity - I win.

We could go round in circles - like I said, we agree to disagree on this design. You have some 
valid points to consider, but nothing that makes our design inherently "bad" and makes me concede 
anything. I'm not suggesting for 1 minute we have 1 table for the ENTIRE system, just some 
description lookups.


Followup   February 19, 2009 - 8am Central time zone:

Look, if you have to tell a developer how to enforce data integrity for each and every table, you've made it complex. Especially when you could have data integrity enforced

a) safely - you know it is done right, without having to tell anyone
b) all of the time, in every application
c) consistently

without *teaching* anyone how to do it - then you've made it more complex. Coders are coders, they will not follow the rules if they see a way to make a shortcut. Declarative integrity rules out shortcuts. Period.


... the COMMON
function approach was already in place before we introduced the lookup. ...

how could that be so? How could you have a common lookup function that predates having data to lookup?


... We're talking about if it decides to do a hash
join here - that's just not smart if it's building X identical hash list.
...

and you are describing a micro-optimization that works only for this sort of bad design. Whenever you see a table being joined to N times over and over and over again in the same query - you know a mistake was made. This is something that frankly, just doesn't happen that often. So, that the optimizer doesn't recognize that it is dealing with "the single lookup table that satisifies all lookups" doesn't bother the rest of the world really - it is the *only* time that 'optimization' would ever come into play. Show me another case whereby you join to a single table many times in the same query? And the case has to be only "joining to the same set of attributes, there can be NO other predicate on the data being joined to"


... really? You like maintaining 50 table
creation/maintenance scripts as well? ...

absolutely! It is called *documentation*, metadata, more information, commentary, stuff to make it all make more sense.

I'm not afraid of a bit more "code" if it means

a) simple (you cannot screw up the data, I don't have to teach you)
b) always in place (you cannot screw up the data - do you see a common thread here - you cannot screw up the data) - data integrity is going to be there.
c) more information, better information. Having a lookup table with a well formed naming convention tells me something, provides information.

... Let's say I want to move all the reference
data to a different tablespace, how many lines of code do you have to write?
Productivity - I win.
...

*give me a break*. And how often would you do that? And so what - grasping at straws just popped into my head reading that.


... but nothing that makes our
design inherently "bad" and makes me concede anything. ...

it is bad, it is not the correct approach. You have conceded that data integrity can easily be compromised - that in itself is sufficient for me to cry foul. You now see that the optimizer works the way it works, not the way you think it would if you wrote it. And there is no programmer productivity here - just more rules for them to absorb and try to follow.



here is my bottom line.

I see so many 'bad practices' out there from

o lack of binding
o swallowing exceptions
o using autonomous transactions (and not understanding the implication)
o using triggers all over the place (and not understanding the implications)
o committing all over the place (breaking transactions entirely)
o using this single lookup table (and not having data integrity)
o not using declarative constraints
o using an EAV (entity attribute value) model to store all data - just so they are more 'productive' (hah, that one cracks me up)
.... list goes on .....

That I just have to, I am compelled to, point it out - loudly, over and over, every time, every single time.

If you are really smart, if you have a small team, if you have all above average developers - that actually LISTEN TO YOU and FOLLOW YOUR RULES - go for it.

But the average team is - average.
The average team has - people that take shortcuts.
The average team will have those that don't get the implications.

I speak to the average team that typically has 99% of the people not really even thinking that the database is useful - hence they learn nothing about it, they just like to make pretty screens.

And if that means storing a gidget in a color - so be it, they do not really *care*


I've made most all of these mistakes.
I've seen these mistakes made over and over and over and over again and again and again.

If I say to one person "ok, you are ok doing that" then you know what happens? It becomes "he said it is ok"

but it isn't. So, I won't say it.


4 stars Design Approach   May 17, 2009 - 2am Central time zone
Reviewer: A reader 
Hi Tom,
We have a requirement in building a data mart, where we will be getting working version of data, number of times till it becomes approved version. We need to replace working version every time for a specific version ID till it becomes approved. Users want to see current working version in the fact table along with approved versions. My question is:-

The target fact table is quite large, and every time we want to replace specific working version (based upon ID - there are many working versions for different IDs), we either need to delete/insert or update else insert (merge) or is there is any better way of doing? Further, since there is a requirement of aggregated data, we are thinking of creating a materialized view on based table, and delete/insert or merge approach on base table can result into heavy CPU/Memory intensive jobs.

Thanks


Followup   May 23, 2009 - 11am Central time zone:

use list partitioning...

you can then load into an empty partition you add for that version, you can drop that partition or truncate when you want to "unload" it.
4 stars List partition   May 23, 2009 - 8pm Central time zone
Reviewer: A reader 
Hi Tom,

The fact table has version id based upon version dim, which gives whether the version is working or approved. The layout of version dim is like this (some column headings are shortened here for clarity):
Vrsn Type  Code  VersionNo Version
1  PLAN PLAN2008 0    WORKING
2  PLAN PLAN2008 1    APPROVED
3  PLAN PLAN2008 2    APPROVED
......
Only one working version for a type & code, and multiple approved versions.

The data in fact table will be based upon version Keys.

Do you mean this:
Create a temp table (for working version) and load working version data. FACT table is list partitioned with version, and then exchange partition.
For approved version just load the data in the fact table.

Further, there is materialized view on the fact table, which will only show latest 2 approved versions and a working version. When we exchange partition for working version, what refresh options is better - Complete, FAST etc.
PS: We are on 10gR1.
Regards,


Followup   May 23, 2009 - 10pm Central time zone:

no, i mean each thing you load would have its own version number.


you can either load directly into the fact table or use a temp table and exchange, it would not matter. whatever you like to do. each version is unique and independent of every other version.

read about partition change tracking in the data warehousing guide regarding your refresh options.
5 stars Working version   May 24, 2009 - 7am Central time zone
Reviewer: A reader 
Hi Tom,

My apologies, I think I did not make clear.

The working version every time comes for a specific type and code is required to be replaced in the fact table, only approved versions are to appended.

Regards,

Followup   May 26, 2009 - 8am Central time zone:

right - drop old partition with version you don't like anymore

add new partition with new version (and new version id) for the stuff you want.

You would be slicing out entire versions - whenever you want - using DDL

You would be sliding in entire new versions - whenever you want - using DDL
5 stars   May 25, 2009 - 1am Central time zone
Reviewer: A reader 
We have different criteria groups which user can attach with the visit of Patient. If all the requirements of the Group are met in a visit then user can select this Group otherwise we will give message that you can't attach this Group.
For example if user have attached CPT and ICD from the provided list and Place of Service Indicator is not selected then he can attach Group1 with this visit. If user either selects ICD 185 and CPT from provided list OR he selects icd from provided list then he can attach Group 2.

The icd and cpt which user attach with visit are which he selects from the unique common set (as these repeat in different groups) at the time of submission of form we will check for validation. These groups will grow with time and rules will also add up.

Group 1   
  ICD from Provided List
  AND
  CPT From Provided List
  Without
  Place of Service Indicator
   
GROUP 2   
  ICD 185
  AND
  CPT From Provided List
OR 
  ICD from Provided List
 
Measure #143   
  ICD from Provided List
  AND
  CPT From Provide List
  OR
  ICD from Provided List
  AND
  CPT From Provided List1
  And
  CPT From Provided List2
 
I am thinking to make reference data tables for these CPT, ICD, Place of Service and Age and then in procedures write IF Else conditions to check validation. Is this good to do it in this way or there should be some generic or non procedural way should be adopted. Thanks for your time.


Followup   May 26, 2009 - 9am Central time zone:

how many groups - in your wildest dreams - can you imagine there might be some day.

and what is "measure 143" you have there - it just sort of appears, but you don't mention why you typed it in?
3 stars   May 26, 2009 - 9am Central time zone
Reviewer: A reader 
It will be around 30-40 Groups


Followup   May 26, 2009 - 10am Central time zone:

can I ask what should happen when you change the lists (update the "provided lists")

Write a Review
 


All information and materials provided here are provided "as-is"; Oracle disclaims all express and implied warranties, including, the implied warranties of merchantability or fitness for a particular use. Oracle shall not be liable for any damages, including, direct, indirect, incidental, special or consequential damages for loss of profits, revenue, data or data use, incurred by you or any third party in connection with the use of this information or these materials.

About Oracle | Legal Notices and Terms of Use | Privacy Statement