The crux of the question wasn't answered!
A reader, June 26, 2003 - 10:25 am UTC
Tom, the example you used is okay but I am sure such designs are used often in the real world and doesn't portray real business problems posed by Alvaro question such as
"...We use that structure because the attributes are dynamic, we don't know how many features a product will have, product A could have color,size,material (3 attributes) ... and product b could have color,style,group, and length (4 or more ...) and so on. "
How would you tackle a design issue such as the one described above?
June 26, 2003 - 11:00 am UTC
I answered the crux of the question:
quote:
Is this a wrong design ??? or Just we don´t know how to build a query ???
answer:
wrong design
I thought it was very straightforward. Not having all of the requirements -- I cannot comment further. But I would betcha there is a finite set of attributes such a thing could have -- at the very least, there is a very LARGE set of common attributes.
They have gone the route of "total flexibility" -- which I many times find is "we didn't gather the requirements really". That could be part of the issue.
I would go with
create table t (
<set columns -- things we know everything has>,
<columns most things have, color, size, cost, shape, etc>,
<couple of flex fields for oddities>
);
a single table, many attributes. Ability to index. Ability to query performantly.
I would also bet that most attributes are in fact "well known"
Correction!
A reader, June 26, 2003 - 10:28 am UTC
I should have said Tom's example (4 table data structure) is "Not" used in the real world very often...
Alternatives for that bad design
Alvaro, June 26, 2003 - 10:37 am UTC
Ok, Tom we understood your approach ...
If we use that design in our application the performace will be poor ...
So , We are thinking to store all attributes in the same table no matter what king of attribute ..
product table would be
prod_id
prod_atr1
prod_atr2
prod_atr3
...
prod_atr20
With this model we have all the information in one table, but the problem would be the user must know that atr1 is color and atr is size and atr3 is last_name ... and so on.
The flexibility is not 100% but the performance would be better ...
what do you think about it ...
Thanks
Acueto
June 26, 2003 - 11:43 am UTC
I think you'll find MOST attributes are in fact well known.
FEW attributes are not known.
So, I think you have a table with lots of fixed attributes -- color, size, last_name and so on.
You have a couple of columns at the end that vary by "product"
and the end users would have to "know"
But think about how excellent they would have to "know" SQL in order to query the other model at all -- look at the "or" example I have -- ugly.
Proper data analysis required
A reader, June 26, 2003 - 10:44 am UTC
Very good points Tom! Proper analysis of the business procedures and data storage requirements should yield the right balance between flexibility and performance.
Column names
A reader, June 26, 2003 - 10:53 am UTC
Alvaro
From the information you have provided it seems that most products have the same basic information about them, eg. colour, category, material. What prevents you from using these as the column names? Without a full analysis of the business, though, it would be impossible to determine the correct table/column set to use for you data storage.
attributes
Alvaro, June 26, 2003 - 11:15 am UTC
Thanks UK ...
As you said we would use color, material, category and ...
for column names, but actually, the attributes could be more and in some cases unknow for new products.
I could say you our product attributes are this and no more:
(color, material, heigh, width, category , ... ) only 15 attributes but it's not the real world.
If a new product is added to our company it could have some attribute that does not exits, for this reason we are thinking in table fields like
prod_id
prod_atr1
prod_atr2
...
prod_atr20
Tom's Date of Birth is 15th March 1965
A reader, June 26, 2003 - 11:19 am UTC
Is that true ?
Analyse
Paul, June 26, 2003 - 11:29 am UTC
Unknown will be just fine for new products, simply leave it NULL.
Ok, now I'm getting to know more about your business, but more information is still required. For example what is the scope of the products covered by the business? Could you use the scope then to come up with good column names that will cover 99% of all products you would need to store? Might it be necessary to sub-divide products into specific groups, perhaps requiring more than one table? Maybe two groups of products have totally different attributes. Also, we are probably only looking at one tiny part of your whole system, perhaps other parts will affect how this part is designed.
Designing, as with any planning type activity will be more accurate and complete when the information provided is accurate and complete.
Think of it from the users point of view.
Chuck Jolley, June 26, 2003 - 12:06 pm UTC
I try to think of these things from the users point of view.
Numeric attributes like size, cost, number per package etc that virtualy all of your products will have
or properties that could use list of values from lookup tables like color, material, group, style, etc get columns.
If the LOV would be too long to use for data entry or a drop-down on an inquery screen, why give it it's own column?
Just put that stuff in a wide descriptive column. You can still index and search on it.
select *
from products
where desc like '%64 SQUARES%'
and desc like '%TELESCOPING LEGS%'
and color = 'RED'
and width = 17
and units = 'INCHES'
and etc;
Gets you all the red 17" chess and checkers tables with telescoping legs.
Without creating a table with 1000 columns or 1000 lines of sql.
If something REALY new comes in that won't fit you can always add another column.
chuck
June 26, 2003 - 12:29 pm UTC
(tagged xml might even be a consideration.... oh, did I really say that out loud)
we can index that (XML) and do section searches amongst other things.
Flexible database design
Yong, June 26, 2003 - 1:10 pm UTC
The example Tom put out here
was used/have been useing in a real world.
I involved two projects for two big companies at Bay Area. The java developers int my team try to convince me how flexible the data model will be. I gave the exact answer as Tom did: your flexible will kill the database. Unfortuantely, I didn't win these wars and development started. I knew what was going to happen.
In the test environment, we have 20 rows on each tables, java developers could show how easy and flexible for a java develper to add any new attribute and feature. Then they were vary happy.
When we started the production, and table rows grew to 10,000 records in the tables, we encountered performance problem. After a user provide user name and password, our J2EE app loaded search the person's profile, it took 40 seconds to show up the next page!
Happy now?
Finally, I have to re-design the data model 6 month later.
I read many Tom's idea and couldn't agree more.
Supporting this model
will, June 26, 2003 - 1:41 pm UTC
I'm currently cursed with maintaining this type of model. It supports a web app that generates a search on the various attributes. The queries that are generated are something (evil) to behold. No way it will scale.
The question is how to redesign the structures to support the user's ability to add products and attributes and efficiently maintain and query the resulting large and varied set of searchable attributes.
Above you recommend a product (with flex field) approach.
We could probably get around the "user created model" requirement by giving them some facility to assign the flexfields for a given use. However, the number of flexfields would be very large.
To support the large and diverse set of attributes, would you recommend a separate table for each product type? How fine should the distinction be? For example, should you have a single "computer" table or a "PC" and "Server" table? Do you think a company like Amazon would store the data in a table per product type model?
I realize that these are decisions that would have to be made for our specific requirements, but any discussion on the alternatives would be greatly appreciated. Will you discuss alternatives in your book? I know I'll be first in line when it comes out.
Thanks for the great site (wow, cool spellcheck).
June 26, 2003 - 1:49 pm UTC
You might consider XML, you can store attributes in a tagged "document", Oracle text can index that. You can do section searching on it.
benchmarking such an implementation would be crucial obviously.
Also, the generic design prevents enforcement.
Chuck Jolley, June 26, 2003 - 1:47 pm UTC
I forgot to add, that doing the design genericaly like that prevents the use of two of the most classic types of constraints: foreign keys and lookup tables.
What is to prevent someone from entering 'INCHES' for the value of the attribute 'COLOR'?
Let alone the "attribute" 'RANDOM_STRING'?
A 10,000 line check contraint?
10,000 lines of application code?
In our shop we call this the "Big Chief Tablet" school of design.
The database is just used like a gigantic piece of notebook paper.
In my view all the attribute columns except "desc" should be enforced by check constraints or forgeign keys on lookup tables.
This alone will save the programmers a huge amount of work.
chuck
June 26, 2003 - 1:51 pm UTC
hear hear.
(when I talk of xml above -- it would be for ONLY those things that are definitely NOT KNOWABLE until months after the system is fielded -- the exceptions, things to add to version 2)
Re: Tom's Date of Birth is 15th March 1965
A reader, June 26, 2003 - 2:03 pm UTC
I've got to say that if it's true, he looks like heck for being only 38 (just kidding, Tom!).
Similar problem
av, June 26, 2003 - 4:10 pm UTC
I am facing same problem in our system. We do have these 4 tables to store the template data. Mostly the data is from the forms/applications templates.
For example the template is an application when you need to apply for a credit card or auto loan or insurance etc., The system doesnt know the template and its attributes well in advance. Developers can create new template and the engine automatically recognizes it as a new template and registers its fields and values in the database. The 4-
table structure is:
1. Template_type Table
Template_Type_id number,
Template Type Name varchar2
2. Template Table
Template_id number,
Template Name varchar2,
Template_Type_id number
3. Template_Field Table
Field_id number,
Field_name varchar2,
Template_type_id number
4. Field_value Table
Field_Value_id number,
Number_val number,
String_val varchar2,
Date_Val Date
Template_id number,
Field_id number
With this flexible table structure developers dont care about the number of templates and the number/type of fields in it. It is easy for them to allow users to fill templates and store them in database. It is also easy to show all templates and its fields.
But the real problem is with the reporting (both ad-hoc and standard). The fields sizes in each template vary between 100 and 500. Right now we have 40 templates, 8000 fields and 400,000 field_values.
We designed the datawherehouse table as a flat table with 1000 columns (number, varchar2 and date) and transforming the data into columns in ETL process, and we are reusing columns. Because we need to give sort by, filter, group by, aggregate functions etc features on each and every fields in ad-hoc reporting. So we thought that de-normalizing and storing the data in columns give flexibility at report level.
Now the queries for a simple select become
Select decode(Template_Type, 1, number1_column, 2, Number15_column, 3, number34_col, null) column1, decode(Template_type, 1, varchar22_col, 5, varchar35_col, null) column2
From dw_table
The reason for this that we want to give a report of all the field values for that user templates (if the field exists in multiple templates then we want get all of them like cross-template) is with the column reuse let say field Total_cost can exists in any column. So using the metadata we are preparing the above decode to find out all values.
These queries are taking a very long time. I couldnt fully understand the XML approach that you talked about. Is it a viable solution for my problem?
Can you please let me know the best approach to solve the above said reporting problem? If you need to implement this framework how do you handle this design?
June 26, 2003 - 4:39 pm UTC
It'll help on searches for a document instance (XML will) cause you can
select ... from t where contains( 'something within attribute' )
and get the instances back.
As for being a data warehouse solution, not so good. there you need facts, dimensions.
Like I keep saying -- this flexibility = great, good for the developers. this flexibility = stinks, terrible for any ad-hoc queryies, aggreations, "ors", you know -- what people want to do with data.
I think your data warehouse needs 40 fact tables (40 is NOT a big number, you don't have that many templates).
I think your data warehouse needs columns that convey meaning and are appropriately indexed with proper datatypes.
I think your "etl" process would convert their "glob of nothing" into something that people can use in the warehouse.
Their structure is excellent for working on a document instance ("hey, get me the data for document=555, that user is here to input the information")
Their structure is horrible for actually using the data any other way.
I would have changes to a template notify someone so the warehouse can be modified to accomidate it properly.
Thanks
A reader, June 26, 2003 - 7:49 pm UTC
Thanks a bunch for your feedback. I totally accept your point. Still I have a question.
1. The reason for merging them to provide cross-template queries, like get the cost field from Template 1 and Template 5 or aggregate quantity from Template 10 and Template 20. I have to merge the columns in the different table to get this kind of reports.
Another different question, which is not related to above argument is:
Some templates have around 500 columns, so they will become 500 columns in a table (I have this problem with current table too). Is it going to scale well? Please let me know if you can think any better approach?
Thanks in advance.
June 27, 2003 - 8:31 am UTC
facts and dimensions are useful to break out "500" columns into more meaningful sets of tables.
It sounds like you need 1 fact table with common attributes, then 40 other "subfact like tables" for the non-common attributes, with dimensions off of each. Practically speaking in a DW environment, I would probably have 41 facts -- one for the common with all rows from all templates -- and the ability if need be to drill down to the non-common details.
Then I would have 40 more fact tables -- that replicate the common but also have the non-common attributes.
In that fashion, I don't have to join common to non-common to query "template 1" , I just query template 1.
The original question
A support manager, June 27, 2003 - 9:59 am UTC
I agree with Tom whole heartily - just don't go down the route of 'common object' tables. It kills performance, and confuses support staff (or at least gives them a long learning curve on the meaning of data).
I inherited a DB put together by a bunch of dogmatic OO programmers: All data "items" are in a single narrow table with a "class descriptor" to tell you if looking at (say) a stock item or an outlet. Values associated with item are in an attribute table. Certain items have properties that are items in their own right, these are in a link table. Certain items have parents - eg a "steak dinner" has a parent of "meat dish" which has a parent of "food". Each of these relationships is held in a single two column table of item, parent. Most simple queries run as 3 or more joins to the item table.
Performance is a dog. And to make the batch (output) interfaces fly the developers had to create temporary tables that contained a conventional (multi-column) form of the data
Performance is designed in - it follows analysis and design, it is not a generic, commodity item.
Don't you have to take DB theory to get a programming degree any more?
Chuck Jolley, June 27, 2003 - 11:07 am UTC
What's the difference between that and using flat files?
I mean, they could have done the same thing with three flat files and a couple of app maintained "index" files.
They would have saved the price of Oracle and support.
It might have even run faster.
Generic Database Models, don't try this at home
Jeff Klee, M.O.S. (Man-of-Science), June 27, 2003 - 11:22 am UTC
Tom,
I really enjoy your column. This thread peaked my interest because I'm an application DBA and I support Oracle's "Oracle Clinical" application for a large pharmaceutical company. Oracle Clinical is a large "system" which is generally used to capture clinical trials data using a flexible and generic data structure. I would describe the datamodel in a nutshell by saying that it is built around two entities, a question (i.e. "what is your birthday", "what is the barcode on the blood sample") and a response (i.e. "16-JUN-1988" and "AA400055H" respectively). In effect, the RESPONSES table contains almost all of the dynamic data and the remaining schema objects (~10,000 although I didn't count them), are the metadata behind the RESPONSES table and the metadata that is used to construct the presentation layer on the fly. As you suggest, building a system like this is not like picking fruit and you need an army of folks to build it (read a large pile of money) and another highly trained group to "fly it". My role in this is to trouble-shoot problems where the typical problem statement is something like why is the database slow" (usually when someone builds a slow custom-query) and can you make it go" (usually a non-database issue - network routers, defective firewalls etc), so, Ive become familiar with the underlying table structures (~400 tables). After working with this product for a year, I have concluded that the generic database model probably makes sense in this case because you can build a single product, sell it to several end users and there is a large enough market to justify the expense (read payback the cost of building it). The main benefit of this architecture is that it allows a company to build their own presentation layer and capture any kind of data they want, and do this quickly. Although it's marketed as a validated clinical-trials application, you could choose to capture other types of data. My intent is not to challenge your premise, "Do not use Generic Data Models" but to emphasize it. I would go on to qualify that statement by saying that one should have a sound reason to do it, like a business plan that justifies the expense of building such a complex system by weighing the benefits of doing so. In summary, Oracle Clinical provides my company with a high-speed mechanism for deploying custom applications (on the web). Although Oracle provides a mechanism for addressing performance problems and building generic "extract views" that make it easy to query the database, designing a system this complicated is probably a lot of "hooey" if you're only deploying one instance of it (and then I would call that a career development system). Also note that, since the RESPONSES table contains actual entities and their attributes, we extract some of them to a relation model where we can build high-performance (and more readable and easier to understand) queries. So my conclusion is that this generic model makes sense in this case and it does what its intended to do, capture data. Dont try this at home though.
Bad design rant continued
A Support Manager, June 27, 2003 - 11:26 am UTC
If I had my way I would take their degrees from them, Chuck!
As I said above, performance is not an accident, it is good design.
In the OO system above, this generic approach to design has made the system far too complicated, and too divorced from the meaning of the business data, I would love to know that a widget costs a dollar and not that Item 10000324 has an attribute of type 345 with a value of 1.
By the way, this madness also pervades into the application code itself - too many layers of abstraction and objects that are way to big. No wonder the consultancy that put this in for our customer went bust.
Synthetic primary key and Real Primary Key
A reader, August 23, 2003 - 8:23 am UTC
Tom,
can you tell me what is Synthetic primary key and Real Primary Key.
Thanks for your time.
August 23, 2003 - 10:36 am UTC
synthetic primary keys are made of rubber.
real primary keys are made of numbers, dates, strings :)
(that was a joke... never mind)
We normally call them a surrogate key.
it is a key you use in place of the real or natural key of a table. Consider a table of documents:
create table documents
( path varchar2(255),
name varchar2(255),
.... other data ....
)
arguably, the primary key -- the natural key of this table is PATH,NAME. However:
a) we move documents, path might change
b) we rename documents, name might change
c) path and name are "big"
d) path and name are 2 columns, more typing in queries
So, path and name, while they are the natural keys are not necessarily the best fields to use as the primary key. Especially for A) and B) -- cascading updates to foreign keys is *nasty*. Just because you renamed a document doesn't mean it is a different document.
So, in real life, we would probably:
create table documents
( id number PRIMARY KEY,
path varchar2(255),
name varchar2(255),
.... other data ....,
constraint name_unq unique( path, name )
)
we would introduce ID, a surrogate (synthetic) key. It is imutable (never changing) solving the A) and B) problem. It is small solving the C) data storage problem and it is a single simple column solving the D) issue.
Excellent
A reader, August 23, 2003 - 11:36 am UTC
***** for the joke
***** for the answer
whenever i read your responses i will be very very serious. Today you made me laugh with the joke .. Great..Thanks :)
informatica
kit, September 30, 2003 - 1:02 pm UTC
Re generic designs. I understand there is always a tradeoff
but reading your comments, your do not seem to like them
Have you seen the underlying informatica design. how do they allow users to create their own source and target columns and apply business logic
September 30, 2003 - 1:10 pm UTC
the only think i know about informatica is it really likes to do things as deadly slow as it can -- a row by slow row process.
beyond that, i haven't looked at their tables/implementation -- I just try to get people to use the database to its fullest. sometimes a simple insert/update/create table as select will do in seconds/minutes what otherwise would take hours/days.
sql
redmeat, October 03, 2003 - 2:50 am UTC
hi,
it may work if you can build the sql string below.
SELECT * FROM PRODUCT WHERE PROD_ID IN
(
SELECT PROD_ID FROM PRODUCT_ATTRIBUTES A
WHERE A.PROD_ATTR IN (SELECT B.PROD_ATTR
FROM PRODUCT_ATTRIBUTES B
WHERE B.PROD_ID = A.PROD_ID
AND B.PROD_ATTR='GROUP'
AND B.PROD_ATTR_VALUE = 'FIRST CLASS')
INTERSECT
SELECT PROD_ID FROM PRODUCT_ATTRIBUTES A
WHERE A.PROD_ATTR IN (SELECT B.PROD_ATTR
FROM PRODUCT_ATTRIBUTES B
WHERE B.PROD_ID = A.PROD_ID
AND B.PROD_ATTR='CATEGORY'
AND B.PROD_ATTR_VALUE = 'BED ROOM')
)
/
PROD_ID PROD_DESC PROD_COST PROD_LEVEL PR PR
--------------- ------------------------------ ---------- ---------- -- --
1000 SINGLE BED 100000 1 A A
1001 DOUBLE BED 200000 1 A A
question on design
A reader, October 24, 2003 - 2:39 pm UTC
Hi Tom
We have a situation where we have huge files
stored in the database (Eventually this will run
in 9i/10G)
We need to parse it. Parsing in pl/sql has appeal as
you have something in the DB and you do the parsing,
store the parsed results in tables for the UI to display.
However, pl/sql is not meant to be used (I think )
as a parsing mechanism. perl *perhaps* is better bcos
it is designed upfront for this. (10G regular expressions
in PL/sql mitigate this problem a bit may be for 10G)
But doing it in perl introduces the "extproc" complexity and with it perhaps other complications.
If you want to do it in the middle tier then you have to ship this file contents over - not a good idea.
What are your thoughts?
1. pl.sql
2. perl with extproc
3. middle tier
My choices are as above in the order of preference.
Please let me know what your thoughts are.
Thanx for an amazing site!!
October 24, 2003 - 3:44 pm UTC
what is this document and what does your parsing involve (i've never personally seen the appeal of perl, ugh)
Thanx Tom!
A reader, October 24, 2003 - 3:49 pm UTC
The document is a text document (if that is what you meant)
It contains information that needs to be parsed and
put in tables for the UI. The information is in a format
that would need reasonable amount of parsing to be
carried out.
Thanx!
October 24, 2003 - 4:08 pm UTC
do you have a "for example" (and why wouldn't the client that stuffs it in do the parsing?)
thanx!
A reader, October 24, 2003 - 4:30 pm UTC
Unfortunately, I can not give you an example right now
for confidentiality reasons.
Anyways, I dont think the parsing involved is going to be
*very* heavy duty and can be done in pl/sql with the array
of string manipulations. My concern is that if you are using too much pl/sql processing (due to the size of the file) it may bog down the database. The process itself
can be done in background using jobs. - it is not the
"ui" ineractiveness that bothers me but the resource
consumption at the database.
for example (fake)
a=b,c=d,
put it in a table where col_a = b and col_c = d
Thanx!!
Database Design
Bárbara Díaz Vicente, April 21, 2004 - 6:42 am UTC
Hi:
I have to design a database with several schemas, such as when we want delete some funcionallity only we have delete its schema.
The idea is that there is a schema with generic information for all our schemas and then each necessary schema.
For example:
SCHEMA 1: CSR_GES_PRMT
col1
col2
col3
SCHEMA 2: PCRO_GES_PRMT
col1
col4
col5
Cardinality between both tables 1:1.-
......
SCHEMA n: ................................
OBJECTIVE: we need get a table - view synonym ???? -
TB_GES_PRMT with the following attributes
col1,col2,col3,col4,col5.
At the begining I had known use view and synonyms but there are something restrictions with DML sentences.
How can we do this design ???
Thanks in advance.
P.D: Sorry for my english !!!
April 21, 2004 - 8:10 pm UTC
I don't fully understand the concept but it sounds so far like "a bad idea". not sure what you are trying to do exactly.
Database Design
Bárbara Díaz Vicente, April 22, 2004 - 5:02 am UTC
I'll try explain better the cuestion.
We have two Databases created.Now we want integrate both in only one database and think how to do it for future developments where we need reuse some information and create some new data.
We think create a schema with share information and so much schemas that we need for creating our funcionallity.
For a concrete example:
A table for customers ----- Generic Information:CSR_CLIENTS
MSISDN
CO_EST_CLTS_N
CO_EST_SIM_N
Specific Information:PCRO_CLIENTS
MSISDN
FC_TMSEL_V
We have created source for a table PCRO_CLIENTS with the following data: MSISDN,CO_EST_CLTS_N,CO_EST_SIM_N,FC_TMSEL_V and we don't want reprograming that source but we need separate generic-especific information , Perhaps solution will be a wiew with join both tables but we have limitations with DML sentences.
Or perhaps better solution will be another .......
Thanks.
April 22, 2004 - 7:46 am UTC
a single table seems better, unless I'm missing something here. one table will perform better, be easier to insert/update/delete from and so on.
OK
Ram, April 23, 2004 - 1:42 am UTC
Hi Tom,
Any other way to put the following Query?
SQL> select deptno,dname,cursor(select empno,ename,sal from emp e where e.deptno = d.deptno)
from dept d order by deptno;
Please do reply.
April 23, 2004 - 11:45 am UTC
there are an infinite number of ways to "put" the above query.
but if what you want is an order set of dept rows that each have a cursor that is the list of emps that work in it, you got what you need.
OK
Ram, April 24, 2004 - 6:09 am UTC
Dear Tom,
Thanks for your reply.But could you please some other ways
to put the above query *regardless of order by of deptno*?
Please do reply.
Bye!
April 26, 2004 - 5:06 am UTC
just outer join. gives you an 'equivalent' answer, but not with a 'cursor'
what to do with the generic data model
philip, November 01, 2004 - 1:06 am UTC
Tom,
Back to your comment about generic data model. I am a DBA and i am supporting a database designed by our vendor who has a similar generic data model.
Currently, the Object table has 97000 rows and the object attribute table has 9.1 million rows and attribute value table has 9.1 million rows.
Users are always complaining about how slow the system is. From the Oracle performance manager - Top segments performance view, it shows there are high number of physical reads(in 1.4 millions) on Objects, Object attribute and attribute values tables. Logical read is around 20 millions.
What can i do to improve the performance of system ?
Thanks
Philip
November 01, 2004 - 3:36 am UTC
get another vendor or ask the vendor to solve the problem.
see the queries in my example above? you cannot ask even the most BASIC SIMPLE QUESTIONS without performing a contorted query that even a yoga master would be impressed by.
(this is just a side note. when we buy technology -- like a database -- we do things like "compare benchmarks", "benchmark the system", "make sure the selected database can handle our load with our required response times"
I know we do that -- cause part of my job is showing that time after time.
So, why is it when we buy an application the only thing we look at is:
o does the checklist satisfy our wish list more than the other guys
o are the screens really pretty and flashy
we don't seem to include most of the times:
o can it meet our performance goals
o does it use system resources in a reasonable fashion (eg: it isn't a sqlserver application that opens 20 databases connections for each "end user" even though it is running on Oracle and you know you don't do that on Oracle)
o does it take advantage of the underlying technlogy in a reasonable fashion
An application buy almost never involves the technology team -- yet the technology team is the one responsible for making it "go fast")
Just an observation -- I'll bet the technology team responsible for implementing and maintaining this system had very little to say about which vendor to use.
what to do with generic data model?
Philip, November 02, 2004 - 1:10 am UTC
Tom,
You are right, our tech team have no say about which vendor to use or to review vendor's design.
On the other extreme we have one vendor that promised to deliver a OLTP system running on our oracle database. The first time i
received their SQL install script, it was horrible. It turns out that they developed the application on their prefer database
sever MS SQL. The install script they gave me was written in MS SQL T-SQL and they expect that to work on oracle without any changes.
Then i told them that oracle sql syntax and ms-sql syntax are quite different,
I also gave them a copy of Oracle 9i R2 installation CDs and asked them to test their software on oracle
before coming back to us for UAT.
Six months later, on the first week of UAT, the performance of the application was terrible. There was one batch update function(update 15000 accounts) which took 5 hours to complete on our oracle database server.
Our oracle database server isn't very powerful, it is SUN Fire V220R with 1 Ghz CPU , 4 Gb of RAM. Allocated db buffer 1Gb, shared pool 260M.
The acceptable completion time for the batch update function is 45 mins. When i traced the application(for a duration of 5 mins) while they were running the 5 hrs job i found that the VB application was sending
23000 unique sql statements to the db server. I could see library cache latch contention,
blocking session (they didn't index any of their foreign keys and there are 267 tables in their application), large amount of data is sent to the client (what i can tell is each account update, the VB app
will select all the information related to the account (involve 7 select stmts),process it on the front end and update the database.
Later on I implemented the cursor_sharing=force parameter and hope that will reduce library latch contention because there is no way they are going to change their code due to the
tight timeframe. But it didn't improve anything.
My management was upset and dissatisfied with Oracle database performance. I was able to convince them that it has nothing to do with Oracle performance but it is the
inability of our vendor to code their application to work with oracle.
The vendor's response to the performance problem was
our database server is too low spec. They said they tested their application on a 3Ghz dual processes Intel box running MS-SQL server, a high end client PC and it took them around
37mins to complete the same function. They also tested their application on a 3Ghz dual processes Intel box running Oracle 9i R2 and they said it achieved 1.30 hrs.
I tried to put the oracle database on one of the high end(dual 3ghz and 1 Gb RAM) Windows 2003 box, but it still took 5 hrs to complete the batch update. That makes me wonder whether the vendor was telling the truth about
running their application on a oracle 3ghz intel box .
I told the vendor about hard parsing in oracle and the requirements to index foreign keys that involved joins. But their response was MS-SQL server doesn't have hard parsing issue or performance issue with
non-index foreign keys. I also read the ms sql server performance tuning documentation, it didn't really mention that database performance will be bad if application
didn't use bind variables.
At the end of the story is: management decided to ditch oracle and use ms-sql server because the vendor cannot write program to work with oracle db. Now i am learning ms-sql server...
I have the following questions:
1. Do you think the hard parsing and the un-index foreign keys are the major cause of the performance issue?
2. Do you think it is bad design for application to send 23000 statements to the database and they should be replaced with stored procedure? The vendor is
reluctant to use stored procedure because they are going on about the database independent BS again.
3. Also the vendor use a 3 tier approach but really a 2 tier implementation. The application server was doing nothing while the client PC and database was doing
most of the work. The vendor also believe that there is no need to handle transaction locking in their application because that's the job of the
application server(they are using MS 2003 server as application server) to handle locking and serialization. I don't think the vendor is correct. What is the job the application server in relates to transaction lockings and other functions?
4. We now have MS SQL server in place instead of oracle, the batch update function is taking 37mins to complete. In SQL server, there is no need to configure
database buffer cache, shared pool size etc as they are configured automatically by the engine. There is not alot of monitoring i can do on the SQL Server to see whether it has
bind variable issues. Is there a major difference between SQL Server and Oracle on how they handle parsing/generating execution plan and reuse of sql statements?
5. Do you think the big performance difference between Oracle 1Ghz UNIX box and
3 ghz dual processes intel MS SQL box was because parsing/generating execution plan uses alot of CPU and if we have a faster machine definitely it will be faster?
Philip
November 02, 2004 - 7:39 am UTC
Your managment is the problem here. How could they possibly negotiate a contract whereby the vendor didn't even develop on the database they targeted? This is primarily their (mgmt) fault in my opinion. They do not seem to know how to procure software.
there is zero percent chance that an application designed for database X will perform adequately on database Y.
That is, if you wanted sql server and they knew only oracle -- it would fail just as miserably.
the locking models between the two databases -- TOTALLY different, wildly different.
the way you program sql server is different from the way you program oracle is different from the way you program sql server.
1)
I don't like to speculate -- i hate to guess. could hard parsing and unindexed fkeys be an issue? probably - but was this a multi-user situation? In a single user mode, easily 2/3rds of the runtime of an INSERT of a single row could be parse time (if it takes 1 second to insert -- .66 seconds would be parse time, .33 would be actual useful code). In a single user mode, the unindexed fkeys would not be relevant.
2)
to me the BEST and ONLY way to be DB independent is to put 100% of your code into stored procedurs.
why -- cause we lock different, cause we have different features (if you want a bill of materialials explosion in Oracle -- use connect by, in sql server -- use temporary tables and a recursive function), case we are different.
that and you'll never find an individual with the depth of knowledge on oracle, sqlserver, db2, informix, sybase, whatever to make a truly "database independent, works the same in all databases and works well in all databases". You will however find many people proficient in sqlserver OR in oracle OR in db2 -- and they can code the perfect transactional api in one of the databases. Of course that means you actually need to have an up front design, some specifications, a clue..... (the implementors do).
and if you contracted this vendor to build you a CUSTOM SYSTEM to spec -- why do you care about their "database independent needs", they quite simply are not your concern. Your mgmt should have actually litigated perhaps to get their money back (but they probably never negotiated any sort of service aggreement either....)
3) transactions and locking are the job of the database
4) huge difference - they do not have a shared pool at all. they are not built for many people working simultaneously in the same database.
doesn't 37 minutes seem like an absurdly long time?
5) no guessing.
Role of Application Server
Philip, November 12, 2004 - 4:19 am UTC
Thanks for your reply.
You mentioned about "the locking models between the two databases -- TOTALLY different, wildly
different.
the way you program sql server is different from the way you program oracle is
different from the way you program sql server. "
I had a discussion with the developers who have more knowledge of MS SQL Server than Oracle. I disagreed with one comment made by the developers about the need to understand database
locking model when Application Server is used.
They are saying if you program in 2 tier model then whether using MS SQL or Oracle, the programmer must understand how database
handles locking. But if you program in a 3 tier model, the programmer don't have to understand about how database
handles locking because it is the job of the Application server which handles these tasks for you.
Are they correct?
Does SQL server 2000 really supports row level locking, is it as good as Oracle?
What kind of features does the Oracle Application server provides in terms of Database services? And what advantages does it have over some open source AS like JBOSS?
November 12, 2004 - 7:12 am UTC
laughed so hard I fell right out of chair
<quote>
ut if you program in a 3 tier model, the programmer don't
have to understand about how database
handles locking because it is the job of the Application server which handles
these tasks for you.
</quote>
oh ho ho ho ho, that is really funny.
that developer has a thing or two or three left to learn.
MS SQL Server does recommend bind variables
jim, November 12, 2004 - 10:40 am UTC
If one actually reads the MS SQLServer docs and with an eye to improving performance then MS recommends using bind variables. (same advice Oracle gives) So the vendor not using bind variables should be shot. (metaphorically, I am not advocating actual violence) It is amazing how many vendors who are "database independent" are "database ignorant". I am working with a vendor now and their documentation is so full of mythology about Oracle best practices that it would have been better if they said nothing. It would have been better if they pointed people to the Oracle documentation! (rebuild indexes frequently, put things in one extent, cost based optimizer builds indexes in one direction, and rule based builds them in another...) This is a major CRM vendor that is traded on a major stock exchange!
great joke that caused Tom to fall off his chair
Philip, November 13, 2004 - 1:00 am UTC
Tom,
I was shocked and in total disbelief when i first heard the statement 'no need to understand database locking in 3 tier model' from this 'Senior' developer in my company. This 'Senior' developer also had made the following bizzare statements about database programming:
1. Bind variables in Oracle programming - never heard of it and MS SQL server don't have this problem because it is so advance.
2. integrity constraints - no need to enforce on database level because the application handles them already.
3. Never have concurrency issues when you use all the default feature of the MS SQL database statements. No need to worry about any extra feature(in regards to transaction locking etc) because the database should handle them for you. There is no need to worry about how to specially code update/inserts and selects to avoid blocking in a multiuser environment.
When i confronted the 'Senior' developer regarding point 2. The response was it is a nuisance to handle the database exception if integrity constraints were already enforced on the application level. The application is coded so well that it checks for integrity violation before data is put into database.
This arrogant 'Senior' developer doesn't think there could be bugs in the code. How can any developer would think it is a nuisance to handle exception?
Another joke is this 'Senior' developer is developing(alone) a big J2EE application that is going to be used by ........ 2 persons. This project has been going on for 1 year.
Hope you like these jokes.
November 13, 2004 - 9:19 am UTC
1) sqlserver does not only have them -- they encourage them.
If the performance issue isn't enough to convince, please have their mgr do a search on google for
"sql injection"
(don't have them do it, let their boss do it -- get their boss very scared of sql injection -- as they should be -- and have them then ask this Sr developer why he isn't concerned as well)
2) that is the silliest thing I've heard. It is impossible to enforce cross object (foreign key) or cross row integritry constraints without locking large sections of data or entire tables. And what you have to lock differs by database (as we have fundementally different locking models)
3) cause they don't do real transactions by default -- as stated in many places. Locks are a scary thing in sqlserver -- the database quickly grinds to a halt.
He is right that you in general shouldn't have to worry about how to code update/inserts and selects to avoid blocking in a multi-user environment -- but you cannot be using sqlserver at that point in time....
"it is a nuisance to handle database exceptions if the integrity constraints were already enforced by the application" -- hmmmmmm and what exceptions would be raised by the database if the data were clean? There wouldn't be any!
A reader, November 28, 2004 - 9:08 pm UTC
Tom,
You often advocate the use of a single query instead of pl/sql if possible. Can the same apply for reports? A new index is being added to a table with slow reports. However, I am just looking for a quick pointer as to what kind of database features (pl/sql) should be used in reporting. Also, should any specific table/schema designs be considered? Thanks in advance as always.
November 29, 2004 - 8:03 am UTC
my mantra is simple (and fresh in my mind as I just revised this chapter this weekend :)
*You should do it in a single SQL statement if at all possible.
*If you cannot do it in a single SQL Statement, then do it in PL/SQL (as little PL/SQL as possible!).
*If you cannot do it in PL/SQL, try a Java Stored Procedure. The number of times this is necessary is extremely rare today with Oracle9i and above.
*If you cannot do it in Java, do it in a C external procedure. This is most frequently the approach when raw speed, or the use of a 3rd party API written in C is needed.
*If you cannot do it in a C external routine, you might want to seriously think about why it is you need to do it...
Generic lookup model or not...
A reader, December 01, 2004 - 3:41 pm UTC
Hi Tom,
Today, we have a document table that has FK reference to 16 lookup tables (with ID and decription columns), meaning that to get the decription from the lookup tables I do a join with 16 tables. This joins works, but is not too fast though...
Now the new requirents are to have dynamic lookup definitions. After talking with them they would actually accept 30-40 lookup tables (with 50.000 - 100.000 rows in each), the dynamic definition is not a must.
The problem is that I think that this join with 30-40 tables would be very slow, on the other hand the more generic model has the mentioned disadvantages... Could you give some suggestions on the design here?
Thank you.
December 01, 2004 - 7:45 pm UTC
what is a "dynamic lookup definition"?
do you get all of the attributes all of the time?
is this lookup gone "mad"? or are these reasonable lookups -- really needed ones (eg: you are not changing say M/F for male/female into a look right?)
A reader, December 02, 2004 - 4:32 am UTC
Thanks for the fast reply.
By "dynamic lookup definition" I meant that the user can define new lookup types, and today that would mean a new table that must be connected to the document table (new FK column).
<
do you get all of the attributes all of the time?
>
Not all attributes are registered on all documents, but we have inserted a "default/empty" row in each lookup table not to outer join.
<
is this lookup gone "mad"? or are these reasonable lookups -- really needed
ones (eg: you are not changing say M/F for male/female into a look right?)
>
Well, most of the data in these lookups come from ERP systems and are quite static (like accounts, currency type, departments etc), and you have lookup like "projects" which gets a new entry when a new project starts (this table is still ID and descr because the details about projects are held in a different system).
The typical use of the system is: a document of type invoice comes, you register the amount, etc as free text and then you have to register which account to use, which department and/or project will pay the invoice. The user should not be able to register the invoice on a non-existent project - that's the reason for the lookup.
December 02, 2004 - 7:46 am UTC
what i meant by "do you get all of the attributes all of the time"
was -- ok, so you have a 16 table join, do you retrieve ALL of the attributes in your application MOST of the time -- or do you just get a couple here and there. If you do not dereference 10 out of 16 of the lookups, you did 10 joins that were not "useful". We can remove them. but if you always "select * from this_nasty_join", you cannot.
A reader, December 02, 2004 - 8:59 am UTC
I hope I understand you correctly...
We use this functionality 3 places. In one module (an overview screen) we get data from 3 lookups and there the sql has only these 3 lookups.
In the othe 2 (document editor and archive modules) we get all of the attributes with one query. The "useless" lookups return "" for the value of "ID" and "Descr" but they are still in the query. So in 2 of 3 cases we all the atributes...
December 02, 2004 - 11:20 am UTC
if you mostly get all of the attributes, you are in "join heck" -- not too much we can do about it.
A reader, December 02, 2004 - 1:11 pm UTC
So it wouldn't be better with one general and big lookup?
A solution I thought about was instead of this huge 40 tables join, select the references from the doc. table and build a batch of 40 sqls (send it as one job from the application or do it in a procedure on the DB) that would get the correct row from each table. These single selects would be very fast, but do you think it makes sense?
Thanks for your time.
December 02, 2004 - 7:09 pm UTC
you still have a 40 table join -- 39 of the tables have the same name is all.
you would have the hottest table on the planet and it would crush you.
And these lookups still need to go to 39 real tables do they not. with all of the department information, project information and so on.
Very useful!
A reader, December 02, 2004 - 4:02 pm UTC
This is a extremely useful discussion on design issues. Top-notch! Thanks, Tom.
A reader, December 03, 2004 - 5:15 am UTC
Case 1.
<
you still have a 40 table join -- 39 of the tables have the same name is all.
>
Sorry, I didn't explain what I thought here. I agree, if the resultset would have the same structure as today than I agree, you still need the 39 "tables" except that each of these tables have 40 time more data in it. So it's a brainded sollution :). But, to avoid the big join I would define a new couplings table with basically 2 FK columns: docID and lookupID (the PK column in he big lookup) and one row for each defined lookup on a document (eg 40 row/doc). I would change the code to handle that now I get 40 rows and not on row/doc with 40 columns... I would expect that this would be fast on "get all data on document X" but would suck on "get all documents with department Y or project Z" and reports etc. Correct?
So I guess this is not an aproach for "real life" apps...
Case 2.
<
And these lookups still need to go to 39 real tables do they not. with all of the department information, project information and so on.
>
Yes, there would be 39 real tables, I would just release the db doing the big bada join but I would have to merge 2 result sets in code. I know that you generally mean that databases are built for joins but is that still correct for this join? I remember that a while ago we had problems with Sybase because they had a 16 table limit on joins...
So it would be:
Sql1.
Select lookupID1, lookupID2, ... from doc where docId=X
I get lookupID1, lookupID2 in code and build an sql like this:
select lookupDesc1, lookupDesc2, ... from lookup1, lookup2, ... Where lookupID1=X and lookupID2=Y and etc. (lookupIDs are PK of course)
The question is if this approach would be faster than the join. It's easy to answer that the join is way faster if you have 3 row is the doc table and 10 in the lookups, but is this still correct for 10+ million rows?
December 03, 2004 - 8:06 am UTC
now you are talking "generic funky data model" -- i wrote about that in "Effective Oracle by Design". ugh.
significant CPUOTHER utilized query on statspack report
SREENIVAS, December 22, 2004 - 5:21 am UTC
hi tom,
can you write this query in effective way on 9.2.0.1.0 datbase.
SQL Statistics
~~~~~~~~~~~~~~
-> CPU and Elapsed Time are in seconds (s) for Statement Total and in
milliseconds (ms) for Per Execute
% Snap
Statement Total Per Execute Total
--------------- --------------- ------
Buffer Gets: 170,480,244 557,125.0 15.81
Disk Reads: 145,721 476.2 3.28
Rows processed: 22,983 75.1
CPU Time(s/ms): 985 3,219.1
Elapsed Time(s/ms): 971 3,172.1
Sorts: 918 3.0
Parse Calls: 306 1.0
Invalidations: 0
Version count: 1
Sharable Mem(K): 72
Executions: 306
SQL Text
~~~~~~~~
SELECT distinct A.COMP_APPL_ID AS "App No", A.APPDATE AS "A
pp Date", B.CUSTOMERNAME AS "Customer Name", D.DESCRIPTION A
S "Product", D.SUBTYPEDESCRIPTION AS "Sub-Product", C.DESCRI
PTION AS "Location", E.BPNAME AS "Sourced By" FROM LOT_CU
STOMER_T B, LOT_GLOBALLOAN_T A, COMPANY_GENERIC C, PRODUCT
D, BP E, LOT_WORKFLOWSTAGE_DTL F WHERE A.COMP_APPL_ID =
B.COMP_APPL_ID(+) AND A.SRC_BPID=E.BPID AND C.GENERICID = A.
CG_LOCATIONCODE AND B.GUAR_COAP_FLAG(+)='P' AND D.PRODUCTC
ODE = A.PRODUCTCODE AND NVL(A.PRODUCTSUBTYPE,'NULL') = NVL(D.
PRODSUBTYPECODE,'NULL') AND A.COMPANYID=1000 AND A.COMP_APPL
_ID = F.comp_appl_id AND F.EMPLOYEECODE='10030' and F.STAGES
TATUS='P' AND F.STAGE='DDUPDEDS' order by 1
All Optimizer Plan(s) for this Hash Value
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shows all known Optimizer Plans for this Hash value, and the Snap Id's they
were first found in the shared pool
-> ordered by Snap Id
Plan
Hash Value Snap Id Cost Optimizer
------------ -------- ---------- --------------------
3821711819 120 533 CHOOSE
3821711819 126 540 CHOOSE
3821711819 129 546 CHOOSE
-> Rows indicates Cardinality, PHV is Plan Hash Value
-> ordered by Plan Hash Value
--------------------------------------------------------------------------------
| Operation | PHV/Object Name | Rows | Bytes| Cost |
--------------------------------------------------------------------------------
|SELECT STATEMENT |----- 3821711819 ----| | | 533 |
|SORT UNIQUE | | 1 | 261 | 527 |
| NESTED LOOPS | | 1 | 261 | 520 |
| NESTED LOOPS OUTER | | 1 | 221 | 519 |
| NESTED LOOPS | | 1 | 179 | 517 |
| NESTED LOOPS | | 1 | 139 | 516 |
| MERGE JOIN | | 9 | 1K| 489 |
| SORT JOIN | | 409 | 18K| 12 |
| TABLE ACCESS FULL |PRODUCT | 409 | 18K| 2 |
| SORT JOIN | | 3K| 259K| 478 |
| TABLE ACCESS FULL |LOT_GLOBALLOAN_T | 3K| 259K| 431 |
| TABLE ACCESS BY INDEX ROWID|LOT_WORKFLOWSTAGE_DT | 1 | 20 | 3 |
| INDEX RANGE SCAN |LOT_WORKFLOWSTAGE_IN | 48 | | 2 |
| TABLE ACCESS BY INDEX ROWID |COMPANY_GENERIC | 1 | 40 | 1 |
| INDEX UNIQUE SCAN |COMPANY_GENERIC_PK | 1 | | |
| TABLE ACCESS BY INDEX ROWID |LOT_CUSTOMER_T | 1 | 42 | 2 |
| INDEX RANGE SCAN |LOT_CUSTOMER_T_INDX1 | 100 | | 1 |
| TABLE ACCESS BY INDEX ROWID |BP | 1 | 40 | 1 |
| INDEX UNIQUE SCAN |BP_PK | 1 | | |
--------------------------------------------------------------------------------
could you specify whether indexes are useful or not for these query.
moreover parameter values
optimizer_index_caching=0
optimizer_index_cost_adj=100
regards
SREENIVAS
December 22, 2004 - 9:51 am UTC
thats alot of working getting done in 3 seconds isn't it.
how many rows does this query return.
you need to look at the tkprof as much as the explain plan.
significant CPUOTHER utilized query in responsetime=serivcetime+waittime
SREENIVAS, December 23, 2004 - 1:36 am UTC
Thanks for your quick responce.
pls check the tkprof of this query with different sessions.
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 2 0.24 0.21 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 2 0.48 0.48 0 10534 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 6 0.72 0.69 0 10534 0 0
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30 (SCB)
Rows Row Source Operation
------- ---------------------------------------------------
0 SORT UNIQUE
0 NESTED LOOPS
0 NESTED LOOPS OUTER
0 NESTED LOOPS
0 NESTED LOOPS
201 MERGE JOIN
201 SORT JOIN
201 TABLE ACCESS FULL LOT_GLOBALLOAN_T
201 SORT JOIN
32 TABLE ACCESS FULL PRODUCT
0 TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
191 INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
0 TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
0 INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
0 TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
0 INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
0 TABLE ACCESS BY INDEX ROWID BP
0 INDEX UNIQUE SCAN BP_PK (object id 23944)
===============================================================================================
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 2 0.16 0.15 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 8 7.15 7.00 787 1119033 0 74
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 12 7.31 7.16 787 1119033 0 74
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30 (SCB)
Rows Row Source Operation
------- ---------------------------------------------------
20 SORT UNIQUE
54 NESTED LOOPS
54 NESTED LOOPS OUTER
54 NESTED LOOPS
54 NESTED LOOPS
187862 MERGE JOIN
32 SORT JOIN
32 TABLE ACCESS FULL PRODUCT
187862 SORT JOIN
187862 TABLE ACCESS FULL LOT_GLOBALLOAN_T
54 TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
178625 INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
54 TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
54 INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
54 TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
54 INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
54 TABLE ACCESS BY INDEX ROWID BP
54 INDEX UNIQUE SCAN BP_PK (object id 23944)
=================================================================================================
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 2 0.24 0.23 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 8 7.06 6.95 157 1119054 0 76
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 12 7.30 7.18 157 1119054 0 76
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30 (SCB)
Rows Row Source Operation
------- ---------------------------------------------------
20 SORT UNIQUE
56 NESTED LOOPS
56 NESTED LOOPS OUTER
56 NESTED LOOPS
56 NESTED LOOPS
187862 MERGE JOIN
32 SORT JOIN
32 TABLE ACCESS FULL PRODUCT
187862 SORT JOIN
187862 TABLE ACCESS FULL LOT_GLOBALLOAN_T
56 TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
178625 INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
56 TABLE ACCESS BY INDEX ROWID COMPANY_GENERIC
56 INDEX UNIQUE SCAN COMPANY_GENERIC_PK (object id 24064)
56 TABLE ACCESS BY INDEX ROWID LOT_CUSTOMER_T
56 INDEX RANGE SCAN LOT_CUSTOMER_T_INDX1 (object id 24473)
56 TABLE ACCESS BY INDEX ROWID BP
pls advice to tune this query further..
from
SREENIVAS
December 23, 2004 - 11:15 am UTC
investigate a better indexing strategy. you see your index is
56 TABLE ACCESS BY INDEX ROWID LOT_WORKFLOWSTAGE_DTL
178625 INDEX RANGE SCAN LOT_WORKFLOWSTAGE_INDX3 (object id 24658)
you get 178,625 rows from the index, went to the table and found that only 56 of them applied. perhaps your index needs to be "fattened up" to include more relevant columns.
sreenivas, December 24, 2004 - 6:39 am UTC
thanks a lot.
Funky model or not?
Mikael, January 19, 2005 - 8:30 am UTC
Tom,
I am modelling a database that will be used to format certain reports - and every report's look is dependent on each contract from each customer.
Anyway the problem is that this is what I got (it is hard to describe, but here is a go):
500,000 unique products that each have 60-70 attributes out of 1500 possible attributes (varying types, and overall no attribute is used more than others).
Each attribute have some describing data (of course): unit of measure, definition, etc.
In addition each product belong to a classification according to 6 of the attributes - and depending on this classification:
- the attribute description (in different languages) may change.
- some of the attributes also have defined values that it can take - with a corresponding refined description (in different languages).
This is what I need to do:
I select a couple of attributes that a product must have and a classification it must conform to. I receive one product (in this example).
So I have a product which happen to have a classification and 61 attributes. At least one of these attributes has a classification matching the product classification - hence I need to select that description in the requested language for this product. It also has a defined value for the classification it has - so I also need to select the description for that defined value in the requested language. Of course I also need to select the attribute describing data.
Depending on the customer contract I shall return some describing data and some or all of the attributes depending on the product selected.
I am thinking a view based on two tables with attributes (since number of colums is more than 1024) for the product table and an attribute table with attribute describing data and possibly one or two subtables with descriptions/classification and values/classsification respectively. (Finally possibly a customer contract table for determining what data should be selected (i.e. what attributes, language and describing data that should be delivered for a certain product) - if this should be done in the DB and not in the final stages of our report construction widget).
Monsterous!
Any non-funky suggestions?
Thanks,
/Mikael
January 19, 2005 - 10:53 am UTC
i can only re-iterate the caveat above about generic data models and their lack of maintainability (one of their supposed BENEFITS!?!), their lack of performance, their lack of flexibility when querying (again, one of their supposed benefits?!)
...
500,000 unique products that each have 60-70 attributes out of 1500 possible
attributes (varying types, and overall no attribute is used more than others).
.......
you really have 500,000 subtypes? truly?
...
In addition each product belong to a classification according to 6 of the
attributes - and depending on this classification:
......
seems to counter that -- sounds like a structure is starting to form there....
What about Oracle Objects?
Kerry, January 19, 2005 - 11:44 am UTC
Would you recommend using Oracle's object-relational capabilities for the kinds of data discussed here (i.e. data requiring extensibility or variant subtypes)?
I am currently using XML in a product catalog model, and was happy to see your endorsement, albeit guarded, of the XML approach in some of the responses here. We chose CLOB storage for the XML, and have VERY few function-based indexes to search within the XML. As you suggest, we have "real" database columns for some frequently searched attributes. Our particular needs for the attributes stored in the XML are primarily BROWSE rather than SEARCH, making the problem a little more tame.
January 20, 2005 - 9:25 am UTC
Nice
Catherine, January 21, 2005 - 1:24 pm UTC
Hi Tom,
Sorry for the disturbance.But I need a query which lists the number of employees working under an employee. I mean the query output to be in the format of
Mgr_ename Mgr_empno count(*)
--------- ---------- -------
.... ... .. ...
This is what I have tried.
SQL> select ename,empno from emp
2 where empno in( select mgr from emp)
3 /
ENAME EMPNO
---------- ----------
FORD 7902
BLAKE 7698
KING 7839
JONES 7566
SCOTT 7788
CLARK 7782
6 rows selected.
I need that Count column.Could you please provide the query?
If Possible I would like you to replace count column with
the list of enames.Please help
Bye
January 21, 2005 - 7:36 pm UTC
select mgr, count(*) from emp group by mgr;
gives you the count by mgr and
select emp.ename, mgr.*
from emp, (select mgr, count(*) from emp group by mgr) mgr
where emp.empno = mgr.mgr
/
gives you the name of the mgr as well.
A reader, February 11, 2005 - 10:42 am UTC
> you really have 500,000 subtypes? truly?
Actually it seems as if some of the products conform to the classification but far from all - so at best I'd still have 250,000 subtypes... and now I hear that some new products will be adding to the attributes or when they buy companies into their concern with products not conforming to their attribute catalogue will be adding 100s of attributes to the data structure ...
I guess what I was fishing for here is: How can you handle companies with these vast classification models.
Btw are they used to match 2 (3) tables to get the products - one table with values and one table with attribute names and unit of measure (and a third describing the attribute itself and depending on the product classification (can hold wild cards) a description).
My thought was that I'd move away from that design to get better performance, but now it feels as if I am wondering in circles...
Any ideas on how to handle these kinds of records with 1000's of possible attributes without picking up the non maintainable "object" model? (with Oracle). Say that I'd like to search among the 500,000 products "WHERE (ATTR143 > 4 OR ATTR567 < 100 OR ATTR1789 < 100) AND ATTR50 = 'AAAA'
Brgds,
/Mikael
February 12, 2005 - 7:49 am UTC
what possible application could use this? I cannot fathom one. So, the stuff must be so generic, so un-applicable to programming with -- that this information is either used ONLY when looking at the object itself (eg: it need not be structured, no one uses it).
so, maybe you can take the N attributes that are consistent -- you will find there are dozens of them (really) and glob the rest into an xmltype and forget they exist. If someone truly wants them, they can use extract to get them.
If you have over a 1/4 million unique sets of attributes -- I cannot imagine anyone using that query you propose. Like 2 rows would even have those attributes (eg: they already know precisely the product they are going after)
Limit on number of columns per table?
Matt, February 18, 2005 - 5:19 am UTC
I was recently asked whether there was any physical limit on the number of columns allowed per table. Having checked this in the reference doco I see that there is a documented logical limit of 1000 columns per table.
What is it that restricts the number of columns to this figure?
Obviously having many columns per table may impact the efficiency of DML. I would say that an entity should have as many columns as makes sense within the definition of the business problem. However, my friend was wanting to create a table with 10,000 columns. Fundamentally, I see all kinds of issues with this surrounding the definition of the business problem he is trying to solve and the maintainability of any system created like this (and many others).
However, ny friend says that this is a small part of a "temporary" stop gap solution required make a prototype "work".
I'm interested to see what response you have to this, it seems to me that there is a very shortsighted attitude from the business decision makers on this one.
Regards,
February 18, 2005 - 8:49 am UTC
that is a physical limit, the logical limit is alot smaller :)
and it used to be 255, 1,000 is relatively new.
It is just a reasonable upper bound.
10,000 columns!!?!?!
prototypes are supposed to proove the viability of something, this would be an anti-prototype. a prototype that does something totally wrong but appears to succeed will only make them think "this might be a good thing"
they can use varrays (really bad idea too if you ask me), but I'd rather have someone who does data modelling sit with them for an afternoon and straight this out -- get the real reqirements and explain the right way.
9 days too late
A reader, March 24, 2005 - 5:58 am UTC
happy birthday anyway and best wishes
for the next decade
and what's wrong with that? i'd like to know...tutudududu
k., April 10, 2005 - 5:15 am UTC
<quote source="your original reply">
<part1>
You are trying to put a generic
layer on top of a generic layer
</part1>
<part2>
and it fails each and every time except for
the most trivial of applications.
</part2>
</quote>
part1:
------
but you see that happening all the time, don't you?
assembler is a generic layer on top of computer hardware.
what do we need c++ for?
and heeey - what do we need pl/sql for? it cannot compete with c++ (as far as performance and flexibility are concerened)!!
well maybe pl/sql cannot do all those things that c++ is capable of, but certain people (and they are not few) are willing to give up all that flexibility and performance to gain something else.
i am a developer - my task is to do my best to fulfil the wishes of my clients. they say "we need this and this"
and i have to find (what i think is) the best way to achieve that. oracle is here to serve me - i'm not here to serve oracle. and indeed oracle serves me well - does not even complain when i only use a subset of its [his/her? :-)] capabilities - only those that i do need.
part2:
------
</code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:38264759390157 <code>
<quote source="another thread, a few days ago">
You'll live to regret that generic model, every time I've used them every
time (and I have, twice, for real applications, 15 and 13 years ago, never
again)
</quote>
well, it seems to me that the problem lies here:
"15 and 13 years ago"
using java 20 years ago (had it existed) to do things it does today would have come to a similar end.
computers are getting faster and faster. there are many ways to get advantage of that - we can just let it imporove performance of our applications, but we can say
"hey - our application runs pretty well, what if we use hardware performance gain to add new features - something to satisfy our clients - because it's them who we do this all for - we make their lives easier , so why not to give them some funky functionality, now that there's enough resources?")
typically we try to get the best of both approaches - improve the performance + add functionality, you would have not imagined possible in real world a few years ago.
April 10, 2005 - 10:31 am UTC
You see, I'm all about the data.
Applications come, and they go.
The data stays forever.
Relational databases haven't changed radically in 15 years. The problem with the model of 15 years ago exists today.
We will have to agree to disagree, I doubt in this case either of us will convince the other ;)
You are so right.
Vinayak, April 10, 2005 - 7:58 pm UTC
Tom, you are so right about the design.
In one of the 3rd-party applications I've worked on, this was a performance killer. In their application, what you do is that you create your product catalog defining the products and then for each product, you define its attributes. To some extent, I understand the need for them to have a flexible data model to support this. So they had a table that contained each row for each product attribute.
But when we started writing queries/reports, a simple query was involving a 10-way self-join and the performance was not acceptable. If you were searching for just one order, it was taking about a minute or so which was not acceptable at all.
So what I did was that I created another table that has all the products' attributes as columns in this table, initialized it with existing data and then write a trigger to populate it automatically through the application. The performance impact of this trigger on the INSERT was neglible, not even noticeable, but the performance gain while quering data was HUGE. The query that used to take 60 seconds was running now in a fraction of a second.
Surrogate Key
A reader, December 07, 2005 - 11:55 am UTC
At what point would you consider using a surrogate key? You mention if the data is large, but would you consider a VARCHAR2(20) field too large (it will be used in tens of millions of rows)?
So, we'd have a table LOOKUP( id NUMBER, key VARCHAR2(20) ) and do lookups when we needed the "natural" key. Now, though, you've got me thinking this might be the wrong approach.
Any thoughts?
December 08, 2005 - 1:24 am UTC
varchar2(20) is tiny.
I would, assuming it is not subject to update, use the natural key. No good reason not to (assuming it is not subject to update)
Where to draw the line
Loz, March 16, 2006 - 2:29 am UTC
Hi Tom,
I've got a situation that is crying out for the "generic database". The rest of the database is properly modelled and this is just one small aspect of it. I am reluctant to go down this path but can't think of how to model it any other way.
Consider a system where there are some products and people. Each person on the system has some basic, common information about them as well as some additional information that needs to be recorded against them when they buy a given product. For example if they buy product X it requires to know their blood group. Product Y requires the number of bikes they own. Each product is essentially the same thing just with extra information recorded against the purchasers. This extra information is simply used for reporting groups and possibly OLTP searches, although it is very likely we will already know the PK of the person in almost every OLTP situation.
My initial thought is that I should have these tables
product
extra_data_types
extra_data_types_allowed_for_a_product
extra_data_for_person_for_product
person
person_to_product
Another problem is if we needed to sort or compare them we would have to have a pretty messy bit of SQL (or mangled copy of the data) to determine a suitable value to compare if they are not intended to be interpreted as strings (e.g. 2<100 but '2'>'100')
A variation would be to have several reserved columns for various types but I can't see how to do this without dynamic SQL which I want to avoid.
One of the goals is that we can add an extra product and additional attributes simply by adding data and not changing the data model. The meta data would allow us to build a screen & reports based upon the required extra data. I know this sounds trivial but it really would be questionable as to whether it would be worth our while if we had to re-code anything to accommodate each new product (or rather a variation in the extra data).
I know you don't agree with this approach, but how do you think the best way of dealing with this would be given there will be only a few items of extra data per product?
I have tested this with some level of success with about 300K people but need to go to about 1M. Seems ok so far but it's a very basic test
Thanks.
March 16, 2006 - 11:36 am UTC
how many of these additional attributes do you have?
Answer to above
Loz, March 16, 2006 - 5:12 pm UTC
I envisage there will be only a handful of additional attributes per product (say max 5) and not all will be mandatory. So say max 5 million rows for 1M people.
March 17, 2006 - 4:59 pm UTC
so, just add them as columns - you can add 5 columns easily.
Query on Design
thirumaran, April 18, 2006 - 8:29 am UTC
Hi Tom,
I am designing a model which should be very flexible to support dynamic fields. The user should be allowed to create any no of attributes for the template.
My recommendations are we will have one master table and have 2 child tables to support dynamic fields
Note: VC- varchar datatype
master table : Company_table
COMP_ID VC(32) not null - automatic DATABASE generated
COMP_PICT BFILE
COMMENT CLOB
DYNAMIC_FILED_TABLE1 (child for mater table COMPANY_TABLE)
DYNA_FIELD_ID VC(32) not null - automatic DATABASE generated
FK_COMP_ID
COLUMN_NAME VC(255) NOT NULL,
COLUMN_DATATYPE VC(30) NOT NULL
DYNAMIC_VALUE_TABLE2 (child for table DYNAMIC_FILED_TABLE1)
DYNA_VALUE_ID VC(32) not null - automatic DATABASE generated
FK_DYNA_FIELD_ID
COLUMN_VALUE VC(255) NOT NULL,
i am having a separate dynamic_filed & dynamic_value table for every master table which needs to support dynamic attributes creations.
1 )few colleagues are debating these dynamic tables should be a common one for all the tables.(i.e) one child multiple parents
2)few colleagues are debating these 2 dynamic tables should be combined into one .for every master have one dynamic table instead of 2 tables (1 for column (name/datatype) and 2-> for values)
Tom which is the best approach any pro's and con's
A) have one common table for all master table (1 child multiple parents)
B) have one master and 2 child tables (one for column name + dataype & one to store values)
C) have one master one child to store the dynamic attributes with data
Thanks in adv
Thirumaran
April 18, 2006 - 10:01 am UTC
... I am designing a model which should be very flexible to support dynamic fields.
The user should be allowed to create any no of attributes for the template. ...
ugh. I guess you didn't read my response above?
query on design
thirumaran, April 19, 2006 - 7:18 am UTC
Tom,
I got it the walked through this entire page and i understood the Advantages & Disadvantages on this design.
Thanks for clearing my doubts.
i have just one question
is it recommended to have multiple parents for one child , what are the disadvantages.
Can we maintain all FK's for these parent tables , does RDBMS support this.
Thanks in adv
Thirumaran
April 19, 2006 - 9:02 am UTC
the database enforces referential integrity - be that to a single parent (eg: emp.deptno -> dept.deptno) or to multiple (add the necessary foreign key from emp.mgr -> emp.empno).
Multiple Parent for a child Design Consideration
Sujit Mondal, May 09, 2006 - 7:06 am UTC
Hi Tom,
Can you please give me your Input for following Data Model.
Parent Tables : Week (week id , start dt , end dt ...)
Month (month id, start dt , end dt ...)
Child Table : Event
Columns
event id ,
week id (FK to Week)
Month id (FK to Month)
....
....
In any situation the rows in Event table will either be the child of Week or Month , so for a record either week id or Month id will be Null.
Is the above an acceptable design ? or shall we go for week_event and month_event separate table? All other attributes of Event table are same.
May 09, 2006 - 8:29 am UTC
it is acceptable to have a single table have "more than one parent" table it is related to.
Here, parent/child is sort of a misnomer. It is really more like a lookup or a dimension - not like ORDER and ORDER_LINE_ITEMS in the "classic" sense.
but if week and month look like you have them looking - not sure why they are "different"
Why optional FKs?
Gabe, May 09, 2006 - 10:27 am UTC
What I dont understand is why an Event belongs to either a Week or a Month. True, the Week and Month time intervals are likely not aligned (a Week may overlap two Months), but they both provide full coverage of time. Hence, for any Event one should be able to specify the Week and the Month; those FKs should be mandatory
unless??
May 09, 2006 - 11:04 am UTC
(i didn't understand the need for two dates tables myself - seems they are "periods of time")
Tom Kyte: Query on Design
Arjan Hölscher, August 21, 2006 - 6:37 am UTC
great review on database development
How do you suppose Amazon does it?
Aaargh!, September 13, 2006 - 3:02 pm UTC
Great discussion on DB design. I have been creating large databases in MS SQLserver for about 8 years now. Twice have I used the discussed 'generic data model' structure and both times it killed the performance of the application.
I am now a guru on creating the required, insanely complex SQL statements to gather usefull information from the 'generic data model'. You DON'T want to select a set of records filtered on multiple 'dynamic columns', sorted by multiple 'dynamic columns' en include paging.
When you are designing, for example, an application which has to manage ID-cards (all have different information on them) and the card-owner information associated with each card, it looks like a nice solution at first.
The customer has a nice interface to create new cards and cardfields and there is one unified way the software can handle all card information.
BUT you end up with an database that doesn't scale at all(!) and drives you nuts when trying to do anything 'non trivial' with it.
I do wonder how a company like Amazon solves this problem of variable object properties. If you take the products they are offering. Most of them share properties like 'price' but how do they deal with the product-category specific properties?
Something like this?:
Product
ProductPK
Price
ProductCategoryFK
..
ProductCategoryPK
ProductCategoryPK
Name
..
BookProductInformation
ProductFK
ISBN
Title
..
CDProductInformation
ProductFK
Title
NumberOfDiscs
..
Tom, what's your idea about their possible solution?
database design multiple parents
Kez, February 04, 2008 - 9:49 am UTC
I have a question concerning generic vs non-generic design. I understand the reasons you have given for not making a design too generic, but I'm not sure if that's what I'm trying to do. I think that this is relevant to the original question, regarding database design and avoiding generic tables, but if you think it's a new question - please forgive me
I have tables for employee, company, contact, etc. (which do have different columns and cannot be combined - well, they could but there would be a lot of redundant columns in the 'generic' table) and they can all have one or more addresses, so I am facing the decision between tables for employee_address, company_address, contact_address, etc. or one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' column which would define whether the parent table was that of employee, company, etc. Hence, I would have a 'generic' address table, but is this better or worse than 4 or more 'address' tables with the same columns but a foreign key pointing to only one table?
I am trying to find the advantages and disadvantages of each
February 04, 2008 - 4:10 pm UTC
that is not a generic address table (in fact, nothing you said above would be 'generic', none of it)
You have an entity. It is an "address"
many other entities - employees, contact, company - have a relationship to an address.
there only needs be a single address entity, it is a "master table" - a parent table - it is not "generic" at all. It is very special purpose in fact.
It only holds addresses - it is as specific as you get.
Kez, February 05, 2008 - 3:37 am UTC
Thanks! I was just worried about the multiple keys with the 'type' column. Now I just have to translate your answer into French for the rest of the team........
February 05, 2008 - 8:15 am UTC
I just re-read that.
...
I have tables for employee, company, contact, etc. (which do have different columns and cannot be combined - well, they could but there would be a lot of redundant columns in the 'generic' table) and they can all have one or more addresses, so I am facing the decision between tables for employee_address, company_address, contact_address, etc. or one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' column which would define whether the parent table was that of employee, company, etc.
...
that is wrong, you would have:
create table address (
id number primary key,
....
);
create table emp (
empno number primary key,
...
address_id references address );
create table contacts (
....,
address_id references address );
You have the foreign key in the wrong table, it belongs in the emp, contacts and so on.
You just have ADDRESSES - period. Then other entities point to them.
Address table
Mike, February 05, 2008 - 5:56 am UTC
From these remarks:
> one address table with a foreign key of id (employee, company, contact...) plus an 'addresstype' column which would define whether the parent table was that of employee, company, etc.
>I was just worried about the multiple keys with the 'type' column.
it sounds like you are planning to have a multi-purpose column which may be an employee ID, or a company ID, etc. depending on the context. I would suggest you avoid that. Instead, have separate columns: an Employee ID column PLUS a Company ID column PLUS (whatever else)...
You can then have proper foreign-key constraints for these columns. You should not attempt to use these foreign-key columns in your primary key (as they will sometimes be null).
February 05, 2008 - 8:16 am UTC
Actually, I think we had the parent table backwards - the parent table here is address - it does not have any foreign keys
(and using separate columns would be a bad idea - sure, today you have emp, contacts, company - tomorrow you have those three AND something else)
Kez, February 05, 2008 - 9:44 am UTC
In fact, if each employee, company, etc. could only have one address, then I would agree, but we can't put the address_id on the employee table because there could be more than one (eg home address, working away from home address, etc.)
In any case, I explained to the rest of the team and they are going with the separate tables anyway *sigh* I tried and I learnt something, so I'll move on.
Thanks
February 05, 2008 - 10:25 am UTC
then you have a missing entity, a many to many relationship between emps and addresses
create table address( id number primary key, .... );
create table emp( id number primary key, ... );
create table emp_addresses( emp_id references emp, addr_id references address, primary key(emp_id,addr_id) );
Kez, February 05, 2008 - 9:46 am UTC
Oh, and I wouldn't have been using null columns in the primary key as each record would have it's own id, this would be a foreign key constraint
multiple parents
Bob, March 31, 2008 - 6:48 pm UTC
In the last post you said:
then you have a missing entity, a many to many relationship between emps and addresses
create table address( id number primary key, .... );
create table emp( id number primary key, ... );
create table emp_addresses( emp_id references emp, addr_id references address, primary key(emp_id,addr_id) );
----
Just for clarification, are you suggesting that a cross linked table be created for each entity that uses an address? (ie. contact_address, client_address, emp_address, etc...)
April 01, 2008 - 7:18 am UTC
depends on the model. But if Contact has addresses (plural) then yes, there would be a contact to address association object (so that foreign keys and all can be safely in place)
Percentile Caluclation
Vicky, May 08, 2008 - 6:44 am UTC
Hi Tom,
After reading the "question of the day" stuff, hats off to your patience.
I have a requirement is which I need to calulate percentiles day wise and month wise.
Currently we are planning for a job which would run in midnight and calculate percentile for previous day using percentile_cont function.
Everything looks to be fine because we are developing and testing on 1000 records.
I assume that percetile_cont function sorts internally for percentile calculation.Well in production we are expecting millions of records.
1) Dont you think this will affect my performance.I am going to benchmark it once my development is complete but would be great if you can provide some early thoughts.
Secondly we are informed that we can receive some latent records i.e. back dated records. In such cases we would have to recompute percetile for entire day and month i.e. look again for the million of records even if only one latent record turns up.
2) Is there any other strategy you know for computing percentiles?
I am going to benchmark it once development is complete but i dont want to change my entire design if some performance issue turns up and hence requesitng your inputs.
Your inputs will highly be appreciated.
Regards
May 12, 2008 - 8:28 am UTC
... Well in production we are expecting millions of records. ...
We deal with millions of records all of the time. Give us sufficient resources (memory, cpu, IO) and - well - it is what the database does.
1) of course it will affect your performance, but the main thing here - you can and will get your answer - you need to have reasonable expectations of how long it will take to get it, but you will get it.
and if your data arrives lazily over time like that, yes, you'd have to recompute. Sometimes - that is just a fact.
2) since the insertion of a single record will potentially affect the percentile assigned to EVERY OTHER record, you'd need to recompute.
I'm not sure you want percentile_cont - that is an aggregate function:
http://docs.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm#i1000909 PERCENTILE_CONT is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
Percentile Calculation
vicky, May 17, 2008 - 7:06 pm UTC
Hi,
Data is received from 3rd party application and we are told since they implement queue mechanism , we can receive latent records maximum of 5 days back. I have to design it accrodingly. Also I have to report my data next day morning.
I am planning to use bucketized approach.
I have SLA(Sevice Level Agrrement) defined for the incoming data i.e.
if transaction response time is < 1 sec, it has performed well and meeting the SLA.
if transaction response time is >= 1 sec and < 3 sec, it has failed and penalty needs to be computed.
if transactions response time is >= 3 sec, it has completely failed and maximum penalty needs to be imposed.
I will divide the range 1 to 3 in say 10 buckets and the gap between each bucket would be (3-1)/10 = 0.2 seconds
1st bucket : count of messages with response time < 1 second and total count
2nd bucket : count of messages with response time < 1.2 seconds and total count
3rd bucket : count of messages with response time < 1.4 seconds and total count
4th bucket : count of messages with response time < 1.6 seconds and total count
.
.
last bucket : count of messages with response time > 3.0 seconds and total count
once that is done i can find out my 90% easily by applying count/total count. Wherever 90% is matched that is my 90%ile value.
Diadvantage of this approach :
1) Will provide an approximate value and not exact value.
2) every record needs to be exploded based on the count of bucket.
Advantage :
1) Fast and can cater to latent records easily.
Your views on this approach will highly be appreciated.
Regards
May 19, 2008 - 3:54 pm UTC
does it meet your reporting needs - if so, it is something that can be maintained as data arrives, which is good.
Generic Design - Oracle Clinical
Manish, June 25, 2008 - 10:01 am UTC
Hi Tom,
Have been reading through this thread and find it a very informative set of exchanges between you and others on this page.
I am working in the healthcare field and the kind of data that needs to be captured from one specialty to another varies. There are a small set of values like the chief complaints, family history, social history, review of systems, etc that remains the same for each patient. But once we get to the clinical specialty domain, one wants to be able to get the details of the
- patient's glucose level,
- see the list of medications that a patient is on,
- list out the SpO2 values for the patient
And so on, try to correlate the details between multiple other specialties.
To be able to record the case record, there is no set format noticed in my journey through various hospital implmentations. And the best way forward seems to be a generic model approach.
Since Oracle Clinical also perhaps utilises the generic design for the clinical trials, what should be the design considerations one must follow to be able to build a robust "generic data model design" (since at times it is a business need).
Please advice.
June 25, 2008 - 11:31 am UTC
Oracle Clinical is a generic solution for everyone, you buy it. You customize it. You implement it. They truly don't know what you are going to store.
My argument is "if you are building something for you, you know what you are going to store, so store it"
Confirmation That Generic Model = Horrible
A reader, June 25, 2008 - 1:32 pm UTC
Tom,
I don't normally post here anonymously but this time I must. The company I work for has - against my strong objections - purchased a product named NetCracker. It uses a variation of the generic model you showed in your first response.
I can confirm for those readers who wonder about a genreic model that it is totally useless. The flexibility offered as a selling point for it is non-existent. Performance, particularly for getting data out of the database, is hideous.
To any objective observer it has been a total failure. The only good news, and this is a stretch, is that it's use has been limited to a pretty small segment of our business. My advice to anyone thinking of implementing a generic model for any non-triviel application: DON'T DO IT!
Generic Design
archana, July 29, 2008 - 10:55 am UTC
Hi Tom,
I agree with your comments posted on the site and found it very useful.Infact, I can see the same issue being repeated in our current projects by Architects.
Statement by Architect " data field types and sizes can also be categorized into a relatively smaller number of generic classes which improves maintainability throughout the application stack"
We are also urged to take this up as a Design policy.
You have mentioned defining field type/length inappropriately or in a generic manner will cause
*** Data Integrity issues
*** Storage Issues
*** Performance Issues
Also Varchar2 was being used as an example for lengths.
On a related note, I had a query on NUMBER data type, particularly with respect to Primary keys.
My Database designer has 'generically' applied Number(11,0) for all primary key fields in the database
I am not sure i understand his approach is correct
For example
Gender table primary key is also Number(11,0) although there will is only 2 records in this table Male/Female)
Similarly Country master stores maximum of 50 odd countries but Primary key is defined as Number(11,0)
Can you please advise me on this approach whether it is correct or not.Will there be any impact(performance, storage) if we generically design all primary keys in the same manner?
I also looked at the default HR schema and here is what i find, unable to draw any particular conclusions.
But in
Countries table, country_id char(2)
Region table, Region_id number
Departments table, Department_id number(4)
Looking forward to your help and advise,
Archana
p.s I have raised this as a related query to the topic.
Please accept my Apologies if this is considered a new question.
July 29, 2008 - 4:07 pm UTC
using number(11,0) will limit your sequence values to 99,999,999,999 which seems like a willy nilly arbitrary limit.
it will not impact storage (the number 42 whether stored in a number, number(11,0), number(38), number(2) will consume the same amount of storage.
it will possibly impact performance (number(11,0) has a much larger range than number(2) does, high and low values could be much different)
it all comes down to "WHAT THE HECK ARE THE DATA REQUIREMENTS"
what are the requirements of the data, that is what drives the appropriate datatype.
Generic Design -Number Data Type
archana, July 29, 2008 - 7:19 pm UTC
Hi Tom,
Many thanks for your kind response.
Here is the requirement
Gender Table description
Gender_Id PK NUMBER(11,0)
Gender_Desc VARCHAR2(10)
I pointed out that Gender_Id will only have 2 rows, so no need to allow for Number(11)
But the database designer/architect came back with the generic model maintainability trade off .
I would like to know what would you define in this case?
Also not clear how performance might be impacted, please clarify.
And, Is it vsize() that I can use to calculate the storage impact?
Thanks and Regards,
Archana
August 01, 2008 - 11:01 am UTC
what is this trade off? I see no points to discuss?
however, if you use number(11) and you only ever put in there the numbers 1, 2 - it won't matter storage wise (the number "1" stored in ANY number definition is stored the SAME)
a datatype is a data integrity constraint, you use the right type to enforce a rule.
design
A reader, July 29, 2008 - 11:45 pm UTC
excellent thread
Good design principle, but is it excessive?
A reader, February 13, 2009 - 7:11 am UTC
If you have lots of "simple" lookup attributes for an entity that is purely used for lookup purposes eg. A product can have the attributes "colour", "size", "made_in",etc, and we had just 1 lookup table;
create table product_lookup (id number primary key, attr varchar2(30), descr varchar2(30));
insert into product_lookup values (1,'COLOUR','Blue');
insert into product_lookup values (2,'COLOUR','Brown');
insert into product_lookup values (3,'SIZE','Big');
insert into product_lookup values (4,'SIZE','Small');
insert into product_lookup values (5,'MADE_IN','China');
insert into product_lookup values (6,'MADE_IN','Taiwan');
Would you classify that as too generic? Would you recommend creating, in our case 50+ tables to store each different attribute instead of just 1?
e.g colour_lookup_tbl(id number, descr varchar2(30)),
size_lookup_tbl(id number, descr varchar2(30)),
madein_lookup_tbl(id number, descr varchar2(30)), etc
February 16, 2009 - 10:50 am UTC
... Would you classify that as too generic? ...
yes, I'm sort of into this thing called 'data integrity'. How would you have your foreign keys if you do this? Would you put into your other tables that reference this two columns??? What would prevent you from using a color when you should have been using a size??
absolutely - a table for a lookup.
A reader, February 13, 2009 - 10:52 am UTC
...If you have lots of "simple" lookup attributes for an entity that is purely used for lookup
purposes eg....
Actually, I've just read your other thread on this subject;
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1389366900346164087 You've raised the issues there which I think is relevant but I don't think causes us any problems;
data integrity - only administrators have access to screens for updating the lookup table
referential integrity - not a problem with the FK based on the lookup_id
performance - the joins are always of the type; select *
from products_tbl prd, product_lookup lok
where lok.lookup_id = prd.colour_id, I can't see how having 50+ tables with avg less than 6 rows would make much difference compared to 1 table with 300 rows.
February 16, 2009 - 11:25 am UTC
... data integrity - only administrators have access to screens for updating the lookup table ...
ummm so??? that is not data integrity, that is access control (security) - not part of the problem.
... referential integrity - not a problem with the FK based on the lookup_id ...
umm no, you missed the point. If everything is stored in table T, everything is "T", it is not a color, a size, a whatever. You have no referential integrity - you can easily use a color where a size should be. Data integrity - you get zero points for that.
... performance ....
so????? SO?????
Anyone is free to be as lazy as they want, I cannot change that. I know of no reason to use a single table other than that reason.
A reader, February 17, 2009 - 4:02 am UTC
Tom -
Data Integrity/Ref Constraint - I get your point. Yes there is nothing to stop you "from using a color when you should have been using a size??". But the input screens where these attributes are entered is controlled by us. Yes we can make a mistake is choosing the wrong group (but unlikely as we're using a constants package for that), but it's not a high risk and would be corrected fairly easily. Effectively we control the data, we control what groups get added, we control what is entered, I think hence your assessment of these risk differs from mine.
Performance - looking at the code we've got, many of our screens caches these lookups by groups. Hence we have a common function which issue a select statement of this type;
select * from product_lookup where attr = :BIND_VAR;
Now what you're suggesting is to do away with the bind variable with this and change our common function to this;
case p_attr_val
when '1' then select * from size_table;
when '2' then select * from colour_table;
when '3' then select * from made_in_table;
when ...
when '50' then select * from another_table;
end case;
Now to me this would actually hinder scalability.
I don't agree with it being lazy, it improves developer's productivity and I think there is arguably (with benchmarks) that performance is better.
Whilst on performance, maybe you can assist here - if I have a select * from products_tbls and join to every one of those lookup tables as you suggested to gets their description, if the CBO chooses to perform a HASH join, would it have to maintain 50+ different hash list? How would that compare if the lookup was just one table, does the CBO just have to maintain 1 list, which would perform better?
February 17, 2009 - 8:50 am UTC
... But the
input screens where these attributes are entered is controlled by us. ...
that is so very short sighted. Today they are - tomorrow? Are you saying "this data belongs to me, only me, only me and my application and my application is the only thing that will ever interface with this" - if so, I already know you are wrong. That data will be used and reused long after your application is erased. Applications come, applications go - data lives forever.
.... Effectively we control the data, we control what groups get
added, we control what is entered, I think hence your assessment of these risk
differs from mine. ....
Your assessment is way off base - fundamentally flawed. But one that happens time and time again. You believe the data is a side effect of your application. The application is a side effect of the data and the data will be there tomorrow - after your application withers and dies. There will be some "brand new, the last way to build applications, the best way to build applications, the best thing EVER" coming (it always does). If you make it so that the data can only be accessed safely by your application - well - you will have killed that data.
Everything you describe yourself doing - been there, done that. How many ISPF green screens interacting with CICS transactions have I seen that do the same thing - making the invention of the screen scraper so important when this web thing came out (you could not touch the data, you had to pretend to be a 3278 dumb terminal so the application could touch the data safely). You are doing the same thing - you are building the green screen legacy application of 2010.
...
Now what you're suggesting is to do away with the bind variable with this and
change our common function to this;
case p_attr_val
when '1' then select * from size_table;
when '2' then select * from colour_table;
when '3' then select * from made_in_table;
when ...
when '50' then select * from another_table;
end case;
wow, whoa, where did you get that idea. It would be more like
at the appropriate place we issue "select * from size_table", later, at the appropriate place we "select * from made_in_table" and so on. You have a pick list for a field - I see no use for the construct you just gave.
Scalability - you would not have that code, I don't know where that idea even comes from.
and look at your last bit - about the hashing. Umm, you would:
select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;
You would full scan lookup 3 times
You would hash lookup (which is very large - is has EVERYTHING) into memory 3 times
You would hash join to it three time.
Versus:
select * from t, size, color, made_in
where t.color = color.pk and t.made_in = made_in.pk and t.size = size.pk;
I'll full scan three smaller tables and have three smaller hash tables - true, you could put a predicate on yours:
select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and l1.descript = 'COLOR'
and t.made_in = l2.pk and l2.descript = 'MADE_IN'
and t.size = l3.pk and l3.descript = 'SIZE';
you'll hash the smaller bits of data, but full scan the larger....
If you have 50 columns, you'll have 50 joins - regardless of a single table or 50 tables.
It does not improve ANYONES productivity in any way shape or form. It is as bad as using global variables so we don't have to pass parameters - or not commenting code because we can improve our productivity.
Put up the benchmarks - you say you have them, let's see them?
A reader, February 18, 2009 - 2:53 am UTC
Tom,
Yes, data does live forever, but I don't agree that our design will somehow "shorten" its useful life. Yes other applications could interface with this - but subject to the sets we have defined. It's simple, if they're using PL/SQL, they can use the same constants package we use to prevent data integrity problems. If they're using Java, they can copy our constants package and turn it into a constants class and so on.
Other applications can safely access our data if they follow some basic rules and good practice where hardcoding is necessary. Yes, they can make mistakes "easier" with this hardcoding but that's what testing is for! It's hardly difficult to spot on the front screen that COLOR has been set to "Hong Kong"!
I made a mistake and should put "do with benchmarks" and have misled you into thinking I'd already done the benchmark.
"select * from product_lookup where attr = :BIND_VAR;" is embedded in a COMMON function that the front-end calls to cache all the values of a particular set. Hence I presented you with the "CASE" statement scenario as that is how they would adapt to the new 50+ table structure. I know you don't agree with this and it should be changed at the appropriate place, but the common function is what they use and that won't change.
However, whether we have "CASE" statement or your suggestion, I would still have to hard parse an extra 50+ statements where I don't need to. I could benchmark this to prove how using a bind variable can be beneficial, but because fundamentally you don't agree with this design whereas I'm finding that as we're discussing this and I'm appreciating where you're coming from, I'm more convinced that we haven't actually made a bad decision, it's kind of pointless even if I produced stats showing ten-fold increase in performance and reduced latching.
"select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;" - ok if the CBO is doing 3 full scans,then it's not as clever as I thought. Surely, it should recognise that it's the same table and since we're joining to it by the PK, then it's doesn't need to create 3 identical hash list? Whereas understandly, your statement "select * from t, size, color, made_in
where t.color = color.pk and t.made_in = made_in.pk and t.size = size.pk;" would obviously create 3 hash list. Programmatically, if you had to write your own hash list, would you create 3 (or 50+) identical hash list?
Productivity is increased - no developers here wants to remember 50+ lookup table names. From your book (which I have and rate highly), I quote, you "just say no" to CHAR(1) instead of VARCHAR2(1) to avoid confusion, well here we say just "lookup".
I think we agree to disagree on this design. I do value your opinion on your input with regards how the CBO would handle the query and it looks like it doesn't do what I thought it should.
February 18, 2009 - 2:53 pm UTC
... It's simple,....
No, it isn't. You have to be educated on how YOU write your code in YOUR install before you can even consider touching this data. Everything is done in the application, every application has to follow a set of rules. It is not automatic and it will be far too easy for someone to say "eh, yeah, I know that is supposed to store colors, but you know what, we have this new thing that doesn't have a color, it has gidgets, we'll just use that field for gidget when the type of record is "X", it'll be great"
Now, you have a mess.
.. if they follow some basic rules
and good practice where hardcoding is necessary....
and that makes it complex, it didn't need to be that way.
... but the common function is
what they use and that won't change.
....
but only because you made this "design" decision, if you had not - it would be different, so please don't impose on me what your design led you to do.....
.... I would still
have to hard parse an extra 50+ statements where I don't need to. ...
You wouldn't, we would - and SO WHAT. I'll give you 50 statements - big deal, in the grand scheme of things - so what? At least I'd know what data you were going after, what each function used, I'd have *more* information, better information and data integrity to boot.
.... "select * from t, lookup l1, lookup l2, lookup l3
where t.color = l1.pk and t.made_in = l2.pk and t.size = l3.pk;" - ok if the
CBO is doing 3 full scans,then it's not as clever as I thought. ....
think about it, it needs to join to three rows.... three *different* rows.
... Surely, it
should recognise that it's the same table and since we're joining to it by the
PK, then it's doesn't need to create 3 identical hash list? ...
surely you could easily run a tkprof and see, rather then assume what it does?
And still, even if it did it ONCE, it would be full scanning the entire set of 50 tables, I scan 3 tables, 3 short tables. I still win. Regardless.
... no developers here wants to remember 50+ lookup
table names. ...
They have to remember 50 bind variable values... HMMMM.... six one way, half dozen the other. 50 is 50 is 50.
... I quote, you "just
say no" to CHAR(1) instead of VARCHAR2(1) to avoid confusion, well here we say
just "lookup".
...
but - you still have *50* things to remember.
And frankly, if your developers honestly could not figure out how to put the data back together - well, hmmm. I wonder how they remember the API's they use to program?
And one would assume you use a naming convention of sorts, the developers must know that a given field in a table must be "looked up" (right, they have to, they code the join to the lookup table don't they - so they must know what columns are foreign keys to lookups...). One could make the assumption that you would have a naming convention that says "the column COLOR_FK will be used to represent color, that will be joined to the COLOR_LKUP table". They do, well, have access to the column names - they have to type them in
(this productivity concept, very weak)
... I think we agree to disagree on this design. ....
well, not really, you haven't put forth anything that says "this is better", you are conceding
o applications must do the integrity, all of them, every time, hope they do it or we'll have a mess.
o integrity can easily be compromised. And it almost surely will be (gidgets will happen - programmers will hack it - they always do)
o you still have 50 things to remember, you have saved nothing (so your entire "they are more productive" thing is gone, we'll need another example)
o the database will do more overall work, not less, with a single table - regarding IOs.
... how the CBO would handle the query and it looks like it
doesn't do what I thought it should. ...
And this is why I'm steadfast on this, there is the way we envision we would make something work if we wrote it...
And then there is the way it really works.
Of the two, only the latter is relevant, the former would be relevant only if we were to build our own database someday.
A reader, February 19, 2009 - 12:17 am UTC
Doesn't Oracle Apps use lookups?
Doesn't Peoplesoft Apps use lookups?
And SAP?
How long have they been around (yes one of them has been taken over!)? Have their been reports of this disaster design causing data integrity and performance issue due to a single lookup?
February 19, 2009 - 7:31 am UTC
so what? do you agree with everything they do? I don't. Do you? Are they perfect examples of precisely how to build an application?
Or are they things built by people who are coders - that don't necessarily know much about databases?
even oracle apps.
especially SAP.
and do they all, ALL, have many reported instances of data integrity issues? yes, well, yes they do.
A reader, February 19, 2009 - 12:44 am UTC
o applications must do the integrity, all of them, every time, hope they do it or we'll have a mess. -- yes do everything in the database, can't disagree. But the "mess" risk you're describing is exaggerated.
o integrity can easily be compromised. And it almost surely will be (gidgets will happen - programmers will hack it - they always do)
-- no it can't, follow SIMPLE rules. Give programmers some credit.
o you still have 50 things to remember, you have saved nothing (so your entire "they are more productive" thing is gone, we'll need another example)
-- select * from product p,lookup l1,lookup l2,lookup l3 where l1.id=p.color and l2.id=p.made_in and l3.id = p.size
o the database will do more overall work, not less, with a single table - regarding IOs. -- 300 rows in our lookups, as oppose to 6 rows on avg. in your 50+ tables, and only then it's more I/O if the CBO performs Hash Join and decides that it really needs to create 50 identical hash list. Yes, I admit I made an incorrect assumption - it (CBO) really isn't as clever as it ought to be - if I find the benchmark is so poor, I will report back.
February 19, 2009 - 7:49 am UTC
The mess I'm describing is one I see over and over and over. How can you say it is exaggerated?
follow rules? give programmers credit? I give credit where credit is due. You have to train people to follow your rules to access your data. You are assuming everyone is "as good as you". Guess what, in order to have an 'average' coder, 50% of them are below average. Some of them are way below average. They are the ones that will make color become a gidget. It not only MIGHT happen, it FREQUENTLY happens.
After 16 years of begging, documenting, showing the negative effects of not using, saying it day after day - I still cannot get coders to realize that "using bind variables is something we have to understand" - and you think they won't abuse an attribute in a table? You think they won't take the short cut? Hah.
You still haven't provided another example - you give me a query that references color, made_in and size - they had to KNOW that. You still have 50 things to KNOW (or look up in your documentation or get from a describe, whatever)
A reader, February 19, 2009 - 3:26 am UTC
"and that makes it complex, it didn't need to be that way."
-- no it's not complex, you're making it out to be more complex than what it actually is. And yes, you've say the same back to me.
"but only because you made this "design" decision, if you had not - it would be different, so please don't impose on me what your design led you to do....."
-- you making incorrect assumption in the same manner that I did, the COMMON function approach was already in place before we introduced the lookup. Our design doesn't and did not influence how they were performing lookups in the front-end. They may have not taken the approach you wish, but it wasn't driven by the alleged "bad" database design. Regardless, our design have facilitated the use of bind variables. Compared to your 50+ lookups - I win.
"think about it, it needs to join to three rows.... three *different* rows."
-- so what? It's the same table. We're talking about if it decides to do a hash join here - that's just not smart if it's building X identical hash list.
"And still, even if it did it ONCE, it would be full scanning the entire set of 50 tables, I scan 3 tables, 3 short tables. I still win. Regardless."
-- if it did it once, I'd like to see what the effect of creating say 50 different hash list compared to 1 hash list, and compare that to the amount of IO that you'll save which makes yours the "outright winning" solution. We're talking about 300 rows here in 1 in-memory hash-list, as oppose to 50 hash list.
"this productivity concept, very weak)" - really? You like maintaining 50 table creation/maintenance scripts as well? Less code,keep it simple the better it is isn't it? Let say we want to add a new column to indicate a "short description" for all the lookup data, how many tables do you have to amend? How long would it take you to do 50+ compared to 1? Let's say I want to move all the reference data to a different tablespace, how many lines of code do you have to write? Productivity - I win.
We could go round in circles - like I said, we agree to disagree on this design. You have some valid points to consider, but nothing that makes our design inherently "bad" and makes me concede anything. I'm not suggesting for 1 minute we have 1 table for the ENTIRE system, just some description lookups.
February 19, 2009 - 8:37 am UTC
Look, if you have to tell a developer how to enforce data integrity for each and every table, you've made it complex. Especially when you could have data integrity enforced
a) safely - you know it is done right, without having to tell anyone
b) all of the time, in every application
c) consistently
without *teaching* anyone how to do it - then you've made it more complex. Coders are coders, they will not follow the rules if they see a way to make a shortcut. Declarative integrity rules out shortcuts. Period.
... the COMMON
function approach was already in place before we introduced the lookup. ...
how could that be so? How could you have a common lookup function that predates having data to lookup?
... We're talking about if it decides to do a hash
join here - that's just not smart if it's building X identical hash list.
...
and you are describing a micro-optimization that works only for this sort of bad design. Whenever you see a table being joined to N times over and over and over again in the same query - you know a mistake was made. This is something that frankly, just doesn't happen that often. So, that the optimizer doesn't recognize that it is dealing with "the single lookup table that satisifies all lookups" doesn't bother the rest of the world really - it is the *only* time that 'optimization' would ever come into play. Show me another case whereby you join to a single table many times in the same query? And the case has to be only "joining to the same set of attributes, there can be NO other predicate on the data being joined to"
... really? You like maintaining 50 table
creation/maintenance scripts as well? ...
absolutely! It is called *documentation*, metadata, more information, commentary, stuff to make it all make more sense.
I'm not afraid of a bit more "code" if it means
a) simple (you cannot screw up the data, I don't have to teach you)
b) always in place (you cannot screw up the data - do you see a common thread here - you cannot screw up the data) - data integrity is going to be there.
c) more information, better information. Having a lookup table with a well formed naming convention tells me something, provides information.
... Let's say I want to move all the reference
data to a different tablespace, how many lines of code do you have to write?
Productivity - I win.
...
*give me a break*. And how often would you do that? And so what - grasping at straws just popped into my head reading that.
... but nothing that makes our
design inherently "bad" and makes me concede anything. ...
it is bad, it is not the correct approach. You have conceded that data integrity can easily be compromised - that in itself is sufficient for me to cry foul. You now see that the optimizer works the way it works, not the way you think it would if you wrote it. And there is no programmer productivity here - just more rules for them to absorb and try to follow.
here is my bottom line.
I see so many 'bad practices' out there from
o lack of binding
o swallowing exceptions
o using autonomous transactions (and not understanding the implication)
o using triggers all over the place (and not understanding the implications)
o committing all over the place (breaking transactions entirely)
o using this single lookup table (and not having data integrity)
o not using declarative constraints
o using an EAV (entity attribute value) model to store all data - just so they are more 'productive' (hah, that one cracks me up)
.... list goes on .....
That I just have to, I am compelled to, point it out - loudly, over and over, every time, every single time.
If you are really smart, if you have a small team, if you have all above average developers - that actually LISTEN TO YOU and FOLLOW YOUR RULES - go for it.
But the average team is - average.
The average team has - people that take shortcuts.
The average team will have those that don't get the implications.
I speak to the average team that typically has 99% of the people not really even thinking that the database is useful - hence they learn nothing about it, they just like to make pretty screens.
And if that means storing a gidget in a color - so be it, they do not really *care*
I've made most all of these mistakes.
I've seen these mistakes made over and over and over and over again and again and again.
If I say to one person "ok, you are ok doing that" then you know what happens? It becomes "he said it is ok"
but it isn't. So, I won't say it.
Design Approach
A reader, May 17, 2009 - 2:20 am UTC
Hi Tom,
We have a requirement in building a data mart, where we will be getting working version of data, number of times till it becomes approved version. We need to replace working version every time for a specific version ID till it becomes approved. Users want to see current working version in the fact table along with approved versions. My question is:-
The target fact table is quite large, and every time we want to replace specific working version (based upon ID - there are many working versions for different IDs), we either need to delete/insert or update else insert (merge) or is there is any better way of doing? Further, since there is a requirement of aggregated data, we are thinking of creating a materialized view on based table, and delete/insert or merge approach on base table can result into heavy CPU/Memory intensive jobs.
Thanks
May 23, 2009 - 11:00 am UTC
use list partitioning...
you can then load into an empty partition you add for that version, you can drop that partition or truncate when you want to "unload" it.
List partition
A reader, May 23, 2009 - 8:33 pm UTC
Hi Tom,
The fact table has version id based upon version dim, which gives whether the version is working or approved. The layout of version dim is like this (some column headings are shortened here for clarity):
Vrsn Type Code VersionNo Version
1 PLAN PLAN2008 0 WORKING
2 PLAN PLAN2008 1 APPROVED
3 PLAN PLAN2008 2 APPROVED
......
Only one working version for a type & code, and multiple approved versions.
The data in fact table will be based upon version Keys.
Do you mean this:
Create a temp table (for working version) and load working version data. FACT table is list partitioned with version, and then exchange partition.
For approved version just load the data in the fact table.
Further, there is materialized view on the fact table, which will only show latest 2 approved versions and a working version. When we exchange partition for working version, what refresh options is better - Complete, FAST etc.
PS: We are on 10gR1.
Regards,
May 23, 2009 - 10:39 pm UTC
no, i mean each thing you load would have its own version number.
you can either load directly into the fact table or use a temp table and exchange, it would not matter. whatever you like to do. each version is unique and independent of every other version.
read about partition change tracking in the data warehousing guide regarding your refresh options.
Working version
A reader, May 24, 2009 - 7:05 am UTC
Hi Tom,
My apologies, I think I did not make clear.
The working version every time comes for a specific type and code is required to be replaced in the fact table, only approved versions are to appended.
Regards,
May 26, 2009 - 8:38 am UTC
right - drop old partition with version you don't like anymore
add new partition with new version (and new version id) for the stuff you want.
You would be slicing out entire versions - whenever you want - using DDL
You would be sliding in entire new versions - whenever you want - using DDL
A reader, May 25, 2009 - 1:33 am UTC
We have different criteria groups which user can attach with the visit of Patient. If all the requirements of the Group are met in a visit then user can select this Group otherwise we will give message that you can't attach this Group.
For example if user have attached CPT and ICD from the provided list and Place of Service Indicator is not selected then he can attach Group1 with this visit. If user either selects ICD 185 and CPT from provided list OR he selects icd from provided list then he can attach Group 2.
The icd and cpt which user attach with visit are which he selects from the unique common set (as these repeat in different groups) at the time of submission of form we will check for validation. These groups will grow with time and rules will also add up.
Group 1
ICD from Provided List
AND
CPT From Provided List
Without
Place of Service Indicator
GROUP 2
ICD 185
AND
CPT From Provided List
OR
ICD from Provided List
Measure #143
ICD from Provided List
AND
CPT From Provide List
OR
ICD from Provided List
AND
CPT From Provided List1
And
CPT From Provided List2
I am thinking to make reference data tables for these CPT, ICD, Place of Service and Age and then in procedures write IF Else conditions to check validation. Is this good to do it in this way or there should be some generic or non procedural way should be adopted. Thanks for your time.
May 26, 2009 - 9:07 am UTC
how many groups - in your wildest dreams - can you imagine there might be some day.
and what is "measure 143" you have there - it just sort of appears, but you don't mention why you typed it in?
A reader, May 26, 2009 - 9:07 am UTC
It will be around 30-40 Groups
May 26, 2009 - 10:09 am UTC
can I ask what should happen when you change the lists (update the "provided lists")
database design
Matt, January 25, 2011 - 1:54 pm UTC
Hello Tom
I am supporting an application that has performance problems early on in its release, I believe there are serious design flaws.
For example there is one main/central table to the application, and a PL/SQL procedure operating on this table of the form;
proc MyProc ( param1 in number, ..., paramN in varchar)
begin
...
insert into "MainTable"
select col1, col2, ..., colN
from ( select from in line view here
querying various tables )
where not exists ( select 0 from "MainTable"
where someColA = :someValue
and someColB = :someOtherValue );
COMMIT;
end;
To be honest I was somewhat confused and could not understand the thought process of the person writing that statement. Why have they not defined a unique constraint?
The table has about 7 or 8 indexes many of them have 4 or 5 columns - but one index has 14 columns! The app needs to write tons of data to this table (about 10 million rows after an initial departmental pilot over a year) and as it is a central table there are lots of queries too. But 14 columns, and row by row inserts.
Would a select have to use the leading edge column of the index in either its select clause or where clause to use the index?
Sorry my question is not well formed - I'm just hoping for a view on this kind of design. What can we infer about the designer? The developers agree there is impact on write speeds but say the index is required as the index blocks will always be in buffer cache and that the index satisfies certain select statements without having to read data blocks (potentially from disk). But the index is probably nearly as large as the table anyway (it is not a hundred column table, only about 25 columns). Are the index blocks more likely to be cached than the data blocks?
best regards
Matt
February 01, 2011 - 12:10 pm UTC
... Why have they not defined a unique constraint? ...
well, a unique constraint would not necessarily make sense. It looks like they just want to add rows that don't already exist in MainTable.
... The table has about 7 or 8 indexes many of them have 4 or 5 columns - but one index has 14 columns ...
that could make sense if this is a "spreadsheet reporting table".
... But 14 columns, and row by row inserts. ...
I don't see any row by row? It is a single bulk SQL statement.
... Would a select have to use the leading edge column of the index in either its select clause or where clause to use the index?
...
not necessarily. We can do INDEX SKIP SCANS where by a query such as
"... where column2 = ?"
can use an index on (low_cardinality_column,column2)
Additionally, a query
"select b, c from t where d = ..."
can use an index on (x,y,z,a,b,c,d) INSTEAD of using the table.
... the index is required as the index blocks will always be in buffer cach ...
there they are wrong. It is doubtful that all blocks of every index would be in the buffer cache - however - the insert would need them all to be since it randomly inserts into many of them.
... Are the index blocks more likely to be cached than the data blocks? ...
Frequently accessed blocks are more likely to be cached than non-frequently accessed blocks. sometimes a table block is more frequently accessed than an index block and vice versa.
"It depends"
Bad designs for query performance
abc, February 26, 2011 - 12:40 am UTC
Oracle recently bought a product CC&B from SPL and its new name assigned by oracle is Oracle Utilities.
There are two tables with this structure in it.
CI_BILL
--------
BILL_ID
etc
etc
CI_BILL_CHAR
-------------
BILL_ID
BILL_CHARACTERISTIC_CODE
BILL_CHARACTERISTIC_VALUE
etc
etc
There are many things printed on bill including the billing month also. Note that there is no date datatype column and billing month is not stored in ci_bill,
All things printed on bills are called bill characteristics according to SPL, and they stored it in the detail/child table, the child table also stores the billing month characteristic of the bill as a bill characteristic.
The problem is that both tables are grown very large, and now we were thinking of partitioning the ci_bill_characteristic table on the basis of billing month, so that to separate recent bills along with there characteristics from old bills. But we were unable to range partition it for this purpose.
This is one example, there are many tables in this application which are designed the same way and are not partition able on the basis of date/time.
1- Do you think its a good design?
2- Is there any solution to this problem in any version of oracle, existing versions or future versions.
3- Is there any thing like "join partitioning" in oracle database now or in future versions. Like there is bitmap join indexes.
In my opinion, the only solution is to redesign the whole application, what you say.
Thanks and Regards
February 28, 2011 - 8:46 am UTC
1) it looks like an EAV- entity attrbite value - model. My opinion of such a "design" is well known.
I'm not a fan.
2) there is nothing to partition here - the design would need to support partitioning.
3) not that I'm aware of.
mfz, February 28, 2011 - 9:21 am UTC
Hash partitioning is only partition method , if you can think of .. in this data model;
your queries may go for partition pruning , if all your queries looks for bill_id . But that may not be the issue here.
Tom has spent a section in the first chapter of "Effective Oracle by design " on these data models.
question to mfz
abc, March 01, 2011 - 4:19 am UTC
mfz,
can you please elaborate how hash partitioning can be thought of for solving this problem. I dont think even that would help the performance of the queries based on date/time/month ranges.
Generally most third party product more or less have same generic model. So this is a general problem, which is limiting the users of oracle to use the feature of partitioning in oracle. But tom, tell me, does oracle's policy is also to design such model? if this product was made by oracle, would oracle also had designed this way? or may be, as now oracle has bought this product, will oracle in future change the design to support time base partitioning.
mfz, March 01, 2011 - 10:04 am UTC
If the queries are based on the hash key ( here in this case , billing_id) ONLY , it might help . Otherwise it will not .
I was throwing generic suggestion for a generic problem :-)
Also , by hash partitioning , you can mitigate hot spots ( contention due to monotonically increasing sequences ... probably be in the case of BILLING_ID)
Re: Bad designs for query performance
Jakub Illner, March 01, 2011 - 1:23 pm UTC
Hi,
I agree that with packaged application you cannot do much, since introducing any partitioning scheme is probably prohibited by the application vendor.
However if you are free to partition these tables I would consider range partitioning of bills on bill date, possibly using hash subpartitioning on bill id and also using the new reference partitioning for the bill characteristics.
This partitioning scheme will work well if you primarily access or load the data by bill date and if you always join both tables by bill id.
create table ci_bill (
bill_id integer not null
, bill_date date not null
, bill_filler varchar2(200)
, constraint ci_bill_pk primary key (bill_id)
)
partition by range (bill_date)
subpartition by hash (bill_id) subpartitions 16
(
partition p2010 values less than (to_date('20110101','yyyymmdd'))
, partition p2011 values less than (to_date('20120101','yyyymmdd'))
)
/
create table ci_bill_char (
bill_id integer not null
, bill_char_code varchar2(40) not null
, bill_char_value varchar2(256)
, bill_char_filler varchar2(200)
, constraint ci_bill_char_pk primary key (bill_id, bill_char_code)
, constraint ci_bill_char_bill_fk foreign key (bill_id) references ci_bill (bill_id)
)
partition by reference (ci_bill_char_bill_fk)
/
Jakub
March 01, 2011 - 1:56 pm UTC
there is no bill date in the master table, it is stored as an "entity attribute value" in the child table.
from the original poster:
Note that there is no date
datatype column and billing month is not stored in ci_bill,
bill date
abc, March 01, 2011 - 2:32 pm UTC
Yes, there is no bill date column in ci_bill the parent/master table.
But tom, even if the bill date was there in master table as a column in addition to as a characteristic in detail table, and then we used reference partitioning like mentioned above, how would partition pruning take effect when we query on the detail table?
March 01, 2011 - 4:16 pm UTC
the presumption would be that you always drive from the parent into the child - the foreign key to the parent is the partition key for the child table.
For Tom
abc, March 01, 2011 - 2:37 pm UTC
Tom, please comment on this.
Generally most third party product more or less have same generic model. So this is a general
problem, which is limiting the users of oracle to use the feature of partitioning/partition pruning in oracle. But
tom, tell me, does oracle's policy is also to design such model? if this product was made by
oracle, would oracle also had designed this way? or may be, as now oracle has bought this product,
will oracle in future versions, change the design of tables to support time based partitioning.
March 01, 2011 - 4:19 pm UTC
In general, we do not use entity attribute values in our data models. IN GENERAL we do not.
Sometimes we do.
In general we use 'flex fields' - additional columns (that can be indexed and queried efficiently).
Re: Bad designs for query performance
Jakub Illner, March 02, 2011 - 1:54 am UTC
I did not realize that even the bill date (month) is not available in the table. But maybe in this case you could use the bill id as a proxy? I.e. if bill id is ascending sequence, then probably each bill run will generate ids within a given range and the ranges will not overlap. So you could still use the range partitioning on bill date and each partition could correspond to a single month. I am not sure if this will help for queries & loading, however for data aging and archiving this could be a valid approach.
Jakub
March 02, 2011 - 7:34 am UTC
... I.e. if
bill id is ascending sequence, then probably each bill run will generate ids
within a given range and the ranges will not overlap. ...
it is not a good idea to assume that sequence value implies date or ordering of data.
In RAC in particular.
A really bad idea to assume anything regarding a sequence number other than:
it is unique.
Re: Bad designs for query performance
Jakub Illner, March 02, 2011 - 9:02 am UTC
Tom,
you are perfectly correct that there is no guarantee that sequence numbers are always ascending.
However if I am really desperate and need to partition data somehow in order to manage the growth of data over time I think that I could do it like I proposed:
1. run the billing process for the given month (or cycle)
2. find the max(bill_id)
3. split the last range partition for bill_id less than the max(bill_id)+1
4. increase the sequence to > max(bill_id) for all instances in the RAC
The last step will need small procedure that will have to be executed on all instances of the RAC since it is not possible to simply reset minvalue.
This could work since the billing is typically a batch process which has a clear start/end so there is (usually) time window for finding the max(bill_id) and increasing the sequence.
FK will impact performance ?
Rajeshwaran, Jeyabal, September 30, 2011 - 12:15 am UTC
Tom:
We have a table (say t_user) which holds list of application user information. (Production data details num_rows = 173 and total_blocks = 8)
We are planning to design a new table which has a column called
updated_user_key which is FK to
t_user table. This table is initially populated with some existing data's (say 114M records, so I loaded the data into this table and enabled FK with option as ENABLE NOVALIDATE) and once deployed into production each transaction will insert nearly 1 to 20 records into this new table.
Questions1) I got a feedback from a developer saying "
Need to remove the FK on update_user_key because these FK check will slow down insert transaction " - Did a Benchmark in TEST environment. I don't see any slow down in insert transaction. Please let me know your thoughts on having this Foreign key.
2) Why the second query has the no of buffers for Current mode (182714) is half than the first query(354517)? I dont understand Can you please explain on this.
create table t_user
(
user_id number,
user_name varchar2(30)
);
insert into t_user
select level,dbms_random.string('x',30)
from dual
connect by level <= 172;
commit;
alter table t_user add constraint t_pk primary key(user_id);
exec dbms_stats.gather_table_stats(user,'T_USER');
create table t_with_ri
(
user_id number references t_user,
user_name varchar2(30)
);
create table t_without_ri
(
user_id number,
user_name varchar2(30)
);
alter session set timed_statistics=true;
alter session set events '10046 trace name context forever, level 12';
declare
type array is table of t_user%rowtype
index by pls_integer;
l_array array;
begin
select *
bulk collect into
l_array
from t_user;
for i in 1..1000
loop
for k in 1..l_array.count
loop
insert into t_with_ri
values( l_array(k).user_id,
l_array(k).user_name);
insert into t_without_ri
values( l_array(k).user_id,
l_array(k).user_name);
end loop;
end loop;
commit;
end;
/
And the Tkprof shows me this.
********************************************************************************
INSERT INTO T_WITH_RI
VALUES
( :B1 , :B2 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 172000 30.37 31.59 0 1109 354517 172000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 172001 30.37 31.59 0 1109 354517 172000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
log file switch completion 2 0.07 0.13
********************************************************************************
INSERT INTO T_WITHOUT_RI
VALUES
( :B1 , :B2 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 172000 29.46 30.85 0 1130 182714 172000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 172001 29.46 30.85 0 1130 182714 172000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
log file switch completion 2 0.07 0.12
log buffer space 2 0.05 0.08
********************************************************************************
September 30, 2011 - 6:22 pm UTC
1) I got a feedback from a developer saying " Need to remove the FK on update_user_key because these FK check will slow down insert transaction " - Did a Benchmark in TEST environment. I don't see any slow down in insert transaction. Please let me know your thoughts on having this Foreign key.
tell the developer "so what, the data has to be clean - your application will have bugs, we will have orphans, data integrity is important"
A foreign key will add some runtime - but it will be orders of MAGNITUDE less runtime then if the developer attempts to do it themselves.
And - this is important - the developer won't be able to do the checking without locking the table. Ask them how fast will that be. We do it at the row level, they can only do it at the table level.
And if they say "no way, I don't have to lock the table" - post their logic and I'll tell you why it is wrong.
I get fed up with this attitude.
Do this
DEMAND from your developer the hard numbers to back up their malicious and incorrect CLAIM. Period.
Make sure that they present numbers that not only compare no fkey to a fkey - but also to their code *attempting* (but failing) to do so as well. Or if they say "data integrity is not an issue", have them present those facts that support that claim.
2) you are seeing the effect of doing the foreign key checking and the locking that has to take place (getting a block in current mode prevents others from doing so - it is a very short term lock there)
FK will impact performance ?
Rajeshwaran, Jeyabal, October 01, 2011 - 8:26 am UTC
Tom:
Thanks for your response.
2) you are seeing the effect of doing the foreign key checking and the locking that has to take place (getting a block in current mode prevents others from doing so - it is a very short term lock there)
I dont understand this, Can you please explain in detail.
October 03, 2011 - 12:15 pm UTC
when we get a block in current mode, no one else can get that block in current mode. It shows us what exactly is on the block *right now* and can also act as a serialization device (since no one else can get it).
That is us serialization at the lowest level - the foreign key checks. Your developer would have to serialize at a much higher level (table lock, duration = length of transaction).
You are seeing the effect of us doing a bit of extra work to enforce the foreign key relationship.
Note: if you update the parent tables primary key(even to the same value!) or delete/merge the parent table, you almost certainly want to index the foreign keys! That will change your numbers - you'll see individual executions increase in IO's a bit (reading the index) - at the expense of marginally increased cpu for each individual execution - but the payoff is we WILL NOT lock the child table for the duration of the delete/update/merge operation on the parent table.
vinay, May 22, 2012 - 1:35 am UTC
Create table EMP with following columns and constraints- Note: Give descriptive enough name to the constraints Name Type Constraint EMPNO NUMBER(4) Make this as primary key after creating table ENAME VARCHAR2(10), JOB VARCHAR2(9) CHECK CLRK/MGR/A.MGR/GM/CEO CLRK MGR_ID NUMBER(4) References EMPDATE_BIRTH DATE SAL NUMBER(7,2) More than 20000, COMM NUMBER(7,2) DEFAULT 1000 DEPTNO VARCHAR2(3) References DEPT DATE_OF_JOIN DATE default SYSDATE Add a primary key constraint to EMP table after creating the table1
May 22, 2012 - 8:15 am UTC
ok, go ahead. homework is fun.
Design for loading ref data on the fly
A reader, November 23, 2012 - 3:42 pm UTC
I have a 3rd party product which provides XML messages that we store in a relational form.
Say we have 2 tables;
ref (id number primary key, ref_value varchar2(30));
create unique index refval_x on ref(ref_value);
master(id number primary key, fld1 varchar2(10), ref_id references ref);
Both ids above generated using sequence
The XML message comes in this format;
<root>
<mastrec fld1="Hello" refval="ABCDEF"/>
<mastrec fld1="There" refval="ABCDEF"/>
<mastrec fld1="Bye" refval="HIJKLM"/>...etc
</root>
This XML structure is loaded into a temporary table to begin with.
We would create 3 MASTER records above. We only know about some of the possible valid values in refval and because we have a normalised design, we create REF records on the fly where needed. In the example above we end up with 2 REF records.
I've started to look at the existing code which does this by calling a PL/SQL function which reads the REF table using REFVAL and if record exists it returns the REF.ID, if not, it creates a REF record as part of an AUTONOMOUS transaction and returns the newly created REF.ID;
e.g. update TEMP_TBL set REF_ID = get_ref(refval);
I thought this was bad in terms of context switching and could lead to data inconsistency. So I started to rewrite it using a 2-step process instead;
insert into ref ()
select ref.nextval,refval
from (select distinct refval
from temp_tbl
) t
where not exists (select *
from ref
where ref.ref_val = t.refval
)
log errors into err$_ref_table reject limit unlimited;
insert into master
select master.nextval,fld1,r.id
from temp_tbl t
join ref_tbl r
on r.ref_val = t.refval;
All looked good until I found out that due to the huge volumes of messages, the load process is multi-threaded and therefore it could be inserting into REF the same REFVAL at the same time. At this point, whichever thread gets to do the insert into REF first would lock out the other thread which is then also trying to insert into REF.
Can you explain why? And what's the best way to get round this issue?
November 29, 2012 - 7:23 am UTC
At this point, whichever thread gets to do the insert into REF first would lock out the other thread which is then also trying to insert into REF.
that is not true - not unless
a) there was a unique constraint and all threads tried to insert the same values - I don't see that here
b) they actually were using insert /*+ APPEND *? into ref...
give example to reproduce with otherwise.
Design for loading ref data on the fly
A reader, November 27, 2012 - 1:42 am UTC
No issue after all, the "hang" occurs because both threads trying to lock resource (both inserting the same unique ref_value at the same time, which has a unique index on it)
So no matter which method I used, the hang would persist.
I guess the only way is to try and lock the ref_value using dbms_lock
November 29, 2012 - 7:50 am UTC
or use bulk operations and parallel inside the database instead of having a bunch of java jockeys try to do it outside the database with tons of code that isn't necessary to write in the first place of course!!!!
Design for loading ref data on the fly
A reader, November 30, 2012 - 2:53 am UTC
Well, we have what we have, the Java jockeys are here to stay.
The bulk operations I've implemented is better than the context switching and autonomous transaction previously.
I've got another issue which is related, instead of using insert into..log errors, I've switched the implementation to use forall..save exceptions instead
procedure process_data is
type id_list_typ is table of number;
type descr_list_typ is table of varchar2(100);
l_id_list id_list_typ;
l_descr_list descr_list_typ;
begin
select ref_id_seq.nextval,
inp.refval
bulk collect into l_id_list,
l_descr_list
from (select distinct
refval
from temp_tbl
) inp
where not exists (select * from ref_table ref where ref.refval = inp.refval);
..forall..save exceptions...
end;
What I need to achieve is to be able to perform the same select...nextval + processing afterwards but with the ability to accept different datasets from different tables e.g. so instead of (select distinct refval from temp_tbl) above, it could be (select distinct some_other_val from temp_data2 where x=1), etc
I know I can do it dynamically, but I would like to keep it static without having to do a UNION ALL i.e. (select distinct refval from temp_tbl UNION ALL select distinct some_other_val from temp_data2 where x=1)
Is there a way to parameterise the procedure to accept different input sources ?
November 30, 2012 - 4:09 am UTC
java jockeys can be educated as to better approaches - they can stay, but they can learn.
why not just use DML error logging and a single insert as select?
You'd have to either use dynamic sql here to change the identifier or the union all trick - since identifiers cannot be bound into the sql statement really.
why not use a single table with an identifier column so you can just
select distinct refval from temp_tbl where some_identifier = :x
and you bind in 'data1', 'data2', 'data3', ... and so on?
A reader, November 30, 2012 - 4:41 am UTC
Thanks Tom.
There are many REFTABLES to be loaded, instead of having to create many DML error logging tables and then query every one for errors, it's easier in the exception handler code to loop round SQL%BULKEXCEPTIONS - in fact the error handling code is exactly the same for every REFTABLE, so can be made into a procedure.
In our case, I'm finding the performance difference between FORALL and single DML to be negligible and also the chances of encountering errors are minimal - therefore FORALL was the better choice.
Single table as the "source" sounds good.
Yes java jockeys can be educated - but we old fashioned database bods will be out of a job if we can't rescue them if they're well trained !!
November 30, 2012 - 6:08 am UTC
you only need one DML error logging table, they can all use the same one, just bind in a different tag for each insert to be able to identify it!
you do not need a dml error logging table per table, one will do!!!
so, arguably, dml error logging will be a lot easier to code...
Query on Design for new tables
A reader, November 30, 2012 - 9:10 am UTC
We have a new requirement to store some text data;
text_tbl (id number primary key, text_val varchar2 or clob)
It is anticipated virtually all text_val will be under 4k bytes long, an extremely small % will be >4k.
Having done a comparison between CLOB (Securefiles and Basicfiles, enable storage in row for both) versus VARCHAR2(4000) - we've found that VARCHAR2 is faster for both INSERTS and SELECTS when dealing with 4k or less (which in the majority of cases will be).
We're tempted to change the design to have 3 columns;
text_tbl (id number primary key, text_val_vc varchar2(4000), text_len number, text_val_clob clob)
so that the app can select from text_val,text_len most of the time, but if it sees 1 > 4k, it gives the user the option to retrieve the text_val_clob field.
Could you please give me some advice on above?
December 03, 2012 - 6:24 am UTC
are you sure about your findings?
ops$tkyte%ORA11GR2> create table t1 ( x int primary key, y clob );
Table created.
ops$tkyte%ORA11GR2> create table t2 ( x int primary key, y varchar2(4000) );
Table created.
ops$tkyte%ORA11GR2> create sequence s;
Sequence created.
ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> create or replace procedure p1
2 as
3 l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
4 l_rec t1%rowtype;
5 begin
6 for i in 1 .. 10000
7 loop
8 insert into t1 (x,y) values (i,l_data);
9 end loop;
10 commit;
11 for i in 1 .. 10000
12 loop
13 select * into l_rec from t1 where x = i;
14 end loop;
15 end;
16 /
Procedure created.
ops$tkyte%ORA11GR2> create or replace procedure p2
2 as
3 l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
4 l_rec t2%rowtype;
5 begin
6 for i in 1 .. 10000
7 loop
8 insert into t2 (x,y) values (i,l_data);
9 end loop;
10 commit;
11 for i in 1 .. 10000
12 loop
13 select * into l_rec from t2 where x = i;
14 end loop;
15 end;
16 /
Procedure created.
ops$tkyte%ORA11GR2> @trace
ops$tkyte%ORA11GR2> alter session set events '10046 trace name context forever, level 12';
Session altered.
ops$tkyte%ORA11GR2> exec runStats_pkg.rs_start;
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> exec p1
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> exec runStats_pkg.rs_middle;
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> exec p2
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> exec runStats_pkg.rs_stop(10000);
Run1 ran in 268 cpu hsecs
Run2 ran in 246 cpu hsecs
run 1 ran in 108.94% of the time
Name Run1 Run2 Diff
LATCH.row cache objects 15,390 1,711 -13,679
LATCH.cache buffers chains 311,562 325,967 14,405
STAT...calls to get snapshot s 40,192 20,164 -20,028
STAT...redo entries 66,394 46,244 -20,150
STAT...file io wait time 24,793 64,695 39,902
STAT...physical read bytes 23,240,704 21,733,376 -1,507,328
STAT...cell physical IO interc 23,240,704 21,733,376 -1,507,328
STAT...physical read total byt 23,240,704 21,733,376 -1,507,328
STAT...logical read bytes from 856,498,176 854,392,832 -2,105,344
STAT...redo size 40,791,372 37,173,332 -3,618,040
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
466,160 483,679 17,519 96.38%
PL/SQL procedure successfully completed.
INSERT INTO T1 (X,Y) VALUES (:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 1.18 1.23 1 4884 64543 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 1.18 1.23 1 4884 64543 10000
********************************************************************************
SELECT * FROM T1 WHERE X = :B1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 0.41 0.40 0 0 0 0
Fetch 10000 0.25 0.23 2834 34729 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 20001 0.66 0.64 2834 34729 0 10000
********************************************************************************
INSERT INTO T2 (X,Y) VALUES (:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 0.98 0.99 1 4879 64471 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 0.98 0.99 1 4879 64471 10000
********************************************************************************
SELECT * FROM T2 WHERE X = :B1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 0.40 0.44 0 0 0 0
Fetch 10000 0.24 0.25 2648 34628 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 20001 0.65 0.70 2648 34628 0 10000
I don't see anything here so far that would make me consider using two columns...
Query on Design for new tables
A reader, December 04, 2012 - 3:44 am UTC
Thanks, I'm going to try your benchmark
What does your @trace do?
December 07, 2012 - 4:46 am UTC
turns on sql trace, the command you see right after @trace...
Assaf
Assaf Soudry, December 23, 2012 - 9:21 am UTC
Hello
This is a very long running thread and I am interested in the original question from 2003 and the responses at that time.
It's now 10 years later. Is the advice still the same? Tom, do you today still council against the Generic Data Model?
For example, we now have the Pivot function which can make readable/query-able views of the generic style tables. Does this change matters at all?
Also:
What if the database is envisaged to be relatively small in terms of numbers of records? Does this make a difference to your council?
For example, if the database defines the pages, fields and controls to be used within a system, their size, colours, captions, locations etc. The total amount of data here will not be great; it's not a customer database or a bills processing system, just a list of pages, fields, controls and their relationships.
Would a Generic approach such as the following be acceptable? (Pivot() views would be added).
<<sorry, I don't have an SQL editor on my internet station>>
Table: Control_Types(Control_ID, Control_Name);
Table: Control_Types_Properties(Control_ID, Control_Property, Property_Value);
Example Control_Properties: Height, Width, Type, is_data_control, triggers_special_functionality, and various other application specific properties
Table: Pages(Page_ID, Page_Name);
Table: Page_Properties(Page_ID, Page_Property, Property_Value);
Example Page Properties: Background Colour, Boundary COlour, display_as_single_column, and various other application specific properties
Table: Page_Controls(Page_ID, Control_ID) -- relationship table
Table: Page_Controls_Properties(Page_ID, Control_ID, Page_Control_Property, Property_Value);
Example properties: X_Position, Y_Position, and various other application specific properties
As in the original question we are thinking that we are not sure what properties may be added to our system in the future.
Many thanks for your time
Assaf
January 04, 2013 - 10:29 am UTC
Tom, do you today
still council against the Generic Data Model?
yes.
For example, we now have the Pivot function which can make readable/query-able
views of the generic style tables. Does this change matters at all?
no, that is good for reporting and such - but not for OLTP queries. You might use PIVOT against EMP to print deptnos across the page instead of down the page. But you wouldn't use it to query in an OLTP fashion for data.
What if the database is envisaged to be relatively small in terms of numbers of
records? Does this make a difference to your council?
depends on your definition of small and depends on whether it is a 100% reality that it will be small. Anything "good" gets big (and bigger and bigger)
if the database defines the pages, fields and controls to be used
within a system, their size, colours, captions, locations etc.
sounds a lot like rows and columns to me, a traditional parent child relationship. I see no reason to even consider a generic model yet.
The time a "generic" model might make sense is when you always query it by the OBJECT_ID. "get me all of the attributes of object=1234". That is very efficient.
if on the other hand you ask questions like "find me all of the objects that have red pages or were authored by frank or were (created before yesterday and are yellow)" - the generic model isn't going to be very "friendly"
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 02, 2013 - 2:21 am UTC
Tom,
I'm not getting the similar results as you for CLOB/VARCHAR2. I've tried your benchmark without runstats/trace and just using dbms_utility.get_time to get a simple elapsed time for insert/select into the CLOB and VARCHAR2 tables;
INSERT CLOB took 37.23 secs
SELECT CLOB took 0.51 secs
INSERT VARCHAR2 took 1.98 secs
SELECT VARCHAR2 took 0.42 secs
Getting the above times consistently, no one else on the system. What could be the difference between your CLOB and my CLOB for INSERTS ?
Are you storing CLOB as basicfile or securefile? Are there any other DB or storage parameters to consider, LOGGING/NOLOGGING,etc?
I'll try TRACE next to see if anything obvious.
January 04, 2013 - 1:54 pm UTC
please post things exactly like I did. using sql trace, tkprof. I don't know how you ran your test, what you used, how you gathered your numbers.
I use a vanilla database (I do happen to be in noarchivelog mode) and you have my create table statements (I on the other hand have nothing from you...)
In archivelog mode:
ops$tkyte%ORA11GR2> exec runStats_pkg.rs_stop(10000);
Run1 ran in 267 cpu hsecs
Run2 ran in 260 cpu hsecs
run 1 ran in 102.69% of the time
Name Run1 Run2 Diff
STAT...undo change vector size 1,831,444 1,814,544 -16,900
LATCH.simulator hash latch 6,665 26,334 19,669
STAT...calls to get snapshot s 40,289 20,174 -20,115
STAT...redo entries 65,552 45,227 -20,325
STAT...session uga memory 0 65,512 65,512
STAT...session uga memory max 123,452 0 -123,452
STAT...cell physical IO interc 622,592 98,304 -524,288
STAT...physical read total byt 622,592 98,304 -524,288
STAT...physical read bytes 622,592 98,304 -524,288
STAT...redo size 40,763,928 37,103,132 -3,660,796
STAT...logical read bytes from 853,000,192 847,192,064 -5,808,128
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
398,064 407,959 9,895 97.57%
PL/SQL procedure successfully completed.
INSERT INTO T1 (X,Y) VALUES (:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 1.30 1.50 4 5001 64534 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 1.30 1.50 4 5001 64534 10000
********************************************************************************
SELECT * FROM T1 WHERE X = :B1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 0.42 0.38 0 0 0 0
Fetch 10000 0.11 0.15 70 33677 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 20001 0.53 0.53 70 33677 0 10000
********************************************************************************
INSERT INTO T2 (X,Y) VALUES (:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 1.20 1.44 1 4989 64480 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 1.20 1.44 1 4989 64480 10000
********************************************************************************
SELECT * FROM T2 WHERE X = :B1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 0.40 0.39 0 0 0 0
Fetch 10000 0.10 0.12 0 33608 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 20001 0.51 0.52 0 33608 0 10000
just about the same
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 08, 2013 - 5:09 am UTC
Ok Tom, ran the same tests as you, RUNSTATS output as follows;
Run1 ran in 4112 hsecs
Run2 ran in 590 hsecs
run 1 ran in 696.95% of the time
Name Run1 Run2 Diff
STAT...redo entries 41,220 51,455 10,235
STAT...db block changes 61,948 72,234 10,286
STAT...session logical reads 133,252 122,162 -11,090
LATCH.cache buffer handles 262 13,958 13,696
STAT...non-idle wait count 20,432 3,238 -17,194
LATCH.enqueue hash chains 21,996 3,354 -18,642
LATCH.gcs resource hash 688 19,591 18,903
LATCH.enqueues 21,150 1,596 -19,554
LATCH.cache buffers chains 320,763 340,488 19,725
LATCH.ASM map operation hash t 20,485 627 -19,858
LATCH.object queue header oper 56,930 33,007 -23,923
LATCH.gc element 705 28,109 27,404
STAT...db block gets from cach 50,664 22,243 -28,421
STAT...calls to get snapshot s 50,096 20,089 -30,007
STAT...physical read bytes 139,264 106,496 -32,768
STAT...physical read total byt 139,264 106,496 -32,768
LATCH.loader state object free 40,000 0 -40,000
LATCH.gcs partitioned table ha 115,603 51,344 -64,259
STAT...session cursor cache hi 20,038 -45,496 -65,534
STAT...session pga memory -65,536 262,144 327,680
STAT...redo size for direct wr 520,000 0 -520,000
STAT...undo change vector size 2,279,024 1,520,316 -758,708
STAT...redo size 7,837,888 47,385,556 39,547,668
STAT...physical write bytes 81,920,000 0 -81,920,000
STAT...physical write total by 81,920,000 0 -81,920,000
STAT...cell physical IO interc 82,059,264 106,496 -81,952,768
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
679,887 573,339 -106,548 118.58%
I'm trying to get trace file over to you (I'm not allowed on DB server!), but in the meantime - can you explain why CLOB is so much slower given the above stats?
January 14, 2013 - 11:37 am UTC
looks like your lobs are not cached - see the physical IO? all writes to disk.
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 08, 2013 - 5:41 am UTC
Ok, trace file out now as well;
INSERT INTO T1 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 9999 11.16 37.84 17 10288 92693 9999
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10000 11.16 37.84 17 10288 92693 9999
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=17 pr=13 pw=1 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
row cache lock 20 0.00 0.00
ges message buffer allocation 123 0.00 0.00
enq: HW - contention 59 0.00 0.03
db file sequential read 17 0.00 0.01
CSS initialization 2 0.00 0.00
CSS operation: action 2 0.00 0.00
Disk file operations I/O 10 0.00 0.00
CSS operation: query 6 0.00 0.00
kfk: async disk IO 9999 0.00 0.05
direct path write 9999 0.02 27.15
enq: FB - contention 5 0.00 0.00
asynch descriptor resize 498 0.00 0.00
********************************************************************************
INSERT INTO T2 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 9999 3.43 5.86 13 7605 84387 9999
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10000 3.44 5.87 13 7605 84387 9999
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=6 pr=4 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
ges message buffer allocation 758 0.00 0.00
library cache lock 2 0.00 0.00
library cache pin 2 0.00 0.00
row cache lock 14 0.00 0.00
enq: HW - contention 66 0.00 0.02
db file sequential read 13 0.00 0.01
enq: FB - contention 322 0.00 0.14
Disk file operations I/O 1 0.00 0.00
gc current multi block request 365 0.01 0.21
KJC: Wait for msg sends to complete 21 0.00 0.00
gc remaster 16 0.16 0.47
gcs drm freeze in enter server mode 8 0.20 1.58
January 14, 2013 - 11:38 am UTC
<b>
direct path write 9999 0.02 27.15
</b>
make your lob 'cache'
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 08, 2013 - 6:12 am UTC
And finally, saw on another thread you suggest changing NOCACHE to CACHE, and it did improve;
Run1 ran in 1155 cpu hsecs
Run2 ran in 618 cpu hsecs
run 1 ran in 186.89% of the time
So a lot better, but not as good as the results you're showing
January 14, 2013 - 11:38 am UTC
so, post the findings from that please
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 15, 2013 - 6:05 am UTC
Findings below.
I'm guessing it's the "undo change vector size" or "redo size" ? If so, what could be causing this and how can I fix it?
The only LOB storage parameter I changed was set it to CACHE, the others are default; ENABLE STORAGE IN ROW,CHUNK 8192,LOGGING
Run1 ran in 1788 hsecs
Run2 ran in 1211 hsecs
run 1 ran in 147.65% of the time
Name Run1 Run2 Diff
STAT...gcs messages sent 10,149 0 -10,149
STAT...gc remote grants 10,149 0 -10,149
LATCH.gc element 20,618 10,374 -10,244
STAT...db block changes 71,970 61,574 -10,396
LATCH.simulator hash latch 17,246 5,999 -11,247
LATCH.gcs resource hash 20,880 5,396 -15,484
LATCH.object queue header oper 43,723 26,150 -17,573
LATCH.gcs partitioned table ha 70,722 43,343 -27,379
STAT...non-idle wait count 29,820 541 -29,279
STAT...db block gets 92,762 62,845 -29,917
STAT...db block gets from cach 92,762 62,845 -29,917
STAT...calls to get snapshot s 50,085 20,084 -30,001
STAT...db block gets from cach 50,660 16,253 -34,407
STAT...session logical reads 133,169 97,108 -36,061
STAT...session uga memory 28,736 65,488 36,752
LATCH.cache buffers chains 359,361 273,487 -85,874
STAT...undo change vector size 2,278,968 1,555,732 -723,236
STAT...redo size 68,360,528 36,528,392 -31,832,136
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
596,751 444,541 -152,210 134.24%
Tracefile output
================
INSERT INTO T1 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 9.87 11.65 0 10271 92682 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 9.87 11.65 0 10271 92682 10000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=4 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
ges message buffer allocation 9910 0.00 0.03
gc current grant 2-way 9872 0.00 3.07
gc current multi block request 32 0.00 0.02
row cache lock 35 0.00 0.00
enq: FB - contention 3 0.00 0.00
********************************************************************************
INSERT INTO T2 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 2.79 2.88 0 4126 62772 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 2.79 2.88 0 4126 62772 10000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=0 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
DFS lock handle 10 0.00 0.01
row cache lock 7 0.00 0.00
ges message buffer allocation 164 0.00 0.00
enq: FB - contention 160 0.00 0.07
KJC: Wait for msg sends to complete 2 0.00 0.00
January 15, 2013 - 12:40 pm UTC
interesting, wondering if RAC related. can you test on a single instance somewhere?
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 16, 2013 - 4:27 am UTC
Findings on a single instance below.
I can only infer from the findings that Undo/Redo could be the issue?
Run1 ran in 761 hsecs
Run2 ran in 682 hsecs
run 1 ran in 111.58% of the time
Name Run1 Run2 Diff
STAT...consistent gets from ca 10,701 577 -10,124
LATCH.object queue header oper 21,525 11,286 -10,239
STAT...db block changes 71,844 61,532 -10,312
STAT...IMU Redo allocation siz 476 11,212 10,736
STAT...db block gets from cach 92,672 62,820 -29,852
STAT...db block gets 92,672 62,820 -29,852
STAT...calls to get snapshot s 50,081 20,052 -30,029
STAT...physical read bytes 139,264 106,496 -32,768
STAT...cell physical IO interc 139,264 106,496 -32,768
STAT...physical read total byt 139,264 106,496 -32,768
STAT...db block gets from cach 50,668 16,238 -34,430
STAT...session logical reads 133,543 97,472 -36,071
STAT...session cursor cache hi 20,044 -45,510 -65,554
LATCH.cache buffers chains 359,484 272,527 -86,957
STAT...session pga memory -131,072 65,536 196,608
STAT...undo change vector size 2,279,104 1,561,916 -717,188
STAT...redo size 68,356,188 36,537,316 -31,818,872
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
414,914 326,139 -88,775 127.22%
Tracefile output
================
INSERT INTO T1 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 10000 3.22 3.27 17 10707 92626 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10000 3.22 3.27 17 10707 92626 10000
Misses in library cache during parse: 0
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 17 0.00 0.01
********************************************************************************
INSERT INTO T2 (X,Y)
VALUES
(:B2 , :B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 10000 1.99 1.98 13 4555 62780 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10000 1.99 1.98 13 4555 62780 10000
Misses in library cache during parse: 0
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 2532 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 13 0.00 0.01
January 16, 2013 - 12:18 pm UTC
I cannot remember if you told me your version or not - I don't see it after a quick glance.
There is a big difference in the number of query mode gets, current mode gets as well - the redo size difference is huge. as is the undo generated.
those would be the reasons why the times are so different, but I'd like to try to figure out why they are different in the first place...
so, please select * from v$version for us.
Query on Design for new tables - CLOB vs VARCHAR2
A reader, January 17, 2013 - 2:14 am UTC
We're on 11g Enterprise Edition 11.2.0.1.0 - 64bit Production
I was somewhat dubious of someone here telling me that CLOBs were slower than VARCHAR2, so I decided to do some tests myself. Now I'm surprised by the results of the INSERT comparisons and would like to find out why there's so much difference in UNDO/REDO, it'll be an education if you could shed any light on this.
January 17, 2013 - 8:42 am UTC
If anyone else out there reading this has an 11.2.0.1 database (or an 11.2.0.3 database) - would you mind running the test case and posting the final results? I'm curious to see if it was an issue in 11.2.0.1 that was changed in a subsequent patch set.
I've tested in 10.2.0.5 and 11.2.0.3 - repeatedly see the same amount of redo, same amount of query/current mode gets for both p1 and p2.
runstats:
http://asktom.oracle.com/pls/asktom/ASKTOM.download_file?p_file=6551378329289980701 create table t1 ( x int primary key, y clob );
create table t2 ( x int primary key, y varchar2(4000) );
create sequence s;
create or replace procedure p1
as
l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
l_rec t1%rowtype;
begin
for i in 1 .. 10000
loop
insert into t1 (x,y) values (i,l_data);
end loop;
commit;
for i in 1 .. 10000
loop
select * into l_rec from t1 where x = i;
end loop;
end;
/
create or replace procedure p2
as
l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
l_rec t2%rowtype;
begin
for i in 1 .. 10000
loop
insert into t2 (x,y) values (i,l_data);
end loop;
commit;
for i in 1 .. 10000
loop
select * into l_rec from t2 where x = i;
end loop;
end;
/
alter session set events '10046 trace name context forever, level 12';
exec runStats_pkg.rs_start;
exec p1
exec runStats_pkg.rs_middle;
exec p2
exec runStats_pkg.rs_stop(10000);
Database 11g Enterprise Edition Release 11.2.0.1.0
Winston, January 17, 2013 - 10:45 pm UTC
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE 11.2.0.1.0 Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production
SQL>
SQL>
SQL>
SQL> alter session set events '10046 trace name context forever, level 12';
exec runStats_pkg.rs_start;
exec p1
exec runStats_pkg.rs_middle;
exec p2
exec runStats_pkg.rs_stop(10000);
Session altered.
SQL>
PL/SQL procedure successfully completed.
SQL>
PL/SQL procedure successfully completed.
SQL>
PL/SQL procedure successfully completed.
SQL>
PL/SQL procedure successfully completed.
SQL> Run1 ran in 1161 hsecs
Run2 ran in 587 hsecs
run 1 ran in 197.79% of the time
Name Run1 Run2 Diff
STAT...physical write total IO 10,017 0 -10,017
STAT...non-idle wait count 11,134 865 -10,269
STAT...consistent gets from ca 13,142 817 -12,325
LATCH.JS slv state obj latch 1 -16,096 -16,097
LATCH.row cache objects 24,273 2,014 -22,259
STAT...db block gets from cach 86,955 64,657 -22,298
LATCH.object queue header oper 2,164 25,843 23,679
STAT...calls to get snapshot s 51,464 20,228 -31,236
STAT...db block gets 96,955 64,657 -32,298
STAT...db block gets from cach 52,473 17,023 -35,450
STAT...session logical reads 141,751 102,783 -38,968
LATCH.cache buffers chains 341,323 302,328 -38,995
LATCH.loader state object free 40,004 2 -40,002
STAT...table scan rows gotten 42,476 580 -41,896
STAT...IMU undo allocation siz 42,964 916 -42,048
STAT...session uga memory 131,024 0 -131,024
STAT...file io wait time 262,087 12,189 -249,898
STAT...redo size for direct wr 520,220 0 -520,220
STAT...undo change vector size 2,396,012 1,584,940 -811,072
STAT...session pga memory max 4,521,984 0 -4,521,984
STAT...physical read total byt 688,128 7,004,160 6,316,032
STAT...physical read bytes 327,680 7,004,160 6,676,480
STAT...redo size 8,224,028 36,896,064 28,672,036
STAT...physical write bytes 81,920,000 0 -81,920,000
STAT...cell physical IO interc 91,889,664 7,004,160 -84,885,504
STAT...physical write total by 91,201,536 0 -91,201,536
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
473,206 370,458 -102,748 127.74%
PL/SQL procedure successfully completed.
********************************************************************************
SQL ID: 82n4rty9fvwt4
Plan Hash: 0
INSERT INTO T1 (X,Y)
VALUES
(:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 4.87 7.35 12 10700 95860 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 4.87 7.35 12 10700 95860 10000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 92 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=0 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
asynch descriptor resize 994 0.00 0.00
Disk file operations I/O 6 0.00 0.00
direct path write 10000 0.06 0.63
direct path sync 1 0.00 0.00
db file sequential read 14 0.00 0.00
reliable message 11 0.00 0.00
enq: CR - block range reuse ckpt 12 0.43 0.44
control file sequential read 21 0.00 0.00
Data file init write 11 0.31 0.50
db file single write 1 0.00 0.00
control file parallel write 3 0.00 0.00
buffer busy waits 1 1.07 1.07
********************************************************************************
********************************************************************************
SQL ID: ammarsd6fhd2j
Plan Hash: 0
INSERT INTO T2 (X,Y)
VALUES
(:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 2.42 2.55 0 4387 64502 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 2.42 2.55 0 4387 64502 10000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 92 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=0 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
log file switch completion 2 0.01 0.03
log buffer space 1 0.09 0.09
********************************************************************************
Results
Hoek, January 18, 2013 - 9:02 am UTC
From an ancient HP/WinXP laptop:
SQL> create table t1 ( x int primary key, y clob );
Table created.
SQL> create table t2 ( x int primary key, y varchar2(4000) );
Table created.
SQL> create sequence s;
Sequence created.
SQL>
SQL> create or replace procedure p1
2 as
3 l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
4 l_rec t1%rowtype;
5 begin
6 for i in 1 .. 10000
7 loop
8 insert into t1 (x,y) values (i,l_data);
9 end loop;
10 commit;
11 for i in 1 .. 10000
12 loop
13 select * into l_rec from t1 where x = i;
14 end loop;
15 end;
16 /
Procedure created.
SQL> create or replace procedure p2
2 as
3 l_data varchar2(3000) := rpad( 'x', 3000, 'x' );
4 l_rec t2%rowtype;
5 begin
6 for i in 1 .. 10000
7 loop
8 insert into t2 (x,y) values (i,l_data);
9 end loop;
10 commit;
11 for i in 1 .. 10000
12 loop
13 select * into l_rec from t2 where x = i;
14 end loop;
15 end;
16 /
Procedure created.
SQL> alter session set events '10046 trace name context forever, level 12';
Session altered.
SQL> exec runStats_pkg.rs_start;
PL/SQL procedure successfully completed.
SQL> exec p1
PL/SQL procedure successfully completed.
SQL> exec runStats_pkg.rs_middle;
PL/SQL procedure successfully completed.
SQL> exec p2
PL/SQL procedure successfully completed.
SQL> exec runStats_pkg.rs_stop(10000);
Run1 ran in 49895 hsecs
Run2 ran in 36545 hsecs
run 1 ran in 136,53% of the time
Name Run1 Run2 Diff
LATCH.begin backup scn array 10,022 0 -10,022
STAT...physical write total IO 10,115 47 -10,068
STAT...enqueue requests 11,017 814 -10,203
STAT...enqueue releases 11,017 814 -10,203
STAT...recursive calls 32,649 22,333 -10,316
LATCH.cache buffers lru chain 3,470 14,089 10,619
STAT...consistent gets 48,083 35,993 -12,090
STAT...consistent gets from ca 48,083 35,993 -12,090
STAT...min active SCN optimiza 0 12,488 12,488
STAT...Elapsed Time 49,907 36,556 -13,351
STAT...consistent gets from ca 16,596 2,645 -13,951
STAT...DB time 50,346 36,171 -14,175
LATCH.simulator hash latch 16,020 31,038 15,018
STAT...user I/O wait time 4,255 24,718 20,463
STAT...non-idle wait time 6,099 27,297 21,198
STAT...db block gets from cach 86,041 64,752 -21,289
LATCH.shared pool 44,955 22,976 -21,979
LATCH.enqueues 30,181 7,031 -23,150
LATCH.enqueue hash chains 31,940 8,748 -23,192
LATCH.AWR Alerted Metric Eleme 25,340 0 -25,340
LATCH.object queue header oper 8,163 33,820 25,657
STAT...non-idle wait count 30,485 3,470 -27,015
STAT...calls to get snapshot s 51,080 20,266 -30,814
STAT...db block gets 96,041 64,752 -31,289
STAT...db block gets from cach 52,157 17,104 -35,053
LATCH.loader state object free 40,038 14 -40,024
STAT...session logical reads 144,124 100,745 -43,379
LATCH.row cache objects 61,648 7,932 -53,716
STAT...session uga memory max 196,536 123,452 -73,084
LATCH.cache buffers chains 399,013 323,308 -75,705
STAT...table scan rows gotten 145,523 63,330 -82,193
STAT...session uga memory 196,536 0 -196,536
STAT...redo size for direct wr 520,132 0 -520,132
STAT...session pga memory 524,288 0 -524,288
STAT...undo change vector size 2,256,288 1,504,608 -751,680
STAT...session pga memory max 3,407,872 0 -3,407,872
STAT...physical read total byt 5,398,528 28,622,848 23,224,320
STAT...physical read bytes 1,433,600 26,820,608 25,387,008
STAT...redo size 7,918,248 36,625,144 28,706,896
STAT...physical write bytes 81,920,000 0 -81,920,000
STAT...cell physical IO interc 118,636,544 35,708,928 -82,927,616
STAT...physical write total by 113,238,016 7,086,080-106,151,936
STAT...file io wait time 4,066,925 245,289,060 241,222,135
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
719,182 496,591 -222,591 144.82%
PL/SQL procedure successfully completed.
SQL> disc
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL>
TKPROF output:
...
INSERT INTO T1 (X,Y)
VALUES
(:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 9999 3.52 446.29 1 10701 95740 9999
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10000 3.52 446.29 1 10701 95740 9999
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 91 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=0 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
Disk file operations I/O 36 2.99 4.59
asynch descriptor resize 19997 0.00 0.18
direct path write 9993 0.49 30.16
latch: shared pool 4 0.05 0.10
db file sequential read 23 2.82 3.80
enq: CR - block range reuse ckpt 1 0.17 0.17
control file sequential read 231 3.94 13.44
Data file init write 11 0.00 0.00
db file single write 11 0.01 0.05
control file parallel write 33 0.07 0.17
rdbms ipc reply 11 0.35 2.09
buffer busy waits 2 1.50 1.91
log file switch completion 1 0.29 0.29
********************************************************************************
...
INSERT INTO T2 (X,Y)
VALUES
(:B2 ,:B1 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 10000 6.96 53.12 3 1899 64625 10000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10001 6.96 53.13 3 1899 64625 10000
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 91 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
0 LOAD TABLE CONVENTIONAL (cr=0 pr=0 pw=0 time=0 us)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
asynch descriptor resize 1 0.00 0.00
control file sequential read 105 0.91 5.37
db file sequential read 13 0.18 0.85
Disk file operations I/O 15 0.44 1.28
Data file init write 5 0.00 0.00
db file single write 5 0.01 0.02
control file parallel write 15 0.00 0.02
rdbms ipc reply 5 0.28 0.99
buffer busy waits 5 7.55 19.28
********************************************************************************
...
********************************************************************************
OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 11 0.03 0.04 0 0 0 0
Execute 14 8.38 47.59 0 0 1 14
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 25 8.41 47.63 0 0 1 14
Misses in library cache during parse: 5
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 27 0.00 0.00
SQL*Net message from client 27 24.80 28.87
log file sync 1 0.01 0.01
OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 826 0.45 1.27 0 4 10 0
Execute 41405 16.46 550.09 33 13346 161065 23630
Fetch 21874 2.08 267.50 3449 71955 0 21915
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 64105 18.99 818.87 3482 85305 161075 45545
Misses in library cache during parse: 180
Misses in library cache during execute: 156
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 3395 3.96 253.37
Disk file operations I/O 51 2.99 5.87
asynch descriptor resize 20027 0.00 0.18
direct path write 9994 0.49 30.17
latch: shared pool 5 0.05 0.15
db file scattered read 21 0.09 1.01
enq: CR - block range reuse ckpt 1 0.17 0.17
control file sequential read 336 3.94 18.82
Data file init write 16 0.00 0.00
db file single write 16 0.01 0.08
control file parallel write 48 0.07 0.19
rdbms ipc reply 16 0.35 3.08
buffer busy waits 7 7.55 21.20
log file switch completion 1 0.29 0.29
latch: cache buffers lru chain 1 0.11 0.11
40058 user SQL statements in session.
1338 internal SQL statements in session.
41396 SQL statements in session.
********************************************************************************
Trace file: D:\app\diag\rdbms\orcl\orcl\trace\orcl_ora_2416.trc
Trace file compatibility: 11.1.0.7
Sort options: default
2 sessions in tracefile.
40058 user SQL statements in trace file.
1338 internal SQL statements in trace file.
41396 SQL statements in trace file.
192 unique SQL statements in trace file.
677446 lines in trace file.
895 elapsed seconds in trace file.
...
January 18, 2013 - 10:36 am UTC
I guess that was 11.1?
Looks like something was changed in 11.1/11.2 that negatively affected inline clobs (I don't see these issues on 10.2.0.5 or 11.2.0.3) but was fixed by the 11.2.0.3 patchset.
Nope, it was 11.2
Hoek, January 18, 2013 - 2:02 pm UTC
from an ancient laptop... ;)
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application
You're right, but you can't over-generalize
Prakash Nadkarni, July 17, 2015 - 5:16 am UTC
Hello Tom - I'm a fan of your books, which I've used to educate myself. However, I also happen to be one of the persons commonly cited in association with the term "EAV".
It's interesting that you're a vice president at Oracle: Oracle's health sciences division has two products, Oracle Clinical and ClinTrial (acquired in 2010) that employ EAV for the clinical data sub-component (only - most of the tables correctly use traditional columnar modeling).
Maybe you should talk to the development teams for these products and find out what special circumstances exist that made them employ it.
I would be the first to agree with you that argumentslike "flexibility" and "you don't have to worry about data structure" are idiotic. (In transactional systems, that is a recipe for disaster in terms of interactive validation of data elements.) Clinical data, however, happens to a circumtance where attributes are highly sparse. So EAV is not *universally* bad.
Cheers,