Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Mark.

Asked: September 19, 2007 - 12:18 pm UTC

Last updated: January 28, 2009 - 8:17 am UTC

Version: n/a

Viewed 1000+ times

You Asked

Hi Tom,

Michael Stonebraker, co-creator of Ingres and Postgres - is pushing his new database that stores data by column. Here is the article - http://www.computerworld.com/action/article.do?command=printArticleBasic&articleId=9034619
Though I don't necessarily agree with his claims (my BS alarm goes off whenever someone claims database x is k times faster than database y), it does sound like column databases could be useful in data warehousing. My question: Is there a way to organize Oracle data warehouse tables in a similar way? I suppose indexing each column would be similar enough, but if you have a large number of columns and/or a large table, build-time could be very high and manageability difficult.

Also, do you have any general opinions on such a column-organized database?

Thanks,
Mark

and Tom said...

There is a lot of confusion over them - and I would agree with your "anyone that says N times faster" isn't being very intelligent.

We do have OLAP cubes - different from a columnar store, but similar in some respects
http://docs.oracle.com/cd/B19306_01/olap.102/b14349/toc.htm

And - if columnar data stores are "the wave of the future" (eg: become a proven technology), you can be sure Oracle will have it. We have data mining builtin, we have OLAP builtin, we have relational, we have in memory, we have text, we have XML, we have <.... big list ....>

The article is somewhat self contradictory as well, for example:

... Stonebraker wrote. "Since many warehouse users are in considerable pain (can't load in the available load window, can't support ad-hoc queries, can't get better performance without a "fork-lift" upgrade), I expect this transition to column stores will occur fairly quickly." ...

so, load times are problematic - but later:

... Organizing data by rows does have its advantages. Writing data to disk in row format is faster than doing so by columns. ...

columnar databases have certain "loading" issues that are special


Stonebraker has had many "revolutionary breakthrus that will forever change the way computing is done" many times in the past.


I remember the mid 1990's - "warning: objects are closer than they appear" was the tone back then....

In fact, here we have the death of traditional relational databases forecasted:

... Relational database vendors running "ancient code" will have a hard time extending their databases, said Gartner's Feinberg. "Building extensibility into old code paths is an almost impossible task. ...

http://findarticles.com/p/articles/mi_m0SMG/is_n3_v16/ai_18030941
that was over 11 years ago... If you notice the companies spotlighted in that article are not really database competitors anymore. Informix no longer exists as a company and Sybase is not really considered a competitor in the database field anymore (Microsoft, IBM and Oracle...)

Here is a better one:
http://aurora.regenstrief.org/doc/OO/int9402.html

"STONEBRAKER: That's a great question. The implementation of all the relational systems is totally screwed up."

"STONEBRAKER: You've got to build a system that makes no built-in assumptions about what types exist. It has to be a thing we call a "razor," in which "blades" can be inserted. That's the only way you can make the system extendable, and it must be architected in the following way."



It didn't quite pan out that way - but he was as zealous about objects in 1994 as he is about columns in 2007.


I'm going to publish this one for fun :) I have a feeling the comments on it will be interesting.

Rating

  (22 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Stonebraker is right - sort of...

Tomas Arvidson, September 20, 2007 - 7:31 pm UTC

Regarding the Stonebraker quotes at the end; Stonebraker is actually right in a way.

> "STONEBRAKER: That's a great question. The implementation of all the relational systems is totally screwed up."

Most existing "relational" DBMS are SQL-based and therefore they are not actually relational from a formal point of view and none separate the logical and physical layers to the extent that they should.

SQL is not relational because, as a simple example, a SQL table may contain duplicates. A relation on the other hand is defined in the relational model as a (mathematical) set, and therefore may only contain unique items - no duplicates. The consequence is that what is true within the relational theory may not be true for SQL and to date no one has provided a verified formal and anambiguous model for SQL. This mostly has consequences for the implementor of the language component (SQL) of the DBMS.

Relational theory has nothing to say on how relations, their contents, or other structures are physically stored. However, most (SQL-based) systems makes a 1-to-1 mapping between table rows and records on disk. If there are N columns in a table then there will be N consequtive fields stored for each row in the table. These fields are also stored in physical blocks/pages of some fixed size. This is clearly an implementation trade-off and it may not be the best trade-off in all circumstances. It would be preferable if you could specify the storage structure, that is as rows, columns, or something else, when the table is defined or redefined - perhaps as an extended version of the physical storage clause of Oracles create table statement.

> "STONEBRAKER: You've got to build a system that makes no built-in assumptions about what types exist. ...

The relational model provides for this through the notion of domains (data types/classes) but most implementations do not or do so only to a limited extent. Oracle supports this to some extent through its "object-relational" features - which, incidentally, is the solution Stonebaker suggest in the article. Unfortunately, one of the omissions Oracle has done is to disallow the use of object types as keys where they could have been a great aid for modeling natural keys.

N.B. None of these arguments say that current systems are unusable or "totally screwed up", only that they could be much better than they are.

Entitity-Attribute-Value (EAV) modeling

Raj Kathamuthu, September 21, 2007 - 11:50 am UTC

Bio-informatics community would love to see the effective EAV implementation in Oracle. They have this in their wish list for quite some time. It would definitely add value to specific application domains if Oracle can somehow provide this dynamic table (column based storage) feature as an option like Partition, OLAP etc.

http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

www.jamia.org/cgi/reprint/14/1/86.pdf


Thanks!

Mark, September 21, 2007 - 12:25 pm UTC

I appreciate for your insight, Tom and Tomas.

Maybe just a storm in a cup ...

Gabe, September 26, 2007 - 12:55 pm UTC

Tomas:

Most existing "relational" DBMS are SQL-based and therefore they are not actually relational from a formal point of view

Sure ... but maybe you're giving Stonebraker a bit too much credit!

"The Third Manifesto" was, in good part, a response to Stonebraker's "Third Generation Database System Manifesto" which equated relational DBMS implementations to SQL-based ones.

http://www.dbmsmag.com/int9410.html
The most sensible argument of this saga comes, not surprisingly, from Jonathan Lewis over at Kevin Closson's blog:

http://kevinclosson.wordpress.com/2007/09/13/the-death-of-row-oriented-rdbms-technology/
Storm in a cup ...

Tom Kyte
September 26, 2007 - 10:06 pm UTC

EAV: UGH!

Dave Hemming, September 27, 2007 - 7:11 am UTC


Bio-informatics community would love to see the effective EAV implementation in Oracle. They have
this in their wish list for quite some time. It would definitely add value to specific application
domains if Oracle can somehow provide this dynamic table (column based storage) feature as an
option like Partition, OLAP etc.

http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

www.jamia.org/cgi/reprint/14/1/86.pdf



I just threw up a little in my mouth. A database has objects that have attributes - using those objects and attributes to hold objects that have attributes is like taking flour, eggs and milk* and baking a mixing bowl and whisk.

*Disclaimer: I am not a cook.

EAV - beem there

JM, September 29, 2007 - 7:56 pm UTC

search asktom for "generic data model".

I remember quite a few discussions over the years about how terrible this EAV concept is for scaling a system on asktom.

Here is one of them. It isn't the one I was thinking of, but it makes many of the same points.

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:74676260920442


Reg Stonebraker's Claims

Gokul, April 03, 2008 - 1:52 am UTC

Actually there is not much contradiction in what Stonebraker claims. Rowstore does have the advantage of faster inserts than Columnstore. But Column oriented databases mostly advocate trickle feeds, instead of the night batch loading concept. They form a intermediate staging table kind of stuff, where they form the data for each column file and compress it and load it at one shot.
Tom Kyte
April 03, 2008 - 7:49 pm UTC

... They form a intermediate staging table kind of stuff, where they form
the data for each column file and compress it and load it at one shot. ...

um, please describe again how that is a trickle feed please. That sounds like a batch load - again.

column stores do not do trickle feeds - if they could, they would be great for OLTP (they must support concurrent modifications and reads very well - IF they did trickle feeds)

Column Store Loading

Gokul, April 15, 2008 - 8:25 pm UTC

Ok. I think the term "Loading at one shot" is somewhat misleading. What i meant was grouping the data to form one/few database pages (say 8k or 16k) and loading it at one shot. I don't mean loading the entire table at one shot(nightly batch). Trickle feed goes into staging tables. After a time interval t,we will have n tuples/rows which would be sufficient to form a column store page. So all the tuples are passed through the compression algorithm and it gets copied into the column store as a single page(equivalent to a database block in oracle).

Tom Kyte
April 16, 2008 - 3:14 pm UTC

sorry, still doesn't compute.

Column Store Loading

Gokul, April 26, 2008 - 4:40 pm UTC

Please read the information in the link
http://www.databasecolumn.com/2008/02/insert-performance.html

Tom Kyte
April 28, 2008 - 1:09 pm UTC

I saw a bunch of words - and my favorite kinds of words:

... By implementing all these strategies, it is possible to have a column store with INSERT performance that is at least competitive in performance with that of the major row stores. In fact, in many cases, benchmarks have shown that load performance for a column store is typically better than that of a row store. ...

you know, the ones where they reference magic benchmarks they have done, in house themselves. I saw someone asked for a pointer to said information (benchmarks) and all they got were "crickets" (the sound you here in the forest when no one says anything....)

you know, using generalizations - we go from "it is possible to have a column store ... that is at least competitive" to "is typically better than that of a row store".

it also says....

... One approach to significantly mitigating the performance problem is to batch INSERTs and to perform the sorts, compression, and disk writes in large groups. ....

well, pardon me for not equating that statement (which I read a long time ago) with your "equivalent to a database block in oracle".



Hey, given that, I'll just close with this:

I've run benchmarks that show the column data store in question is so slow compared to Oracle as to appear to be standing still, hanging - relatively speaking.


Now, we are even.


I love their approach to "ACID". The use of the word "in general" is sort of amusing. I guess they never had both nodes of a cluster fail - I'm sure that never happens.

Sorry for the sarcasm here, but show us something - a fair and unbiased evaluation performed by a team that knows how to maximize both the row as well as column storage databases to the maximum.


Column Store Loading

Gokul, May 08, 2008 - 1:15 pm UTC

Let me clarify myself. I haven't compared the performance of both and i am not claiming that Column Store performs better than RowStore. All i am saying is that, since Inserts are slower in Column Store compared to Row Store in a nightly batch load, the column-store folks (StoneBraker etc) are recommending a staging table kind of approach for loading tables(as explained above). They have also made a case against Nightly batch loading, saying that many nightly batches don't finish in time.

In one part of your review, you have written like this
"The article is somewhat self contradictory as well, for example:

... Stonebraker wrote. Since many warehouse users are in considerable pain (can't load in the available load window, can't support ad-hoc queries, can't get better performance without a "fork-lift" upgrade), I expect this transition to column stores will occur fairly quickly. ... "

Here Stonebraker has made one such case against Nightly batch. Hope i am slightly clear now..

But coming to the performance comparison of row-stores Vs Column stores, i am convinced of the fundamental concept. I think even Oracle follows that concept in Fast Full Index Scan. Oracle goes to the index on a non-null column to get the count, instead of going to the table. It is better performing because, you read less(lesser sequential I/O). This is the same concept employed in Column Stores. You form projections on those related columns which might be queried together. But i really don't know about other complications which might accompany it.

By the by i adore your writing style and please don't leave that Sarcasm !!!!
Tom Kyte
May 12, 2008 - 9:57 am UTC

.... I haven't compared the performance of both and i am not
claiming that Column Store performs better than RowStore. ...

Really???!??!?!

quotes from you:

o Rowstore does have the advantage of faster inserts than Columnstore.

o Please read the information in the link
http://www.databasecolumn.com/2008/02/insert-performance.html


... But coming to the performance comparison of row-stores Vs Column stores, i am convinced of the fundamental concept. ...

without ever having touched one, without ever having compared one.

I am convinced we will one day travel the galaxies, in my lifetime, I've read about it, I'm convinced of the fundamental concept.

Our bitmap indexes are much closer to the concept of a columnar store than a b*tree with a fast full scan and the fast full scan is nothing like a columnar store (a fast full index scan is just the optimizer realizing that "hey, I have a skinny version of a RELATIONAL (row oriented) table over here", nothing more, a fast full scan is row oriented, a skinny table, nothing more)

... You form projections on those related columns which might be queried
together. ...

No, you have it a bit wrong there - that is a relational - row oriented thing you just described....


.... Here Stonebraker has made one such case against Nightly batch. Hope i am ...


Yes, you are right, he points out the problem, proposes a hypothetical solution and says "this will be better". But we see nothing to back it up.


Stonebraker also sent us down the alley of "object relational" with the Illustra datablades. They were going to replace relational databases...

About Column Stores

Gokul, May 13, 2008 - 6:17 am UTC

OK! When i said "i haven't compared the performance", i meant i haven't done it experimentally. I quoted the link to make the process of inserts in Column Store clear.
Anyways, apart from the arguments, i wanted to convince you that it was a case against Nightly batch and i can claim a partial success there :)

"No, you have it a bit wrong there - that is a relational - row oriented thing you just described.... "

I hope you know that Stonebraker's product is Vertica. I hope you might also know that it started as a university project called C-Store. There is a white paper on C-Store. I got my knowledge on Column Oriented DBs after reading that. I hope this link might be useful to you.
http://db.lcs.mit.edu/projects/cstore/vldb.pdf
I find the C-Store's projections very similar to the skinny table, you just described. Please correct me, if you see some more differentiating concepts over there...

Of course the famous Column Oriented DB, with which i have done some SQL is MonetDB. But their architecture is slightly different. They use a concept called BATs. Projections in C-Store need before-hand knowledge of the class of queries, but you don't need that in MonetDB.
Again going by your argument, i am looking at the TPC-H performers. Hmm.. the top performer in most of the categories is EXASOL(on May 13 2008). I went into their website and i am looking at their whitepaper(you have to register in order to read that). It is in the link https://www.exasol.com/fileadmin/template/exasol/img/pool/BARC_EXASolution__WhitePaper.pdf
There is a statement like this on page 7 of the document. "The DBMS uses a column-based approach to storing data (similar to Sybase IQ),". I also find that they claim, they are a RDBMS. I can't understand why you say that the concept of RDBMS and Column Store can't co-exist.

Expecting your comments on the same.....
Tom Kyte
May 13, 2008 - 11:09 am UTC

... , i meant i haven't done it
experimentally. I quoted the link to make the process of inserts in Column
Store clear. ....

My car which I designed and built goes 200 miles per hour (safely, no worries there, you cannot die in this car) and gets 500 miles to the gallon. Would you be interested in it?

I see nothing in that article to back up any of that - no metrics, nothing.

....
Anyways, apart from the arguments, i wanted to convince you that it was a case
against Nightly batch and i can claim a partial success there :)
....

No, you did not.

...
I hope you know that Stonebraker's product is Vertica....

current product, yes, Illustra and Postgres were before that - I am aware of his background.

... I find the C-Store's projections very similar to the skinny table, you just
described. Please correct me, if you see some more differentiating concepts
over there...
...

In general, the c-stores are single column (can be more) but are physically stored sorted by some attribute - b*trees are not, the nodes on a leaf block are sorted, but the set as a whole- not so, not at all.


The top performer in the small databases is currently that one. (where did I say these cannot co-exist - I said they are *different* and you cannot say a b*tree is like a c-store, but I never said they are orthogonal concepts that cannot coexist?)

And if you look at the sizing, their sizing (database sizing) is interesting. One of the things that people keep saying is "hey, these column stores will fit on my usb pendrive they are so efficient"

Not so, apparently
Total Storage/Database Size Ratio:   32.63  1tb exasol
Total Storage/Database Size Ratio:   35.60  1tb oracle


and there you are comparing their 80 cpu/320 core implementation to a 32 cpu/64 core implementation - so it is really difficult to compare - but whatever (I don't see them in the big benchmarks, just the small ones that even sqlserver does...)


David Aldridge, May 13, 2008 - 1:15 pm UTC

A technique closer to column storage that I use quite frequently is to identify the columns that are most commonly of interest and load them to one or more separate "skinny" table or tables. Multitable insert in SQL or any ETL tool makes this simple. They are invoked transparantly through query rewrite. With data segment compression it's generally more compact than indexing, and very cacheable.

About Column Stores

Gokul, May 14, 2008 - 9:33 pm UTC

"In general, the c-stores are single column (can be more) but are physically stored sorted by some attribute - b*trees are not, the nodes on a leaf block are sorted, but the set as a whole- not so, not at all. "

I still can't understand, where you are arriving at. So the difference stated is
a) Column Stores don't have non-leaf nodes and b-tress have leaf nodes. Non-leaf nodes are not sorted in B-tree. I don't think results in Fast full index scan are sorted.

I think, i need not convince you to say that Fast full Index scans won't care about non-leaf nodes. I have studied a bit about Lehman and Yao B-tree algorithm. When you do a fast full index scan, you come across very few non-leaf nodes than the leaf nodes(Of course it depends on the size of the column data being indexed, and the above argument doesn't hold good for very large columns..) . Then Column Stores are more closer to skinny table concept than the b-tree, since they don't have non-leaf nodes. So my argument remains the same. The skinny table approach, which oracle tried to follow in Fast full index scan is being followed in Column Stores.

.....

I really can't argue much when you switch to Marketing mode from technical mode. :). I wonder a person, who saw the difference in cores, could miss differences like 512 GB RAM Vs 16 GB RAM, price per performance ratio of 28:1 etc. Exasol has used mostly commodity hardware, whereas Oracle has used Enterprise hardware. They didn't need external RAID, whereas Oracle did. Certainly Exasol design has a lot to learn from.
Going by Boolean Logic, you have to take the blame on one thing. Either on the Oracle Software / Oracle professionals, who participated in the TPC benchmarks, since we can't question the benchmarks.


.....

But i accept that the claim on Column-Store compression should be re-looked at. Compression algorithm exploit the symmetry of data. Column data is more similar to each other than row data. So i am still convinced of the claim theoretically. I accept that the theoretical knowledge is no where nearer to practical proof in database world.
Tom Kyte
May 16, 2008 - 12:20 pm UTC

column stores - if you full scan them (multi-block IO), data comes out sorted.

b*trees - if you fast full scan them (multi-block IO), they are NOT sorted at all.

so, no, the statement:

The skinny table approach, which oracle tried to follow in Fast full
index scan is being followed in Column Stores.

is false. Column stores can rely on data being retrieved in sorted order during a full scan. We cannot, not with b*trees, not with IOTs, not with clusters, not with anything - we don't have the concept as of 11gr1.

... I really can't argue much when you switch to Marketing mode from technical
mode. :)....


they used clusters, you best multiply some of those (they actually used like 80 cores versus 32).

And give it a couple of weeks, tpc's are like the stock market.
Interesting, you point to a marketing paper.... and you say I switched? We were already well there.

About Column Stores

Gokul, May 15, 2008 - 2:30 am UTC

"..... With data segment compression it's generally more compact than indexing, and very
cacheable"

Nice try....If you can post more data like, the size ratio between index and you Mat. view, it might be more helpful. An index may not be suitable, if the no. of columns exceed a critical no. Also i hope you can understand that, if Oracle can provide this feature builtin, then you may not spend the extra T-Cycles in updating a view and the extra storage cost for it. This might even help Tom to put across a case to implement Column Store features in some future version of Oracle....
Tom Kyte
May 19, 2008 - 1:54 pm UTC

Gokul -

what do you mean by "nice try"

If you wanted a skinny version of a table - than a create table compress as select <columns you want> order by <somthing> will generally be small than create index <columns you want> compress.

Why? the index contains all of the columns AND the rowid.
The table - contains all of the columns, no rowids (rowids are implied)

And if the index was compressable because the leading edge repeated a lot - then the table itself would be very compressable because we'd see LOTS OF repeated information.

here would be a quick and dirty example. But - if the index is compressable, than a table created sorted by that key would be very compressable as well - and since it does not contain the rowid and the branch blocks and other index structure overhead - it would in general be MORE COMPACT than an index.

big_table%ORA10GR2> create index big_table_idx on big_table(owner,object_type,object_name) compress 3;



Index created.

big_table%ORA10GR2> create table big_table_skinny compress as select owner,object_type,object_name from big_table order by owner,object_type,object_name;

Table created.

big_table%ORA10GR2> exec show_space( 'BIG_TABLE');
Unformatted Blocks .....................             416
FS1 Blocks (0-25)  .....................         298,014
FS2 Blocks (25-50) .....................               1
FS3 Blocks (50-75) .....................               2
FS4 Blocks (75-100).....................             314
Full Blocks        .....................       1,542,414
Total Blocks............................       1,848,576
Total Bytes.............................  15,143,534,592
Total MBytes............................          14,442
Unused Blocks...........................           5,120
Unused Bytes............................      41,943,040
Last Used Ext FileId....................               5
Last Used Ext BlockId...................       1,843,209
Last Used Block.........................           3,072

PL/SQL procedure successfully completed.

big_table%ORA10GR2> exec show_space( 'BIG_TABLE_SKINNY');
Unformatted Blocks .....................               0
FS1 Blocks (0-25)  .....................               0
FS2 Blocks (25-50) .....................               0
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................               0
Full Blocks        .....................         177,015
Total Blocks............................         180,224
Total Bytes.............................   1,476,395,008
Total MBytes............................           1,408
Unused Blocks...........................           2,549
Unused Bytes............................      20,881,408
Last Used Ext FileId....................               5
Last Used Ext BlockId...................       2,228,105
Last Used Block.........................           5,643

PL/SQL procedure successfully completed.

big_table%ORA10GR2> exec show_space( 'BIG_TABLE_IDX',user,'INDEX');
Unformatted Blocks .....................               0
FS1 Blocks (0-25)  .....................               0
FS2 Blocks (25-50) .....................               1
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................               0
Full Blocks        .....................         199,094
Total Blocks............................         204,800
Total Bytes.............................   1,677,721,600
Total MBytes............................           1,600
Unused Blocks...........................           5,020
Unused Bytes............................      41,123,840
Last Used Ext FileId....................               5
Last Used Ext BlockId...................       2,047,881
Last Used Block.........................           3,172

PL/SQL procedure successfully completed.




The table is smaller, more compact, than the corresponding 'compressed index' because of the segment compression and the fact there is no rowid.

About Column Stores

Gokul, May 16, 2008 - 4:20 pm UTC

"Column stores can rely on data being retrieved in sorted order during a full scan."

Hmm.... Yes. That's a difference, i missed that, because not all the column stores rely on that technique. As i have already pointed out, MonetDB doesn't rely on that. But since we were discussing about Vertica, i should have noticed that.

"Interesting, you point to a marketing paper.... and you say I switched?"

Well i tried to point some information in the paper. All i pointed out was that they are using Column stores and they are a RDBMS and they have put that under their technical white paper section. May be you can say i am marketing Column level storage to Oracle. Nothing more than that.

"We were already well there."

I am ignoring this, as you have no chance of knowing, that i am nowhere related to Exasol/Vertica.

Nice discussion though... Thanks..
Tom Kyte
May 19, 2008 - 3:14 pm UTC

there is nothing concrete in that paper. There are no numbers.


When I said "we were already well there" - we were, we were discussing a marketing paper with no concrete facts in it.

Sri, May 18, 2008 - 8:09 pm UTC

That conversation between Gokul and Tom was wonderful. (Pssst. What all that about? What is a row store anyway?) For illiterates like me, some light on this subject would be greatly appreciated.

Thanks

Tom Kyte
May 19, 2008 - 4:16 pm UTC

a row store stores data in rows....

like a spread sheet

About Column Stores

Gokul, May 20, 2008 - 4:05 pm UTC

"If you wanted a skinny version of a table - than a create table compress as select <columns you want> order by <somthing> will generally be small than create index <columns you want> compress. "

I think i should have been more clear in my statement. I accept that compressed table occupying lesser space is obvious, but it involves a update process from the main table periodically. Creating an index/ Mat. view avoids the need of periodic update(through manual stored proc/job). But if Oracle could have provided column stores, there are twin advantages
a) The space needed for index/ mat. view is saved, because the required projection is both part of the table and also serves the purpose of grouping related columns.
b) There is only one insert into the column store projection. Right now we have to insert into the main table and update the auxiliary structure(index/M.view)

Hope i am slightly clear this time..

Your example clearly shows the space advantages, but i think its only because of the non-leaf nodes in index. You pointed out that index also stores row-id information. I think the table should also have some space for undo-log pointer in each row. A heap tuple(table row) should probably be having more per-tuple control information than a index tuple. Atleast that's what i have observed in the open source databases.
May be i am wrong about Oracle..


Tom Kyte
May 20, 2008 - 4:20 pm UTC

in a warehouse, whereby these would be used, the "involving an update process" is not a big deal - eg: not a show stopper


if you look at column stores, you'll see they tend to store the same data redundantly (hence their c-store was about the same size as the bad old relational data), they store them over and over sorted differently - just like we do with indexes and materialized views.


and they maintain all of the redundant c-stores just like we do indexes and materialized views.

as for:

... but i think its only because
of the non-leaf nodes in index. ..

no, that is wrong. It is the rowid. an index block has the same overheads as a table block does for undo and transaction information.

the index has the key PLUS rowid.

the table has the key - rowid is INFERRED from the file the block is in and the block the row is on and the slot on the block the row is to be found, the rowid is missing in the table.



...A heap tuple(table row) should probably be having more per-tuple
control information than a index tuple...

umm - why? I would think the opposite actually. And how have you derived this from "open source databases" - got pointer?

Column Stores

Gokul, May 21, 2008 - 3:15 pm UTC

"if you look at column stores, you'll see they tend to store the same data redundantly (hence their c-store was about the same size as the bad old relational data), they store them over and over sorted differently - just like we do with indexes and materialized views."

I accept the argument, but then the redundant storage provides the fail-over part. The designer can still design projections without duplicates. Nevertheless, if Oracle provides this, its an optional feature, which might come handy in circumstances.

"no, that is wrong. It is the rowid. an index block has the same overheads as a table block does for undo and transaction information. "
Hmm... Oracle has a undo log pointer from the index tuple?? That really confuses me a little. I have referred postgres source (8.3), where heaptuple occupies 24 bytes overhead and indextuple occupies 8 bytes overhead. I haven't counted the postgres null bitmap storage, which is not present in Oracle tuple storage.But they don't store the timestamp information in the index tuple, whereas you guys do it (that's how Oracle can support scans which go only through index).

Say there is a index tuple and i make an update to the indexed column. According to your statement, i believe, the old index tuple is moved to the undo log and there is a pointer to it from the new index tuple. This won't work out, as the index only scans, will then become largely inefficient. I believe an index structure should maintain all the versions of the index tuple in the b-tree itself. I am unable to think of an alternate design, where index tuple might need undo log pointer.

Can you explain me?? Thanks
Tom Kyte
May 21, 2008 - 3:29 pm UTC

... I accept the argument, but then the redundant storage provides the fail-over
part. ...

the failover part? hardly, it provides them the same data over and over sorted. And if they designed without duplicates then they would have a c-store per column and would be joining massively - which means in real life they would not.

they use redundancy for performance, just like we do with indexes, and the indexes of the warehouse - the materialized view.

... Oracle has a undo log pointer from the index tuple?? ...

absolutely - how could it *not*. The index block needs to be rolled back (sometimes independent of the table blocks!!!!! we roll back for queries remember... we do not store locks in a "list", we store them on the block itself). I'm sure postgress blocks - all of them - have a BLOCK HEADER. That is where this ITL (interested transaction list) stuff is stored in Oracle. All blocks have this.


I don't get your last paragraph at all. Your statement "I think" doesn't come into play here - it works the way it works, not the way you think it should or could work.

When you modify a block (any block) we store information to UNDO that modification in the undo tablespace. We store information to REDO that change in the redo logs. When you rollback, we process that stuff. Index blocks, table blocks, cluster blocks, blocks is blocks is blocks.


Say you did a query such as:

select a from t where a is not null;

and say there was an index on A.

Now, the optimizer would likely pick that index on A as either fast full scan or index full scan.

It will NOT hit the table at all.

Now, as we traverse the index structure, we need to do it consistently - just like a table. If we hit a block that was *modified* since our transaction began - IN THE INDEX - we need to rollback that modification to make the block appear as it did right before the modification. So, we use the transaction information in the block header to find the undo trail for that block and we undo the changes (which include undo-ing the changes to the block header itself). We look to see if the block is old enough now and if not, we do it again, and again, and again - until we are done.

The table is never consulted for this query, the index is used as a skinny version of the table, the table could not be the only place where this information is stored - that would be a horrendous performance impact.


... I am unable to think of an alternate design, where index tuple
might need undo log pointer. ...

You need to learn about Oracle then - multi-versioning, read consistency.

Column Stores

Gokul, June 02, 2008 - 9:08 pm UTC

"the failover part? hardly, it provides them the same data over and over sorted."

If the C-Store has the same column in two projections and if one goes offline, then the optimizer might be wise enough to provide the query results from the other projection. This was what i meant by failover. Am i missing something here??

And if they designed without duplicates then they would have a c-store per column and would be joining massively - which means in real life they would not.

I accept this argument for the C-store way of storing the data. But if you look at MonetDB(another Column store, open-source ), they store each column in a separate file. Yet their joining mechanism is efficient.

"The index block needs to be rolled back (sometimes independent of the table blocks!!!!! )"

Actually Undo-log is used in oracle for two purposes - time-travel(to goto the previous version of records) and rollback. My last argument was that time-travel can't be achieved in the index through undo log. All the previous versions of a record has to be in the B-tree structure. Hence no need to store undo-log pointer for time travel purpose. If the statement needs more explanation, let me know.
But i wonder why we can't roll-back table and index together and have an undo-log pointer only for table. If we are going to rollback for table, then we anyway can reach the undo log block and from there we can reach the index block.

"we roll back for queries remember... we do not store locks in a "list", we store them on the block itself). I'm sure postgress blocks - all of them - have a BLOCK HEADER. That is where this ITL (interested transaction list) stuff is stored in Oracle. All blocks have this. "
Yeah.. i know this. You have mentioned it in your book, when explaining initrans and maxtrans.

"Now, as we traverse the index structure, we need to do it consistently - just like a table. If we hit a block that was *modified* since our transaction began - IN THE INDEX - we need to rollback that modification to make the block appear as it did right before the modification."

Let me re-iterate my previous point. Say we have a index on column "A".Let's say the column A has values 1-5000 in 5000 rows(uniformly distributed.). Now we issue a update like this "update table set A=5001 where A=5". If you visualize the B-tree leaf pages, the updated row should be in the leaf page at the right end and the old row should be in the b-tree leaf page at the left end. That's how it can support a query "select * from table where A=5"(issued before the update) and a query "select * from table where A=5001" (issued after the update commits) simultaneously. If you can think of how b-tree is getting navigated, you can easily get my point. Instead if you say that the older version is reached only through undo-log, then you should scan all the index blocks and reach their undo-block to verify whether atleast one row has the value of "5" in its older version. That's why i said the time-travel/ version-travel in index can't happen through undo-log and all the versions have to get stored in b-tree. This can work for index full scan, but not for normal index scans and we have only one index structure which supports both scans.

Hope i am clear right now.

"You need to learn about Oracle then - multi-versioning, read consistency."

I have a fair idea on this. I just say that this can't work for index and it can only work for table.

Tom Kyte
June 03, 2008 - 11:06 am UTC

please take the comment in the context of your entire paragraph:

...
I accept the argument, but then the redundant storage provides the fail-over part. The designer can
still design projections without duplicates. Nevertheless, if Oracle provides this, its an optional
feature, which might come handy in circumstances.

.....

you seem to be alluding "but they use redundancy for failover (first time mentioning this fact - first time failover is mentioned)". They do not use redundancy for failover, they use it for the same reason we use materialized views - to avoid WORK. Further, now you write:

...then the optimizer might be wise enough to provide the query results from the
other projection....

meaning - it might not be, in fact probably isn't. They do not store redundant copies for failover, they do it for performance. If there is some higher available feature by accident there - it is by accident, it was not their GOAL (if it were their goal, the redundant copy would be a mirror image - so that when you failover, performance doesn't go down the tubes)


... Yet their joining mechanism is efficient.
...


unfounded, where is the science, the numbers (and what does "store each in a different file" have to do with anything?)



... My last argument was that
time-travel can't be achieved in the index through undo log ....

you are wrong there. You are entirely, utterly, completely wrong there.

Undo is used for one thing - rolling back. (introducing a fancy term 'time travel' will just confuse the issue). Any structure can use undo generated for it to put it back the way it was.


stop visualizing how you think it has to be done, it doesn't work that way. Just because something could have been done the way you think it could be does not mean that it does.

Oracle does not work the way you describe at all.



....
"You need to learn about Oracle then - multi-versioning, read consistency."

I have a fair idea on this. I just say that this can't work for index and it
can only work for table.
......


You have a far distance to cover yet, you persist in thinking it must work the way "i (meaning you)" think it should work. Until you lose that perspective - you will get it wrong.


....
If you can think of how b-tree is getting navigated,
you can easily get my point.
........

you need to rethink that yourself.

ops$tkyte%ORA9IR2> create table t tablespace USERS as select * from all_objects;

Table created.

ops$tkyte%ORA9IR2> create index t_idx on t(object_id) tablespace TOOLS;

Index created.

ops$tkyte%ORA9IR2> exec dbms_stats.gather_table_stats(user,'T');

PL/SQL procedure successfully completed.

ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2> set autotrace traceonly explain
ops$tkyte%ORA9IR2> select count(*) from t;

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=1)
   1    0   SORT (AGGREGATE)
   2    1     INDEX (FAST FULL SCAN) OF 'T_IDX' (NON-UNIQUE) (Cost=4 Card=30755)



ops$tkyte%ORA9IR2> set autotrace off
ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2> delete from t;

30755 rows deleted.

ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2> declare
  2          pragma autonomous_transaction;
  3  begin
  4          execute immediate 'alter tablespace users offline';
  5  end;
  6  /

PL/SQL procedure successfully completed.

ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2> select count(*) from t;

  COUNT(*)
----------
         0

ops$tkyte%ORA9IR2> select * from t;
select * from t
              *
ERROR at line 1:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9: '/home/ora9ir2/oradata/ora9ir2/users01.dbf'


ops$tkyte%ORA9IR2>
ops$tkyte%ORA9IR2> declare
  2          pragma autonomous_transaction;
  3          n number;
  4  begin
  5          select count(*) into n from t;
  6          dbms_output.put_line( n );
  7  end;
  8  /
30755

PL/SQL procedure successfully completed.

ops$tkyte%ORA9IR2> declare
  2          pragma autonomous_transaction;
  3  begin
  4          execute immediate 'alter tablespace users online';
  5  end;
  6  /

PL/SQL procedure successfully completed.

ops$tkyte%ORA9IR2> rollback;

Rollback complete.

ops$tkyte%ORA9IR2> select count(*) from t;

  COUNT(*)
----------
     30755




think about that for a while.


delete all records from t - NO COMMIT
offline the datafiles for the table (but not the index)

queries that hit index are OK, query that hits table fail - cannot see the table.

so, we use an autonomous transaction - and hit the index and the index is as it was BEFORE the delete (we used UNDO for that....)

we online the datafiles and rollback - and do the count again, it is greater than zero - showing that undo was applied to the table - the transaction hadn't closed ever - we used UNDO on the index to roll it back.


I have other ways to show this using tkprof as well - but - undo does not work the way you ENVISION it should.

You are missing the point of how multiversioning (which works at the block level) *WORKS*

Stop thinking "rows", start thinking "traverse all of the blocks as of the same consistent committed point in time"



Column Stores and Indexes

Gokul, June 03, 2008 - 2:20 pm UTC

They do not store redundant copies for failover, they do it for performance. If there is some higher available feature by accident there - it is by accident, it was not their GOAL (if it were their goal, the redundant copy would be a mirror image - so that when you failover, performance doesn't go down the tubes)
Yes, you are correct. ( Probably I need to spend some time to experiment on this).

unfounded, where is the science, the numbers (and what does "store each in a different file" have to do with anything?)

If you download MonetDB, you would see that it stores one only column in a BAT(equivalent to a projection). For technical reasoning, please read this
http://www.cwi.nl/themes/ins1/publications/docs/MaBoKe:VLDB:00.pdf

I went through your example, but its not what i am trying to say. I don't say that index-only scans can't provide a read consistent view. I say that they can't provide the read consistent view through undo log. i am looking at the figure of b-tree index, which you have drawn in your "expert oracle" book(8i). In the figure, the left most leaf page contains the following contents (100,rid), (101, rid)....
and the right most leaf page contains (0, rid), (1,rid)....

But if we look at it more closely, it will look like this (100, tran. info, rid), (101, tran info, rid).... etc. Say if someone updates the record with 101 to 10, the index tuple won't get removed from the index block, it would have got the extra information, of which SCN deleted the record(1. Do you contend this?). Also a new tuple will get inserted into the right most leaf page(according to your diagram). So it would mean that the older version of the record exists in the index block(corresponding to the left most leaf page) and the new version also exists in the index block(corresponding to the right most leaf page). So if one scans through the index, they would get to see all the versions, they need not goto the undo log.

If you are contending 1), then it means that old record gets removed completely from the index and its only found in the undo log. If someone has fired the query with condition = 101, the index scan will only goto the left most leaf page and by the b-tree navigation mechanism and it will never goto the right most leaf page to reach the undo log from there. Are you able to understand my point??






Tom Kyte
June 03, 2008 - 4:29 pm UTC

... If you download MonetDB, you would see that it stores one only column in a
BAT(equivalent to a projection). ...

I do not doubt that - you just threw out "Yet their joining mechanism
is efficient." - that is what I was talking about.


... I say that they
can't provide the read consistent view through undo log...


well STOP SAYING THAT BECAUSE YOU ARE WRONG. How is that. I don't know how else to say this to you - you are *wrong*. Stop thinking that "because I think it cannot be, it cannot be". Start thinking "maybe it could be and my concepts are getting in the way of my thought process"

Let me try again.

Say I delete the number 1 from the index.

Say YOU open a query that will "select * from t where column = 1". You will use the index to query this table. Your query cannot see the row I just deleted. You have done NO IO yet, you have just opened the result set, the cursor.


Say I commit.

Say someone else inserts the number 1. Guess where it will tend to go? Right where mine was ( we cannot keep all versions in the index forever, can you ... no....)

Say they commit.

Now, you start fetching.... Guess what will happen? You'll use read consistency and not see the row they inserted even though they very likely overwrote the index entry that used to be there way back when.



So, let me try again.... Would you believe me if......

a) we build an index
b) we measure it's size
c) we delete everything from it
d) we commit
e) we populate the index again with new values (noting that index size stays same, therefore we MUST have overwritten everything ...)
f) we commit

g) we then run a query that uses the index and only the index AS OF (b) and a query that uses the index AS OF (f) and we see differing results.





ops$tkyte%ORA11GR1> create table t
  2  tablespace users
  3  as
  4  select level id, rpad('*',10,'*') data
  5    from dual
  6  connect by level <= 100000
  7  /
Table created.

<b>100,000 rows, all integer values for ID</b>

ops$tkyte%ORA11GR1> exec dbms_monitor.session_trace_enable
PL/SQL procedure successfully completed.

<b>this will be used to show that we used the index - the row source operation will CONFIRM that for us</b>


ops$tkyte%ORA11GR1> create index t_idx on t(id) tablespace tools;
Index created.

<b>our index...</b>

ops$tkyte%ORA11GR1> analyze index t_idx validate structure;
Index analyzed.

<b>this will gather detailed information on the index structure...</b>

ops$tkyte%ORA11GR1> create table idxstats as select 'before' when, index_stats.* from index_stats;
Table created.

<b>which we just saved...</b>

ops$tkyte%ORA11GR1> select sum(blocks) from user_extents where segment_name = 'T_IDX';

SUM(BLOCKS)
-----------
        256

<b>index has allocated 256 blocks</b>

ops$tkyte%ORA11GR1> variable before_cursor refcursor
ops$tkyte%ORA11GR1> exec open :before_cursor for select count(*) from t t1 where id > 0 and id = trunc(id);

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR1>
ops$tkyte%ORA11GR1> variable before_cursor2 refcursor
ops$tkyte%ORA11GR1> exec open :before_cursor2 for select count(*) from t t2 where id > 0 and id = trunc(id);

PL/SQL procedure successfully completed.

<b> these two cursors return the number 100,000 as the answer - regardless of what DML we do to the table - it is 100,000.  I've opened two cursors here just so I can demonstrate below that we are ONLY READING FROM THE INDEX and that we haven't read from the index here - we will read from the index below (eg: NO IO HAS BEEN PERFORMED FOR THIS QUERY YET)
</b>

ops$tkyte%ORA11GR1>
ops$tkyte%ORA11GR1> delete from t;

100000 rows deleted.

ops$tkyte%ORA11GR1> commit;

Commit complete.

<b>data is gone, but the two ref cursors above will still say "100,000 due to read consistency ON THE INDEX</b>


ops$tkyte%ORA11GR1> insert into t
  2  select level+0.01 id, rpad('*',10,'*') data
  3    from dual
  4  connect by level <= 100000
  5  /

100000 rows created.

ops$tkyte%ORA11GR1> commit;

Commit complete.

<b>note that this data is slightly different - it is 1.01, 2.01 and so on.  So:</b>

ops$tkyte%ORA11GR1> variable after_cursor refcursor
ops$tkyte%ORA11GR1> exec open :after_cursor for select count(*) from t t3 where id > 0 and id = trunc(id);

PL/SQL procedure successfully completed.

<b>that query will be.... 0 rows - id = trunc(id) won't be satisfied..</b>


ops$tkyte%ORA11GR1> analyze index t_idx validate structure;
Index analyzed.

ops$tkyte%ORA11GR1> insert into idxstats
  2  select 'after' when, index_stats.* from index_stats;

1 row created.

<b>save the current stats for the index now...</b>

ops$tkyte%ORA11GR1> select sum(blocks) from user_extents where segment_name = 'T_IDX';

SUM(BLOCKS)
-----------
        256

<b>index is STILL 256 blocks</b>

ops$tkyte%ORA11GR1> alter tablespace users offline;
Tablespace altered.

ops$tkyte%ORA11GR1> print before_cursor

  COUNT(*)
----------
    100000

ops$tkyte%ORA11GR1> print after_cursor

  COUNT(*)
----------
         0

<b>what we exepcted - and just to show that we are reading the index RIGHT NOW, we'll make before_cursor2 fail</b>


ops$tkyte%ORA11GR1> alter tablespace tools offline;

Tablespace altered.

ops$tkyte%ORA11GR1> print before_cursor2
ERROR:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9:
'/home/ora11gr1/oradata/ora11gr1/ORA11GR1/datafile/o1_mf_tools_44c8k90l_.dbf'



no rows selected

ops$tkyte%ORA11GR1> alter tablespace tools online;

Tablespace altered.

ops$tkyte%ORA11GR1> alter tablespace users online;

Tablespace altered.

<b>and now, look at the index information</b>

ops$tkyte%ORA11GR1>
ops$tkyte%ORA11GR1> with data
  2  as
  3  ( select when, thing, val
  4      from idxstats
  5    unpivot ( val for thing in
  6              ( LF_ROWS, LF_BLKS, LF_ROWS_LEN, LF_BLK_LEN,
  7                BR_ROWS, BR_BLKS, BR_ROWS_LEN, BR_BLK_LEN,
  8                DEL_LF_ROWS, DEL_LF_ROWS_LEN, DISTINCT_KEYS,
  9                MOST_REPEATED_KEY, BTREE_SPACE, USED_SPACE,
 10                PCT_USED, ROWS_PER_KEY, BLKS_GETS_PER_ACCESS,
 11                PRE_ROWS, PRE_ROWS_LEN, OPT_CMPR_COUNT, OPT_CMPR_PCTSAVE )
 12            )
 13  )
 14  select THING, before, after, after-before delta
 15    from data
 16   pivot( max(val) for when in ( 'before' as before, 'after' as after )
 17        )
 18   order by thing
 19  /

THING                    BEFORE      AFTER      DELTA
-------------------- ---------- ---------- ----------
BLKS_GETS_PER_ACCESS          3          3          0
BR_BLKS                       1          1          0
BR_BLK_LEN                 8028       8028          0
BR_ROWS                     221        221          0
BR_ROWS_LEN                2630       2630          0
BTREE_SPACE             1783140    1783140          0
DEL_LF_ROWS                   0          0          0
DEL_LF_ROWS_LEN               0          0          0
DISTINCT_KEYS            100000     100000          0
LF_BLKS                     222        222          0
LF_BLK_LEN                 7996       7996          0
LF_ROWS                  100000     100000          0
LF_ROWS_LEN             1588892    1689902     101010
MOST_REPEATED_KEY             1          1          0
OPT_CMPR_COUNT                0          0          0
OPT_CMPR_PCTSAVE              0          0          0
PCT_USED                     90         95          5
PRE_ROWS                      0          0          0
PRE_ROWS_LEN                  0          0          0
ROWS_PER_KEY                  1          1          0
USED_SPACE              1591522    1692532     101010

21 rows selected.

<b>so, the index size did not change.  256 blocks before, 256 blocks after - same number of leaf blocks before and after...

unless we did some real magic or used a pctfree of 60% (we did not, we defaulted to 10%) - we must have OVERWRITTEN THE INDEX ENTRIES

that in fact is what we did

and we reconstructed them using read consistency....</b>

SELECT COUNT(*) FROM T T1 WHERE ID > 0 AND ID = TRUNC(ID)

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          2          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.34       0.34          0     100842          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.35       0.35          0     100844          0           1

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=100842 pr=0 pw=0 time=0 us)
 100000   INDEX FAST FULL SCAN T_IDX (cr=100842 pr=0 pw=0 time=2332 us cost=66 size=1648868 card=126836)(object id 79911)
********************************************************************************
SELECT COUNT(*) FROM T T3 WHERE ID > 0 AND ID = TRUNC(ID)

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          2          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.01       0.02          0        229          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.02       0.02          0        231          0           1

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=229 pr=0 pw=0 time=0 us)
      0   INDEX FAST FULL SCAN T_IDX (cr=229 pr=0 pw=0 time=0 us cost=66 size=39 card=3)(object id 79911)

<b>and the tkprof shows we used the index and that we did a TON of IO for the "before_cursor" query - more IO than seems possible - this, this was the processing of the UNDO to put the index back into the state were we could use it.  

The after_cursor - it just read the blocks as they were - they were good enough and needed NO UNDO whatsoever.  </b>






About Column Stores and Indexes

Gokul, June 04, 2008 - 3:06 am UTC

"I do not doubt that - you just threw out "Yet their joining mechanism
is efficient." - that is what I was talking about."

I made the point because the partitioned hash join was made more in sync with the modern hardware. i got convinced that it is more cache conscious than the traditional row-store.


"So, let me try again.... Would you believe me if......
see differing results"

Yes, now i understand Oracle's index design. I was more convinced before because b-tree algorithm doesn't have any provisioning for maintenance of versions. But relating records in undo log with the one in index based on their position is a nice concept. Hats off to the person, who found it out, and to you for explaining it to me... Thanks..
Tom Kyte
June 04, 2008 - 9:58 am UTC

unless you have numbers, proof points, something concrete - statements like "there joining mechanism is efficient" is no different from me saying "ours is better".


they found it (multi-versioning) out some 25 years ago.... version 3 of Oracle...

But why so much I/O?

Matthew McPeak, January 26, 2009 - 9:32 pm UTC

Tom,

In your example explaining how undo is used to rebuild the original version of the index, why does printing the first cursor require SO MUCH I/O?

In your example, it seems that a full version of the index takes only 250+ blocks and that there are only 2 versions (one with IDs 1,2,3 and one with IDs 1.01,1.02,1.03). So, even if Oracle goes to UNDO to reconstruct the first version, why does it take more than (250x2) about 500 I/Os?

I'm not questioning your example. I'd just like an explanation because it seems like it could be very instructive.

Thanks in advance,
Matt

Tom Kyte
January 28, 2009 - 8:17 am UTC

we are reading undo in order to rollback the changes to the index. indexes are complex structures, infinitely more complex than a table. rolling back an index has to deal with the entire index structure - root, branch and so on. It just takes more work to more blocks to do it.