Skip to Main Content
  • Questions
  • Possible arcane question on B*Tree indexes as Oracle sees them

Breadcrumb

May 4th

Question and Answer

Tom Kyte

Thanks for the question, Harrison.

Asked: June 16, 2001 - 1:32 pm UTC

Last updated: June 06, 2011 - 9:56 am UTC

Version: 8.1.7

Viewed 1000+ times

You Asked

Hi Tom:

I just got a book on Oracle 8i Application Programming
and it seems to have some odd bits. In a chapter "Database
Foundations", a B*Tree index is pictured as (I am just doing part):

Ivanov
/ \
/ \
Gogi Karamazov
Gorky Lemontov
/ / \
/ / \
Adreev Bulgakov Leskov Rasputin
Belinsky Chekhov Lomonosov Romanov
Blok Dostoevsky Pushkin Tchaikovsky

First, what may be a help to anyone else not familiar
with Oracle's B*Trees, it seems apparent that by using
blocks as nodes, there are many data elements (names) in
each node and not just one (as you might expect if you
are familiar with traditional C style B*Tree nodes).

What is still confusing, after you accept that there can be
many data elements in any node, is how Oracle knows which
pointer to follow next; in this example both the left and
right nodes below "Gogi" and "Gorky" have elements which
seem to be to the left of the parent node, and both of
the other two nodes at this level seem to contain data
that apparently would be to the right of their parent
node.

At the third level, does Oracle just do a brute force search,
and if you have a million records, would that not entail a
search of (at the third level) 250,000 records? It seems
to me that even if many records exist in each node, there still
needs to be two L&R specific records that lead to data which is
greater or lessor than the indicators.

Other than that the concept of using a data block as a node is
interesting as it could return a complete index for a small table
in just a few blocks, so that later queries on different data would
be satisfied from the shared area.

Also, if you had the chance to read this book, what was your
favorite part? Not that I won't read it all, I just like to
read the best parts first (actually, I might not read all of
any book; I was lying).

Last, your book reviews are valuable, I bought "Oracle8i Internal
Services" by Steve Adams and found it a good read (not everybody
that has a good idea can also write, but Steve can) and the modest
cost was welcome. It has as much on diagnosis and tuning as it
does on understanding what we are diagnosing, so it is well
worth reading.

Thanks

Harrison

and Tom said...

They probably should have used a different diagram. It seems to imply there are two entries on a branch block -- there are dozens (or more) actually. That branch block with (gogol, gorky) would have MANY other entries pointing us to the right place -- that block may point to andreev/belinsky/blok -- but the entry on that branch block that did that pointing wasn't depicted.

Here is a tiny extract from my book from the chapter on indexes:

B*Tree, or what I call 'conventional', indexes are the most commonly used type of indexing structure in the database. They are similar in implementation to a binary search tree you might have learned about in a computer science course. Their goal is to minimize the amount of time Oracle spends searching for data. Loosely speaking, the structure might look like this, if you have an index on a number column:

+---------------+
| 50 and less--|--------+
+------|--more than 50 | |
| +---------------+ |
| |
| |
| |
+---------------+ +---------------+
+-----|-->= 100 | | >40 to 50 |
| | >90 to 100---|---+ | >30 to 40 |
| | >80 to 90 | | +---|-->20 to 30 |
| | >70 to 80 | | | | >10 to 20----|---+
| | >60 to 70 | | | | >0 to 10 | |
| | >50 to 60 | | | | less than 0 | |
| +---------------+ | | +---------------+ |
| | | |
| | | |
| | | |
| | | |
+---------------+ +---------------+ +---------------+ +---------------+
| 101, rowid | | 91, rowid | | 21, rowid | | 11, rowid |
| 102, rowid | | 92, rowid | | 22, rowid | | 12, rowid |
| 103, rowid |<->| 93, rowid |<->| 23, rowid |<->| 13, rowid |
| 104, rowid | | 94, rowid | | 24, rowid | | 14, rowid |
| .... | | .... | | .... | | .... |
+---------------+ +---------------+ +---------------+ +---------------+


The lowest level blocks in the tree, called LEAF NODES, contain every indexed key and a rowid (rid in the picture) that points to the row it is indexing. The interior blocks, above the leaf nodes, are known as branch blocks. They are used to navigate through the structure. For example, if we wanted to find the value '42' in the index, we would start at the top of the tree and go to the right. We would inspect that block and discover we needed to go to the block in the range 'less than 50 to 40'. This block would be the leaf block and would point us to the rows that contained the number 42. It is interesting to note that the leaf nodes of the index are actually a doubly linked list. Once we find out where to 'start' in the leaf nodes – once we have found that first 42 value – doing an ordered scan (also known as an INDEX RANGE SCAN) of values is very easy. We don’t have to navigate the structure any more, we just go forward through the leaf nodes. That makes solving a predicate, such as the following, pretty simple:

where x between 20 and 30

Oracle finds the first index block that contains 20 and then just walks horizontally, through the linked list of leaf nodes until it finally hits a value that is greater than 30.
---------------------------- end of quote ---------------------------------

Hopefully that is more clear. Each branch block has a whole bunch of ranges on it. We navigate from branch to branch finding the right range. Each range points to the next block (either yet another branch block or a leaf node).

It is not a simple LEFT/RIGHT -- its an N-Way tree with each branch block pointing to many other blocks).


I haven't actually read the book from start to finish myself (well, I definitely did read chapter 16 since I wrote that one and chapter 8 which Sean Dillon who works with me wrote).

I read it piecemeal as a technical reviewer and only read the first drafts ;) It has be well recieved so far though



A book I highly recommend -- Practical Oracle8i by Jonathan Lewis.



Rating

  (20 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Harrison Picot, June 18, 2001 - 12:35 am UTC


Anantha, March 02, 2002 - 4:15 pm UTC


A reader, September 16, 2002 - 5:43 pm UTC


Cvdeepak, September 16, 2002 - 5:43 pm UTC


Cvdeepak, September 16, 2002 - 5:44 pm UTC


B*Tree indexes as Oracle sees them

Dinesh, October 28, 2003 - 6:29 pm UTC

Excellent! This is the best explanantion one can get on Index Structure

Reverse Key index

A reader, July 11, 2004 - 11:45 am UTC

But if i have a verly large table with reverse key index on it and then if i query the table for a specific record then the result will be faster but if i give a range of values like 'beween 9000 and 9500' then normal btree index will return the result faster then reverse key as the later will have to search for each and every record matching in the query.

Tom Kyte
July 11, 2004 - 1:32 pm UTC

well, that is a statement of fact of course.

not sure of the relevance here specifically...

Balancing of B*Tree Index when we use Sequences/Surrogate Key for PK

Moorthy Rekapalli, July 12, 2004 - 10:15 am UTC

Tom,

I think that it is a common practice to use the Surrogate Key (fetching sequence_name.nextval) to generate a value for Primary Key. In that case, one will always get a higher value - thus going to the left of the tree.

If we use reverse key indexes, they are good for single lookups but not efficient for Index Range Scans. From an application standpoint, it is difficult to imagine that one will do an index range scan on surrogate keys - as they don't carry any business significance.

From data structure point of view, just curious in knowing whether B*Tree needs balancing of the tree OR it is just fine in these cases.

Thanks,
Moorthy.


Tom Kyte
July 12, 2004 - 11:45 am UTC

the b in b*tree stands for balanced. it is impossible for an Oracle b*tree index to not be balanced.

Thanks for the quick response

Moorthy Rekapalli, July 12, 2004 - 2:35 pm UTC

Tom,

Thanks for the quick response. So far, I did not realize that B stands for Balanced in B*Tree index. I learn new things/techniques from your site all the time.

I just want to drive this point home. That's why, doing a follow-up with you on when to use reverse key indexes. So far, I was under the impression that reverse key indexes are useful in two scenarios.

1) To balance B*Tree indexes on surrogate key columns. After your explanation, I figured that we don't need to do that as it balances itself.

2) Quote from your expert one-on-one red cover book (page 275) - "Reverse Key Indexes were implemented to reduce contention on index leaf blocks in an Oracle Parallel Server environment". Is this point still valid in RAC with cache fusion?

Thanks in advance,
Moorthy.


Tom Kyte
July 12, 2004 - 11:19 pm UTC

1) correct

2) absolutely - can work in non "rac" environments as well. instead of everyone going for the same block (contention) they go all over the place.

However ....

Alex Voinea, February 16, 2005 - 4:57 pm UTC

From the Oracle9i Database Performance Planning manual:
</code> http://download-east.oracle.com/docs/cd/B10501_01/server.920/a96532/ch1.htm#21629 <code>

"Use of sequences, or timestamps, to generate key values that are indexed themselves can lead to database hotspot problems, which affect response time and throughput. This is usually the result of a monotonically growing key that results in a right-growing index. To avoid this problem, try to generate keys that insert over the full range of the index. This results in a well-balanced index that is more scalable and space efficient. You can achieve this by using a reverse key index or using a cycling sequence to prefix and sequence values."

So this means that Oracle indexes can become unbalanced after all ...



Tom Kyte
February 16, 2005 - 5:23 pm UTC

hahaha, no it doesn't, it is not "true"

but i will "bug" the doc.

right growing indexes are very super efficient at packing space in actually, very tight.

hotspot, possibly.

Sometime later....

Tony Bass, February 17, 2005 - 6:54 am UTC

A common commercial application will use a sequence to add a unique primary key, lets says a sales order table and the index will be a moving window on that sequence with new orders being added and old ones being dropped. The table row count can be fairly stable but the index blocks grow as some old orders can hang around holding on to blocks and the pct_used sinks. Analyze index .. validate structure; gives us the information on the state of the index but only the latest is in the Index_stats view. I wrote a stored proc to analyze my database and run an index rebuild. Is there any other way of getting pct_used?


Tom Kyte
February 17, 2005 - 9:22 am UTC

i "snapshot" index stats after analyze validate structure to save the data off to somewhere else myself.

Index Structure Graphically

Jagjeet Singh, May 12, 2005 - 4:00 pm UTC

Hi,

Is there any way to see a b*tree index structure in tree format as you shown.

Is there any tool available or any other way.

Thanks,
Js

Tom Kyte
May 12, 2005 - 9:12 pm UTC

no, you can dump blocks but as the b*tree is a very fluid structure (changes with inserts/updates/deletes), even dumping blocks isn't going to give you a very "stable" picture if one is changing them...

of what use would this be? (curious)

Index structure

A reader, May 13, 2005 - 8:46 am UTC

Tom,
My understanding is that though the index structure is shown hierarchical, this is not the way it is laid out on the disk. The hierarchical structure is an internal implementation and a visual depiction. On the disk, Oracle will find the root block wherever it is, then the head will move to read the node block pointed to by the root block and then onto the leaf block pointed to by the node block...maybe traversing different disks if datafiles are located on different spindles. The root block, node blocks and leaf blocks may or may not be right next to each other. However, the total number of blocks examined by Oracle can be inferred from the hierarchical structure. So it really doesn't add any value by being able to see the tree structure. Is that correct?

Thanks

Tom Kyte
May 13, 2005 - 10:36 am UTC

correct, each block is separate and could be on different disks even.


I cannot imagine what you would get from a 'picture', mostly what we need to know is height (how many io's to get to leaf)

Index structure

Jonathan Lewis, May 13, 2005 - 10:50 am UTC

One minor detail about the physical layout.
The root block is always kept as the block
after the segment header block.


height of b* tree

Jianhui, June 23, 2005 - 5:10 pm UTC

Hello Tom,

My db block size is 16KB, I created a testing table with a column char(1600) 
then put index on it. After analyzed the index i got following result.


SQL>desc a
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 CHAR(1600)
 N                                                  NUMBER(38)

SQL>select count(*), count(distinct id)
SQL>from a;

  COUNT(*) COUNT(DISTINCTID)
---------- -----------------
      7955              7955

SQL>select height, blocks, name, lf_rows, lf_blks, br_rows, br_blks, 
br_rows_len, distinct_keys, rows_per_key
SQL>from index_stats
SQL>/

    HEIGHT     BLOCKS NAME                              LF_ROWS    LF_BLKS
---------- ---------- ------------------------------ ---------- ----------
   BR_ROWS    BR_BLKS BR_ROWS_LEN DISTINCT_KEYS ROWS_PER_KEY
---------- ---------- ----------- ------------- ------------
         2        888 A_ID                                 7955        884
       883          1       10559          7955            1

My question is since there is only 1 branch blocks, which means less than 10 
index keys(16K block size/1600 key length) can be stored in one block, how could 
it be possible to point to 884 leaf blocks? Shouldnt it be more levels but each 
level only has less than 10 blocks? These keys are unique as we can see from 
above. I guess I do not under B*Tree structure correctly, could you explain how 
root, branch blocks are stored in order to point to leaf blocks?

Thanks. 

 

Tom Kyte
June 25, 2005 - 5:53 pm UTC

has to do with leading edge compression of the branch data.  it only has to know where the leading edge of the key varies.  If you put into the table

'1                                '
'2                                '
'3                                '

vs

'                                1'
'                                2'
'                                3'

you'll see big time differences:


ops$tkyte@ORA9IR2> create table t ( id1 char(1600), id2 char(1600) );
 
Table created.
 
ops$tkyte@ORA9IR2> insert /*+ append */ into t select rownum, rpad( ' ', 1600-length(rownum), ' ' ) || rownum
  2  from all_objects where rownum <= 7955;
 
7955 rows created.
 
ops$tkyte@ORA9IR2> create index t_idx1 on t(id1);
 
Index created.
 
ops$tkyte@ORA9IR2> create index t_idx2 on t(id2);
 
Index created.
 
ops$tkyte@ORA9IR2> analyze index t_idx1 validate structure;
 
Index analyzed.
 
ops$tkyte@ORA9IR2> create table idx_stats as select * from index_stats;
 
Table created.
 
ops$tkyte@ORA9IR2> analyze index t_idx2 validate structure;
 
Index analyzed.
 
ops$tkyte@ORA9IR2> insert into idx_stats select * from index_stats;
 
1 row created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create or replace function is_number( p_str in varchar2 ) return number
  2  as
  3          l_number number;
  4  begin
  5          l_number := p_str;
  6          return 1;
  7  exception
  8          when others then return 0;
  9  end;
 10  /
 
Function created.
 
ops$tkyte@ORA9IR2> column cname format a25
ops$tkyte@ORA9IR2> column t1 format a20
ops$tkyte@ORA9IR2> column t2 format a20
ops$tkyte@ORA9IR2> select a.cname,
  2         decode( is_number(a.val),0,a.val,round(a.val,2)) t1,
  3         decode( is_number(b.val),0,b.val,round(b.val,2)) t2,
  4             case when is_number(a.val) = 1 and is_number(b.val) = 1
  5                  then to_char( decode(a.val,'0',null,round(b.val/a.val*100,2) ), '9,999.00' )
  6                   end pct
  7    from table( cols_as_rows( 'select *
  8                                 from idx_stats
  9                                where name = ''T_IDX1'' ' ) ) a,
 10         table( cols_as_rows( 'select *
 11                                 from idx_stats
 12                                where name = ''T_IDX2'' ' ) ) b
 13       where a.cname = b.cname
 14  /
 
CNAME                     T1                   T2                   PCT
------------------------- -------------------- -------------------- ---------
BLKS_GETS_PER_ACCESS      3                    5                       166.67
BLOCKS                    896                  1024                    114.29
BR_BLKS                   1                    90                    9,000.00
BR_BLK_LEN                16220                16220                   100.00
BR_ROWS                   883                  883                     100.00
BR_ROWS_LEN               10478                1420648              #########
BTREE_SPACE               14326412             15769992                110.08
DEL_LF_ROWS               0                    0
DEL_LF_ROWS_LEN           0                    0
DISTINCT_KEYS             7955                 7955                    100.00
HEIGHT                    2                    4                       200.00
LF_BLKS                   884                  884                     100.00
LF_BLK_LEN                16188                16188                   100.00
LF_ROWS                   7955                 7955                    100.00
LF_ROWS_LEN               12831415             12831415                100.00
MOST_REPEATED_KEY         1                    1                       100.00
NAME                      T_IDX1               T_IDX2
OPT_CMPR_COUNT            0                    1
OPT_CMPR_PCTSAVE          0                    8
PARTITION_NAME
PCT_USED                  90                   91                      101.11
PRE_ROWS                  0                    0
PRE_ROWS_LEN              0                    0
ROWS_PER_KEY              1                    1                       100.00
USED_SPACE                12841893             14252063                110.98
 
25 rows selected.
 
ops$tkyte@ORA9IR2> set pause off



<b>
Neat Stuff</b>








ops$tkyte@ORA9IR2> create or replace type myScalarType as object
  2  ( rnum number, cname varchar2(30), val varchar2(4000) )
  3  /
 
Type created.
 
ops$tkyte@ORA9IR2> create or replace type myTableType as table of myScalarType
  2  /
 
Type created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create or replace
  2  function cols_as_rows( p_query in varchar2 ) return myTableType
  3  -- this function is designed to be installed ONCE per database, and
  4  -- it is nice to have ROLES active for the dynamic sql, hence the
  5  -- AUTHID CURRENT_USER
  6  authid current_user
  7  -- this function is a pipelined function -- meaning, it'll send
  8  -- rows back to the client before getting the last row itself
  9  -- in 8i, we cannot do this
 10  PIPELINED
 11  as
 12      l_theCursor     integer default dbms_sql.open_cursor;
 13      l_columnValue   varchar2(4000);
 14      l_status        integer;
 15      l_colCnt        number default 0;
 16      l_descTbl       dbms_sql.desc_tab;
 17      l_rnum          number := 1;
 18  begin
 19          -- parse, describe and define the query.  Note, unlike print_table
 20          -- i am not altering the session in this routine.  the
 21          -- caller would use TO_CHAR() on dates to format and if they
 22          -- want, they would set cursor_sharing.  This routine would
 23          -- be called rather infrequently, I did not see the need
 24          -- to set cursor sharing therefore.
 25      dbms_sql.parse(  l_theCursor,  p_query, dbms_sql.native );
 26      dbms_sql.describe_columns( l_theCursor, l_colCnt, l_descTbl );
 27      for i in 1 .. l_colCnt loop
 28          dbms_sql.define_column( l_theCursor, i, l_columnValue, 4000 );
 29      end loop;
 30
 31          -- Now, execute the query and fetch the rows.  Iterate over
 32          -- the columns and "pipe" each column out as a separate row
 33          -- in the loop.  increment the row counter after each
 34          -- dbms_sql row
 35      l_status := dbms_sql.execute(l_theCursor);
 36      while ( dbms_sql.fetch_rows(l_theCursor) > 0 )
 37      loop
 38          for i in 1 .. l_colCnt
 39          loop
 40              dbms_sql.column_value( l_theCursor, i, l_columnValue );
 41              pipe row
 42              (myScalarType( l_rnum, l_descTbl(i).col_name, l_columnValue ));
 43          end loop;
 44          l_rnum := l_rnum+1;
 45      end loop;
 46
 47          -- clean up and return...
 48      dbms_sql.close_cursor(l_theCursor);
 49      return;
 50  end cols_as_rows;
 51  /
 
Function created.










 

above question

jianhui, June 24, 2005 - 11:57 pm UTC

Hello Tom,
Would you mind taking a look at my question above when you get a chance?
Best Regards,

Multiple columns

Todd, January 11, 2008 - 11:51 am UTC

Hi Tom,

Could you explain how the structure of B-Tree columns is stored for multiple column indexes?

IE. create index IX_MULTI (col1, col2) on table12
Tom Kyte
January 13, 2008 - 10:57 pm UTC

the same as when it is on a single column

In a b*tree index, we store

a) the key
b) the rowid

the key could be one, two or more columns....


So, there is a "key place" and a "rowid place"

Actually, there is a "key place" and a "not part of the key" place

The key is always unique, so in a non-unique index, the rowid is actually part of the key :)

Thanks

Stefano Trallori, February 22, 2008 - 8:07 am UTC

Thanks for your answers, they made me understand B*tree indexes in a easier way than which documentation did.

But I'd like to ask you some questions.

1) Do B*Tree (balanced tree?) indexes have something to do in "some way" with Binary Indexes?

2) As per your ASCII art in the first reply, is HEIGHT the count of rows in leaf blocks? (excluding "...")

3) In which way is a "text" (char, varchar2 etc..) column ordered before creating an index? Is this the same we can achieve by doing select table.column order by column?

Thank you very much!!
Tom Kyte
February 22, 2008 - 12:03 pm UTC

1) sure, they just are not binary. binary trees are very simple data structures. A B*tree is more of an "n-way" tree - the root block points to MANY branch blocks, not just two. Each branch block points to many other branch or eventually leaf blocks.

2) height is the height of the btree, the number of IO's from the root to the leaf block

my ascii art has a btree with 3 levels, 3 IOs to get from root to leaf.

3) depends on your NLS settings, you can create an index using a binary sort or a character set sort, or on a function (nls_sort())

How branch block gets created

Bhaskar, June 06, 2011 - 9:30 am UTC

Hi Tom,

First of all thanks for your inputs to this discussion.
I have the below question on btree index branch node.

I read somewhere the below

"Branch blocks store the minimum key prefix needed to make a branching decision between two keys. This technique enables the database to fit as much data as possible on each branch block...”

Now what algorithm ORACLE takes to decide which branch will point which values?

In your fist description say I have a column with 90 values and I have created index on this column. Now as per the below picture how ORACLE will decide which branch block will point to which values in different levels? Is there any specific internal algorithm for this?

+---------------+
| 50 and less--|--------+
+------|--more than 50 | |
| +---------------+ |
| |
| |
| |
+---------------+ +---------------+
+-----|-->= 100 | | >40 to 50 |
| | >90 to 100---|---+ | >30 to 40 |
| | >80 to 90 | | +---|-->20 to 30 |
| | >70 to 80 | | | | >10 to 20----|---+
| | >60 to 70 | | | | >0 to 10 | |
| | >50 to 60 | | | | less than 0 | |
| +---------------+ | | +---------------+ |
| | | |
| | | |
| | | |
| | | |
+---------------+ +---------------+ +---------------+ +---------------+
| 101, rowid | | 91, rowid | | 21, rowid | | 11, rowid |
| 102, rowid | | 92, rowid | | 22, rowid | | 12, rowid |
| 103, rowid |<->| 93, rowid |<->| 23, rowid |<->| 13, rowid |
| 104, rowid | | 94, rowid | | 24, rowid | | 14, rowid |
| .... | | .... | | .... | | .... |
+---------------+ +---------------+ +---------------+ +---------------+


Tom Kyte
June 06, 2011 - 9:56 am UTC

we use a very simple "leading edge redundant data removal" technique. It is a very simple form of compression. The 'algorithm' to decide which branch will point to what values isn't really an algorithm, it is a state of being. A branch block will point to the values it points to - I'm not sure how else to state it. We store as many branch values as we can (so a branch points to as many lower level blocks - be they other branch or leaf blocks - as possible). And the values are stored in a sorted order - which mandates their positions in the index.

Bhaskar, June 06, 2011 - 10:14 am UTC

thanks.