Database, SQL and PL/SQL

On Uniqueness, Space, and Numbers

Our technologist discusses determining space required, randomizing, and partition outer joins.

By Tom Kyte

January/February 2004

I'm looking for information on creation and performance timing of a unique versus a nonunique index on a unique column of a large (260M rows) table. Can you help me?

The only difference between a unique and a nonunique index is whether the ROWID is considered part of the key for uniqueness or not. In a conventional B*Tree index, there is only one kind of index: unique. In a nonunique index, Oracle Database will consider the ROWID as part of the key.

Since a ROWID is unique within a table, that will ensure that all of the keys in the B*Tree are unique. So the key in a nonunique B*Tree index includes the ROWID value. In a unique index, Oracle Database does not consider the ROWID as part of the key—but rather, just as data. In that fashion, the database will ensure that your indexed columns are, in fact, unique in the table. In both cases, Oracle Database is storing your indexed columns and the ROWID . It just uses the ROWID as part of the key in a nonunique index.

You will find that unique and nonunique are more or less the same in terms of creation and performance. I loaded up a table with the unique series of numbers from 1 to 258,564,672:

SQL> select count(*)
 2 >   from tall_but_skinny;
    COUNT(*)
 -----------
   258564672

Then, with SQL_TRACE enabled, I created unique and nonunique indexes. The TKPROF report showed me the following:

create index tall_but_skinny_2
on tall_but_skinny(r);
call        cpu    elapsed       disk
-------  -------  ---------   --------
Parse      0.00      0.16         0
Execute  2000.30   3375.34    1175887
Fetch      0.00      0.00         0
-------  -------  ---------   --------
total     2000.30    3375.51    1175887
create UNIQUE index tall_but_skinny_3 on tall_but_skinny(r);
call        cpu    elapsed       disk
-------   -------  ---------   --------
Parse      0.00       0.24         0
Execute          1969.09    2857.14    1175908
Fetch       0.00       0.00         0
-------   -------  ---------   --------
total     1969.10   2857.39    1175908

The results are roughly the same. This was done serially (not parallel) with logging (and it generated tons of logs), so your mileage may vary, but you have very similar structures at the end of the day, and you'll find the CREATE performance times to be about the same.

Estimate TEMP Space Needed

I have a 4.2-billion-row table with eight columns that I need to group by five columns. This table is 380GB. How can I estimate how much space in TEMP I will need? Will it be close to 380 x 2 = 760GB? I need to use parallel query (degree 8 or 16, probably).

In Oracle9i Release 2, there is a nifty new supplied package DBMS_XPLAN that shows lots of information. I'll show how to use this by explaining a query that will do a big sort. In this example, BIG_TABLE is a copy of ALL_OBJECTS and has about 1.8 million rows in it.

SQL> delete from plan_table;
4 rows deleted.
SQL> explain plan for
  2  select object_name, owner,
  3        object_id, created,
  4        last_ddl_time, count(*)
  5      from big_table
  6  group by object_name, owner,
  7          object_id, created,
  8          last_ddl_time;
Explained.
SQL> select * from
  2  table( dbms_xplan.display );

Listing 1 shows the DBMS_XPLAN output for this query. As you can see, the plan as shown by DBMS_XPLAN includes this estimation. It is saying "the sort group by step will need about 225MB of temporary space." DBMS_XPLAN shows you only relevant information for your query; what I mean by that is, don't be surprised if you run an explain plan and don't see any TempSpc information. That just means you don't actually need any TEMP , hence DBMS_XPLAN doesn't show you anything. Consider this small example using the standard EMP table:

SQL> analyze table emp
  2  compute statistics;
Table analyzed.
SQL> delete from plan_table;
3 rows deleted.
SQL> explain plan for
  2  select deptno, count(*)
  3    from emp group by deptno;
Explained.
SQL> select * from
  2  table( dbms_xplan.display );

Listing 2 shows the DBMS_XPLAN output for this query. It shows that for 14 rows with three distinct values of deptno , you won't need any TEMP space. Let's now pretend that EMP has lots of rows, with lots of values:

SQL> begin
  2  dbms_stats.set_table_stats
  3  ( user, 'EMP', numrows => 2000000,
  4     numblks => 1000000 );
  5  dbms_stats.set_column_stats
  6  ( user, 'EMP', 'DEPTNO',
  7    distcnt => 2000000 );
  8  end;
  9  /
PL/SQL procedure successfully completed.

Now, if you delete from the plan table and re-explain that same exact query, DBMS_XPLAN will show you the result in Listing 3 .

What you see here is that when the optimizer thinks there are lots of rows with lots of distinct values, it'll need about 30MB of temporary disk space to complete its job.

Generating Some Random Data

How could you create six unique random numbers between 1 and 49 with one SQL statement?

I'd generate the set of numbers to pick from (see the innermost query that follows); any table with 49 or more records would do. I could write a PIPELINED function as well, and I will explain that too. First for the quick-and-dirty answer:

select r   from ( select r
from ( select rownum r from all_objects where rownum < 50 )
order by dbms_random.value ) where rownum <= 6;

That query works by generating the numbers 1 .. 49, using the inline view. I wrap that innermost query as an inline view and sort it by a random value, using DBMS_RANDOM.VALUE . I wrap that result set in yet another inline view and just take the first six rows. If I run that query over and over, I'll get a different set of six rows each time.

This sort of question comes up frequently—maybe not about how to generate a set of six random numbers but rather, "how can we get N rows?" For example, I'd like the inclusive set of all dates between 01-JAN-2003 and 15-JAN-2003. I can generate that easily with a 15-row table; in fact, select to_date('01-jan-2003')+rownum-1 from all_objects where rownum <= 15 does that, but it incurs some amount of overhead ( ALL_OBJECTS is a view, and a complex one at that). I could create a table and stuff some rows into it, but that is not always appealing either. The question becomes how to do this without a "real" table, and the answer lies in Oracle9i with its PIPELINED function capability. I can write a PL/SQL function that will operate like a table. I need to start with a SQL collection type; this describes what my PIPELINED function will return. In this case, I'm choosing a table of numbers; the virtual table I am creating will simply return the numbers 1, 2, 3, ... N:

SQL> create type array
  2  as table of number
  3  /
Type created.

Next, I create the actual PIPELINED function. This function will accept an input to limit the number of rows returned. If no input is provided, this function will just keep generating rows for a very long time (so be careful and make sure to use ROWNUM or some other limit in the query itself!). The PIPELINED keyword on line 4 allows this function to work as if it were a table:

SQL> create function
  2  vtable( n in number default null )
  3  return array
  4  PIPELINED
  5  as
  6  begin
  7     for i in 1 .. nvl(n,999999999)
  8     loop
  9         pipe row(i);
 10     end loop;
 11     return;
 12  end;
 13  /
Function created.

Suppose I needed three rows for something. I can now do that in one of two ways:

SQL> select *
  2    from TABLE(vtable(3))
  3  /
 COLUMN_VALUE
 ------------
           1
           2
           3

SQL> select *
  2    from TABLE(vtable)
  3   where rownum <= 3
  4  /
 COLUMN_VALUE
 ------------
           1
           2
           3

Now I am ready to re-answer the original question, using the following functionality:

SQL> select *
  2    from (
  3  select *
  4    from (
  5  select *
  6    from table(vtable(49))
  7         )
  8   order by dbms_random.random
  9         )
 10   where rownum <= 6
 11  /
 COLUMN_VALUE
 ------------
          47
          42
          40
          15
          48
          23
6 rows selected.

And I can use this vtable functionality for many things, such as generating that range of dates:

SQL> select to_date('01-jan-2003')+
  2        column_value-1
  3  from TABLE(vtable(15));
TO_DATE('
---------
01-JAN-03
...
15-JAN-03
15 rows selected.

Note the name of the column I used: COLUMN_VALUE . That is the default name for the column coming back from my PIPELINED function.

What's Cool in Oracle database 10g?

Can you just name a few important features that you like most in Oracle Database 10g?

The three Oracle Database 10g features I'll pick today are the following:

Automatic Storage Management (ASM)
The manageability features in general and Automatic Database Diagnostic Monitor (ADDM) in particular
New SQL features in general and partitioned "sparse" outer joins in particular

Since it is hard to demonstrate ASM and ADDM in text, I'll let you read about them on Oracle Technology Network, but partitioned "sparse" outer joins are easy to show.

Data is normally stored in sparse form. That is, if no value exists for a given time, no row exists in the fact table. However, time series calculations are most easily performed (year over year calculations, for example) when data is "dense" along the time dimension. This is because dense data will fill a consistent number of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets. Take the example of a SALES result set that has the attributes PRODUCT_NAME, DATE_OF_SALE , and TOT_SALES . Your goal is to show the TOT_SALES for every row against the TOT_SALES for that same row a year ago. If all of the months were available for every product, you would just use the LAG() analytic function to look back 12 rows. But what happens when there are missing months? Suppose there is no entry for JUNE of last year? You'd be comparing JUNE of this year to MAY of last year—instead of NULL (because the JUNE data is in fact missing). This is the case where partitioned outer joins make sense.

Let's set up a small example. All I want to do is compare this month's sales with last month's sales. So, I need to partition my result set by PRODUCT_NAME and order it by DATE_OF_SALE and use LAG() to look back a single row. Start with this data:

ops$tkyte@ORA10 g> select *
  2    from sales
  3   order by product_name,
  4           date_of_sale;
PRODUCT_NAME   DATE_OF_SALE   TOT_SALES
------------   ------------   ---------
TV               01-JUL-03        496
TV               01-AUG-03        993
TV               01-OCT-03        468
TV               01-NOV-03        535
TV               01-DEC-03        665
VCR              01-JUL-03        617
VCR              01-AUG-03        984
VCR              01-SEP-03        308
VCR              01-OCT-03        954
VCR              01-DEC-03        947
10 rows selected.

Notice how I am missing data here. There are no sales for VCRs in NOV, and there are no TVs sold in SEP. If I were to try the LAG() analysis now:

ops$tkyte@ORA10 g> select product_name,
  2    date_of_sale,
  3    lag(date_of_sale)
  4    over (partition by product_name
  5         order by date_of_sale)
  6         last_month,
  7    tot_sales,
  8    lag(tot_sales)
  9    over (partition by product_name
 10         order by date_of_sale)
 11        last_months_sales
 12    from sales
 13  /

I would get the results shown in Listing 4 .

Notice how the output in Listing 4 shows TVs in OCT being compared to TVs in AUG and VCRs in DEC being compared to VCRs in OCT. That isn't what I wanted. I want to compare to last month (NULL). What I need is a result set that looks like this:

PRODUCT_NAME  DATE_OF_SALE    TOT_SALES
------------  ------------    ---------
TV              01-JUL-03         496
TV              01-AUG-03         993
TV              01-SEP-03
TV              01-OCT-03         468
TV              01-NOV-03         535
TV              01-DEC-03         665
VCR             01-JUL-03         617
VCR             01-AUG-03         984
VCR             01-SEP-03         308
VCR             01-OCT-03         954
VCR             01-NOV-03
VCR             01-DEC-03         947
12 rows selected.

That is, I want a result set with the missing dates filled in for me. The way to achieve this is with the following query:

with dates as
( select add_months
        ( to_date( '01-jul-2003' ),
          column_value-1) dt
    from table( vtable(6) )
)
select product_name,
      dt,
      tot_sales
  from dates left outer join
      sales partition by (product_name)
      on ( dates.dt =
           sales.date_of_sale )

I am borrowing the previous VTABLE concept, because what I need to do is build a result set that has all of the dates I need—from the last six months—represented in it. That subquery DATES returns six rows: one date for each month I want to report on. Now I simply outer join that DATES subquery with the SALES data to synthesize the missing rows. But notice the use of the PARTITION BY clause in the outer join. What I've done is broken the SALES table into N partitions, and I am outer joining each of those partitions to the DATES data. In this fashion, each partition of SALES will have the missing data "made up" and the partition key carried along. That result above, the one with the made up SEP and NOV data, was created by using exactly that query.

Before this partition clause, I had to generate the set of all distinct PRODUCT_NAMES and perform a cartesian product between it and the set of DATES . Then I could have outer joined to that. In general, that is not desirable, because I would have to scan all of the SALES records to generate this unique set of PRODUCT_NAMES . That query could have looked like this:

with dates as
( select add_months
        ( to_date( '01-jul-2003' ),
          column_value-1) dt
    from table( vtable(6) )
),
products as
( select distinct product_name
    from sales
),
dates_products as
( select * from dates, products
)
select sales.product_name,
      dates_products.dt,
      sales.tot_sales
  from dates_products left outer join
      sales
      on ( dates_products.dt =
                   sales.date_of_sale
          and
          dates_products.product_name =
                   sales.product_name)

As the number of dimensions you want to analyze goes up, so does the raw size of this cartesian product I would need to generate. The partition clause on the outer join allows queries of this nature to scale more readily, and it avoids the need to pregenerate that unique set of product names in this case as well.

The complete query to perform the analysis I wanted would simply be a logical extension of the partitioned outer join query. Once I have the data as above, I can use an inline view to do the LAG() analysis easily:

with dates as
( select add_months
        ( to_date( '01-jul-2003' ),
         column_value-1) dt
    from table( vtable(6) )
)
select product_name,
      dt,
      lag(dt)
      over (partition by product_name
            order by dt) last_dt,
      tot_sales,
      lag(tot_sales)
      over (partition by product_name
            order by dt) last_sales
  from (
select product_name,
      dt,
      tot_sales
  from dates left outer join
     sales partition by (product_name)
     on (dates.dt = sales.date_of_sale)
       )

This query compares this month's to last month's sales, even when last month's sales are "missing."

Next Steps

ASK Tom
asktom.oracle.com
Oracle Vice President Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column.

Share

On Uniqueness, Space, and Numbers

Our technologist discusses determining space required, randomizing, and partition outer joins.