Database, SQL and PL/SQL

Oracle Database 11g Redux

Our technologist continues to explore Oracle Database 11g features.

By Tom Kyte

March/April 2008

I've received some good questions since the last issue of Oracle Magazine , and they are all online at asktom.oracle.com. However, I've been using Oracle Database 11g for so many tasks recently, I feel compelled to address more new featurestidbits I keep stumbling on that just make the new release that much easier to use.

From SQL that pivots to virtual columns, Oracle Database 11g continues to impress.

How to Pivot

A frequently asked question on Ask Tom and virtually every online Oracle forum is, "How can I pivot a result set?" For example, suppose I have the query

SQL > select deptno, job, sum(sal) sal
  2     from emp
  3     group by deptno, job
  4     /
  DEPTNO   JOB                 SAL
--------   ----------   ----------
      20   CLERK              1900
      30   SALESMAN           5600
      20   MANAGER            2975
      30   CLERK               950
      10   PRESIDENT          5000
      30   MANAGER            2850
      10   CLERK              1300
      10   MANAGER            2450
      20   ANALYST            6000
9 rows selected.

and I need to have the JOB attributes become the column headings, showing just three rowsfor DEPTNO 10, 20, and 30with columns CLERK, ANALYST, and so on. Well, we've been able to do this before Oracle Database 11g, and it looked like Listing 1.

Code Listing 1: Pivot with DECODE

SQL > select deptno,
  2         sum( decode( job, 'CLERK', sal ) ) clerk,
  3         sum( decode( job, 'SALESMAN', sal ) ) salesman,
  4         sum( decode( job, 'MANAGER', sal ) ) manager,
  5         sum( decode( job, 'ANALYST', sal ) ) analyst,
  6         sum( decode( job, 'PRESIDENT', sal ) ) president
  7       from emp
  8     group by deptno
  9     order by deptno
 10     /    DEPTNO     CLERK     SALESMAN         MANAGER       ANALYST        PRESIDENT
-----------  -------- ------------    ------------    ----------    -------------
         10      1300                         2450                           5000
         20      1900                         2975          6000
         30       950         5600            2850

Now that certainly works, but it was never considered intuitive or easy. Starting with Oracle Database 11g, there is new PIVOT syntax for this operation. Documented and a bit more intuitive to use, this syntax is shown in Listing 2.

Pivot with PIVOT

SQL > select *
  2     from( select deptno, job, sal
  3              from emp ) e
  4     pivot( sum ( sal ) for job in
  5     ( 'CLERK', 'SALESMAN', 'MANAGER', 'ANALYST', 'PRESIDENT' ) )
  6     order by deptno
  7     /
     DEPTNO     'CLERK'     'SALESMAN'      'MANAGER'    'ANALYST'     'PRESIDENT'
-----------   --------    ------------   ------------   ----------   -------------
         10       1300                           2450                         5000
         20       1900                           2975         6000
         30        950           5600            2850

Virtual Integrity

We have a data model based on a legacy system, and one of our fields is actually a concatenation of many other fields. We cannot modify this field right now, but we would like to enforce referential integrity on the values embedded within it. For example, if we applied the SUBSTR(column,5,2) function to the column, that value should appear in our lookup table. How can we achieve this?

In the past, there was no good answer for this. Referential integrity mandated that the child table had a value in a real column that pointed to a parent table. Here, there is no real column but rather a function on a column that needs to exist in this parent table. In Oracle Database 10g Release 2 and before, the only reasonable solution was to create an additional column and to define a trigger to maintain this value by applying the SUBSTR function. The solution might look like the code in Listing 3.

Code Listing 3: Referential integrity via a new column and trigger

SQL> create table lookup
  2    ( status_code  varchar2(2) primary key,
  3      other_data    varchar2(20)
  4     );
Table created.
SQL> create table data
  2    ( the_key       varchar2(10) primary key,
  3      status_code  varchar2(2)
  4      NOT NULL
  5      references lookup
  6    );
Table created.
SQL> create trigger data
  2    before insert or update of the_key on data
  3    for each row
  4    begin
  5          :new.status_code := substr(:new.the_key,5,2);
  6    end;
  7    /
Trigger created.
SQL> insert into lookup (status_code) values ( 'AA' );
1 row created.
SQL> insert into data (the_key) values ( '1234AA7890' );
1 row created.
SQL> insert into data (the_key) values ( '1234XX7890' );
insert into data (the_key) values ( '1234XX7890' )
*
ERROR at line 1:
ORA-02291: integrity constraint (OPS$TKYTE.SYS_C009710) violated - parent key not found

Now, the problems with this solution are twofold. One is an obvious storage overhead issue, because we are storing the data twice: once in the THE_KEY field and once in the STATUS field. The other issue is the performance implication of the triggerit will necessarily add to the runtime of all INSERTS and at least some UPDATES.

Starting with the introduction of virtual columns in Oracle Database 11g, we can define referential integrity based on functions of columns, not just on real columns themselves. So, in this new release, we can get rid of the trigger and instead just code a virtual column, as shown in lines 3 through 5 of Listing 4.

Code Listing 4: Referential integrity via a virtual column

SQL> create table data
  2   ( the_key   varchar2(10) primary key,
  3     status_code
  4        as (substr( the_key, 5, 2 ))
  5        references lookup
  6    );
Table created.
SQL> insert into data (the_key) values ( '1234AA7890' );
1 row created.
SQL> insert into data (the_key) values ( '1234XX7890' );
insert into data (the_key) values ( '1234XX7890' )
*
ERROR at line 1:
ORA-02291: integrity constraint (OPS$TKYTE.SYS_C009730) violated - parent key not found

The virtual column is simply metadatait does not store the data redundantly, and it does not require the trigger to maintain the value. And it can now be used to enforce the integrity constraint, further increasing the times when you can use declarative integrity in your applications.

But Wait, There's More to Virtual Columns

Virtual columns can do even more for you, and the most important new capability associated with virtual columns has to do with the optimizer.

The cost-based optimizer is a complex piece of software that takes many bits of input, including

Statistics on a table (number of rows, number of blocks, and the like)
Statistics on indexes (number of leaf rows, clustering factor, and the like)
Statistics on columns (number of distinct values, high/low values, and maybe even histograms on a column describing the skew of the data)
Statistics on the system (I/O speeds, CPU speeds)
Init.ora parameter settings (multiblock read counts, optimizer goal, and the like)

Based on the input, the optimizer creates a plan for executing a SQL statement. It does this in large part by estimating cardinality values (the number of rows) for each step of a plan and coming up with a cost for executing that step of the plan, based on how many rows it anticipates being returned. For example, for even the most trivial of queries,

select * from t where x = 5;

Oracle Database will attempt to compute the cardinality of the WHERE clauseWHERE X=5and based on that, it will figure out how to execute the query.

For example, suppose a table T has 100,000 records and X is indexed. The query optimizer will attempt to figure out the cardinality returned by WHERE X=5 when it decides whether to use the index. If X=5 is estimated to return 2 rows, the optimizer will almost certainly use the index. On the other hand, if X=5 is estimated to return 50,000 rows, the optimizer is likely to go for the more efficient full-table scan. For intermediate values, there will be a point up to which the optimizer will use the index, and after that, it will go for the full-table scan.

So, getting the right cardinality estimate is crucial. If the optimizer guesses "2 rows" but the query actually returns 50,000 rows, we will have a performance issue. An index will be used when it should not be. Likewise, if the optimizer guesses "50,000 rows" but the query actually returns only 2 rows, we will have run a full-table scan when using an index would have been the obviously correct choice.

Well, given the above really simple query, the optimizer is typically pretty good. Collect statistics on the table, column, and index, and the optimizer would probably estimate the right cardinality in most cases. But what about when the query is a bit more complex:

select * from emp
where salary+commission > 50,000;

This is a more typical query, involving an expressiona function applied to various columns. The result of this expression is what we need to know about to correctly estimate the cardinality here. But because the optimizer has information only on individual thingsthe table, the salary column, the commission column, and so onit cannot correctly estimate the number of rows returned. If we were using this query in a larger, more-complex one where it was in a subquery or being joined to, a wrong cardinality estimate could lead to a disastrous plan in terms of performance.

In Oracle Database 10g Release 2 and before, we had a couple of tools we could use for this. We could have created a function-based index on EMP (SALARY+COMMISSION) and gathered statistics on that index. That would have given the optimizer some information, but at the cost of maintaining the index at runtime and the resulting increased storage. We could also have used dynamic sampling , whereby the optimizer, at hard-parse time, would have dynamically sampled the underlying table with predicates it needed to "guess" about (our WHERE SALARY+COMMISSION > 50,000 predicate above, for example). But this approach would come at the cost of increased parse times and increased resource utilization, because a mini "gather statistics" (similar to DBMS_STATS.GATHER_TABLE_STATS) would be executed each time we hard-parsed the query.

Enter virtual columns and the ability to gather statistics against them in Oracle Database 11g. By way of example, I'll create a small EMP table with some SALARY and COMMISSION data. I'll use ALL_OBJECTS to generate the data, using the OBJECT_ID column to populate SALARY and using DBMS_RANDOM to make up some COMMISSION values:

SQL> create table emp_data
  2    as
  3    select all_objects.*,
  4       object_id salary,
  5       round(
  6       dbms_random.value(1000,100000))
  7       commission
  8    from all_objects;
Table created.
and then gather statistics on it:
SQL> begin
  2      dbms_stats.gather_table_stats
  3      ( user,
  4        'EMP_DATA',
  5         method_opt=>'for columns ' ||
  6         'salary,commission size 254'
  7      );
  8    end;
  9    /
PL/SQL procedure successfully completed.

Next, I'll just run a query to determine the average SALARY, COMMISSION, and total compensation (SALARY+COMMISSION) as well as determine how many people make more than these averages. As you can see in Listing 5, about 50 percent of the people make more than the averages (as you would expectabout 33,000 people make more than the average, and there are about 66,000 rows altogether).

Code Listing 5: Determine averages and total compensation

SQL> column avg_salary new_val ASALARY
SQL> column avg_commission new_val ACOMMISSION
SQL> column avg_comp new_val ACOMP
SQL> select count( case when salary+commission > avg_comp then 1 end ) above_comp, avg_comp,
  2             count( case when salary > avg_salary  then 1 end ) above_salary, avg_salary,
  3             count( case when commission > avg_commission then 1 end )
  4             above_commission, avg_commission,
  5             count(*) total
  6    from emp_data, ( select avg(salary) avg_salary,
  7                             avg(commission) avg_commission,
  8                             avg(salary+commission) avg_comp
  9                               from emp_data)
 10   group by avg_salary, avg_commission, avg_comp
 11   /
ABOVE_COMP   AVG_COMP   ABOVE_SALARY    AVG_SALARY  ABOVE_COMMISSION AVG_COMMISSION TOTAL
----------   ---------- ------------   -----------  ---------------- -------------- ----- 
     33829   85039.7057        33603    34564.1069             33739  50475.5988    67477

So far, so good. As expected, the people in about 50 percent of the rows (33,829) make more than the average SALARY, COMMISSION, and total compensation (SALARY+COMMISSION). If we ask Oracle Database how many rows are expected for each of these, however, we'll see a different story. For the discrete columns SALARY and COMMISSION, Oracle Database is very accurate, as shown in Listing 6. But for the SALARY+COMMISSION column, the cardinality story is very different, as shown in Listing 7.

Code Listing 6: Average cardinality estimate for SALARY and COMMISSION

SQL> set autotrace traceonly explain
SQL> select * from emp_data where salary > &Asalary;
old   1: select * from emp_data where salary > &Asalary
new   1: select * from emp_data where salary > 34564.1069
Execution Plan
----------------------------------------
Plan hash value: 4079136661
-------------------------------------------------------------------------------------
|  Id   | Operation          | Name     |Rows  |Bytes   |Cost (%CPU)     |Time      |
-------------------------------------------------------------------------------------
|   0   | SELECT STATEMENT   |          |33315 | 3611K  |   307     (1)  | 00:00:04 |
|*  1   | TABLE ACCESS FULL  | EMP_DATA |33315 | 3611K  |   307     (1)  | 00:00:04 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("SALARY">34564.1069)
SQL> select * from emp_data where commission > &Acommission;
old   1: select * from emp_data where commission > &Acommission
new   1: select * from emp_data where commission > 50475.5988
Execution Plan
------------------------------------------------------------------------
Plan hash value: 4079136661
----------------------------------------------------------------------------------
|  Id   | Operation           | Name     |Rows  |Bytes |Cost (%CPU)   | Time     |
----------------------------------------------------------------------------------
|   0   | SELECT STATEMENT    |          |34663 |3757K |   307   (1)  | 00:00:04 |
|*  1   | TABLE ACCESS FULL   | EMP_DATA |34663 |3757K |   307   (1)  | 00:00:04 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COMMISSION">50475.5988)

Code Listing 7: Low estimate for rows returned for SALARY+COMMISSION

SQL> select * from emp_data where salary+commission > &ACOMP;
old   1: select * from emp_data where salary+commission > &ACOMP
new  1: select * from emp_data where salary+commission > 85039.7057
Execution Plan
------------------------------------
Plan hash value: 4079136661
---------------------------------------------------------------------------------
|  Id  | Operation          | Name     |Rows  |  Bytes |Cost (%CPU) |    Time   |
---------------------------------------------------------------------------------
|   0  | SELECT STATEMENT   |          |3374  |  365K  |   307   (1)| 00:00:04  |
|*  1  |  TABLE ACCESS FULL | EMP_DATA |3374  |  365K  |   307   (1)| 00:00:04  |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
-----------------------------------------------------------------------
   1 - filter("SALARY"+"COMMISSION">85039.7057)

As you can see, the cardinality is way off. The value reported is about 5 percent of the rows (3374)very far away from 50 percent. This is because the optimizer knows a lot of information about the SALARY column and the COMMISSION column but nothing about the SALARY+COMMISSION expression.

So, what can we do in this case? We need to "feed" the optimizer information about the SALARY+COMMISSION expression. In Oracle Database 11g, we can add a virtual column

SQL> alter table emp_data
add (comp as (salary+commission));
Table altered.
and then gather statistics on it:
SQL> begin
  2      dbms_stats.gather_table_stats
  3      ( user,
  4        'EMP_DATA',
  5         method_opt=> 'for columns COMP'
  6      );
  7    end;
  8    /
PL/SQL procedure successfully completed.

Now the optimizer has the same amount of information on this virtual COMP column as it does on the SALARY and COMMISSION columns and the cardinality estimates are dead on, as shown in Listing 8.

Code Listing 8: With virtual column, better estimate for SALARY+COMMISSION

SQL> select * from emp_data where salary+commission > &ACOMP;
old   1: select * from emp_data where salary+commission > &ACOMP
new   1: select * from emp_data where salary+commission > 85039.7057
Execution Plan
-----------------------------------
Plan hash value: 4079136661
------------------------------------------------------------------------------------
|  Id   | Operation           | Name     |Rows   |Bytes |Cost (%CPU)     |Time     |
------------------------------------------------------------------------------------
|   0   | SELECT STATEMENT    |          | 34162 |3869K |   307   (1)    |00:00:04 |
|*  1   | TABLE ACCESS FULL   | EMP_DATA | 34162 |3869K |   307   (1)    |00:00:04 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
----------------------------------------------------------------------
   1 - filter("EMP_DATA"."COMP">85039.7057)

Next Steps

ASK Tom
Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column.
asktom.oracle.com

Share

Oracle Database 11g Redux

Our technologist continues to explore Oracle Database 11g features.