Database, SQL and PL/SQL

On Separation and Counting

Our technologist promotes more than one schema, counts partitions, and uses TKPROF.

By Tom Kyte

November/December 2008

A data architect at work has proposed that we start using separate database accounts to hold the code (packages, procedures, views, and so on) and the data (tables, materialized views, indexes, and so on) for an application. I’ve never come across this idea before, and it seems to be contrary to the concepts of encapsulation, in that the application will be spread across at least two schemas and require more administrative overhead to maintain the necessary grants between them.

Are there any situations you can think of where this would be a recommended approach? And if you did this, how would you recommend referencing objects in the data schema from the application schema? Finally, would you put any views into the code or data schema?

This separate application (code) and data schema approach is hardly contrary. This approach promotes encapsulation; modularization; and most importantly, least privilege.

Maintaining the grants is a positive aspect to this approach, because you’ll have to state why you need a privilege, and that can be documented. Auditors can inspect this setup and see what access you have to the data.

It is a really good idea, in fact, to set up more than one schema for application code to give even finer-grained control over the grants.

Right now, developing the application code in the same schema as the data, you could drop the table, truncate it, perform any data manipulation language (DML) on it, and alter it in any way you see fit. You will (fortunately) lose all of that when this really good separate-schema idea is put in place. The new separate schema approach will ultimately lead to a better-documented system, with great security controls in place.

Will you lose some “flexibility”? Sure, you will think so, but a development team should not necessarily have this flexibility. With so many privileges, there are too many things you can do wrong in this environment.

With the separate application and data schema approach in place, you’ll be able to query the data dictionary and see the least set of privileges you need for your application to execute—which is very good—and you’ll understand why you need those privileges, because you had to ask for them—which is also very good.

This is a very common implementation, becoming more common today with the need to be more accountable—needing to know who does what and when, who can do what and when, and so on.

Remember, the data is the data, and the application is something that accesses data. They are not one and the same, and they should be separate.

As for the question of how to reference the objects in the data schema in your application code, I prefer the following, in order:

Use fully qualified references: schema.object_name. If you are worried about the schema name changing in the future, you can always make the schema name a SQL*Plus substitution variable.
Use private synonyms, so the schema name is not referenced in the code. Private synonyms incur less overhead than public synonyms, and public synonyms are something to be avoided at all costs, because there can be only one public synonym “SOME_NAME” in a database. The use of public synonyms can easily prevent server consolidation if two applications both need to use the public synonym “SOME_NAME” for two different objects.

As for the question of where the views go, the answer is that views can go in either schema. They make sense in both places.

The case for putting views into the application schema. A view in the application schema is a view of convenience; it joins N tables, selects needed data, and formats a couple of columns. It is in place because its SQL is used here, there, and everywhere. The application was granted access to the base tables without the grant option/admin option, so the application cannot grant access to this view to others (it is usable only by the application logic in the database), so you haven’t opened up any security holes.

In this case, the view is very much like a subroutine—it accesses the data to which the application schema was granted access.

The case for putting views into the data schema. A view in the data schema is used to further restrict access to data. It includes a predicate, for example, that limits which data can be seen. Suppose there is a large base table and a given application schema needs to see only half of that data. You create a view that exposes the correct set of data and grant access on the view to that application schema, and the application schema uses the view as if it were a table. That application schema will not have access to the base table. You can use the “with check option” and so on to enforce INSERT/UPDATE restrictions as well. Note that fine-grained access control (DBMS_RLS) can also do this.

So the developers are free to create their views in the application schema, and the data owners, the people securing the data, create views in the data schema—to protect the data.

The Right Number of Partitions

I have a table with about one million records, but the table has the potential to grow to three million or more records. I am planning to partition this table, using hash partitioning on the most queried column, but before I do that, I have a few questions:

1. I have read somewhere that due to the hashing algorithm Oracle Database uses, it is better to use a power-of-2 number of partitions. Is that correct?
2. Is there a recommendation on how many partitions are optimal for a given number of records? What would you take into consideration when deciding on the number of partitions?
3. Do you have any other recommendations?

First, when you’re using hash partitioning, it is paramount to use a power of 2—2, 4, 8, 16, 32, 64 . . . —and no other numbers in between. If you do not, the data will be skewed across partitions with most of the data in the “middle” partitions and little data in the “end” partitions. Also, a hash key should have lots and lots of distinct values, or else you won’t achieve even distribution of data. In Listing 1, I create a four-partition table and a five-partition table using hash partitions. Note that the data distribution is clearly skewed in the five-partition table.

Code Listing 1: Four partitions (OK) versus five partitions (not OK)

SQL> CREATE TABLE t1 ( x )
   2   PARTITION BY hash(x)
   3   ( partition part1 ,
   4     partition part2 ,
   5     partition part3 ,
   6     partition part4
   7   )
   8   as
   9   select rownum from all_objects;
Table created.
SQL> CREATE TABLE t2 ( x )
   2   PARTITION BY hash(x)
   3   ( partition part1 ,
   4     partition part2 ,
   5     partition part3 ,
   6     partition part4,
   7     partition part5
   8   )
   9   as
  10  select rownum from all_objects;
Table created.
SQL> select h,
   2         count(*) cnt,
   3         sum(count(*)) over () totcnt,
   4         substr( rpad('*',100,'*'), 1,
   5         100*ratio_to_report(count(*)) over ()) hist
   6    from
   7   (select 1 h from t1 partition (part1)
   8    union all select 2 h from t1 partition(part2)
   9    union all select 3 h from t1 partition(part3)
  10   union all select 4 h from t1 partition(part4)
  11  ) group by h order by h
 12   /
         H        CNT     TOTCNT  HIST
         ---    ------    ------ ------------------------
         1      12409      49877 ************************
         2      12413      49877 ************************
         3      12697      49877 ************************
         4      12358      49877 ************************
SQL> select h,
   2         count(*) cnt,
   3         sum(count(*)) over () totcnt,
   4         substr( rpad('*',100,'*'), 1,
   5         100*ratio_to_report(count(*)) over ()) hist
   6    from
   7   (select 1 h from t2 partition (part1)
   8    union all select 2 h from t2 partition(part2)
   9    union all select 3 h from t2 partition(part3)
  10   union all select 4 h from t2 partition(part4)
  11   union all select 5 h from t2 partition(part5)
  12   ) group by h order by h
  13   /
         H        CNT     TOTCNT  HIST
         ---    ------    ------ ------------------------
         1       6234      49883 ************
         2      12413      49883 ************************
         3      12698      49883 ************************
         4      12358      49883 ************************
         5       6180      49883 ************

As for the question about determining the right number of partitions for a number of records, the determination is not record-driven. A million records might be stored in 10MB, 100MB, or 1TB, so determining the number and size of partitions is about volume. You have to ask yourself

What is the biggest segment I want to have?
What am I trying to accomplish with partitioning?

The answers to these questions will drive your partition sizes and your partitioning scheme.

You need to first understand why you are partitioning. Then, and only then, apply partitioning in a manner that will achieve what you want.

When you apply partitioning, make sure you understand the ramifications of doing so and consider each and every index independently of every other index. For example, suppose you have an employees table and you hash-partition it into 16 partitions by EMPNO. Further suppose you have an index on last_name and you only locally partition that index on last_name.

Now, when you query

select * from employees
where last_name=:x

you will do 16 index range scans—16 times the work!

In this case, you would have wanted to do one of the following:

Not partition that index at all. The size of the index in a single partition was fine with you.
Partition that index by range. For example, A-K go into one partition and L-Z go into another. Use whatever ranges make sense for your data.
Hash-partition it. This presumes that you always use equality in your searches, because hash-partitioned indexes are not suitable for LIKE and <, > searches.

To gain an understanding of partitioning, you will want to review the Oracle Database Data Warehousing Guide 11 g Release 1 , especially chapter 5 (download.oracle.com/docs/cd/B28359_01/server.111/b28313/parpart.htm#i1007993). Also, in my book Expert Oracle Database Architecture , I discuss all of this in some detail, including why you want to know why you are partitioning, how to approach partitioning, and what not to do.

Using TKPROF

How can I use TKPROF to find problematic queries? The only point I know is that when elapsed time is longer, it means that the query spent more time waiting for something. But I am not sure how to use TKPROF for query tuning.

First, you use TKPROF in general to identify queries in your application that consume “lots of resources,” including CPU time, elapsed time, and so on. Suppose you do this (enable tracing in your application and then run bits of your application):

SQL> begin
  2  dbms_monitor.session_trace_enable
  3  ( waits => true );
  4  end;
  5  /
PL/SQL procedure successfully completed.
SQL> select count(subobject_name)
from big_table.big_table;
COUNT(SUBOBJECT_NAME)
-------------------------------
                         688256

TKPROF will produce a report that looks much like the content of Listing 2. Now we have a bunch of facts:

Code Listing 2: TKPROF report, first run

select count(subobject_name)
from
big_table.big_table
call      count       cpu      elapsed     disk    query    current   rows
-------   -------     -----    -------     ----    -----    -------   ----
Parse         1        0.00       0.01        0        3          0      0
Execute       1        0.00       0.00        0        0          0      0
Fetch         2       99.36     262.10  1840758  1840797          0      1
-------   -------     -----    -------     ----    -----    -------   ----
total         4       99.36     262.11  1840758  1840800          0      1
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 344
Rows              Row Source Operation
----------       ----------------------------------------------------------------
        1         SORT AGGREGATE (cr=1840797 pr=1840758 pw=0 time=262104893 us)
128000000         TABLE ACCESS FULL BIG_TABLE (cr=1840797 pr=1840758 pw=0 time=384004887 us)
Elapsed times include waiting on following events:
  Event waited on               Times     Max. Wait    Total Waited
  -------------------------     Waited    ----------   -------------
  SQL*Net message to client          2         0.00          0.00
  db file scattered read         14425         0.22        195.87
  db file sequential read           13         0.01          0.06
  SQL*Net message from client        2         0.00          0.00

The query took a long time—about four and a half minutes.
We did a lot of physical I/O.
We did a lot of logical I/O.
We used about 100 CPU seconds.
The query took about 262 seconds of elapsed time but only 99 seconds of CPU time, so we waited lots of seconds for something.
We waited for DB file scattered read for a long time—this was the read of a full scan.
We did a full scan on a huge table.
We can see our query.
We have the plan that was used.

Now, what can we do with these facts? We can use our knowledge of the data and how Oracle Database works to “tune.” What are some obvious things to think about there?

Well, we see that the query result returned about 688,256 rows out of 128,000,000. That in itself gives us another fact or two:

We needed a very small subset of rows from this table to answer this query.
The table’s high-water mark is probably OK. Given that we read 1.8 million blocks and processed 128 million rows, we must have about 70 rows per block, so the table seems to be well packed. So, shrinking or reorganizing the table isn’t going to do anything.

These facts make it possible to rule things out. That is as useful as using facts to rule something in.

OK, so what are some possible “tuning” options?

1. Make the full scan faster, maybe by compressing the table.
2. Because we need only about 0.54 percent of the table’s rows to answer our query (count the non-null occurrences of subobject_name), maybe indexing would improve query performance.

Let’s try option 2 first, because it will have the least impact.

SQL> create index big_table_so_idx
on big_table(subobject_name);
Index created.

After we do that and rerun the query, TKPROF shows the report in Listing 3.

Code Listing 3: TKPROF report, second run

select count(subobject_name)
from
big_table.big_table
call       count       cpu     elapsed     disk    query    current   rows
-------   -------     -----    -------     ----    -----    -------   ----
Parse           1      0.00       0.04        0        0          0      0
Execute         1      0.00       0.00        0        0          0      0
Fetch           2      0.28       0.52     3342     3355          0      1
-------   -------     -----    -------     ----    -----    -------   ----
total           4     0.28        0.56     3342     3355          0      1
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 344
Rows        Row Source Operation
-------     -------------------------------------------------------
      1     SORT AGGREGATE (cr=3355 pr=3342 pw=0 time=520528 us)688256       INDEX FAST FULL SCAN BIG_TABLE_SO_IDX (cr=3355 pr=3342 pw=0 time=730570 us)
                    (object id 161278)
Elapsed times include waiting on following events:
  Event waited on              Times     Max. Wait   Total Waited
  --------------------------   Waited    ---------   ------------
  SQL*Net message to client         2         0.00           0.00
  db file scattered read           42         0.01           0.27
  SQL*Net message from client       2         0.00           0.00

Now in real life, it’ll be more complicated than this, of course. You’ll have a multitable query to deal with and complex data relationships. To tune that query, you either

Submit the query to a tool (such as the tuning/performance pack of Oracle Enterprise Manager) and let the tool process it. It will apply the rules and tell you which indexes, materialized views, query rewrites, and—in Oracle Database 11g—partitioning schemes would be useful for that query.
Use the facts you can derive from TKPROF and other sources and apply your knowledge of the data, the data patterns, and Oracle Database—including the indexes it has to offer, the schema structures (clusters, index-organized tables, heap tables, partitioning, and so on) it has to offer, and the SQL it supports (because you can rewrite a query more efficiently in many cases).

Nonselective Columns in an Index

My company runs a third-party application on Oracle Database 10g Release 2 (10.2.0.2). A developer created a huge index, with seven columns, on a huge table, some time ago. Analyzing the usefulness of the index, I found out that it would be much better if the index had only two columns (because the other five are not selective).

I want to know if there is a simple way to find out which queries use the index, because I want to re-create the index with fewer columns, but I do not want to degrade the performance of the queries.

As for the question about how to find out which queries use an index, you can query V$SQL_PLAN to see which queries that are in the shared pool right now (that last bit is important—to see which queries that are in the shared pool right now ) use that index.

But more importantly, you wrote, “. . . the other five [columns] are not selective. . . .” That fact has nothing to do with whether they are useful or not. I don’t care if they are constant or have two distinct values—that they are not selective has nothing to do with whether they should be in the index. Selectivity has nothing to do, really, with whether a column should be indexed. The questions you ask and the data patterns dictate what columns make sense in an index.

For example, I’ll create a table in which ID is unique (very selective) and ID2 has about five values (not very selective):

SQL> create table t
  2  as
  3  select rownum id, trunc(rownum/10000) id2, a.*, rpad( 'x', 2000, 'x' ) data
  4    from all_objects a
  5  /
Table created.
SQL> create index t_idx1 on t(id);
Index created.
SQL> create index t_idx2 on t(id,id2);
Index created.
SQL> select count(distinct id), count(distinct id2) from t;
COUNT(DISTINCTID) COUNT(DISTINCTID2)
------------------------------------ ------------------
                  49719                       5

Now, according to your approach, we should not want ID2 in the index, because it is “not very selective.” But look at what can happen if you leave it out (we’ll use a hint to use the index just on ID):

SQL> set autotrace traceonly
SQL> select
  2  /*+ index( t t_idx1 ) */ *
  3  from t
  4  where id between 1
  5              and 5000
  6  and id2 = 1;
no rows selected
Execution Plan
----------------------------------------------
| Id  | Operation                   | Name   |
----------------------------------------------
|   0 | SELECT STATEMENT            |        |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T      |
|*  2 |   INDEX RANGE SCAN          | T_IDX1 |
----------------------------------------------
Statistics
----------------------------------------------
            0  recursive calls
            0  db block gets
       1679  consistent gets
         968  physical reads
            0  redo size
       1154  bytes sent via SQL*Net to client
        373  bytes received via SQL*Net from client
           1  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
           0  rows processed

That was a lot of work to find nothing, wasn’t it? The index on just ID found 5,000 candidate rows (where id between 1 and 5000) but had to go to the table to see if ID2 was equal to 1, only to find that it isn’t.

So here’s what happens if we use the index on (ID,ID2) instead:

SQL> set autotrace traceonly
SQL> select
   2  /*+ index( t t_idx2 ) */ *
   3  from t
   4  where id between 1
   5           and 5000
   6  and id2 = 1;
no rows selected
Execution Plan
----------------------------------------------
| Id  | Operation                   | Name   |
----------------------------------------------
|   0 | SELECT STATEMENT            |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| T      |
|*  2 |   INDEX RANGE SCAN          | T_IDX2 |
----------------------------------------------
Statistics
----------------------------------------------
         0  recursive calls
         0  db block gets
       13  consistent gets
         0  physical reads
         0  redo size
    1154  bytes sent via SQL*Net to client
      373  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
         0  sorts (memory)
         0  sorts (disk)
         0  rows processed

We did significantly less work. We did a range scan of all the ID values between 1 and 5000 in the index, but we were able to avoid hitting the table over and over to see if ID2 was equal to 1!

So forget selectivity. Instead, when you consider the columns in the index, see if they are not being used to avoid going to the table. If you have a predicate on “WHERE X = ? AND Y = ?”, it might be advantageous to have both X and Y in the index, so you can avoid having to go from the index to the table over and over to evaluate the WHERE clause.

Next Steps

ASK Tom
Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column. asktom.oracle.com

Share

On Separation and Counting

Our technologist promotes more than one schema, counts partitions, and uses TKPROF.