Database, SQL and PL/SQL

On Sorts, Selecting, and Selectivity

 

Our technologist looks at sorting, cursors, architecting, selecting, and extracting.

By Tom Kyte Oracle Employee ACE

May/June 2004

 

Can you provide an example of how to read from a file in PL/SQL and then sort it by, say, two fields and read the result into variables?

My answer is: Don't use PL/SQL to do this. You should use an external table; it can't get any easier than that. Using an external table allows you to use the full power of SQL (including ORDER BY ) on the file easily.

To use an external table, I need to set up a directory object in the database that points to the directory in which the file exists:

SQL> create or replace directory
  2  data_dir as '/tmp/'
  3  /
Directory created.

And then I need some data to work with. I'll use my flat utility—available from asktom.oracle.com:

SQL> host flat scott/tiger -
>    emp > /tmp/emp.dat
SQL> host head /tmp/emp.dat
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81, ...

So, I have a very simple delimited file in the operating system, and now I would like to sort it (or just query it in general). I can use a table with an ORGANIZATION of EXTERNAL , which tells Oracle that the table data will reside not in the database but rather in a file outside of the database. The syntax for the EMP table above can be

SQL> create table external_table
  2 (EMPNO NUMBER(4) ,
  3  ENAME VARCHAR2(10),
  4  JOB VARCHAR2(9),
  5  MGR NUMBER(4),
 6   HIREDATE DATE,
 7   SAL NUMBER(7, 2),
 8   COMM NUMBER(7, 2),
 9   DEPTNO NUMBER(2)
10  )
11  ORGANIZATION EXTERNAL
12  ( type oracle_loader
13    default directory data_dir
14    access parameters
15    ( fields terminated by ',' )
16    location ('emp.dat')
17  )
18  /
Table created.

The syntax for the external table includes something that looks a lot like a SQL Loader control file (lines 12 through 16 in the preceding syntax). This is no accident; having an external table is very much like having SQL Loader inside the database. In fact, an external table can do pretty much everything SQL Loader can do, plus a whole lot more, because you have the full power of SQL at your disposal now.

Now that I have created the external table, sorting the file and processing the results becomes as easy as using SQL:

SQL> select empno, ename
   2   from external_table
   3  order by empno;
     EMPNO  ENAME
-----------  ----------
      7369  SMITH
      7499  ALLEN
...
      7934  MILLER
14 rows selected.
SQL> select empno, ename
  2    from external_table
  3   order by ename;
     EMPNO  ENAME
-----------  ----------
      7876  ADAMS
      7499  ALLEN
...
      7521  WARD
14 rows selected.

Cursor or Ref Cursor

At an interview for an Oracle PL/SQL developer job, I was asked to describe the difference between cursor and ref cursor and when you would appropriately use each of these. Could you please tell me the answer?

Well, technically, under the covers, at the most basic level, they are the same. A typical PL/SQL cursor is static by definition. Ref cursors, on the other hand, can be dynamically opened (the query is not known until runtime) or opened by use of a set of static SQL statements, the choice of which is based on logic (an IF/THEN/ELSE block will open one or the other query). For example, the following block of code demonstrates a typical static PL/SQL cursor, cursor C. Also, it shows how you can use a ref cursor ( L_CURSOR in this example) to open a query by using dynamic SQL or static SQL:

Declare
  type rc is ref cursor;
  cursor c is select * from dual;
  l_cursor rc;
begin
  if (to_char(sysdate,'dd') = 30)
  then
    -- ref cursor with dynamic sql
    open l_cursor for
      'select * from emp';
  elsif (to_char(sysdate,'dd') = 29)
  then
    -- ref cursor with static sql
    open l_cursor for
      select * from dept;
  else
    -- with ref cursor with static sql
    open l_cursor for
      select * from dual;
  end if;
  -- the "normal" static cursor
  open c;
end;
/

In this block of code, you see perhaps the most salient difference: No matter how many times you run that block, cursor C will always be select * from dual . The ref cursor, on the other hand, can be any result set whatsoever, because the ' select * from emp ' character string literal can be replaced with a variable containing virtually any query.

Other differences that are not obvious from this example include the ability of a ref cursor to be returned to a client. This is how you return result sets from stored procedures in Oracle (see asktom.oracle.com for more details on that topic). A PL/SQL static cursor, on the other hand, cannot be returned to a client; only PL/SQL can work with it.

Additionally, a PL/SQL static cursor can be global and a ref cursor cannot. That is, you cannot define ref cursors outside of a procedure or function in a package specification or body. Ref cursors can only be processed in the defining procedure or returned to a client application.

Also, a ref cursor can be passed from subroutine to subroutine and a cursor cannot. To share a static cursor like that, you would have to define it globally in a package specification or body. Because using global variables is not a very good coding practice in general, ref cursors can be used to share a cursor in PL/SQL without having global variables getting into the mix.

Last, using static cursors—with static SQL (and not using a ref cursor) —is much more efficient than using ref cursors, and the use of ref cursors should be limited to

  • Returning result sets to clients

  • Sharing cursors across many subroutines (very similar to the above point, actually)

  • Achieving your goal when there is no other efficient/effective means of doing so, such as when dynamic SQL must be used

In short, you want to use static SQL first and use a ref cursor only when you absolutely have to.


Architectural Issue

We have lots of production databases on lots of production servers catering to various business needs. Now there is a need to use J2EE architecture to design and build Web-based applications that need information from these multiple databases. We are discussing options for achieving this result.

We are divided into two groups with different approaches. Because it is a homogeneous data source—Oracle—one group proposes to use Oracle database links to get the data from multiple databases, which would give better access control and concurrency and has great transparency for the application. The other group opposes creating any database links and wants to have joins and data captured at the Java pool connection level and let the application do the work. In this way, the need for tight integration with vendor-based software is completely avoided.

I would like to get your thoughts on this. I would like to know what would be the best way to design and build a Web-based application that uses multiple data sources.

I'm surprised at how often questions like this come up. Really. It seems like such a simple decision: Use what you bought and paid for (the database), or write a ton of custom code that will need to be designed, debugged, tuned, and maintained for the rest of your life. I've always had a hard time understanding why this question even comes up.

In short, the people in group 2 want to write lots of code and have job security for the rest of their natural lives. They will never actually finish, however, and responding to requests for new functionality will take 10 times as long as it should. The end users will be wondering why this technology thing is so hard. Group 2 will make all technical people look bad.

The people in group 1 want to use the more cost-effective, faster method. They'll also be able to support "new technology 532143," which will come out next year. J2EE is "new technology 532140" and is just a middle-tier architecture.

I quit my last job because of people like those in group 2! They were dead set on writing their own two-phase commit (even though the database already had it) and writing their own gateways to other databases (even though the database already had it). They were going as far as to implement their own stored procedure language—all in the name of database independence. In retrospect, I can see they accomplished none of that. To get anything done, they ultimately had to drop down to the database and use some of the features. The layers they appended on top added no functionality not already provided; in fact, they hid functionality from the end users, preventing them from using the database.

I don't see how pretending you have one Oracle database (via database links) instead of n of them is going to force you into "tight integration with vendor-based software"—whatever that means. To me, that tight integration means you are actually using what you paid for! It means that you are not reinventing an optimizer, a join engine, and a database. It means you can actually use the advanced features of the product you paid good money for.

Would these people in group 2 invent their own steering mechanism simply because they drive more than one car?

This topic has generated lots of online discussion (see asktom.oracle.com for the original question and answer on the Web). One of the really good points made was as follows:

Case II has many problems. If you don't use database links and try to implement everything in J2EE, performance will be a nightmare. All the query screens should have a message saying, "Please come back tomorrow to see the result...."

Consider a case in which Query1 from Database1 returns 300,000 records and Query2 from Database2 returns 10,000 records. The final output you want to return to the user is the join of both the queries, which should return only 1,000 records at the end. If you use database links, then Oracle Database will do all the optimization and return you 1,000 records with the shortest possible time.

If someone in your group starts writing this code, then he will fetch all 300,000+ records into server memory and paging will kick in. This will kill performance of the entire system. All users will suffer. If your application is heavily data-centric, implementing at connection level is not the right choice.

But forgetting about performance for a moment—have you thought about how complex this situation is? Do you really want your business application developers writing code that is, in fact, emulating all the functions of a database? They will have to create temporary spaces for swapping result sets in and out. They will be implementing nested loops, joining techniques, hash joins, sort merges—and so on. These are things a database (every database, pretty much) does out of the box. It would be just about the worst decision possible to re-create a database on the J2EE layer. Consider what happens as data sources, requirements, and the needs of the end user change. How can these business application developers respond to that while 100 percent of their attention is devoted to creating a new database engine of their own? (It takes a while to build one of those things; Oracle has been at it for more than 25 years now.)


Selective Indexing

I use Oracle8i Release 3 (8.1.7). Is it possible with this release to create indexes only on selected rows? For example, when the column being indexed is NULL, I've seen lots of examples in various applications I've worked with where I know for a fact that I'm never ever going to look at rows using this index that have NULL in the column. It strikes me that there's a potential space-saving feature (memory as well as disk, I suppose) if it is possible to tell the index, "Index only columns that have these values..." or the converse, "Don't index these columns...."

This is something we've been able to do since Oracle8i Release 1 (8.1.5), using function-based indexes (available with the Enterprise Edition of Oracle Database). It works as follows: You index a function that returns the data values you want indexed and returns NULL otherwise. Because an entirely NULL key is not entered into B*Tree indexes, you can, in fact, selectively index only the rows of interest, using this technique. Consider this small example:

SQL> create table t ( x int );
Table created.
SQL> create index t_idx on t(x);
Index created.
SQL> insert into t
  2  select null from all_users;
51 rows created.
SQL> analyze index t_idx
  2  validate structure;
Index analyzed.
SQL> select name, lf_rows
  2    from index_stats;
NAME                      LF_ROWS
-------------------------  --------
T_IDX                          0

That index is empty. There are 51 rows in the table and 0 rows in the index structure, because the key is entirely NULL. Now, if I put in a non-NULL value:

SQL> insert into t values ( 1 );
1 row created.
SQL> analyze index t_idx
  2  validate structure;
Index analyzed.
SQL> select name, lf_rows
  2    from index_stats;
NAME                       LF_ROWS
-------------------------  ---------
T_IDX                            1

I finally have an index entry. Next I make all of the values not NULL:

SQL> update t set x = rownum;
52 rows updated.
SQL> analyze index t_idx
  2  validate structure;
Index analyzed.
SQL> select name, lf_rows
  2    from index_stats;
NAME                       LF_ROWS
-------------------------  ---------
T_IDX                           53

And all of the rows are in the index. Now, in order to selectively index, I can use CASE (or DECODE , or any other function, in fact) with a complex predicate to identify the values I want. The following example indexes all rows in which the value of X is less than 25. This would be useful if the table had many rows and most of the values of X were, say, 100, indicating that the record was processed and completed in some fashion, and a small percentage of the values of X were 1, 2, 3, ... 25, indicating they were at some stage of processing. This would not only save space but the index would also be very appealing to the optimizer, because it is small and very selective. In the following example, only the rows where x < 25 are in the index:

SQL> create index t_idx2 on
  2  t( case when x < 25 then x end );
Index created.
SQL> analyze index t_idx2
  2  validate structure;
Index analyzed.
SQL> select name, lf_rows
  2    from index_stats;
NAME                        LF_ROWS
-------------------------  ----------
T_IDX2                           24

Further, I can (and should, I think) use a view to hide the complexity of querying via this index:

SQL> create or replace view v
  2  as
  3  select x,
  4  case when x<25 then x end another_x
  5  from t;
View created.
SQL> analyze table t
  2  compute statistics;
Table analyzed.
SQL> set autotrace traceonly explain
SQL> select * from v
  2   where another_x = 5;
Execution Plan
-----------------------------------
SELECT STATEMENT Optimizer=CHOOSE
  TABLE ACCESS (BY INDEX ROWID) OF 'T'
    INDEX (RANGE SCAN) OF 'T_IDX2'
SQL> set autotrace off

As you can see, I can selectively index just the rows I am interested in and I can provide the applications access to this index to transparently. Please see asktom.oracle.com for more in-depth information on using function-based indexes.


Extracting DDL

One of my coworkers was working on a script to extract DDL. He asked me if there is any way, from a data dictionary point of view, to determine the difference between a NOT NULL constraint, created as create table (a number NOT NULL) and a check constraint with the condition IS NOT NULL. We have not been able to determine the difference by looking at data dictionary views. Can you help?

Sometimes the easiest way to do something is not to do it at all. Extracting DDL is one of those things. You don't need to write DDL extraction scripts. You can use one of the two following approaches:

  • Use the EXP (export) and IMP (import) utilities.

  • Use Oracle9i and later—the DBMS_METADATA package turns this more-than-weeklong exercise into a simple query against dual!

The following example uses EXP and IMP:

$ exp userid=/ tables=t rows=n
Export: Release 9.2.0.4.0 -
...
About to export specified tables
. . exporting table           T
Export terminated successfully.
$ imp userid=/ full=y indexfile=t.sql
Import: Release 9.2.0.4.0
...
Import terminated successfully.
$ cat t.sql
REM  CREATE TABLE "OPS$TKYTE"."T"
REM  ("X" NUMBER(*,0), "Y" NUMBER(*,0))
REM  PCTFREE 10 PCTUSED 40 INITRANS
REM  1 MAXTRANS 255 STORAGE(INITIAL
REM  65536 FREELISTS 1 FREELIST GROUPS
REM  1) TABLESPACE "USERS" LOGGING
REM  NOCOMPRESS ;
CONNECT OPS$TKYTE;
CREATE INDEX "OPS$TKYTE"."T_IDX" ON
"T" ("X" ) PCTFREE 10 INITRANS 2
MAXTRANS 255 STORAGE(INITIAL 65536
FREELISTS 1 FREELIST GROUPS 1)
TABLESPACE "USERS" LOGGING ;

As you can see, T.SQL has everything you need. In Oracle9i this is even easier with the DBMS_METADATA

package:
SQL> select dbms_metadata.get_ddl
  2         ( 'TABLE', 'T' ) ddl
  3   from dual;
DDL
------------------------------------
CREATE TABLE "OPS$TKYTE"."T"
(       "X" NUMBER(*,0),
"Y" NUMBER(*,0)
) PCTFREE 10 PCTUSED 40 INITRANS 1
MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576
MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST
GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "USERS"

But really, you don't want to write any code; the tools already do that!


To Hash or not to Hash

I am starting a project to encrypt passwords and store them in the database. You mention on your Web site that this should never be done and that hashing is the way to go. What benefit does hashing have over encrypting? Can the hash be reverse-engineered?

The most basic benefit is that hashes cannot be reversed and encryption must be reversable. The steps of the logon process are:

  1. User gives a username and password.

  2. You verify they are correct.

  3. If correct, you grant access.

  4. If not correct, you fail the attempt.

Do we need to store their password in order to perform step 2? Absolutely not. We can do it the way that Oracle and UNIX and most other systems do: We can use a hash. The steps for using a hash are simple. We only need to pass the supplied username and password using a standard hashing function. (Oracle provides DBMS_OBFUSCATION_TOOLKIT.MD5 in Oracle8i and Oracle9i and DBMS_CRYPTO.HASH in Oracle 10g for just such a purpose.) We take the hash returned by such a function and compare it to the stored hash for that user. In that fashion, we can perform the authentication but we never have to worry about someone stealing our passwords, because we are not storing them! There is no way to deduce the passwords from the stored hash.

Everyone responsible for security will sleep much better knowing they do not have a table full of passwords (encrypted or not) that can be stolen. See asktom.oracle.com for more on this important topic.

Next Steps

 DOWNLOAD Tom's flat utility

READ
about external tables
 Oracle9i Database Utilities Guide

 returning queries from result sets

 function-based indexes

 

DISCLAIMER: We've captured these popular historical magazine articles for your reference. Links etc contained within these article possibly won't work. These articles are provided as-is with no guarantee that the information within them is the best advice for current versions of the Oracle Database.