Database, SQL and PL/SQL

Persistence Doesn’t Always Pay

Sometimes you don’t want a database table and SQL. Sometimes you just need to collect your thoughts.

By Steven Feuerstein

September/October 2014

In my last article, “The Joy of Low-Hanging Fruit,” you met a team of developers at extremememe.info (all names changed to protect the innocent), an up-and-coming Web 3.0 paradigm shifter that analyzes internet memes, tracks them to their sources, and best of all predicts new memes.

Answer to Last Issue’s Challenge The PL/SQL Challenge quiz in last issue’s “The Joy of Low-Hanging Fruit” article asked whether four statements about BULK COLLECT and FORALL were correct. Answers (a) and (b) were incorrect because the major impact of a switch from row-by-row processing to bulk processing is on process global area (PGA), and you can use BULK COLLECT with both static and dynamic queries. Answers (c) and (d) were correct. You can have only one nonquery DML statement inside a FORALL statement, and to make sure the SQL engine attempts to execute every DML statement generated by FORALL, add the SAVE EXCEPTIONS clause.

The developers at extremememe.info are very knowledgeable about Oracle Database and PL/SQL, but as is the case with many developers, they have little time to explore new technology. They mostly struggle to keep up with the demands of their users. And even when they discover a great new feature, it can be a challenge to convince management to commit the resources to apply the new feature to stable production code.

As demonstrated in my last article, I helped the developers at extremememe.info improve the performance of a critical batch job by applying bulk processing (BULK COLLECT and FORALL).

A week later, I visited extremememe.info to see how they were doing. Joe, one of their newer developers, pulled me aside to show me a program he’d been working on.

“It’s another batch job,” he explained. “I need to keep track of memes that were referenced in the job, and then avoid processing those memes again within the job. We’ve also expanded the memes that we are tracking; instead of just memes for humans we are also working with memes for dogs, so we have to keep those two lists separate. I remember you saying that I should do everything I possibly can in SQL, and that I should build APIs with packages, so that’s what I did.”

And then he showed me. I was most impressed with the clarity and readability of his code, and horrified by his approach.

First, he created a table to hold the memes that had already been processed:

CREATE TABLE temp_processed_memes 
(meme VARCHAR2(1000) UNIQUE)
/

Then he put together the simple and intuitive package specification shown in Listing 1.

Code Listing 1: The original (Joe’s) meme_tracker package specification

PACKAGE meme_tracker
   IS
      PROCEDURE reset_memes;
      FUNCTION already_saw_meme (
            meme_target_in VARCHAR2, meme_in IN VARCHAR2)
         RETURN BOOLEAN;
      PROCEDURE add_meme (
         meme_target_in VARCHAR2, meme_in IN VARCHAR2);
   END;

“Congratulations, Joe,” I told him. “Your specification doesn’t have a single comment, and it doesn’t need one. You came up with clear, understandable names so that the code tells its own story. Show me more.”

So Joe showed me the package body (in Listing 2), and then he opened up the script he was using (in Listing 3) to test the package and demonstrate its intended usage.

Code Listing 2: The meme_tracker package body

PACKAGE BODY meme_tracker
IS
   PROCEDURE reset_memes
   IS
   BEGIN
      DELETE FROM temp_processed_memes;
      COMMIT;
   END;
   FUNCTION already_saw_meme (
         meme_target_in VARCHAR2, meme_in IN VARCHAR2)
      RETURN BOOLEAN
   IS
   BEGIN
      SELECT 'x' INTO l_dummy
        FROM temp_processed_memes
       WHERE meme = meme_target_in || '-' || meme_in;
      RETURN TRUE;
   EXCEPTION WHEN NO_DATA_FOUND THEN RETURN FALSE;
   END;
   PROCEDURE add_meme (meme_target_in VARCHAR2, meme_in IN VARCHAR2)
   IS
   BEGIN
      INSERT INTO temp_processed_memes (meme) VALUES (
         meme_target_in || '-' || meme_in);
      COMMIT;
   EXCEPTION WHEN DUP_VAL_ON_INDEX THEN NULL;
   END;
END;

Code Listing 3: Test/demonstration script for meme_tracker package

BEGIN
   meme_tracker.reset_memes;
   meme_tracker.add_meme ('DOG', 'CAT');
   meme_tracker.add_meme ('HUMAN', 'CAT VIDEO');
   meme_tracker.add_meme ('HUMAN', 'HONEY BOO BOO');
   IF meme_tracker.already_saw_meme ('DOG', 'CAT')
   THEN
      DBMS_OUTPUT.PUT_LINE ('DOG-CAT-FOUND');
   END IF;
   IF NOT meme_tracker.already_saw_meme ('DOG', 'CAT VIDEO')
   THEN
      DBMS_OUTPUT.PUT_LINE ('DOG-CAT VIDEO-NOT FOUND');
   END IF;
END;

I asked Joe three questions:

“Do you always reset the processed memes table—temp_processed_memes—before starting the job?” His answer: Yes.
“Do you ever access the processed memes table after ending the job?” His answer: No.
“What do you want first—the good news or the bad news?” Joe asked for the good news.

“OK, the good news is that you took an important piece of advice, and you applied it: you used lots of SQL. The bad news, unfortunately, is also that you used lots of SQL.”

Joe looked rather confused, so I took a step back (conceptually) and explained the situation as follows:

Without doubt, the general strategy that Oracle Database developers should follow is to do everything they can in SQL. What they cannot do in SQL, they should do in PL/SQL. And what they cannot do in PL/SQL, they should do in whatever language they are using to build their user interfaces. (This is a “mantra” created by Tom Kyte and to which I wholeheartedly subscribe.)

There is, however, a hidden assumption in this set of language priorities: the task at hand should be well suited to and require SQL. To offer an obvious example, if you need to create a radio group to display on a web page, you don’t want to use SQL for that.

In the case of the meme_tracker package, however, it is code running in the database, implemented in PL/SQL. So why wouldn’t it make sense to use SQL? The prefix “temp” on the table name pretty well gives it away: Joe decided to use a persistent relational table as the container for his processed memes. But, in his mind, it was just a temporary table. Insert the data from the current job, use it during the job, and empty the table.

No doubt about it, Joe was using the table as a container for temporary, session-specific data. That’s why Joe called it a temporary table. Unfortunately, calling it temporary doesn’t make it temporary. When you execute a CREATE TABLE statement, that table is permanent (until you drop it), even if you take all the steps needed to ensure that there is nothing in the table.

So what’s wrong with that? If the data is really only needed within your session, then you are paying a substantial penalty in performance to read and write tables. Plus, you have created a database object that is not needed: unnecessary “noise” in your schema. Finally, this could create substantial confusion in the future when other developers need to work with this package and try to figure out where and how this table is used besides this one process.

Use a Different Kind of Table

“If I don’t use SQL and don’t use a database table, how am I supposed to keep track of my memes?” asked Joe.

“Simple,” I replied. “Use a different kind of table.”

I imagine some readers are now thinking: well, of course. Use a global temporary table. That way the data in the table will not persist past the session. It’s true that a global temporary table would address a part of the problem with Joe’s approach. But it still relies on SQL when it is not needed, and it still will not offer the best performance for in-session data.

No, for this kind of situation, Joe should take a look at PL/SQL-specific tables, otherwise known as collections. PL/SQL developers can declare, populate, and work with data in a collection entirely within PL/SQL. That data exists only for the duration of the session, and it is much faster to read/write collections than it is to read/write relational tables.

I pointed to one more concern I had about Joe’s meme_tracker package approach. His package subprograms accept a meme target and a meme, but his table includes just a single meme column. Joe concatenates the two values with a dash as a separator. But what if a meme or meme target contains a dash? Joe could add rules and code to ensure that this doesn’t happen, but it would make much more sense to build these two distinct pieces of data into the design.

For a relational table, this means having two columns (and in many cases, two tables). In the case of collections, you can instead take advantage of a very nifty feature: nested collections, or collections within collections.

Backward-Compatibility Concerns

Before reimplementing an existing piece of code, you should always check to see where and how it is being used. If it is not yet in use, then you have complete freedom to change the package specification and body. If other developers have already been making calls to subprograms in a package, you have a responsibility to those developers to make sure that all existing invocations remain valid.

In the case of the meme_tracker package, I will be changing the way Joe implemented the data structures used to support the package API. Because Joe had already broken out the meme target from the meme in his parameter lists, it shouldn’t be necessary to make any changes to the specification. All existing invocations, therefore, will not be affected.

Collections in the Package Body

When I am done with the meme_tracker package rewrite, I will no longer need the temp_processed_memes table. But I don’t want to drop it until the rewrite is complete, because it would disrupt existing users.

So let’s begin the rewrite by removing all references to that table!

Listing 4 contains the reimplementation of the meme_tracker package body.

Code Listing 4: meme_tracker package body reimplemented

PACKAGE BODY meme_tracker
IS
   SUBTYPE meme_target_t IS VARCHAR2(100);
   SUBTYPE meme_t IS VARCHAR2(1000);
   c_was_processed CONSTANT BOOLEAN := TRUE;
   TYPE memes_for_target_t IS TABLE OF BOOLEAN
      INDEX BY meme_t;
   TYPE processed_memes_t IS TABLE OF memes_for_target_t
      INDEX BY meme_target_t;
   g_processed_memes processed_memes_t;
   PROCEDURE reset_memes
   IS
   BEGIN
      g_processed_memes.DELETE;
   END;
   FUNCTION already_saw_meme (
         meme_target_in VARCHAR2, meme_in IN VARCHAR2)
      RETURN BOOLEAN
   IS
   BEGIN
      RETURN g_processed_memes 
               (meme_target_in)
                   (meme_in) = c_was_processed;
   EXCEPTION 
      /* PL/SQL raises NDF if you try to "read" a collection
         element that does not exist */
      WHEN NO_DATA_FOUND THEN RETURN FALSE;
   END;
   PROCEDURE add_meme (
      meme_target_in VARCHAR2, meme_in IN VARCHAR2)
   IS
   BEGIN
      g_processed_memes (meme_target_in)(meme_in) 
         := c_was_processed;   
   END;
END;

Defining the collections. Collections are data structures, just like relational tables. I have learned over the years, mostly through painful lessons (that is, mistakes), that if I don’t get the design of my tables right, the resulting coding exercise is much more difficult. Conversely, if I do take the time to think through my table design, I end up writing a much smaller amount of much more intuitive code. The same holds true with collections.

And when you are working with nested collections, sorting out the structures can be tricky. In this regard, I have two suggestions:

Use subtypes to give names to all application-specific data with which you will be working.
Employ the same data design techniques with collections as you would with tables.

Here is the code I wrote in the new meme_tracker package body (in Listing 4) to define my collections:

1  SUBTYPE meme_target_t IS 
      VARCHAR2(100);
2  SUBTYPE meme_t IS VARCHAR2(1000);
3  c_was_processed 
      CONSTANT BOOLEAN := TRUE;
4  TYPE memes_for_target_t IS 
      TABLE OF BOOLEAN
5     INDEX BY meme_t;
6  TYPE processed_memes_t IS 
        TABLE OF memes_for_target_t
7     INDEX BY meme_target_t;
8  g_processed_memes processed_memes_t;

Let’s take the code one line at a time.

1  SUBTYPE meme_target_t IS 
      VARCHAR2(100);

I declare my first subtype, which is nothing more than an alias for an existing or base type. In this case, I am simply “hiding” the fact that a meme target is a string with a maximum size of 100 characters. You might ask: Why bother? And the answer lies in lines 4 through 7.

2  SUBTYPE meme_t IS VARCHAR2(1000);

I declare another subtype, this time “hiding” the fact that a meme is a string with a maximum size of 1,000 characters.

3  c_was_processed 
      CONSTANT BOOLEAN := TRUE;

I need to indicate that a meme has been processed, so I create a constant to hold that indicator value. As you will see in a later section, neither the value nor its type matters; I could have declared this constant (and the following collection type) to be DATE or NUMBER.

4  TYPE memes_for_target_t IS 
      TABLE OF BOOLEAN
5     INDEX BY meme_t;

My first collection type declaration! It’s a list of TRUE-FALSE values: has a meme been processed or not? But which meme? The INDEX BY clause of the type declaration gives me the answer: elements in collections of this type are indexed by the name of the meme.

When I showed Joe these declarations, he was surprised by that INDEX BY clause. “I thought you could only index by PLS_INTEGER with collections,” he said. “It looks like you are indexing by string—indirectly.”

Joe is right: my INDEX BY clause uses a SUBTYPE that’s based on a VARCHAR2 declaration. And that’s not only perfectly valid but highly efficient, at least for relatively small strings (less than 100 characters or so).

I took this approach for two reasons:

A collection declared using this type is indexed by a string, not an integer (the usual way that collections in PL/SQL and arrays are indexed). This means that once I have the meme, I can very quickly and easily find out if it is in the collection and therefore has been processed.
I use the subtype to make my code self-documenting. Without the subtype, I would have needed to add a comment if I had any hope of a person understanding the significance of the INDEX BY clause:
```
/* Indexing by the meme */
TYPE memes_for_target_t IS 
   TABLE OF BOOLEAN
   INDEX BY VARCHAR2(1000);
```

Comments have their place in code, and they should certainly be used to explain complex algorithms. I always prefer, however, to let the code speak for itself—tell its own story—and the subtype can play an oversize role (compared to the simplicity of the feature itself) in this storytelling.

So I have a collection (or list) of memes. Am I ready to write my subprograms now? No. My memes are categorized by their “target” (HUMAN or DOG). I don’t see any evidence of a target in that collection declaration.

Does this mean I am going to concatenate the target to the meme, separating them with a colon? I suppose I could, but that would be ugly and error prone. Instead, I place this meme collection inside another collection.

6  TYPE processed_memes_t IS 
      TABLE OF memes_for_target_t
7     INDEX BY meme_target_t;

My second collection type declaration is the “outer” collection. Each element in this collection is a collection of type memes_for_target_t. Not only that, this new collection type is indexed by the meme target, as you can see by my use of the appropriate subtype (meme_target_t ) in the INDEX BY clause.

8  g_processed_memes processed_memes_t;

And, finally, I declare the actual collection variable that will hold all my lists of processed memes. I define it at the package level, rather than inside any particular packaged subprogram, which means that the state of that collection will persist in my session in between calls to the subprograms. It is, in essence, a global variable that can be referenced only inside the package body, and so I use the “g_” prefix.

What’s a Method?

The DELETE procedure is referred to as a method because it is called using object-oriented syntax, rather than procedural syntax. The syntax for deleting elements from a collection using a standard procedure looks like this:

DELETE (my_collection);

The syntax for deleting elements from a collection using the DELETE method looks like this:

my_collection.DELETE;

My collection design is complete. Now it's time to implement the subprograms and test the results.

Resetting the processed memes. In the original SQL-based implementation of the meme_tracker package, Joe dropped all the rows in the table and then committed that transaction. (And, by the way, the commit and rollback in that original implementation were completely unnecessary, because the rows were being used only in that session.)

With the reset_memes procedure in my collection implementation, I simply use the DELETE method: a procedure that is defined for all collections and is invoked by attaching it to the collection name with dot notation.

PROCEDURE reset_memes
  IS
  BEGIN
     g_processed_memes.DELETE;
 END;

Adding a processed meme. To indicate that a meme has been processed, I need to add it to the list of memes for the specified target. Again, because of my careful collection design, I get the job done with a minimum of code in the meme_tracker package body:

PROCEDURE add_meme (
   meme_target_in VARCHAR2, 
   meme_in IN VARCHAR2)
  IS
  BEGIN
     g_processed_memes 
        (meme_target_in)(meme_in) 
        := c_was_processed;   
END;

If you are new to nested collections, the syntax in this procedure may look quite odd. So let’s parse the add_meme procedure, step-by-step:

g_processed_memes is the identifier that “points” to the collection.
(meme_target_in) is the index value of a single element in g_processed_names. But, remember, each element is actually a collection in its own right: a list of processed memes for a given target.
(meme_in) is the index value of a single element in g_processed_names (meme_target_in). Each element in this collection is the name of a meme.
c_was_processed is the value assigned to the list of processed memes. At first glance, this might seem a little odd. I always assign the same value to the collection? Then how do I know which meme has been processed? By looking at the index value, which is the name of the meme.

That’s why it doesn’t matter what value I use to add each new collection element—I need and care only about the index value.

Checking for a processed meme. It doesn’t take much code to determine if a meme has already been processed, when using a nested collection. For reference, here is the header of the already_saw_meme function from the meme_tracker procedure specification:

FUNCTION already_saw_meme (
      meme_target_in VARCHAR2, 
      meme_in IN VARCHAR2)
   RETURN BOOLEAN

Before I show you the new implementation, let’s review the three scenarios I need to handle in the function:

The combination of meme target and meme passed to the function is defined in the collection.
No meme with the specified target (meme_target_in) has yet been processed.
At least one meme with the specified meme target has been processed, but not the meme passed to already_saw_meme in the current call to the function (meme_in).

For scenario 1, the function will find a defined element and can return TRUE.

For scenario 2, there will not be an element in g_processed_memes for that meme target, and therefore certainly no element in g_processed_memes (meme_target_in) for the meme_in value.

For scenario 3, there will be an element in g_processed_memes for that meme target, but no index value in g_processed_memes (meme_target_in) for the meme_in value.

Now I am ready for the function body:

BEGIN
   RETURN g_processed_memes 
            (meme_target_in)
                (meme_in) = 
           c_was_processed;
   EXCEPTION 
      WHEN NO_DATA_FOUND 
         THEN RETURN FALSE;
   END;

The RETURN statement takes care of scenario 1: the left side of the Boolean equality expression repeats the same construct used to record the meme as processed:

g_processed_memes 
   (meme_target_in)
       (meme_in)

The right side references the same value I used to define that element in the nested collection: c_was_processed.

The EXCEPTION section handles both scenarios 2 and 3. It might seem odd to see a handler for NO_DATA_FOUND when there is no SELECT-INTO in the body of the function. It makes perfect sense, however, when you know that the PL/SQL runtime engine raises this same exception when you try to “read” an element in a collection that has not been defined.

So if scenario 2 is true, there is no element defined in g_processed_memes for index value meme_target_in, and this code

g_processed_memes 
    (meme_target_in)

results in a NO_DATA_FOUND exception being raised.

If scenario 3 is true, g_processed_memes(meme_target_in) identifies an element in the outer collection, but g_processed_memes(meme_target_in)(meme_in) remains undefined and so PL/SQL raises the NO_DATA_FOUND exception. In both cases, this meme has not yet been processed, so FALSE is returned.

And that’s the reimplemented meme_tracker package.

After I finished showing it to Joe, I could see him wrestling with the fact that such a small amount of relatively trivial code could do so much. The elegance of the code is, however, a direct consequence of my up-front analysis and the resulting decision to use nested collections and string indexing.

But Is It Faster?

Both implementations of the meme_tracker package (SQL-based and collection-based) require a small amount of code. Furthermore, the collection implementation is likely to be a bit more difficult for someone unfamiliar with collections to understand, compared to the SQL solution. So why use collections for this requirement?

You do not need SQL, because the data does not need to persist beyond the duration of the job. Yet you pay a price in performance for using SQL. That may come as a surprise, because Oracle Database so thoroughly optimizes the performance of SQL statements.

No matter how much SQL performance is optimized, however, there are a certain number of CPU cycles that must be consumed to use SQL from PL/SQL., due to the following:

Oracle Database must switch from the PL/SQL runtime engine to the SQL runtime engine, passing the SQL statement and bind variables. This has a cost associated with it.
The SQL engine must parse the SQL statement or confirm that the SQL statement has already been parsed.
The SQL statement must change or retrieve data from the system global area (SGA).

If I use collections instead, then Oracle Database doesn’t have to switch to the SQL engine; no SQL statement parsing is needed; and the data is stored in the program global area (PGA) for that job’s session. Data can be changed and retrieved in the PGA much more quickly than in the SGA.

I put together a script to compare the performance of the original SQL and collection implementations of the meme_tracker package (meme_tracker.sql, available at bit.ly/1mLLesO); the results are shown in Table 1.

Activity	SQL Time	Collections Time
Add 100,000 memes	17 seconds	.1 seconds
Check to see if processed (100,000 calls to already_saw_meme)	4 seconds	.08 seconds

Table 1: Performance comparison

Use SQL Before PL/SQL—But . . .

“So,” said Joe, after all was said and done, “the lesson I have learned from this rewrite is to stop relying on relational tables and SQL, and instead write programs that use collections.”

A part of me wanted to howl in frustration. But I stayed calm and reminded myself that any lack of comprehension by Joe is my fault, not his.

“May I suggest a slight modification to that lesson?” Joe nodded. I summed it up as follows:

“Use the right tool for the right job. We programmers are paid to solve problems, not to come up with the most elegant SQL statement or the coolest PL/SQL algorithm.”

SQL is an incredibly powerful and efficient data manipulation language, and it is also optimized to move data in and out of relational tables, providing full ACID (atomicity, consistency, isolation, durability) support. You should take full advantage of SQL, but also recognize that there is a price to be paid for all that functionality.

If you have a requirement that does not need the full power of SQL, such as manipulation of data that is created and discarded entirely within your session, PL/SQL offers alternatives that are faster.

Take the Challenge

Each PL/SQL article offers a quiz to test your knowledge of the information provided in it. The quiz appears below and also at PL/SQL Challenge (plsqlchallenge.com), a website that offers online quizzes on the PL/SQL language as well as SQL, Oracle Application Express, database design, and deductive logic.

Here is your quiz for this article:

Which of the choices provide a replacement for the #REPLACE# tag so that the resulting block displays “1-Papua New Guinea” after execution?

DECLARE
   #REPLACE#
BEGIN
   l_population_by_country ('Papua New Guinea') := 7200000;
   l_howmany := l_population_by_country.COUNT;
   l_first := l_population_by_country.FIRST;
   DBMS_OUTPUT.put_line (l_howmany || '-' || l_first);
END;
/

SUBTYPE people_count_t IS INTEGER;
SUBTYPE country_t IS VARCHAR2 (1000);
TYPE population_t IS TABLE OF people_count_t
   INDEX BY country_t;
l_population_by_country   population_t;
l_howmany                 people_count_t;
l_first                   country_t;

TYPE population_t IS TABLE OF INTEGER
   INDEX BY VARCHAR2 (1000);
l_population_by_country   population_t;
l_howmany                 INTEGER;
l_first                   VARCHAR2 (1000);

/* Each row contains the population for a country.
   The country name is used as the index value. */
TYPE population_t IS TABLE OF INTEGER
   INDEX BY VARCHAR2 (1000);
l_population_by_country   population_t;
/* Population counts always integer: no partial people */
l_howmany                 INTEGER;
/* First, last etc. methods return country names */
l_first                   VARCHAR2 (1000);

Next Steps

Share

Persistence Doesn’t Always Pay

Sometimes you don’t want a database table and SQL. Sometimes you just need to collect your thoughts.