March/April 2017
In the next few columns, I’ll spend some time looking at new features in Oracle Database 12c Release 2. These features come from the “12 Things About Oracle Database 12c” presentation series that Chris Saxon and I, the Ask Tom team, gave at Oracle OpenWorld 2016 in San Francisco. (See asktom.oracle.com, under the Resources tab). In this article, I’ll take a look at a suite of Oracle Database 12c Release 2 improvements for validating data that make loading or querying data from unreliable datasources easier.
BackgroundAt the UK Oracle User Group conference (December 2016), during my presentation on Oracle Database 12c Release 2 features, I projected a slide containing the acronym QTFWC. I could see many attendees in the audience nodding, acknowledging that this was the next acronym to be digested in an information technology industry already filled to the brim with acronyms.
The acronym was fictional; there’s no such product or feature as QTFWC. It was merely a humorous reflection on a real issue that faces all developers dealing with data whose quality and correctness are unknown: producing Queries That Finish Without Crashing.
In times gone by, data that wasn’t in your database was in someone else’s database. But datasources now vary widely, and equally variable are the quality and correctness of that data. Developers tasked with querying, cleansing, and loading this data face a chicken-and-egg problem when it comes to datatypes: the only way to know if a string can be converted to, say, a valid number is to attempt the conversion. But to attempt the conversion is to also run the risk of the conversion’s being invalid and having the entire query crash.
Consider a load process that must load data from a table called STAGING_SALES, representing sales data staged from several external sources. The task seems simple enough: create an INSERT statement to transfer the raw data into the target table ANNUAL_SALES.
SQL> insert into ANNUAL_SALES 2 select * 3 from STAGING_SALES;
A developer may patiently watch this statement execute for minutes, or possibly hours, eagerly waiting for its completion, only to see the following:
Elapsed: 06:12:34.00 ERROR at line 1: ORA-01847: day of month must be between 1 and last day of month
The error suggests a problem with a DATE conversion, but there are no clues as to what data caused it. Adding to the developer’s frustration is that the data conversion error probably occurred some three hours into the total elapsed time of six hours, with the last three hours representing the effort of undoing all the changes applied so far. (See the “Instead of Waiting” sidebar for information on how to determine whether a data manipulation language [DML] statement has commenced a rollback before it completes.)
An examination of some of the data in the STAGING_TABLE reveals the conversion issue. A column named CREATED_DATE contains string data, some of which cannot be correctly converted to a date in the equivalent ANNUAL_SALES table.
SQL> select CREATED_DATE 2 from STAGING_SALES; CREATED_DATE ———————————————————— 01-FEB-2016 12-MAR-2012 54-AUG-2013 09-SEP-2014 23-OCT-2012 ...
I am trivializing the true troubleshooting effort here. Analyzing source data for correctness, especially if the source data is millions of rows over dozens of columns, is an arduous task.
Historical SolutionsIn the past, intercepting an error in data conversion typically required a PL/SQL function to act as a wrapper around standard facilities. In the insert-into-ANNUAL_SALES example, to check for a valid date in the CREATED_DATE column and prevent the statement from failing, I first create a date_checker PL/SQL wrapper function:
SQL> create or replace 2 function date_checker(p_str varchar2) return date is 3 l_dte date; 4 begin 5 l_dte := to_date(p_str,'dd-mon-yyyy'); 6 return l_dte; 7 exception 8 when others then return null; 9 end; 10 / Function created.
The date_checker function returns the source data as a DATE datatype if the conversion can be performed and returns NULL otherwise. In Oracle Database 12c Release 1, the PL/SQL code can be folded directly into the SQL statement itself to avoid cluttering the data dictionary. For more information on using PL/SQL functions within a WITH clause, refer to the documentation.
SQL> insert /*+ with_plsql */ into ANNUAL_SALES 2 with 3 function date_checker(p_str varchar2) return date is 4 dte date; 5 begin 6 dte := to_date(p_str,'dd-mon-yyyy'); 7 return dte; 8 exception 9 when others then return null; 10 end; 11 select date_checker(created_date) valid_date, ... 12 from staging_sales;
Although this solves the data conversion problem, there is increased complexity in the SQL code as well as the performance overhead of calling a PL/SQL function potentially millions of times.
More Validation Control with Oracle Database 12c Release 2Oracle Database 12c Release 2 adds attempt-to-convert-and-catch-errors functionality natively to the database via the new VALIDATE_CONVERSION function and the existing CAST and TO_datatype suite of conversion functions.
Returning to the data loading example, here’s how the new VALIDATE_CONVERSION can be used in the INSERT statement:
SQL> insert into ANNUAL_SALES 2 select to_date(created_date, 'dd-mon-yyyy'), ... 3 from STAGING_SALES 4 where validate_conversion( 5 created_date as date, 'dd-mon-yyyy' 6 ) = 1;
Instead of returning an error because of string data that could not be converted to a DATE datatype, the new VALIDATE_CONVERSION predicate in the WHERE clause picks up data only where a conversion of the CREATED_DATE column to a DATE datatype with the supplied format ‘dd-mon-yyyy’ mask is successful. Success is indicated by a return value of 1. (If the conversion would not have succeded, the function returns 0.) Because only the rows that could be converted are returned, I can now apply the TO_DATE function in the SELECT portion of the INSERT with the assurance that it will not cause the statement to fail.
Using VALIDATE_CONVERSION as a predicate ensures that the statement will not crash, but it also keeps complete rows from the source data from being loaded into the target table. What about other requirements for row handling and values? What if you must replace erroneous data with a default value but retain the row, so that before and after row counts are consistent in the load process?
To address that requirement, the TO_datatype conversion functions have been extended in Oracle Database 12c Release 2 to optionally return a default value if the data conversion fails.
The SALES_AMT column in the STAGING_SALES table also contains string data, but that data should be loaded as number values into the ANNUAL_SALES table. A sample of the SALES_AMT data shows that one of the rows has an erroneous comma, which would typically cause an error for the standard TO_NUMBER function call.
SQL> select SALES_AMT, 2 from STAGING_SALES; SALES_AMT —————————— 120000 172125 128000 125,000 99500 ...
By using the new extended syntax for TO_NUMBER, however, you can nominate a default value for use whenever the TO_NUMBER function fails:
SQL> select SALES_AMT, 2 TO_NUMBER(SALES_AMT 3 DEFAULT -1 ON CONVERSION ERROR) conv_sale 4 from STAGING_SALES; SALES_AMT CONV_SALE —————————— —————————— 120000 120000 172125 172125 128000 128000 125,000 -1 99500 99500 ...
The other TO_datatype functions support the same extended functionality. Similarly, the CAST function supports the same extension for casting from one datatype to another.
SummaryWith Oracle Database 12c Release 2, extensions to data conversion functions and the new VALIDATE_CONVERSION function make data validation via SQL a breeze for database developers. SQL statements can self-validate data without the need for additional PL/SQL wrappers or nondatabase code to guard against conversion errors.
LEARN more about Oracle Database 12c Release 2. DOWNLOAD Oracle Database 12c Release 2. |
DISCLAIMER: We've captured these popular historical magazine articles for your reference. Links etc contained within these article possibly won't work. These articles are provided as-is with no guarantee that the information within them is the best advice for current versions of the Oracle Database.