July/August 2017
In the next few columns, I’ll spend some time looking at new features in Oracle Database 12c Release 2. These features come from the “12 Things About Oracle Database 12c” presentation series that Chris Saxon and I (the AskTom team) gave at Oracle OpenWorld 2016 in San Francisco, California. (You can find the slides for these presentations on asktom.oracle.com, under the Resources tab.) In this article, I’ll take a look at some enhancements to external tables in Oracle Database 12c Release 2.
A Polyglot WorldThe dictionary definition for polyglot is “a person who knows and is able to use several languages.” This naturally crept over into the IT software world, where the term polyglot programming refers to the use of multiple programming languages. More recently the term polyglot persistence has come to refer to the use of multiple database technologies to best meet the needs of various business requirements.
For example, to meet the stringent data integrity needs of financial transaction processing applications, Oracle Database is a natural technology fit, but for simpler key/value pair requirements, a business may opt for Oracle NoSQL Database. The fundamental unit of storage in many NoSQL databases is a document, and rather than using multiple relational structures, a NoSQL database stores data in much larger chunks, typically in documents of a proprietary format or an open standard such as JavaScript Object Notation (JSON). It is already possible to query data from, for example, a Hadoop Distributed File System (HDFS) datastore by creating and using an external table definition in Oracle Database:
SQL> create table CUSTOMER_RAWDATA 2 (customer_number number, 3 customer_name varchar2(50), 4 postal_code char(5) 5 ) 6 organization external 7 (type oracle_hdfs 8 default directory TEMP 9 access parameters 10 ( 11 com.oracle.bigdata.cluster = hadoop_clust 12 com.oracle.bigdata.rowformat = delimited fields terminated by ',' 13 ) 14 location('hdfs/p1a.dat', 15 'hdfs/p1b.dat', 16 'hdfs/p2.dat', 17 'hdfs/p3.dat' 18 ) 19 );
But using this external table definition may lose the benefits of the parallelism that large NoSQL clusters afford. If the information for customer 150, for example, is contained only in the document p2.dat, a query to the external table will not have that insight and all documents will have to be scanned. Similarly, if the data was natively partitioned in the source system, for example, via hive partitions, the knowledge of the partitions will not be conveyed within the corresponding external table definition within the Oracle Database instance.
Partitioned External Tables with Oracle Database 12c Release 2Oracle Database 12c Release 2 removes this previous “missing insight” limitation by introducing partitioned external tables to allow a partitioning definition to be included within the external table definition:
SQL> create table CUSTOMER_RAWDATA 2 (customer_number number, 3 customer_name varchar2(50), 4 postal_code char(5) 5 ) 6 organization external 7 (type oracle_hdfs 8 default directory TEMP 9 access parameters 10 ( 11 com.oracle.bigdata.cluster = hadoop_clust 12 com.oracle.bigdata.rowformat = delimited fields terminated by ',' 13 ) 14 ) 15 partition by range(customer_number) 16 ( 17 partition p1 values less than (100) location('hdfs/p1a.dat', 'hdfs/p1b.dat'), 18 partition p2 values less than (200) location('hdfs/p2.dat'), 19 partition p3 values less than (300) location('hdfs/p3.dat') 20 );
Partitioned external tables are not solely for administrators who need to access nonrelational datasources. They can also be used by relational database developers for accessing local relational datasources to achieve performance benefits, because all the usual optimizations for partitions, such as partition elimination, are available for partitioned external tables. In the following example, I have two datasets of employee data, one file for each region—Australia (AU) and New Zealand (NZ).
File: emp_au.dat "AU",7369,"SMITH","CLERK",7902,17/DEC/80,800,,20 "AU",7499,"ALLEN","SALESMAN",7698,20/FEB/81,1600,300,30 "AU",7521,"WARD","SALESMAN",7698,22/FEB/81,1250,500,30 "AU",7566,"JONES","MANAGER",7839,02/APR/81,2975,,20 ... File: emp_nz.dat "NZ",8369,"SMITH","CLERK",7902,17/DEC/80,800,,20 "NZ",8499,"ALLEN","SALESMAN",7698,20/FEB/81,1600,300,30 "NZ",8521,"WARD","SALESMAN",7698,22/FEB/81,1250,500,30 "NZ",8566,"JONES","MANAGER",7839,02/APR/81,2975,,20 ...
To obtain the set of all employees in both regions, I can define a partitioned external table by using the conventional list partition syntax:
SQL> create table ext_emp 2 ( region varchar(2), 3 empno number(4), 4 ename varchar2(10), 5 job varchar2(9), 6 mgr number(4), 7 hiredate date, 8 sal number(7,2), 9 comm number(7,2), 10 deptno number(2) 11 ) 12 organization external 13 ( type oracle_loader 14 default directory temp 15 access parameters 16 ( records delimited by newline 17 badfile 'emp.bad' 18 logfile 'emp.log' 19 fields terminated by "," optionally enclosed by '"' ldrtrim 20 reject rows with all null fields 21 ( region, 22 empno, 23 ename, 24 job, 25 mgr, 26 hiredate, 27 sal, 28 comm, 29 deptno 30 ) 31 ) 32 ) 33 partition by list(region) 34 ( 35 partition p_au values ('AU') location('emp_au.dat'), 36 partition p_nz values ('NZ') location('emp_nz.dat') 37 ) reject limit unlimited;
If I defined the external table as nonpartitioned with two files defined in a single location definition, all queries would need to scan both files. But with a partitioned external table definition, partition elimination enables access to each file in isolation if the predicates allow it. For example, a query to just the employees in region AU yields the following execution plan, showing that only the first partition, P_AU, was required to satisfy the query.
SQL> set autotrace traceonly explain SQL> select * from ext_emp where region = 'AU'; Execution Plan —————————————————————————————————————————————————————————— Plan hash value: 2018056040 ————————————————————————————————————————————————————————————————————————————————————— |Id|Operation |Name |Rows|Bytes|Cost(%CPU)|Time |Pstart|Pstop| ————————————————————————————————————————————————————————————————————————————————————— | 0|SELECT STATEMENT | | 82| 7380| 29 (0)|00:00:01| | | | 1| PARTITION LIST SINGLE | | 82| 7380| 29 (0)|00:00:01| 1 | 1| | 2| EXTERNAL TABLE ACCESS FULL|EXT_EMP| 82| 7380| 29 (0)|00:00:01| 1 | 1| —————————————————————————————————————————————————————————————————————————————————————
Note that the database does not validate the external data to ensure that it aligns with the partition definition. For example, I added an invalid row (REGION = ‘XX’) into the emp_au.dat file. Then when I query for REGION = ‘XX’, I get no rows returned, because the partition definition, and hence subsequent optimizer partition elimination, defines that data to be impossible. However, scanning all the data by using a nonpartition key predicate reveals that the data is present in the file. It is your responsibility to ensure that the file data maps correctly to the partition definition.
SQL> select * from ext_emp where region = 'XX'; no rows selected SQL> select * from ext_emp where empno = 1934; RE EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO —— —————————— ———————— ——————— ———————— ————————— ———————— ——————— ————————— XX 1934 MILLER CLERK 7782 23-JAN-82 1300 10Process Control
As long as software applications have existed, process controls on the management, deployment, and maintenance of those applications have also existed. Entire practice management frameworks such as the Information Technology Infrastructure Library (ITIL), which have established a suite of standards and procedures for any part of the application development lifecycle, are now commonplace in most organizations. As part of current standards and procedures, structural changes to objects via data definition language (DDL) are much more tightly controlled. But unlike the DDL for a standard Oracle Database table, the success of a DDL statement defining an external table is not a guarantee that the definition is valid. As such, there are pros and cons to such tight controls over DDL deployment for external tables.
In this example, I’ve created an external table to read from a daily extract of employee data:
SQL> create table ext_emp ( 2 empno number(4), 3 ename varchar2(10), 4 job varchar2(9), 5 mgr number(4), 6 hiredate date, 7 sal number(7,2), 8 comm number(7,2), 9 deptno number(2) 10 ) 11 organization external 12 ( type oracle_loader 13 default directory TMP 14 access parameters 15 ( records delimited by newline 16 fields terminated by ',' 17 missing field values are null 18 ( empno,ename,job,mgr,hiredate,sal,comm,deptno ) 19 ) 20 location ('emp-2016-10-01.dat')); Table created.
Although this definition may appear valid, given that no error was returned, the true test comes when I attempt to query the external table, which is when the column mappings and parsing of the external file take place. In my case, something is awry:
SQL> select * from ext_emp; select * from ext_emp * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEFETCH callout ORA-30653: reject limit reached
I have forgotten to add the REJECT LIMIT clause to my table definition, and hence even a single invalid row in the source file data immediately terminates my SQL query. I probably should have taken advantage of the facility in the SQL*Loader feature of Oracle Database where a control file can be used to generate the DDL for an external table, using the EXTERNAL_TABLE=GENERATE_ONLY clause, which would have added the REJECT LIMIT clause for me automatically. (See “On Better Loading and Fact-Checking” for an example of this facility.)
So my problem is easily resolved; I just need to add a clause to my table DDL. But in a world of strict process control, DDL changes are typically seen as high-risk changes, requiring the full gamut of validations and checks to be exercised before approval. This is a not a criticism of such process controls—they are critical for avoiding the aforementioned scenario of small errors leading to catastrophic consequences—but there should be recognition that all DDL does not present the same risk.
Another external table DDL risk example is the filename attribute. In my example above, the format of the filename emp-2016-01-01.dat suggests that this filename may need to be adjusted in the external table definition every day as a new employee file is generated with a fresh date stamp in the name. But demanding a complete suite of checks and balances for such a small DDL change would be overkill.
More Flexibility with Oracle Database 12c Release 2Oracle Database 12c Release 2 recognizes that there are some modifications to external table definitions that can be considered a lower-risk exercise than would be the case with more-typical DDL changes to tables.
In Oracle Database 12c Release 2, without any DDL changes to the table, some external table parameters can be modified directly within the execution of a query against that external table. To solve the problem of my missing REJECT LIMIT clause, for example, I can add an EXTERNAL MODIFY clause directly to my SELECT statement:
SQL> select * from ext_emp 2 external modify ( reject limit unlimited ); EMPNO ENAME JOB MGR HIREDATE SAL —————————— —————————— ————————— —————————— ————————— —————————— 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 7521 WARD SALESMAN 7698 22-FEB-81 1250 7566 JONES MANAGER 7839 02-APR-81 2975
Similarly, if the LOCATION filename changes from day to day, I can override the filename each time I query the external table:
SQL> select * from ext_emp 2 external modify ( location ('emp-2016-10-02.dat') ); EMPNO ENAME JOB MGR HIREDATE SAL —————————— —————————— ————————— —————————— ————————— —————————— 7902 FORD ANALYST 7566 03-DEC-81 3000 7934 MILLER CLERK 7782 23-JAN-82 1300 7566 JONES MANAGER 7839 02-APR-81 2975
The EXTERNAL MODIFY clause permits modification of the external table definition without requiring DDL changes—and the likely resultant process control implications. The EXTERNAL MODIFY clause allows the modification or override of the following clauses only:
With Oracle Database 12c Release 2, extensions to external table functionality provide more-flexible data access across the datasources that are becoming commonplace in a polyglot persistence landscape. Partitioned external tables deliver a more direct mapping to the partitioning mechanism in those nonrelational datasources as well as realizing conventional partitioning performance benefits via partition elimination. Also, changing some of the defining attributes of external tables dynamically at query time removes the need for convoluted process controls that can impede developer productivity.
LEARN more about Oracle Database 12c Release 2 DOWNLOAD Oracle Database 12c Release 2. |
DISCLAIMER: We've captured these popular historical magazine articles for your reference. Links etc contained within these article possibly won't work. These articles are provided as-is with no guarantee that the information within them is the best advice for current versions of the Oracle Database.