Skip to Main Content

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Vikas.

Asked: October 09, 2015 - 1:20 pm UTC

Last updated: March 13, 2019 - 2:33 pm UTC

Version: 11g

Viewed 1000+ times

You Asked

Hi Team

My csv looks like below

123456,20,20,1500 --- this is the header
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs

I need to read the header info. Can you pls tell me whats the best way to do this?

For loading the data, I am using external table by skipping the first row. Now i need only header info for some validations.

and Chris said...

Remove the skip clause from your external table and fetch the first row:

create table t (
  c1 varchar2(10),
  c2 varchar2(10),
  c3 varchar2(10),
  c4 varchar2(10)
) organization external (
  default directory tmp
  access parameters (
    fields terminated by ','
  )
  location ('test.csv')
);

select * from t
where  rownum = 1;

C1         C2         C3         C4       
---------- ---------- ---------- ----------
123456     20         20         1500      


Uses test.csv with the following contents:

123456,20,20,1500
abc,efg,hij,klm
abc,efg,hij,klm
abc,efg,hij,klm
abc,efg,hij,klm

Rating

  (6 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Vikas N, October 09, 2015 - 2:26 pm UTC

It doesnt work for my csv for the reason that,i have set limit for 2nd field as varchar2(3 char) as i expect only 3 char, kind of validation. Also my header has only 4 fields against rest of the rows which have 49 fields.So i have to use MISSING FIELD VALUES ARE NULL. Is there any other to read the header info?

Example data

123456 2015110900 2015110910 1750
ABC123456789012 ABC 1.1 NEW XYZ000000041123 XYZ..49
ABC123456789012 ABC 1.1 NEW XYZ000000041123 XYZ..49
ABC123456789012 ABC 1.1 NEW XYZ000000041123 XYZ..49
ABC123456789012 ABC 1.1 NEW XYZ000000041123 XYZ..49


Chris Saxon
October 09, 2015 - 3:48 pm UTC

If the (some) of the header values are constant or you use a function to identify them in some way, you can use the "load when" clause:

create table t (
  c1 varchar2(10),
  c2 varchar2(10),
  c3 varchar2(10),
  c4 varchar2(10)
) organization external (
  default directory tmp
  access parameters (
    records delimited by newline
    load when (c1 = '123456')
    fields terminated by ','
  )
  location ('test.csv')
);

select * from t;

C1         C2         C3         C4       
---------- ---------- ---------- ----------
123456     20         20         1500      


File contents:

123456,20,20,1500
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs
abc,efg,hij,klm,nop,qrs


You'll need to create a separate external table for this though.

Header assumes line order kept

Norbert, February 27, 2019 - 4:15 pm UTC

Hi,

If I understand correctly, then reading the first line assumes that Oracle has no issues getting the file content in the original line order. But according to this below, there is an issue there:

https://asktom.oracle.com/pls/apex/asktom.search?tag=rownum-for-external-table

So, can we say if the first returned line is trusted to be the original file header, with and without using parallel hints?

Thank you,
Norbert
Connor McDonald
March 02, 2019 - 6:33 am UTC

To quote from that referenced question, I said:

My point is this.

a) My *observation* is that the file data is always returned in the order that the data appears in the file.

b) Various bugs on MOS (which describe that the data is returned *not* in that order) suggest that our intention that (even with parallel) the data is meant to be returned in file order.

But both of those mean *nothing* unless you can find a definitive statement either on MOS or in the docs saying "Yes, the file data comes back in order".

Its quite possible that this is just an omission in the docs, and you could log a support call for that omission...but until its *in* there, you cant be 100% sure.

The *only* place in the docs I can find a reference that talks about a means of guaranteed knowledge of the file sequence in the data, is RECNUM, namely:

"Setting a Column to the Datafile Record Number
Use the RECNUM parameter after a column name to set that column to the number of the logical record from which that record was loaded. Records are counted sequentially from the beginning of the first datafile, starting with record 1. RECNUM is incremented as each logical record is assembled. Thus it increments for records that are discarded, skipped, rejected, or loaded. If you use the option SKIP=10, the first record loaded has a RECNUM of 11."

*That* is why I'm saying use RECNUM. Because (to my knowledge) that's the only option where if the data order in the file does *not* match the value assigned by RECNUM, you would have an avenue to raise it with Support.


So whilst I'm *confident* that the rows come back in order as per the file, I don't have a reference to *guarantee* that to you.

I'm trying to follow up internally whether we can get a note explicitly guaranteeing it.

Addenda: Checked with the group responsible for it:

No default order should *ever* be assumed. If you want the rows as per the file, add the RECNUM attribute and order by that.

pipe dream

Racer I., February 28, 2019 - 7:10 am UTC

Hi,

Although I can't try it yet, because our DBAs are too restrictive I'm phantasizing about quick data loading using external tables (with preprocessor, maybe someday even from a client-site-source) read and processed in parallel through pipelined cursor-functions with autonomous transactions (I would use a single varchar-field and do all the parsing/converting/mapping/rejecting in the function).
One key aspect there for me has always been no inter-row dependencies, i.e. each row must be processable on its own. So the order would essentially be random.
Of course you can assign a rownum/group-id (for example in the preprocessor-script via say awk or in the external table definition) and then use that in the pipeline-PARALLEL-CLAUSE (grouping/clustering/sorting). But the result would be an arbitrary slow-down and very likely bad skew.
I don't know how Oracle handles the parallelizing from a serial source like this. Would it assign individual rows in a round-robin fashion among the parallel sessions or would it assign chunks of rows? Either way there would be weaving/discontinuities even if earlier rows in general will appear earlier than later ones because the file is read serially. At least I can't think of a technique of reading the file in parallel because you would still have to scan the whole file to find row-starts because jumping in and scanning for the next line-break might find one inside a quoted field. And impossible for preprocessor-output (or if the file is actually a named pipe).
Then spreading it out inside the server onto many parallel processes does make sense if you do a lot of processing.

Ok, this was kinda off-topic rambling.
What might help in the OPs case would be a second external table with a preprocessor-script which just contains

head -1 %1

regards,
Connor McDonald
March 01, 2019 - 1:11 am UTC

I don't know how Oracle handles the parallelizing from a serial source like this

Multiple files = each slave gets a file

Single file = all work will be done by one slave *unless* the external table definition explicit says fixed width rows. Then we can split the file with simple sizing arithmetic.

please tell me it ain't so

Racer I., March 07, 2019 - 10:28 am UTC

Hi,

I just read this (should of course have done this sooner) :
https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle_loader-access-driver.html#GUID-AFD2D4C7-08FA-4B80-A137-E60BAE7BB150

about restrictions on parallel processing with external tables in general and preprocessors in particular. It reads as if parallel is switched of altogether in many cases where I would think it should possible.
I.e. the the driver reads/processes a single stream but passes it to X processing sessions (like with a SERIAL-TO-PARALLEL-BROADCAST). Is this still possible or do all the restrictions on the driver itself affect the rest of the statement too?
What if you MATERIALIZE (hint) the input stream and then read that in parallel (into the aforementioned PIPELINED-construct)? Would be horrible inefficient writing all that data unecessarily to disk though.

regards,
Connor McDonald
March 09, 2019 - 2:28 am UTC

The problem there is we don't know what is in your pre-processor.

What if it was something as simple as:

awk '{print NR,$0}'

ie, to print the record number concatenated with the record. Parallelise that and you're toast.

If its getting toward those kind of boundary needs for parallel, I'd be more inclined to do that outside the external table (eg file splitting etc) and then nominate 'n' files in the definition

fan dance

Racer I., March 11, 2019 - 8:22 am UTC

Hi,

I think I get the various cases where the external table driver itself cannot run in parallel. My hope was that this doesn't stop the whole statement from being parallelized inside the server. You can have a statement that reads table A with NOPARALLEL hint and INSERTS with PARALLEL X (and enable PARALLEL DML in the session) or does a hash join to table B which has PARALLEL X and the hash itself is PARALLEL X too. The plan will show reading A serially and then fan out with some serial-to-parallel step.

I was kinda hoping this could still happen in this case. Can't test it though (unlikely that liveSQL allows this somehow?).

regards,
Chris Saxon
March 13, 2019 - 2:33 pm UTC

The rest of the plan can run in parallel (on 18c at least):

create or replace directory tmp as '/tmp';
declare
  f utl_file.file_type;
begin
  f := utl_file.fopen ('TMP', 'test.txt', 'w');
  utl_file.put_line(f, 'col1,col2');
  for i in 1 .. 1000 loop
    utl_file.put_line(f, i || ',data');
  end loop;
  utl_file.fclose(f);
end;
/
create table t ( 
  c1 int, c2 varchar2(10)
) organization external ( 
  default directory tmp
  access parameters ( 
    records delimited by newline
    skip 1
    preprocessor tmp:'preprocessor.sh'
    fields terminated by ',' (
      c1, c2
    )
  )
  location ( 'test.txt' )
)
reject limit unlimited ;

create table t2 as 
  select level c1 from dual
  connect by level <= 100;

select /*+ gather_plan_statistics parallel ( t2 2 ) */
       mod ( t2.c1, 4 ), count(*)
from   t
join   t2
on     t.c1 = t2.c1
group  by mod ( t2.c1, 4 );

select * 
from   table(dbms_xplan.display_cursor(null, null, 'BASIC ALL +PARALLEL'));

PLAN_TABLE_OUTPUT                                                                                                                
EXPLAINED SQL STATEMENT:                                                                                                         
------------------------                                                                                                         
select /*+ gather_plan_statistics parallel ( t2 2 ) */        mod (                                                              
t2.c1, 4 ), count(*) from   t join   t2 on     t.c1 = t2.c1 group  by                                                            
mod ( t2.c1, 4 )                                                                                                                 
                                                                                                                                 
Plan hash value: 3286847861                                                                                                      
                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------   
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |   
------------------------------------------------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT                     |          |       |       |   345 (100)|          |        |      |            |   
|   1 |  PX COORDINATOR                      |          |       |       |            |          |        |      |            |   
|   2 |   PX SEND QC (RANDOM)                | :TQ10002 |     1 |    16 |   345   (1)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |   
|   3 |    HASH GROUP BY                     |          |     1 |    16 |   345   (1)| 00:00:01 |  Q1,02 | PCWP |            |   
|   4 |     PX RECEIVE                       |          |     1 |    16 |   345   (1)| 00:00:01 |  Q1,02 | PCWP |            |   
|   5 |      PX SEND HASH                    | :TQ10001 |     1 |    16 |   345   (1)| 00:00:01 |  Q1,01 | P->P | HASH       |   
|   6 |       HASH GROUP BY                  |          |     1 |    16 |   345   (1)| 00:00:01 |  Q1,01 | PCWP |            |   
|*  7 |        HASH JOIN                     |          |   102K|  1595K|   343   (1)| 00:00:01 |  Q1,01 | PCWP |            |   
|   8 |         PX BLOCK ITERATOR            |          |   100 |   300 |     2   (0)| 00:00:01 |  Q1,01 | PCWC |            |   
|*  9 |          TABLE ACCESS FULL           | T2       |   100 |   300 |     2   (0)| 00:00:01 |  Q1,01 | PCWP |            |   
|  10 |         PX RECEIVE                   |          |   102K|  1296K|   341   (1)| 00:00:01 |  Q1,01 | PCWP |            |   
|  11 |          PX SEND BROADCAST           | :TQ10000 |   102K|  1296K|   341   (1)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |   
|  12 |           PX SELECTOR                |          |       |       |            |          |  Q1,00 | SCWC |            |   
|  13 |            EXTERNAL TABLE ACCESS FULL| T        |   102K|  1296K|   341   (1)| 00:00:01 |  Q1,00 | SCWP |            |   
------------------------------------------------------------------------------------------------------------------------------   

far out

Racer I., March 14, 2019 - 6:25 am UTC

Hi,

That's a relief.
Thanks for the effort to check.

regards,