Skip to Main Content
  • Questions
  • How to extract substrings enclosed within double quotes from a string

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Joyan.

Asked: February 12, 2017 - 7:42 pm UTC

Last updated: February 15, 2017 - 3:43 am UTC

Version: 11.0.2

Viewed 10K+ times! This question is

You Asked

I have a requirement to extract column names (there could be multiple instances) from a SQL string.
The column names will always be enclosed within double quotes.
e.g. "le code" = 802 AND "principal amount" > 1200 OR "branch id" = 'ABC'

Requirement 1:
In the above example I would have to extract the following:
le code
principal amount
branch id

How do I achieve that?

Requirement 2:
An extension to the above problem I want to replace any instance of double quoted pattern with REPLACE(logical_col_name, "<pattern>", physical_col_name).
Below should be the input and output:

Input:
"le code"

Output:
REPLACE(logical_name, "le code", physical_name)

I am not able to solve this using REGEXP_REPLACE?
I need to do this for > 1000 strings and store the results. Please suggest an efficient approach.

Thanks as always.

and Connor said...

If you need to parse out the columns you can use some PL/SQL for this

SQL> create or replace
  2  function word_finder(p_str varchar2) return sys.odcivarchar2list pipelined is
  3    in_quotes boolean := false;
  4    l_word varchar2(100);
  5  begin
  6    for i in 1 .. length(p_str) loop
  7      if substr(p_str,i,1) = '"' then
  8        if in_quotes then
  9           pipe row ( l_word );
 10           l_word := null;
 11        end if;
 12        in_quotes := not in_quotes;
 13      else
 14        if in_quotes then
 15          l_word := l_word || substr(p_str,i,1);
 16        end if;
 17      end if;
 18    end loop;
 19    return;
 20  end;
 21  /

Function created.

SQL>
SQL>
SQL> select *
  2  from table(word_finder('"le code" = 802 AND "principal amount" > 1200 OR "branch id" = ''ABC'''));

COLUMN_VALUE
-----------------------------
le code
principal amount
branch id


For second one, we can emulate nested replaces with some recursive SQL


SQL> create table t ( id int, str varchar2(100));

Table created.

SQL> insert into t values (1, '"le code" = 802 AND "principal amount" > 1200 OR "branch id" = ''ABC''');

1 row created.

SQL> insert into t values (2, '"le code" = 801 AND "principal amount" > 1000');

1 row created.

SQL>
SQL> create table t1 ( seq int, col varchar2(20), real_col varchar2(20));

Table created.

SQL> insert into t1 values (1, 'le code','CODE');

1 row created.

SQL> insert into t1 values (2, 'principal amount','AMOUNT');

1 row created.

SQL> insert into t1 values (3, 'branch id','BRANCH');

1 row created.

SQL>
SQL> with
  2  x(id,str,seq,col,real_col)  as
  3  (
  4    select t.id, replace(t.str,t1.col,t1.real_col) str, t1.seq, t1.col, t1.real_col
  5    from   t, t1
  6    where  t1.seq = 1
  7    union all
  8    select x.id, replace(x.str,t1.col,t1.real_col) str, t1.seq, t1.col, t1.real_col
  9    from   x, t1
 10    where  x.seq +1 = t1.seq
 11  )
 12  select * from x
 13  order by 1,3;

        ID STR                                                                 SEQ COL                  REAL_COL
---------- ------------------------------------------------------------ ---------- -------------------- --------------------
         1 "CODE" = 802 AND "principal amount" > 1200 OR "branch id" =           1 le code              CODE
           'ABC'

         1 "CODE" = 802 AND "AMOUNT" > 1200 OR "branch id" = 'ABC'               2 principal amount     AMOUNT
         1 "CODE" = 802 AND "AMOUNT" > 1200 OR "BRANCH" = 'ABC'                  3 branch id            BRANCH
         2 "CODE" = 801 AND "principal amount" > 1000                            1 le code              CODE
         2 "CODE" = 801 AND "AMOUNT" > 1000                                      2 principal amount     AMOUNT
         2 "CODE" = 801 AND "AMOUNT" > 1000                                      3 branch id            BRANCH

6 rows selected.

--
-- So you can see above, each pass does a replace on an item.  Hence
-- we just the final string for the high water mark for SEQ
--

SQL>
SQL>
SQL> with
  2  x(id,str,seq,col,real_col)  as
  3  (
  4    select t.id, replace(t.str,t1.col,t1.real_col) str, t1.seq, t1.col, t1.real_col
  5    from   t, t1
  6    where  t1.seq = 1
  7    union all
  8    select x.id, replace(x.str,t1.col,t1.real_col) str, t1.seq, t1.col, t1.real_col
  9    from   x, t1
 10    where  x.seq +1 = t1.seq
 11  )
 12  select * from x
 13  where seq = ( select max(seq) from t1 )
 14  order by 1,3;

        ID STR                                                                 SEQ COL                  REAL_COL
---------- ------------------------------------------------------------ ---------- -------------------- --------------------
         1 "CODE" = 802 AND "AMOUNT" > 1200 OR "BRANCH" = 'ABC'                  3 branch id            BRANCH
         2 "CODE" = 801 AND "AMOUNT" > 1000                                      3 branch id            BRANCH

2 rows selected.

SQL>
SQL>



Rating

  (7 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Joyan Sil, February 13, 2017 - 7:06 am UTC

This was most helpful. It solves my problem :-)
Thanks very much...
Just one more clarification. I need to do this validation for around 5k records. So hoping this will scale up for larger no.of validations.

I will do the analysis. Thanks again

Regexp

Rajeshwaran, Jeyabal, February 13, 2017 - 9:01 am UTC

demo@ORA11G> variable l_str varchar2(100)
demo@ORA11G> exec :l_str := q'| "le code" = 802 AND "principal amount" > 1200 OR "branch id" = 'ABC' |';

PL/SQL procedure successfully completed.

demo@ORA11G> print l_str

L_STR
---------------------------------------------------------------------------------------------------------------
 "le code" = 802 AND "principal amount" > 1200 OR "branch id" = 'ABC'

demo@ORA11G> select trim('"' from regexp_substr( :l_str, '"\S+\s+\S+"',1,level))  x
  2  from dual
  3  connect by level <= regexp_count( :l_str, '"\S+\s+\S+"' ) ;

X
---------------------------------------------------------------------------------------------------------------
le code
principal amount
branch id

demo@ORA11G>


Team - Any specific reason here to avoid "REGEX" and switching into Pipelined Table functions?

RE: REGEX

Duke Ganote, February 13, 2017 - 8:56 pm UTC

Good question. I'd've guessed REGEX was slow, but it didn't seem that way on 12c.

Either at a summary level:

 select owner, count(view_name) cnt
      , max( regexp_count( text_vc, '"[^"]{1,30}"' )
           ) as max_column_cnt
   from dba_views
  where regexp_count( text_vc, '"[^"]{1,30}"' ) > 0
    and owner >= 'SYS'
  group by owner
  order by 1
SQL> /

OWNER                                        CNT MAX_COLUMN_CNT
----------------------------------- ------------ --------------
SYS                                        3,223            110
SYSTEM                                         6              9
WMSYS                                         12             22
XDB                                            2             12

Elapsed: 00:00:04.03


or detail:

WITH X AS (
select owner, view_name, text_vc, regexp_count( text_vc, '"[^"]{1,30}"' ) cnt
  from dba_views
 where owner = 'SYS'
   and regexp_count( text_vc, '"[^"]{1,30}"' ) = 110
   and view_name = 'CDB_HIST_ACTIVE_SESS_HISTORY'
),
rCTE ( owner, view_name, text_vc, cnt, column_name, column_id ) AS (
SELECT owner
     , view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"') column_name
     , 1
  from x
union all
SELECT owner
     , view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"', 1, column_id+1) column_name
     , column_id + 1
  from rCTE
 where column_id < cnt
)
select view_name, column_id, column_name--, text_vc
  from rCTE
 order by 1,2;

VIEW_NAME                       COLUMN_ID COLUMN_NAME
------------------------------ ---------- ------------------------
CDB_HIST_ACTIVE_SESS_HISTORY            1 "SNAP_ID"
CDB_HIST_ACTIVE_SESS_HISTORY            2 "DBID"
CDB_HIST_ACTIVE_SESS_HISTORY            3 "INSTANCE_NUMBER"
CDB_HIST_ACTIVE_SESS_HISTORY            4 "SAMPLE_ID"
CDB_HIST_ACTIVE_SESS_HISTORY            5 "SAMPLE_TIME"
...
CDB_HIST_ACTIVE_SESS_HISTORY          104 "TEMP_SPACE_ALLOCATED"
CDB_HIST_ACTIVE_SESS_HISTORY          105 "DBOP_NAME"
CDB_HIST_ACTIVE_SESS_HISTORY          106 "DBOP_EXEC_ID"
CDB_HIST_ACTIVE_SESS_HISTORY          107 "CON_DBID"
CDB_HIST_ACTIVE_SESS_HISTORY          108 "CON_ID"
CDB_HIST_ACTIVE_SESS_HISTORY          109 "SYS"
CDB_HIST_ACTIVE_SESS_HISTORY          110 "DBA_HIST_ACTIVE_SESS_HISTORY"

110 rows selected.

Elapsed: 00:00:00.68

Regex | To : Duke

Rajeshwaran, Jeyabal, February 14, 2017 - 7:13 am UTC

Good question. I'd've guessed REGEX was slow, but it didn't seem that way on 12c.

Yes, REGEX was slow in 12c too, that is its nature.

So my general principle would be this.
Check if things can be done WITHOUT REGEX, if yes skip regexp, if not then use REGEX

drop table t purge;
create table t as select * from all_objects;
insert /*+ append */ into t select * from t;
commit;
insert /*+ append */ into t select * from t;
commit;
insert /*+ append */ into t select * from t;
commit;
exec dbms_stats.gather_table_stats(user,'T');


Tkprof shows this.

select object_name
from t
where regexp_replace( object_name, '[A-Z]*[0-9]*') is null 

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch       54      4.66       4.67          0      10575          0        7808
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       56      4.66       4.67          0      10575          0        7808

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 112  
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
      7808       7808       7808  TABLE ACCESS FULL T (cr=10575 pr=0 pw=0 time=216463 us cost=2896 size=203008 card=7808)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                      54        0.00          0.00
  SQL*Net message from client                    54        0.00          0.01
********************************************************************************

select object_name
from t
where replace( translate(object_name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789','0'),'0') is null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch       54      0.15       0.16          0      10572          0        7808
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       56      0.15       0.16          0      10572          0        7808

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 112  
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
      7808       7808       7808  TABLE ACCESS FULL T (cr=10572 pr=0 pw=0 time=19464 us cost=2900 size=805246 card=30971)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                      54        0.00          0.00
  SQL*Net message from client                    54        0.00          0.01



Connor McDonald
February 15, 2017 - 3:43 am UTC

"\S+\s+\S+"

breaks for non-space items, eg

SQL> select  regexp_count('"lecode" = 802 AND "principalamount" > 1200 OR "branchid" = ''ABC''', '"\S+\s+\S+"')
  2  from dual;

REGEXP_COUNT('"LECODE"=802AND"PRINCIPALAMOUNT">1200OR"BRANCHID"=''ABC''','"\S+\S+\S+"')
---------------------------------------------------------------------------------------
                                                                                      0


And when it comes to performance, because regex has to be all things to all people, it sometimes is not optimal. And similarly, if performance is critical its easy enough to tweak my function to be quicker

SQL> create table t as select owner, view_name, text_vc
  2  from dba_views
  3  where owner = 'SYS'
  4  and   mod(regexp_count( text_vc, '"[^"]{1,30}"' ),2) = 0
  5  and regexp_count( text_vc, '"[^"]{1,30}"' ) > 0;

Table created.

SQL> create or replace
  2  function word_finder2(p_str varchar2) return sys.odcivarchar2list pipelined is
  3    idx1 pls_integer;
  4    idx2 pls_integer;
  5  begin
  6    for i in 1 .. length(p_str) loop
  7      idx1 := instr(p_str,'"',1,i);
  8      exit when idx1 = 0;
  9      if mod(i,2) = 1 then
 10        idx2 := idx1;
 11      else
 12         pipe row ( substr(p_str,idx2,idx1-idx2+1 ) );
 13      end if;
 14    end loop;
 15    return;
 16  end;
 17  /

Function created.

SQL> set timing on
SQL> WITH
  2  rCTE ( owner, view_name, text_vc, cnt, column_name, column_id ) AS (
  3  SELECT owner
  4       , view_name
  5       , text_vc
  6       , regexp_count( text_vc, '"[^"]{1,30}"' ) cnt
  7       , regexp_substr( text_VC, '"[^"]{1,30}"') column_name
  8       , 1
  9    from t
 10  union all
 11  SELECT owner
 12       , view_name
 13       , text_vc
 14       , regexp_count( text_vc, '"[^"]{1,30}"' ) cnt
 15       , regexp_substr( text_VC, '"[^"]{1,30}"', 1, column_id+1) column_name
 16       , column_id + 1
 17    from rCTE
 18   where column_id < regexp_count( text_vc, '"[^"]{1,30}"' )
 19  )
 20  select max(column_name)
 21    from rCTE
 22  ;

MAX(COLUMN_NAME)
------------------------------------------------------------------------------------------------
"table"

Elapsed: 00:00:03.61
SQL>
SQL> select max(column_value)
  2  from t,
  3       table(word_finder2(t.text_vc));

MAX(COLUMN_VALUE)
------------------------------------------------------------------------------------------------
"table"


General principle of Rajeshwaran, Jeyabal

Duke Ganote, February 15, 2017 - 3:16 am UTC

"RE: So my general principle would be this. Check if things can be done WITHOUT REGEX, if yes skip regexp, if not then use REGEX"

I'd probably start with whatever approach is easiest first, then explore other options if performance is insufficient, or just for a check on the other.

For example, I tried both REGEX and normal SQL. Performance was adequate for data sets similar in size to that claimed for the original question. While REGEX performed a bit slower, it was so much simpler to code, I'd probably use it. I used the REGEX to debug my other approach, which was much more coding and debugging (and still not perfect).

CONVENTIONAL SQL
WITH setup2 AS (
select view_name
     , text_vc
     , TRUNC((length(text_vc)-length(replace(text_vc,'"'))) / 2 ) as cnt
  from dba_views_copy
 where owner = 'SYS'
), rCTE2 ( view_name, text_vc, cnt, column_name, occurrence#, pp ) AS (
select view_name
     , text_vc
     , cnt
     , ''
     , 1
     , instr(text_vc,'"',1) pp
  FROM setup2
union all
select view_name
     , text_vc
     , cnt
     , case when mod(occurrence#+1,2) = 0 and instr(text_vc,'"',1,occurrence#+1)-pp+1 <= 30
            then substr(text_vc,pp,instr(text_vc,'"',1,occurrence#+1)-pp+1)
        end as column_name
     , occurrence#+1
     , instr(text_vc,'"',1,occurrence#+1) pp
  from rCTE2
 where occurrence# <= cnt*2
) select /*+ gather_plan_statistics */ count(*), count(unique view_name)
    , count(unique column_name)
 from rCTE2 where column_name is not null;

COUNT(*) COUNT(UNIQUEVIEW_NAME) COUNT(UNIQUECOLUMN_NAME)
-------- ---------------------- ------------------------
  43,003                   3206                     8753

Elapsed: 00:00:05.10


REGEX SQL
WITH setup AS (
SELECT view_name, text_vc, regexp_count( text_vc, '"[^"]{1,30}"' ) cnt
  FROM dba_views_copy
 WHERE owner = 'SYS'
), rCTE ( view_name, text_vc, cnt, column_name, column_id ) AS (
SELECT view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"', 1, 1) as column_name
     , 1
  FROM setup
UNION ALL
SELECT view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"', 1, column_id + 1) as column_name
     , column_id + 1
  FROM rCTE
 WHERE column_id <= cnt
) select /*+ gather_plan_statistics */ count(*)
    , count(unique view_name), count(unique column_name)
 from rCTE
 where column_name is not null;

OUNT(*) COUNT(UNIQUEVIEW_NAME) COUNT(UNIQUECOLUMN_NAME)
------- ---------------------- ------------------------
 43,232                   3223                     8872

Elapsed: 00:00:07.55

Breaks for non-space items

Rajeshwaran, Jeyabal, February 17, 2017 - 2:24 pm UTC

"\S+\s+\S+" 

breaks for non-space items, eg 

SQL> select  regexp_count('"lecode" = 802 AND "principalamount" > 1200 OR "branchid" = ''ABC''', '"\S+\s+\S+"')
  2  from dual;

REGEXP_COUNT('"LECODE"=802AND"PRINCIPALAMOUNT">1200OR"BRANCHID"=''ABC''','"\S+\S+\S+"')
---------------------------------------------------------------------------------------
                                                                                      0


We could look for all possible combinations and "tweek" the patterns. some thing like this.

demo@ORA12C> variable l_str varchar2(100)
demo@ORA12C> exec :l_str := q'| "le code" = 802 AND "principal-amount" > 1200 OR "branch id" = 'ABC' |';

PL/SQL procedure successfully completed.

demo@ORA12C> print l_str

L_STR
---------------------------------------------------------------------------------------------------------
 "le code" = 802 AND "principal-amount" > 1200 OR "branch id" = 'ABC'

demo@ORA12C>
demo@ORA12C> select regexp_count(:l_str ,'"\S+\s+\S+"|"\S+"' ) cnt
  2  from dual;

       CNT
----------
         3

demo@ORA12C>

Rules for Column Names

Duke Ganote, February 20, 2017 - 3:50 pm UTC

The rules for Oracle column names in quotes are pretty general, for example:

Quoted identifiers can contain any characters and punctuations marks as well as spaces. However, neither quoted nor nonquoted identifiers can contain double quotation marks or the null character (\0).
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm

It's entirely possible and entirely annoying to do the following:

SQL> create table " " ( " " integer );

Table created.

Elapsed: 00:00:00.31
SQL> ed
Wrote file c:/users/DGANOTE/sqlplusedit.sql

  1* BEGIN execute immediate 'CREATE TABLE "'||chr(7)||'" ( "'||chr(7)||'" integer )'; end;
SQL> /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.11
SQL> create table "http://xmlns.oracle.com/xdb"
  2  ( "http://xmlns.oracle.com/xdb" varchar2(35) );

Table created.

Elapsed: 00:00:00.09
SQL> insert into "http://xmlns.oracle.com/xdb"
  2  values ( '"http://xmlns.oracle.com/xdb"');

1 row created.

Elapsed: 00:00:00.10
SQL> select * from "http://xmlns.oracle.com/xdb";

http://xmlns.oracle.com/xdb
-----------------------------------
"http://xmlns.oracle.com/xdb"


That's why I used the pattern '"[^"]{1,30}"' which covers almost all cases -- although to be thorough, I should have explicitly excluded the null character (\0).

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library