Skip to Main Content
  • Questions
  • Parse string then flatten into columns

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Rutger.

Asked: September 13, 2018 - 11:10 am UTC

Last updated: April 29, 2022 - 10:47 am UTC

Version: SQL Developer 18.1

Viewed 10K+ times! This question is

You Asked

Hi,

LiveSQL link not accepted by the form:
https://livesql.oracle.com/apex/livesql/s/g88hb5van1r4ctc65yp4lq9gb

I have this situation (see link):

ID String
Id1 Thing1: Sub1, <br>Thing2: Sub7 ,Sub8 , Sub9 <br>Thing3: Sub12
Id1 Thing2: Sub6, Sub7, Sub8, <br>Thing3: Sub12 ,Sub13
Id1 Thing1: Sub1, Sub2 <br>Thing2: Sub7 ,Sub8 , Sub9 <br>Thing3: Sub12
Id2 Thing1: Sub2, Sub3 <br>Thing2: Sub7 <br>Thing3: Sub13
Id3 Thing2: Sub7

So each Id can have multiple strings and those strings contain two levels of information. I want to split the strings into two columns so that the records are at 'Level2' level, which would look like this:

ID Level1 Level2
Id1 Thing1 Sub1
Id1 Thing2 Sub7
Id1 Thing2 Sub8
Id1 Thing2 Sub9
Id1 Thing3 Sub12
Id1 Thing2 Sub6
Id1 Thing2 Sub7
Id1 Thing2 Sub8
Id1 Thing3 Sub12
Id1 Thing3 Sub13
Id1 Thing1 Sub1
Id1 Thing1 Sub2
Id1 Thing2 Sub7
Id1 Thing2 Sub8
Id1 Thing2 Sub9
Id1 Thing3 Sub12
Id2 Thing1 Sub2
Id2 Thing1 Sub3
Id2 Thing2 Sub7
Id2 Thing3 Sub13
Id3 Thing2 Sub7

I have actually got it to work using this syntax (twice) after parsing out the two levels into comma separated substrings:

regexp_substr(table_name,'[^,]+', 1, level) AS level1_flat,
connect by regexp_substr(table_name, '[^,]+', 1, level) is not null


but it is too slow to work on a large data set. I don't fully understand the code above, I guess it's doing a cross join and I'm applying a distinct select to ignore the dupes but that is obviously not efficient.

Any help would be greatly appreciated.

Thanks,
Rutger


and Chris said...

It's not a cross join.

But for every row, you're building the complete tree of:

- It's string split into rows
- Every other row's string split to rows!

So for row 1, you link it each of the other four rows as a child. Then for each of these four, create the tree of the three remaining below it. And so on.

So for each row, you create a tree with 89 rows. 5 times.

This leads to an explosion in the number of rows. As your table has more rows, this increases exponentially.

Avoid this by ensuring you only add child rows that link to the original parent. You can do this by:

- Using connect by inside a lateral subquery (needs 12.1+)

select id, regexp_substr(string,'[^,]+', 1, x) AS level1_flat 
from   test, lateral (
  select level x from dual 
  connect by regexp_substr(string, '[^,]+', 1, level) is not null
);


- Checking that the prior string is the same as the current

select id,
       regexp_substr(string,'[^,]+', 1, level) AS level1_flat
from   test
connect by regexp_substr(string, '[^,]+', 1, level) is not null
and     string = prior string
and     prior sys_guid() is not null; -- avoid loops


If this still isn't fast enough, you could try using xmltable with tokenize operations:

select id, str 
from   test, xmltable(
    'if (contains($X,",")) then ora:tokenize($X,"\,") else $X'
  passing string as X
  columns str varchar2(4000) path '.'
);


Finally, you could switch regular expressions for substr/instr.

Rating

  (16 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

If on 12c...

Stew Ashton, September 14, 2018 - 4:14 pm UTC

I have found good results by reformatting to a JSON string and using JSON_TABLE:
with json_ized as (
  select id, '[' ||
  replace(
    replace(
      replace(
        json_array(string),
        ',',
        '","'
      ),
      ': ',
      '",["'
    ),
    ' <br>',
    '"]],["'
  )
  || ']' json_string
  from test
)
select id, level1, trim(level2) level2
from json_ized,
json_table(
  json_string,
  '$[*]'
  columns (
    level1 path '$[0]',
    nested path '$[1][*]' columns (
      level2 path '$'
    )
  )
)
where trim(level2) is not null;

ID    LEVEL1   LEVEL2   
Id1   Thing1   Sub1     
Id1   Thing2   Sub7     
Id1   Thing2   Sub8     
Id1   Thing2   Sub9     
Id1   Thing3   Sub12    
Id1   Thing2   Sub6     
Id1   Thing2   Sub7     
Id1   Thing2   Sub8     
Id1   Thing3   Sub12    
Id1   Thing3   Sub13    
Id1   Thing1   Sub1     
Id1   Thing1   Sub2     
Id1   Thing2   Sub7     
Id1   Thing2   Sub8     
Id1   Thing2   Sub9     
Id1   Thing3   Sub12    
Id2   Thing1   Sub2     
Id2   Thing1   Sub3     
Id2   Thing2   Sub7     
Id2   Thing3   Sub13    
Id3   Thing2   Sub7 

Best regards,
Stew

on benchmarking

Rajeshwaran, Jeyabal, September 18, 2018 - 7:50 am UTC

Just comparing the traditional approach (substr/instr) with JSON approach and thing looks promising with JSON based approach.

demo@ORA12C> set feedback off
demo@ORA12C> drop table test purge;
demo@ORA12C> create table test(x int,id varchar2(3),string varchar2(100));
demo@ORA12C> insert into test(x,id,string) values (1,'Id1','Thing1:  Sub1, <br>Thing2: Sub7 ,Sub8 , Sub9 <br>Thing3: Sub12');
demo@ORA12C> insert into test(x,id,string) values (2,'Id1','Thing2:  Sub6, Sub7, Sub8, <br>Thing3: Sub12 ,Sub13');
demo@ORA12C> commit;
demo@ORA12C> set feedback 6
demo@ORA12C>
demo@ORA12C> col level1 format a20
demo@ORA12C> col level2 format a20
demo@ORA12C> with json_ized as (
  2    select x, id, '[' ||
  3    replace(
  4      replace(
  5        replace(
  6          json_array(string),
  7          ',',
  8          '","'
  9        ),
 10        ': ',
 11        '",["'
 12      ),
 13      ' <br>',
 14      '"]],["'
 15    )
 16    || ']' json_string
 17    from test
 18  )
 19  select x,id, level1, trim(level2) level2
 20  from json_ized,
 21  json_table(
 22    json_string,
 23    '$[*]'
 24    columns (
 25      level1 path '$[0]',
 26      nested path '$[1][*]' columns (
 27        level2 path '$'
 28      )
 29    )
 30  )
 31  where trim(level2) is not null;

         X ID  LEVEL1               LEVEL2
---------- --- -------------------- --------------------
         1 Id1 Thing1               Sub1
         1 Id1 Thing2               Sub7
         1 Id1 Thing2               Sub8
         1 Id1 Thing2               Sub9
         1 Id1 Thing3               Sub12
         2 Id1 Thing2               Sub6
         2 Id1 Thing2               Sub7
         2 Id1 Thing2               Sub8
         2 Id1 Thing3               Sub12
         2 Id1 Thing3               Sub13

10 rows selected.

demo@ORA12C> select x,id,
  2    substr( txt, 1,instr(txt,':')-1) as level1 ,
  3    trim( substr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,
  4            instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2)+1 ,
  5            instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2+1)-
  6            instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2)-1) ) as level2
  7  from (
  8  select x, id, string , r,
  9     trim(',' from trim(substr( '<br>'||string||'<br>' , instr( '<br>'||string||'<br>' ,'<br>',1,r)+4,
 10                             instr( '<br>'||string||'<br>' ,'<br>',1,r+1) -
 11                             instr( '<br>'||string||'<br>' ,'<br>',1,r) -4) ) ) txt
 12  from test, lateral( select level r
 13     from dual
 14     connect by level <= (length(string) - length(replace(string,'<br>')))/4 +1 )
 15        ) t1, lateral( select level r2
 16            from dual
 17            connect by level <= length(txt) - length( replace(txt,','))+1) ;

         X ID  LEVEL1               LEVEL2
---------- --- -------------------- --------------------
         1 Id1 Thing1               Sub1
         1 Id1 Thing2               Sub7
         1 Id1 Thing2               Sub8
         1 Id1 Thing2               Sub9
         1 Id1 Thing3               Sub12
         2 Id1 Thing2               Sub6
         2 Id1 Thing2               Sub7
         2 Id1 Thing2               Sub8
         2 Id1 Thing3               Sub12
         2 Id1 Thing3               Sub13

10 rows selected.


now bench marking with huge data sets

demo@ORA12C> declare
  2     l_cnt int;
  3  begin
  4     select count(*) into l_cnt
  5     from test;
  6
  7     while ( l_cnt <= 1000000)
  8     loop
  9             dbms_application_info.set_client_info('l_cnt ='||l_cnt);
 10             insert into test(x,id,string)
 11             select l_cnt + rownum,id,string
 12             from test
 13             where rownum <= ( 1000000 - l_cnt);
 14             exit when sql%rowcount = 0;
 15             l_cnt := l_cnt + sql%rowcount;
 16     end loop;
 17     commit;
 18     dbms_application_info.set_client_info('Stats gather');
 19     dbms_stats.gather_table_stats(user,'T');
 20  end;
 21  /

PL/SQL procedure successfully completed.

demo@ORA12C> @tkfilename
D:\APP\VNAMEIT\VIRTUAL\diag\rdbms\ora12c\ora12c\trace\ora12c_ora_11796.trc
demo@ORA12C> @tktrace

PL/SQL procedure successfully completed.

demo@ORA12C> begin
  2     for k in ( with json_ized as (
  3                       select x, id, '[' ||
  4                       replace(
  5                             replace(
  6                               replace(
  7                                     json_array(string),
  8                                     ',',
  9                                     '","'
 10                               ),
 11                               ': ',
 12                               '",["'
 13                             ),
 14                             ' <br>',
 15                             '"]],["'
 16                       )
 17                       || ']' json_string
 18                       from test
 19                     )
 20                     select x,id, level1, trim(level2) level2
 21                     from json_ized,
 22                     json_table(
 23                       json_string,
 24                       '$[*]'
 25                       columns (
 26                             level1 path '$[0]',
 27                             nested path '$[1][*]' columns (
 28                               level2 path '$'
 29                             )
 30                       )
 31                     )
 32                     where trim(level2) is not null)
 33     loop
 34             null ;
 35     end loop;
 36  end;
 37  /

PL/SQL procedure successfully completed.

demo@ORA12C> begin
  2     for k in ( select x,id,
  3               substr( txt, 1,instr(txt,':')-1) as level1 ,
  4               trim( substr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,
  5                               instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2)+1 ,
  6                               instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2+1)-
  7                               instr( ','||trim(substr( txt, instr(txt,':')+1))||',' ,',',1,r2)-1) ) as level2
  8             from (
  9             select x, id, string , r,
 10                     trim(',' from trim(substr( '<br>'||string||'<br>' , instr( '<br>'||string||'<br>' ,'<br>',1,r)+
 11                                             instr( '<br>'||string||'<br>' ,'<br>',1,r+1) -
 12                                             instr( '<br>'||string||'<br>' ,'<br>',1,r) -4) ) ) txt
 13             from test, lateral( select level r
 14                     from dual
 15                     connect by level <= (length(string) - length(replace(string,'<br>')))/4 +1 )
 16                       ) t1, lateral( select level r2
 17                               from dual
 18                               connect by level <= length(txt) - length( replace(txt,','))+1) )
 19     loop
 20             null ;
 21     end loop;
 22  end;
 23  /

PL/SQL procedure successfully completed.

demo@ORA12C> exit
Disconnected from Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production


and Tkprof shows this.


WITH JSON_IZED AS ( SELECT X, ID, '[' || REPLACE( REPLACE( REPLACE( 
  JSON_ARRAY(STRING), ',', '","' ), ': ', '",["' ), ' <br>', '"]],["' ) || 
  ']' JSON_STRING FROM TEST ) SELECT X,ID, LEVEL1, TRIM(LEVEL2) LEVEL2 FROM 
  JSON_IZED, JSON_TABLE( JSON_STRING, '$[*]' COLUMNS ( LEVEL1 PATH '$[0]', 
  NESTED PATH '$[1][*]' COLUMNS ( LEVEL2 PATH '$' ) ) ) WHERE TRIM(LEVEL2) IS 
  NOT NULL


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    50001     22.88      22.78          0      65927          0     5000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50003     22.88      22.79          0      65928          0     5000000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 108     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
   5000000    5000000    5000000  NESTED LOOPS  (cr=65927 pr=0 pw=0 time=20561168 us starts=1 cost=2771 size=72 card=1)
   1000000    1000000    1000000   TABLE ACCESS FULL TEST (cr=65927 pr=0 pw=0 time=880870 us starts=1 cost=2741 size=68 card=1)
   5000000    5000000    5000000   JSONTABLE EVALUATION  (cr=0 pr=0 pw=0 time=8799197 us starts=1000000)


SELECT X,ID, SUBSTR( TXT, 1,INSTR(TXT,':')-1) AS LEVEL1 , TRIM( SUBSTR( ',
  '||TRIM(SUBSTR( TXT, INSTR(TXT,':')+1))||',' , INSTR( ','||TRIM(SUBSTR( TXT,
   INSTR(TXT,':')+1))||',' ,',',1,R2)+1 , INSTR( ','||TRIM(SUBSTR( TXT, 
  INSTR(TXT,':')+1))||',' ,',',1,R2+1)- INSTR( ','||TRIM(SUBSTR( TXT, 
  INSTR(TXT,':')+1))||',' ,',',1,R2)-1) ) AS LEVEL2 
FROM
 ( SELECT X, ID, STRING , R, TRIM(',' FROM TRIM(SUBSTR( 
  '<br>'||STRING||'<br>' , INSTR( '<br>'||STRING||'<br>' ,'<br>',1,R)+4, 
  INSTR( '<br>'||STRING||'<br>' ,'<br>',1,R+1) - INSTR( 
  '<br>'||STRING||'<br>' ,'<br>',1,R) -4) ) ) TXT FROM TEST, LATERAL( SELECT 
  LEVEL R FROM DUAL CONNECT BY LEVEL <= (LENGTH(STRING) - 
  LENGTH(REPLACE(STRING,'<br>')))/4 +1 ) ) T1, LATERAL( SELECT LEVEL R2 FROM 
  DUAL CONNECT BY LEVEL <= LENGTH(TXT) - LENGTH( REPLACE(TXT,','))+1) 


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    50001     60.06      59.63          0      59608          0     5000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50003     60.06      59.63          0      59609          0     5000000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 108     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
   5000000    5000000    5000000  NESTED LOOPS  (cr=59608 pr=0 pw=0 time=42165332 us starts=1 cost=2745 size=94 card=1)
   2499999    2499999    2499999   NESTED LOOPS  (cr=59608 pr=0 pw=0 time=7747816 us starts=1 cost=2743 size=81 card=1)
   1000000    1000000    1000000    TABLE ACCESS FULL TEST (cr=59608 pr=0 pw=0 time=980984 us starts=1 cost=2741 size=68 card=1)
   2499999    2499999    2499999    VIEW  VW_LAT_4DB60E85 (cr=0 pr=0 pw=0 time=4367489 us starts=1000000 cost=2 size=13 card=1)
   2499999    2499999    2499999     CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=3783943 us starts=1000000)
   1000000    1000000    1000000      FAST DUAL  (cr=0 pr=0 pw=0 time=150859 us starts=1000000 cost=2 size=0 card=1)
   5000000    5000000    5000000   VIEW  VW_LAT_4DB60E85 (cr=0 pr=0 pw=0 time=22797732 us starts=2499999 cost=2 size=13 card=1)
   5000000    5000000    5000000    CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=21648264 us starts=2499999)
   2499999    2499999    2499999     FAST DUAL  (cr=0 pr=0 pw=0 time=355530 us starts=2499999 cost=2 size=0 card=1)

Chris Saxon
September 18, 2018 - 1:37 pm UTC

A big chunk of the extra time is coming from the extra connect by.

I'm sure you can find a way to do this with one connect by ;)

siva, September 19, 2018 - 5:27 am UTC

We can try this

with a as(
select id, regexp_substr(REGEXP_REPLACE(string,'<.+?>',';'),'[^;]+', 1, level) AS level1_flat
from test
connect by regexp_substr(REGEXP_REPLACE(string,'<.+?>',';'),'[^;]+', 1, level) is not null),
b as (
select distinct id ,
substr(level1_flat, 1, instr(level1_flat,':',1)-1) AS level1_flat1,
trim(regexp_substr(substr(level1_flat, instr(level1_flat,':',1)+1),'[^,]+', 1, level)) AS level1_flat2
from a
connect by level<=LENGTH(substr(level1_flat, instr(level1_flat,':',1)+1))- LENGTH(REPLACE(substr(level1_flat, instr(level1_flat,':',1)+1),','))+1
)
select * from b
where level1_flat2 is not null ;
Chris Saxon
September 19, 2018 - 10:26 am UTC

Nice try, but still has two connect by clauses...

with one connect bys

Rajeshwaran, Jeyabal, September 20, 2018 - 9:59 am UTC

....
I'm sure you can find a way to do this with one connect by ;)
....


Yes it is here.

demo@ORA12C> select * from test;

         X ID  STRING
---------- --- ----------------------------------------------------------------------------------------
         1 Id1 Thing1:  Sub1, <br>Thing2: Sub7 ,Sub8 , Sub9 <br>Thing3: Sub12
         2 Id1 Thing2:  Sub6, Sub7, Sub8, <br>Thing3: Sub12 ,Sub13

demo@ORA12C> col x1 format a10
demo@ORA12C> col x2 format a10
demo@ORA12C> select x,id,
  2      last_value( case when regexp_substr( x2, '[^,]+',1,r) like '@%' then
  3        substr( regexp_substr( x2, '[^,]+',1,r) ,2, instr( regexp_substr( x2, '[^,]+',1,r) ,':')-1)
  4        end ignore nulls ) over( partition by x order by r ) x1,
  5      case when instr( regexp_substr( x2, '[^,]+',1,r) ,':') > 0 then
  6          substr( regexp_substr( x2, '[^,]+',1,r) , instr( regexp_substr( x2, '[^,]+',1,r) ,':')+1)
  7          else regexp_substr( x2, '[^,]+',1,r) end x2
  8  from (
  9  select x,id,string,r,x1,
 10    regexp_replace(x1,'([^,])@','\1,@',1,0,'i') x2
 11  from (
 12  select x,id, string,
 13    replace( regexp_replace('<br>'||string,'\s+') ,'<br>','@') x1
 14  from test
 15       ) ,  lateral( select level r
 16    from dual
 17    connect by level  <= regexp_count( x1,'@')+
 18            regexp_count(x1,',')-1 )
 19       )
 20  /

         X ID  X1         X2
---------- --- ---------- ----------
         1 Id1 Thing1:    Sub1
         1 Id1 Thing2:    Sub7
         1 Id1 Thing2:    Sub8
         1 Id1 Thing2:    Sub9
         1 Id1 Thing3:    Sub12
         2 Id1 Thing2:    Sub6
         2 Id1 Thing2:    Sub7
         2 Id1 Thing2:    Sub8
         2 Id1 Thing3:    Sub12
         2 Id1 Thing3:    Sub13

10 rows selected.

demo@ORA12C>

siva, September 21, 2018 - 11:41 am UTC

Which way is good in terms of performance ?
1. Json
2. with clause and 2 connect by clauses
3. single connect by clause

is there any other way other than connect by and json ?


Chris Saxon
September 24, 2018 - 12:50 pm UTC

Well, you can always use XML as discussed in the first answer.

Jess, August 06, 2019 - 5:21 pm UTC

Hi Chris,

I think I'm similarly suffering from aggregating in the wrong place. Trying with 1 string works, but trying to select from a table dies eventually with too long a concatenation...

Select in question isn't from a table, but a bunch of nested sub-selects with complex conditions. But the inner bit spits out a bunch of columns along with an "attendance" string: "00011100101011" type thing that's 40 characters long.

(but for test purposes, this mimics what comes back from the inner query... the result set there is about 10K records
create table testvals (col1 varchar2(10), col2 varchar2(10), attend varchar2(40));
insert into testvals values ('val11', 'val12', '0000111110011010000110000000000000100000');
insert into testvals values ('val21', 'val22', '0011111110000011111111111111111111000100');
insert into testvals values ('val31', 'val32', '1110101010101011000011111000011000100000');
)

We need to start by breaking this string into 3 chunks (each will be a separate column). Then we get position of all the 1s and present them in a list. With a 1-string example I cobbled together looks like so (just to see the right output type thing):


select col1, col2,
    listagg((case when x.res1 != '0' then x.res1 else null end), ',') within group (order by x.res1) first_set,
    listagg((case when x.res2 != '0' then x.res2 else null end), ',') within group (order by x.res2) second_set,
    listagg((case when x.res3 != '0' then x.res3 else null end), ',') within group (order by x.res3) third_set
from
    (select 'abc' col1, 'xyz' col2, 
        instr(substr('0000111110011010000110000000000000100000', 1, 15),'1', 1, level) res1,
        instr(substr('0000111110011010000110000000000000100000', 16, 15),'1', 1, level) res2,
        instr(substr('0000111110011010000110000000000000100000', 31, 10),'1', 1, level) res3
    from dual
        connect by level <= regexp_count('0000111110011010000110000000000000100000', '1')
    ) x
group by col1, col2;




That select returns the following, which is absolutely the correct result.
COL1 COL2 FIRST_SET SECOND_SET THIRD_SET
abc xyz 5,6,7,8,9,12,13,15 5,6 5


But when I try to swap it out for the table, it goes wrong because I don't think it results in splitting each string with a regexp_count of that specific string...


with colchunks as (select 1 as chunk1, 16 as chunk2, 31 as chunk3 from dual)
select col1, col2,
    listagg((case when x.res1 != '0' then x.res1 else null end), ',') within group (order by x.res1) first_set,
    listagg((case when x.res2 != '0' then x.res2 else null end), ',') within group (order by x.res2) second_set,
    listagg((case when x.res3 != '0' then x.res3 else null end), ',') within group (order by x.res3) third_set
from
    (select col1, col2, 
        instr(substr(testvals.attend, colchunks.chunk1, 15),'1', 1, level) res1,
        instr(substr(testvals.attend, colchunks.chunk2, 15),'1', 1, level) res2,
        instr(substr(testvals.attend, colchunks.chunk3, 10),'1', 1, level) res3
    from colchunks, testvals
        connect by level <= regexp_count(testvals.attend, '1')
    ) x
group by col1, col2;



Could you please help me out with this one? Or perhaps there is a better/more efficient way of achieving the same result given the data and output we want?

Thank you always!





Chris Saxon
August 12, 2019 - 9:12 am UTC

The problem is the connect by clause builds a tree starting with every row in testvals. Then adds in every other row in the table that matches the expression. This leads to an explosion in the number of rows processed!

You can avoid this by placing the connect by clause within a lateral join (needs 12c or higher):

with colchunks as (select 1 as chunk1, 16 as chunk2, 31 as chunk3 from dual)
select col1, col2,
    listagg((case when x.res1 != '0' then x.res1 else null end), ',') within group (order by x.res1) first_set,
    listagg((case when x.res2 != '0' then x.res2 else null end), ',') within group (order by x.res2) second_set,
    listagg((case when x.res3 != '0' then x.res3 else null end), ',') within group (order by x.res3) third_set
from
    (select col1, col2, 
        instr(substr(testvals.attend, colchunks.chunk1, 15),'1', 1, lvl) res1,
        instr(substr(testvals.attend, colchunks.chunk2, 15),'1', 1, lvl) res2,
        instr(substr(testvals.attend, colchunks.chunk3, 10),'1', 1, lvl) res3
    from testvals, lateral ( 
        select c.*, level lvl from colchunks c
        connect by level <= regexp_count(testvals.attend, '1')
      ) colchunks
    ) x
group by col1, col2;

COL1     COL2     FIRST_SET               SECOND_SET                             THIRD_SET   
val11    val12    5,6,7,8,9,12,13,15      5,6                                    5            
val21    val22    3,4,5,6,7,8,9,15        1,2,3,4,5,6,7,8,9,10,11,12,13,14,15    1,2,3,4,8    
val31    val32    1,2,3,5,7,9,11,13,15    1,6,7,8,9,10,15                        1,5

Jess, September 05, 2019 - 8:24 pm UTC

Thanks Chris, that worked like a charm (and fast to boot).
It's a really nice trick (if one can work out the right place for that lateral!)
Chris Saxon
September 06, 2019 - 11:01 am UTC

Cool, glad this helped!

Slavon, April 13, 2022 - 7:55 am UTC

Hi Chris,

with regards to your 1st answer to the post, about usage of "tokenize"
I have similar task, just that there are blocks enclosed between brackets, that should be considered as a whole
can't figure out how to overcome that, so I get 2 rows with 2 columns
col1 col2
Att1 Val1
[Txt1,Txt2:Txt3] Val2

with WTBL as
(
    select 'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,' as WCLN
    from dual
)
select lvl, substr1, substr2, substr3, WCLN
from WTBL
cross join xmltable('if (contains($PRM,",[")) 
        then
            let $list := ora:tokenize($PRM, ","),
                $cnt := count($list)
          for $val at $r in $list 
          where $r < $cnt
          return $val
        else $PRM'
  passing WCLN as PRM
  columns substr1 varchar2(4000) path '.'
    ,substr2 varchar2(4000) path 'if (contains( . , ":")) then
            let $list := ora:tokenize( . ,":"),
                $cnt := count($list)
          for $val at $r in $list
          where $r = $cnt - 1
          return $val
        else . '
    ,substr3 varchar2(4000) path 'if (contains( . , ":")) then
            let $list := ora:tokenize( . ,":"),
                $cnt := count($list)
          for $val at $r in $list
          where $r = $cnt
          return $val
        else . '
     ,lvl FOR ORDINALITY
) xm

Chris Saxon
April 20, 2022 - 5:35 pm UTC

Hmmm, you've got a horrible mix of separators there. This makes parsing the string tricky - I can't see a way to do it at this point.

Do you have any control over the string format?

UPDATE

Having thought about this further, I believe it is possible with pure SQL using MATCH_RECOGNIZE!

Basic idea is:

- Split the string into rows, with one character/row
- Use MATCH_RECOGNIZE to search for either:
* Right bracket followed by any non-bracket character followed by left bracket; then any character + comma
* Any non-bracket character followed by comma
- This splits the rows into groups
- Use LISTAGG to combine the rows back into values

with wtbl as (
  select 1 id, 'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,' as WCLN
  from dual union all
  select 2 id, 'Att9:Val9,Att8:Val8,' as WCLN
  from dual union all
  select 3 id, '[Txt1,Txt2:Txt3]:Val2,Att1:Val1,' as WCLN
  from dual union all
  select 4 id, '[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val2,' as WCLN
  from dual 
), rws as (
  select id, 
         substr ( WCLN, rn, 1 ) v, 
         rn
  from   WTBL, lateral ( 
    select level rn from dual
    connect by level <= length ( WCLN )
  )
), grps as (
  select * from rws 
    match_recognize (
      partition by id
      order by rn
      measures 
        match_number() as grp,
        classifier() as cls
      all rows per match
      pattern ( 
        ( leftb ch+? rightb ch+? comma ) | 
        ch+? comma 
      )
      define
        leftb as v = '[',
        rightb as v = ']',
        ch as v not in ( '[', ']' ),
        comma as v = ','
    )
)
  select id, 
         listagg ( v ) 
           within group ( order by rn ) str
  from   grps
  group  by id, grp;
  
        ID STR                           
---------- ------------------------------
         1 Att1:Val1,                    
         1 [Txt1,Txt2:Txt3]:Val2,        
         2 Att9:Val9,                    
         2 Att8:Val8,                    
         3 [Txt1,Txt2:Txt3]:Val2,        
         3 Att1:Val1,                    
         4 [Txt1,Txt2:Txt3]:Val2,        
         4 [Txt1,Txt2:Txt3]:Val2,


You still need to split these into columns. You could do this either by refining the pattern or a regex on the final values. But this should be enough to get you started.

That said this method generates (potentially loads) of rows to combine them back together again. So you're probably better off with other techniques.

Tokenizing while respecting blocks enclosed between brackets

Stew Ashton, April 20, 2022 - 7:25 pm UTC

In an earlier comment, I suggested converting the input to JSON format, then using JSON_TABLE to split rows and columns. In this case, I would need to convert the text that is not within brackets. I don't know how to do this directly with REGEXP_REPLACE, so I do the following:

1) For the text in brackets, replace ',' and ':' with special characters (that should not exist anywhere in the original string)
2) Do the same conversion to JSON format that I did above, creating an array of arrays
3) Change the special characters back to ',' and ':' .
with data as
(
  select
    'Att1:[Val1:,],[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val2,X:Y'
    as x
  from dual
)
, json_data(jx) as (
  select '[["' ||
  replace(
    replace(
      replace(
        replace(
          regexp_replace(
            regexp_replace(
              trim(',' from x),
              '(\[[^]]*),', '\1§'
            ),
            '(\[[^]]*):', '\1£'
          ),
          ',', '"],["'
        ),
        ':', '","'
      ),
      '£', ':'
    ),
    '§', ','
  )
  || '"]]'
  from data
)
select att, val
from json_data,
json_table(
  jx, '$[*]' columns
    att path '$[0]',
    val path '$[1]'
);

ATT                 VAL         
Att1                [Val1:,]    
[Txt1,Txt2:Txt3]    Val2        
[Txt1,Txt2:Txt3]    Val2        
X                   Y 

Chris Saxon
April 21, 2022 - 12:24 pm UTC

Nicely done Stew

Hans Schelbeck-Pedersen, April 20, 2022 - 7:26 pm UTC

The folowing is a "solution" to the comment as of April 13th 2022.

with input as
(select 'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,' input from dual 
),
input_split(item,rest) as
(
select regexp_substr(input,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'')item,regexp_substr(input,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),(.*)',1,1,'',3) from input
union all
select regexp_substr(rest,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'')item,regexp_substr(rest,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),(.*)',1,1,'',3) from input_split
where rest is not null
)
select regexp_substr(item,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'',1)item,regexp_substr(item,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'',2) value from input_split
;


The idea is first to split the input into rows of items with a "key":"value" string (input_split).
Then split the items in the acutal "key" and value columns.


Chris Saxon
April 21, 2022 - 12:26 pm UTC

Nice - though it looks like it needs some work when there's more than one set of brackets:

with input as (
  select '[Txt1,Txt2:Txt3]:Val2,[Att2,Attr3:test]:Val1,' input from dual 
),
input_split(item,rest) as
(
select regexp_substr(input,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'')item,regexp_substr(input,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),(.*)',1,1,'',3) from input
union all
select regexp_substr(rest,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'')item,regexp_substr(rest,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),(.*)',1,1,'',3) from input_split
where rest is not null
)
select regexp_substr(item,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'',1)item,regexp_substr(item,'(^[^\[][^:]+|^\[[^\]+\]):([^,]+),',1,1,'',2) value from input_split;

[Txt1,Txt2:Txt3]:Val2,[Att2,Attr3:test]       Val1  

Tokenizing while respecting blocks enclosed between brackets: CORRECTION

Stew Ashton, April 21, 2022 - 8:10 pm UTC

Sorry, my previous solution does not work if there are multiple commas or colons within the bracketed text. Here is an alternative that uses XMLTABLE with some XQUERY code. We have definitely thrown everything and the kitchen sink at this problem ;)

The basic idea is still to convert the string to an equivalent JSON string that can be parsed by JSON_TABLE. The conversion is done by:
1) splitting the string using opening / closing brackets as delimiters
2) when the substring had brackets, restoring the brackets but leaving the text alone
3) when the substring did not have brackets, doing the replacements that JSON needs
4) joining the updated substrings back together.
with data(x) as
(
  select 'Att1:[Val1:,],[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val2,X:Y,' from dual
  union all
  select '[Txt1,Txt2:Txt3]:[Val2,,::],a:b' from dual
  union all
  select 'a:b' from dual
)
, json_data(jx) as (
  select xmlcast(column_value as varchar2(4000))
  from data,
  xmltable('
    let $seq := fn:tokenize($X, "\[|\]")
    let $out :=
      for $i in (1 to count($seq))
      where string-length($seq[$i]) > 0
        return if ($i mod 2 = 0) 
          then concat("[", $seq[$i], "]") 
          else replace(replace($seq[$i], ",", """],["""), ":", """,""")
    return concat("[[""", string-join($out, ""), """]]")
    ' passing trim(',' from x) as x
  )
)
select ord, att, val
from json_data,
json_table(
  jx, '$[*]' columns
    ord for ordinality,
    att path '$[0]',
    val path '$[1]'
);

ORD  ATT                 VAL           
  1  Att1                [Val1:,]      
  2  [Txt1,Txt2:Txt3]    Val2          
  3  [Txt1,Txt2:Txt3]    Val2          
  4  X                   Y             
  1  [Txt1,Txt2:Txt3]    [Val2,,::]    
  2  a                   b             
  1  a                   b   

Chris Saxon
April 25, 2022 - 1:27 pm UTC

Great work Stew

A pure regex solution

Andrew Sayer, April 21, 2022 - 11:31 pm UTC

Just to add my regex solution using the same test data as Stew:

with data(pk,x) as
(
select 1,'Att1:[Val1:,],[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val2,X:Y,' from dual
union all
select 2,'[Txt1,Txt2:Txt3]:[Val2,,::],a:b' from dual
union all
select 3,'a:b' from dual
)
select pk
,level ord
,regexp_substr(x,'((\[[^]]+\])|([^:,]+)):((\[[^]]+\])|([^,]+))',1,level,null,1) att
,regexp_substr(x,'((\[[^]]+\])|([^:,]+)):((\[[^]]+\])|([^,]+))',1,level,null,4) val
from data r
connect by prior pk = pk
and prior sys_guid() is not null
and regexp_substr(x,'((\[[^]]+\])|([^:,]+)):((\[[^]]+\])|([^,]+))',1,level,null,1) is not null
/

It won't handle nested square brackets (yet?) so hopefully that's not going to be necessary
Chris Saxon
April 25, 2022 - 1:28 pm UTC

Thanks for sharing

Regarding @Stew Ashton's contribution: Tokenizing while respecting blocks enclosed between brackets

AndyP, April 25, 2022 - 11:00 am UTC

It's a great approach, but there is something wrong in this example - just a cut'n'paste error I suppose - that doesn't help with clarity of method

with data as
(
  select
    'Att1:[Val1:,],[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val2,X:Y'
    as x
  from dual
)

doesn't produce the quoted result:

ATT                            VAL
------------------------------ ------------------------------
Att1                           [Val1:,]    
...

Instead you get:

ATT                            VAL
------------------------------ ------------------------------
Att1                           [Val1::]
...

but neither does it match the OP's suggested, rather minimal, data set of:

select 'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,' as WCLN from dual

I'm guessing that maybe the test input data should have been something like this:

with data as
(
  select
    'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val3,X:Y'
    as x
  from dual
)

which, with the same code approach, produces:

ATT                            VAL
------------------------------ ------------------------------
Att1                           Val1
[Txt1:Txt2:Txt3]               Val2
[Txt1:Txt2:Txt3]               Val3
X                              Y

What I meant to say

AndyP, April 25, 2022 - 11:31 am UTC

Sorry, my own clarity was sadly lacking there. What I should have made clearer is that the result is not right, not just that the specified input data doesn't match the listed output. Notice that the [Txt1,Txt2:Txt3] gets turned into [Txt1:Txt2:Txt3] ie the comma changes to a colon
Chris Saxon
April 25, 2022 - 1:32 pm UTC

I'm not seeing what the problem is - it looks correct to me:

with data(x) as
(
  select
     'Att1:Val1,[Txt1,Txt2:Txt3]:Val2,[Txt1,Txt2:Txt3]:Val3,X:Y'
  from dual
)
, json_data(jx) as (
  select xmlcast(column_value as varchar2(4000))
  from data,
  xmltable('
    let $seq := fn:tokenize($X, "\[|\]")
    let $out :=
      for $i in (1 to count($seq))
      where string-length($seq[$i]) > 0
        return if ($i mod 2 = 0) 
          then concat("[", $seq[$i], "]") 
          else replace(replace($seq[$i], ",", """],["""), ":", """,""")
    return concat("[[""", string-join($out, ""), """]]")
    ' passing trim(',' from x) as x
  )
)
select ord, att, val
from json_data,
json_table(
  jx, '$[*]' columns
    ord for ordinality,
    att path '$[0]',
    val path '$[1]'
);

       ORD ATT                  VAL                 
---------- -------------------- --------------------
         1 Att1                 Val1                
         2 [Txt1,Txt2:Txt3]     Val2                
         3 [Txt1,Txt2:Txt3]     Val3                
         4 X                    Y

And what I should have done

AndyP, April 25, 2022 - 11:44 am UTC

Refresh my page first! Sorry about that

wow

Slavon, April 28, 2022 - 2:17 pm UTC

super many thanks, Chris and Stew !
Chris Saxon
April 29, 2022 - 10:47 am UTC

You're welcome

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.