Skip to Main Content
  • Questions
  • Improving SQL that deliberately generates cartesian product

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Kurt.

Asked: April 26, 2016 - 1:19 pm UTC

Last updated: May 10, 2016 - 1:55 pm UTC

Version: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit

Viewed 1000+ times

You Asked

A colleague asked me to help implement a SQL solution to generate a set of aliases from a known set of values.

1. Assume a table containing a set of phrases.
2. Assume a second table containing a list of words and their abbreviations.
3. For each row in the first table, generate one row that has the main word replaced by its abbreviation(s).

For example, assuming the phrase "Fort Saint James" and abbreviations:

FORT FT
FORT FT.
SAINT ST
SAINT ST.

Generate these rows:

Fort Saint James
Fort St James
Fort St. James
Ft Saint James
Ft St James
Ft St. James
Ft. Saint James
Ft. St James
Ft. St. James

To mock this up, I created two tables:

create table ncy_phrases
(
phrase varchar2( 100 ) not null
, primary key (
phrase
)
) ;

insert into ncy_phrases values ( 'Fort Saint James' ) ;
insert into ncy_phrases values ( 'Fort Mount Saint Helens' ) ;

create table ncy_abbreviations
(
base_word varchar2( 100 ) not null
, abbreviation varchar2( 100 ) not null
, primary key (
base_word
, abbreviation
)
) ;

insert into ncy_abbreviations values ( 'Fort' , 'Ft.' ) ;
insert into ncy_abbreviations values ( 'Fort' , 'Ft' ) ;
insert into ncy_abbreviations values ( 'Mount', 'Mt.' ) ;
insert into ncy_abbreviations values ( 'Mount', 'Mt' ) ;
insert into ncy_abbreviations values ( 'Saint', 'St.' ) ;
insert into ncy_abbreviations values ( 'Saint', 'St' ) ;

The query I came up with is:

with word as (
-- --------------------------------------------------------------
-- Split the phrase into N blank-delimited words
-- --------------------------------------------------------------
select phrase
, phrase_no
, substr(
txt
, instr( txt, ' ', 1, level ) + 1
, instr( txt, ' ', 1, level + 1 ) - instr( txt, ' ', 1, level ) - 1
) as token
, level as token_no
from (
-- --------------------------------------------------------------
-- select each phrase from the table and append/prepend spaces
-- --------------------------------------------------------------
select fraz.phrase as phrase
, rownum as phrase_no
, ' ' || fraz.phrase || ' ' as txt
from ncy_phrases fraz
where 1 = 1
)
connect by level <= length( txt ) - length( replace( txt, ' ', null ) ) + 1
)
, toke as (
-- --------------------------------------------------------------
-- Join each word from "word" to the abbreviation table to get zero
-- or more abbreviations for the word.
-- --------------------------------------------------------------
select base.phrase
, base.phrase_no
, base.token
, base.token_no
, coalesce( abbr.abbreviation, base.token ) as abbreviation
from word base
left outer join ncy_abbreviations abbr
on (
abbr.base_word = base.token
)
where 1 = 1
and base.token is not null
union
-- --------------------------------------------------------------
-- Union to generate each word as its own abbreviation
-- --------------------------------------------------------------
select base.phrase
, base.phrase_no
, base.token
, base.token_no
, base.token as abbreviation
from word base
where 1 = 1
order by 1 -- base.phrase_no
, 4 -- base.token_no
)
-- -------------------------------------------------------------------------------------------
-- Join the "toke" subquery to itself as many times as needed to generate the expanded strings
-- trim leading/trailing blanks from the resulting string
-- -------------------------------------------------------------------------------------------
select w1.phrase
, trim( w1.abbreviation
|| ' ' || w2.abbreviation
|| ' ' || w3.abbreviation
|| ' ' || w4.abbreviation
|| ' ' || w5.abbreviation
) as match_val
from toke w1
left outer join toke w2
on (
w2.phrase_no = w1.phrase_no
and w2.token_no = 2
)
left outer join toke w3
on (
w3.phrase_no = w1.phrase_no
and w3.token_no = 3
)
left outer join toke w4
on (
w4.phrase_no = w1.phrase_no
and w4.token_no = 4
)
left outer join toke w5
on (
w5.phrase_no = w1.phrase_no
and w5.token_no = 5
)
where 1 = 1
and w1.token_no = 1
order by 1
, 2
;

Given this set up, how can I improve the SQL that generates the result set? I specifically don't like
the need to add joins to the "toke" subquery as the number of words in the phrase increases. Any ideas?

My database is Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit.

Thanks in advance!

and Chris said...

First up, to make things a bit easier I'm going to add each base word as its own abbreviation:

insert into ncy_abbreviations (base_word,abbreviation) values ('Fort','Fort');
insert into ncy_abbreviations (base_word,abbreviation) values ('Mount','Mount');
insert into ncy_abbreviations (base_word,abbreviation) values ('Saint','Saint');


Are you sitting comfortably? Then let's begin...

To start, let's join the phrases and abbreviations together. This will give all the words we need to replace.

Do this using instr, to check if the base word is in the phrase:

select p.*,  n.*
from   NCY_PHRASES p
join   NCY_ABBREVIATIONS n
on     instr(p.phrase, n.base_word) > 0;


Using this, we want to find the position of each word in the phrase we want to replace. I've done this by counting many spaces there are after the base word with the following:

regexp_count(
  substr(
    p.phrase, 
    1, 
    case 
      when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
        length(phrase) 
      else
        instr(p.phrase, ' ', instr(p.phrase, n.base_word))
    end
    ), 
  ' '
)


The logic for the above is:

- Find the location of the first space after the abbreviation in the phrase. If at the end, return the string length.
- Substring the phrase from the start up to this point
- Count how many spaces there are in this

This gives the number of the base word within the phrase.

In addition to this, we want to know:

- The position of the first occurrence of a base word in the phrase
- How many base words there are in total

We can use min/max analytics on the results of the above to calculate these:

with abbrs as (
  select p.*,  n.*, 
         regexp_count(
           substr(
             p.phrase, 
             1, 
             case 
               when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
                 length(phrase) 
               else
                 instr(p.phrase, ' ', instr(p.phrase, n.base_word))
             end
           ), 
           ' '
         ) word_pos
  from   NCY_PHRASES p
  join   NCY_ABBREVIATIONS n
  on     instr(p.phrase, n.base_word) > 0
), vals as (
  select phrase, base_word, abbreviation, word_pos,
         min(word_pos) over (partition by phrase) first_replace,
         max(word_pos) over (partition by phrase) tot_replaces
  from abbrs
)
  select * from vals;


Using this we can then recursively walk along the string.

- Start with the location of the first base word
- Replace the base word with the abbreviation
- Then move onto the next base word and repeat until all words are replaced

Then you just need to return the results of the last replacements. This is where the word position = max word position:

with abbrs as (
  select p.*,  n.*, 
         regexp_count(
           substr(
             p.phrase, 
             1, 
             case 
               when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
                 length(phrase) 
               else
                 instr(p.phrase, ' ', instr(p.phrase, n.base_word))
             end
           ), 
           ' '
         ) word_pos
  from   NCY_PHRASES p
  join   NCY_ABBREVIATIONS n
  on     instr(p.phrase, n.base_word) > 0
), vals as (
  select phrase, base_word, abbreviation, word_pos,
         min(word_pos) over (partition by phrase) first_replace,
         max(word_pos) over (partition by phrase) tot_replaces
  from abbrs
), phrases (phrase, base_word, abbreviation, word_pos, str, tot_replaces) as (
  select phrase, base_word, abbreviation, word_pos, 
         replace(phrase, base_word, abbreviation) str,
         tot_replaces
  from   vals
  where  word_pos = first_replace
  union all
  select p.phrase, a.base_word, a.abbreviation, a.word_pos, 
         replace(str, a.base_word, a.abbreviation) str,
         a.tot_replaces
  from   vals a
  join   phrases p
  on     a.phrase = p.phrase
  and    a.word_pos = p.word_pos + 1
)
  select phrase, str
  from   phrases
  where  word_pos = tot_replaces
  order  by phrase, str;

PHRASE                         STR                          
------------------------------ ------------------------------
Fort Mount Saint Helens        Fort Mount Saint Helens       
Fort Mount Saint Helens        Fort Mount St Helens          
Fort Mount Saint Helens        Fort Mount St. Helens         
Fort Mount Saint Helens        Fort Mt Saint Helens          
Fort Mount Saint Helens        Fort Mt St Helens             
Fort Mount Saint Helens        Fort Mt St. Helens            
Fort Mount Saint Helens        Fort Mt. Saint Helens         
Fort Mount Saint Helens        Fort Mt. St Helens            
Fort Mount Saint Helens        Fort Mt. St. Helens           
Fort Mount Saint Helens        Ft Mount Saint Helens         
Fort Mount Saint Helens        Ft Mount St Helens            
Fort Mount Saint Helens        Ft Mount St. Helens           
Fort Mount Saint Helens        Ft Mt Saint Helens            
Fort Mount Saint Helens        Ft Mt St Helens               
Fort Mount Saint Helens        Ft Mt St. Helens              
Fort Mount Saint Helens        Ft Mt. Saint Helens           
Fort Mount Saint Helens        Ft Mt. St Helens              
Fort Mount Saint Helens        Ft Mt. St. Helens             
Fort Mount Saint Helens        Ft. Mount Saint Helens        
Fort Mount Saint Helens        Ft. Mount St Helens           
Fort Mount Saint Helens        Ft. Mount St. Helens          
Fort Mount Saint Helens        Ft. Mt Saint Helens           
Fort Mount Saint Helens        Ft. Mt St Helens              
Fort Mount Saint Helens        Ft. Mt St. Helens             
Fort Mount Saint Helens        Ft. Mt. Saint Helens          
Fort Mount Saint Helens        Ft. Mt. St Helens             
Fort Mount Saint Helens        Ft. Mt. St. Helens            
Fort Saint James               Fort Saint James              
Fort Saint James               Fort St James                 
Fort Saint James               Fort St. James                
Fort Saint James               Ft Saint James                
Fort Saint James               Ft St James                   
Fort Saint James               Ft St. James                  
Fort Saint James               Ft. Saint James               
Fort Saint James               Ft. St James                  
Fort Saint James               Ft. St. James


Ta-da!

Note this excludes phrases that have no words to abbreviate. So you may need to union the result of this with:

select * from ncy_phrases p
where  not exists (
   select * from NCY_ABBREVIATIONS a
   where  instr(p.phrase, a.base_word) > 0
)


Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Great idea!

Kurt Arthur, May 10, 2016 - 12:19 pm UTC

I really like this answer. It eliminated the need to add joins and seems quicker than my original attempt. Thanks!
Chris Saxon
May 10, 2016 - 1:55 pm UTC

Thanks!

It certainly performs better in my tests - only accessing each table once instead of N times can lead to big savings...

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.