A colleague asked me to help implement a SQL solution to generate a set of aliases from a known set of values.
1. Assume a table containing a set of phrases.
2. Assume a second table containing a list of words and their abbreviations.
3. For each row in the first table, generate one row that has the main word replaced by its abbreviation(s).
For example, assuming the phrase "Fort Saint James" and abbreviations:
FORT FT
FORT FT.
SAINT ST
SAINT ST.
Generate these rows:
Fort Saint James
Fort St James
Fort St. James
Ft Saint James
Ft St James
Ft St. James
Ft. Saint James
Ft. St James
Ft. St. James
To mock this up, I created two tables:
create table ncy_phrases
(
phrase varchar2( 100 ) not null
, primary key (
phrase
)
) ;
insert into ncy_phrases values ( 'Fort Saint James' ) ;
insert into ncy_phrases values ( 'Fort Mount Saint Helens' ) ;
create table ncy_abbreviations
(
base_word varchar2( 100 ) not null
, abbreviation varchar2( 100 ) not null
, primary key (
base_word
, abbreviation
)
) ;
insert into ncy_abbreviations values ( 'Fort' , 'Ft.' ) ;
insert into ncy_abbreviations values ( 'Fort' , 'Ft' ) ;
insert into ncy_abbreviations values ( 'Mount', 'Mt.' ) ;
insert into ncy_abbreviations values ( 'Mount', 'Mt' ) ;
insert into ncy_abbreviations values ( 'Saint', 'St.' ) ;
insert into ncy_abbreviations values ( 'Saint', 'St' ) ;
The query I came up with is:
with word as (
-- --------------------------------------------------------------
-- Split the phrase into N blank-delimited words
-- --------------------------------------------------------------
select phrase
, phrase_no
, substr(
txt
, instr( txt, ' ', 1, level ) + 1
, instr( txt, ' ', 1, level + 1 ) - instr( txt, ' ', 1, level ) - 1
) as token
, level as token_no
from (
-- --------------------------------------------------------------
-- select each phrase from the table and append/prepend spaces
-- --------------------------------------------------------------
select fraz.phrase as phrase
, rownum as phrase_no
, ' ' || fraz.phrase || ' ' as txt
from ncy_phrases fraz
where 1 = 1
)
connect by level <= length( txt ) - length( replace( txt, ' ', null ) ) + 1
)
, toke as (
-- --------------------------------------------------------------
-- Join each word from "word" to the abbreviation table to get zero
-- or more abbreviations for the word.
-- --------------------------------------------------------------
select base.phrase
, base.phrase_no
, base.token
, base.token_no
, coalesce( abbr.abbreviation, base.token ) as abbreviation
from word base
left outer join ncy_abbreviations abbr
on (
abbr.base_word = base.token
)
where 1 = 1
and base.token is not null
union
-- --------------------------------------------------------------
-- Union to generate each word as its own abbreviation
-- --------------------------------------------------------------
select base.phrase
, base.phrase_no
, base.token
, base.token_no
, base.token as abbreviation
from word base
where 1 = 1
order by 1 -- base.phrase_no
, 4 -- base.token_no
)
-- -------------------------------------------------------------------------------------------
-- Join the "toke" subquery to itself as many times as needed to generate the expanded strings
-- trim leading/trailing blanks from the resulting string
-- -------------------------------------------------------------------------------------------
select w1.phrase
, trim( w1.abbreviation
|| ' ' || w2.abbreviation
|| ' ' || w3.abbreviation
|| ' ' || w4.abbreviation
|| ' ' || w5.abbreviation
) as match_val
from toke w1
left outer join toke w2
on (
w2.phrase_no = w1.phrase_no
and w2.token_no = 2
)
left outer join toke w3
on (
w3.phrase_no = w1.phrase_no
and w3.token_no = 3
)
left outer join toke w4
on (
w4.phrase_no = w1.phrase_no
and w4.token_no = 4
)
left outer join toke w5
on (
w5.phrase_no = w1.phrase_no
and w5.token_no = 5
)
where 1 = 1
and w1.token_no = 1
order by 1
, 2
;
Given this set up, how can I improve the SQL that generates the result set? I specifically don't like
the need to add joins to the "toke" subquery as the number of words in the phrase increases. Any ideas?
My database is Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit.
Thanks in advance!
First up, to make things a bit easier I'm going to add each base word as its own abbreviation:
insert into ncy_abbreviations (base_word,abbreviation) values ('Fort','Fort');
insert into ncy_abbreviations (base_word,abbreviation) values ('Mount','Mount');
insert into ncy_abbreviations (base_word,abbreviation) values ('Saint','Saint');
Are you sitting comfortably? Then let's begin...
To start, let's join the phrases and abbreviations together. This will give all the words we need to replace.
Do this using instr, to check if the base word is in the phrase:
select p.*, n.*
from NCY_PHRASES p
join NCY_ABBREVIATIONS n
on instr(p.phrase, n.base_word) > 0;
Using this, we want to find the position of each word in the phrase we want to replace. I've done this by counting many spaces there are after the base word with the following:
regexp_count(
substr(
p.phrase,
1,
case
when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
length(phrase)
else
instr(p.phrase, ' ', instr(p.phrase, n.base_word))
end
),
' '
)
The logic for the above is:
- Find the location of the first space after the abbreviation in the phrase. If at the end, return the string length.
- Substring the phrase from the start up to this point
- Count how many spaces there are in this
This gives the number of the base word within the phrase.
In addition to this, we want to know:
- The position of the first occurrence of a base word in the phrase
- How many base words there are in total
We can use min/max analytics on the results of the above to calculate these:
with abbrs as (
select p.*, n.*,
regexp_count(
substr(
p.phrase,
1,
case
when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
length(phrase)
else
instr(p.phrase, ' ', instr(p.phrase, n.base_word))
end
),
' '
) word_pos
from NCY_PHRASES p
join NCY_ABBREVIATIONS n
on instr(p.phrase, n.base_word) > 0
), vals as (
select phrase, base_word, abbreviation, word_pos,
min(word_pos) over (partition by phrase) first_replace,
max(word_pos) over (partition by phrase) tot_replaces
from abbrs
)
select * from vals;
Using this we can then recursively walk along the string.
- Start with the location of the first base word
- Replace the base word with the abbreviation
- Then move onto the next base word and repeat until all words are replaced
Then you just need to return the results of the last replacements. This is where the word position = max word position:
with abbrs as (
select p.*, n.*,
regexp_count(
substr(
p.phrase,
1,
case
when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
length(phrase)
else
instr(p.phrase, ' ', instr(p.phrase, n.base_word))
end
),
' '
) word_pos
from NCY_PHRASES p
join NCY_ABBREVIATIONS n
on instr(p.phrase, n.base_word) > 0
), vals as (
select phrase, base_word, abbreviation, word_pos,
min(word_pos) over (partition by phrase) first_replace,
max(word_pos) over (partition by phrase) tot_replaces
from abbrs
), phrases (phrase, base_word, abbreviation, word_pos, str, tot_replaces) as (
select phrase, base_word, abbreviation, word_pos,
replace(phrase, base_word, abbreviation) str,
tot_replaces
from vals
where word_pos = first_replace
union all
select p.phrase, a.base_word, a.abbreviation, a.word_pos,
replace(str, a.base_word, a.abbreviation) str,
a.tot_replaces
from vals a
join phrases p
on a.phrase = p.phrase
and a.word_pos = p.word_pos + 1
)
select phrase, str
from phrases
where word_pos = tot_replaces
order by phrase, str;
PHRASE STR
------------------------------ ------------------------------
Fort Mount Saint Helens Fort Mount Saint Helens
Fort Mount Saint Helens Fort Mount St Helens
Fort Mount Saint Helens Fort Mount St. Helens
Fort Mount Saint Helens Fort Mt Saint Helens
Fort Mount Saint Helens Fort Mt St Helens
Fort Mount Saint Helens Fort Mt St. Helens
Fort Mount Saint Helens Fort Mt. Saint Helens
Fort Mount Saint Helens Fort Mt. St Helens
Fort Mount Saint Helens Fort Mt. St. Helens
Fort Mount Saint Helens Ft Mount Saint Helens
Fort Mount Saint Helens Ft Mount St Helens
Fort Mount Saint Helens Ft Mount St. Helens
Fort Mount Saint Helens Ft Mt Saint Helens
Fort Mount Saint Helens Ft Mt St Helens
Fort Mount Saint Helens Ft Mt St. Helens
Fort Mount Saint Helens Ft Mt. Saint Helens
Fort Mount Saint Helens Ft Mt. St Helens
Fort Mount Saint Helens Ft Mt. St. Helens
Fort Mount Saint Helens Ft. Mount Saint Helens
Fort Mount Saint Helens Ft. Mount St Helens
Fort Mount Saint Helens Ft. Mount St. Helens
Fort Mount Saint Helens Ft. Mt Saint Helens
Fort Mount Saint Helens Ft. Mt St Helens
Fort Mount Saint Helens Ft. Mt St. Helens
Fort Mount Saint Helens Ft. Mt. Saint Helens
Fort Mount Saint Helens Ft. Mt. St Helens
Fort Mount Saint Helens Ft. Mt. St. Helens
Fort Saint James Fort Saint James
Fort Saint James Fort St James
Fort Saint James Fort St. James
Fort Saint James Ft Saint James
Fort Saint James Ft St James
Fort Saint James Ft St. James
Fort Saint James Ft. Saint James
Fort Saint James Ft. St James
Fort Saint James Ft. St. James
Ta-da!
Note this excludes phrases that have no words to abbreviate. So you may need to union the result of this with:
select * from ncy_phrases p
where not exists (
select * from NCY_ABBREVIATIONS a
where instr(p.phrase, a.base_word) > 0
)