Skip to Main Content
  • Questions
  • parsing recusirvely "#" from strings without PL/SQL procedure

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Salord.

Asked: November 14, 2018 - 10:34 am UTC

Last updated: November 27, 2018 - 9:47 am UTC

Version: Oracle Database version 11.2.4

Viewed 1000+ times

You Asked

Dear Tom,
First of all thanks for the site, the tips... always useful... I faced recently an sql issue and i designed a beginning of answears but i think it should be optimized.
I have a table containing posts from a social network platforms. Those posts are identified by an id, it contains text (so, strings) and i want to extract(to parse) from those strings # or html url. I want to work without pl/SQL procedure and with no regex (or the less possible). My hypothesis is that an # start with "#" and ends with a " "(a blank space). So if i know the start position of a given # within my string and the position of the next blank space by a simple soustraction i will be able to retrieve my hahstag with SUBSTR command. Here is my work:

--- create intermediate view with targeted pattern position
create or replace view Start_Index_position as
with post as
(select id_post, text from "ma_table")
select id_post, instr(text,'#', 1, level) as position, text
from post
connect by level <= regexp_count(text, '#');


It gives me a view with starting position of each times SQL found "#" in my strings

Then i create a working view from which i will be able to calculate string length of each #, and of course extract with "SUBSTR" the #:

--- create working table view with full references and blank position for each pattern match and string_lenght for each one
create or replace view _#_index as
select id_post, position as hashtag_pos, INSTR(text,' ', position) as blank_position, INSTR(text,' ', position) - position as string_length, text
from start_index_position;


At the end with a simple query as:
select substr(text, hashtag_pos, string_length)from _#_index;


... i can retrieve my #.

There is few mistakes with this process:
- i don't think it's optimized.
- if my # or whatever patterns i'm looking for is at the end of the string it will retrieve a null value cause there will be no blank space at the end of the string.
- i think i m not using at his best level the "CONNECT BY" "LEVEL" Oracle hierarchical suery system. True no?

Thanks a lot again for the site and for your answear

I apologize for my english, i am french

Tristan


Complementary informations:

Test sample:

create table test (a int, b varchar2(5000), c date);

with this data sample:
insert into test values( 1, "#PROTESTOBR: Protestos em várias cidades brasileiras são destaque nos principais sites internacionais! O mundo começa a conhecer o Brasil de verdade!  http://www.guardian.co.uk/world/2013/jun/18/brazil-protests-erupt-huge-scale  http://www.washingtonpost.com/sports/crowds-of-protesters-demonstrate-in-at-least-8-brazilian-cities-venting-complaints-about-life/2013/06/17/4b4d27e2-d7b5-11e2-b418-9dfa095e125d_story.html  http://www.corriere.it/esteri/13_giugno_18/brasile-proteste-rio-trasporti_b593f39c-d7ab-11e2-a4df-7eff8733b462.shtml  http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html  http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html  http://www.bbc.co.uk/portuguese/noticias/130617_protestos_live.shtml  http://elpais.com/elpais/portada_america.html  #vemprarua #vempraruarj #vempraruabrasil #ogiganteacordou #geracaoinvencivel  @J.Oliveira  Curta a Página Geração Invencível: http://www.facebook.com/GeracaoInvencive", to_date('17-june-2013'));
insert into test values( 2, "#AoVivo no acampamento #BrasilComLula com o Deputado Paulo Pimenta  #BrasilComLula #MoroPersegueLula", to_date('10-may-2017');


Okay so let's try my solution with this tiny little table of two data sets. It will works. But as soon as i have much more data (my actual table contains 960741 posts, it means 960741 strings for which i want to duplicate the methodology i explain above).
So how can i optimize/how can i deploy it in a larger scale? the idea is to use substr/instr more than costy regex to extract all # from strings (recursively, a string could contain more than one #)?

Thank you tom!

and Connor said...

Something like this? Yes, it uses a regex, but I would benchmark it before moving onto more complex options

SQL> create table test (a int, b varchar2(4000), c date);

Table created.

SQL>
SQL> insert into test values(
  2  1,
  3  '#PROTESTOBR: Protestos em várias cidades brasileiras são destaque nos principais sites internacionais! O mundo começa a conhecer o Brasil de verdade!      http://www.guardian.co.uk/world/2013/jun/18/b
razil-protests-erupt-huge-scale  http://www.washingtonpost.com/sports/crowds-of-protesters-demonstrate-in-at-least-8-brazilian-cities-venting-complaints-about-life/2013/06/17/4b4d27e2-d7b5-11e2-b418-9dfa095
e125d_story.html    http://www.corriere.it/esteri/13_giugno_18/brasile-proteste-rio-trasporti_b593f39c-d7ab-11e2-a4df-7eff8733b462.shtml  http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au
-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html  http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html  http:/
/www.bbc.co.uk/portuguese/noticias/130617_protestos_live.shtml  http://elpais.com/elpais/portada_america.html  #vemprarua #vempraruarj #vempraruabrasil #ogiganteacordou #geracaoinvencivel  @J.Oliveira  Curt
a a Página Geração Invencível: http://www.facebook.com/GeracaoInvencive',
  4  to_date('17-june-2013')
  5  );

1 row created.

SQL>
SQL> insert into test values( 2, '#AoVivo no acampamento #BrasilComLula com o Deputado Paulo Pimenta  #BrasilComLula #MoroPersegueLula', to_date('10-may-2017'));

1 row created.

SQL>
SQL>
SQL> select a, r.column_value, trim(regexp_substr(b,'#.*? ', 1, r.column_value)) str
  2  from   test,
  3         table(cast(
  4           multiset(
  5             select level from dual
  6             connect by level <= regexp_count(test.b,'#.*? ')
  7           ) as sys.OdciNumberList
  8         )) r;

         A COLUMN_VALUE STR
---------- ------------ ----------------------------------------
         1            1 #PROTESTOBR:
         1            2 #vemprarua
         1            3 #vempraruarj
         1            4 #vempraruabrasil
         1            5 #ogiganteacordou
         1            6 #geracaoinvencivel
         2            1 #AoVivo
         2            2 #BrasilComLula
         2            3 #BrasilComLula

9 rows selected.

SQL>



Is this answer out of date? If it is, please let us know via a Comment

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library