Dear Tom,
First of all thanks for the site, the tips... always useful... I faced recently an sql issue and i designed a beginning of answears but i think it should be optimized.
I have a table containing posts from a social network platforms. Those posts are identified by an id, it contains text (so, strings) and i want to extract(to parse) from those strings # or html url. I want to work without pl/SQL procedure and with no regex (or the less possible). My hypothesis is that an # start with "#" and ends with a " "(a blank space). So if i know the start position of a given # within my string and the position of the next blank space by a simple soustraction i will be able to retrieve my hahstag with SUBSTR command. Here is my work:
--- create intermediate view with targeted pattern position
create or replace view Start_Index_position as
with post as
(select id_post, text from "ma_table")
select id_post, instr(text,'#', 1, level) as position, text
from post
connect by level <= regexp_count(text, '#');
It gives me a view with starting position of each times SQL found "#" in my strings
Then i create a working view from which i will be able to calculate string length of each #, and of course extract with "SUBSTR" the #:
--- create working table view with full references and blank position for each pattern match and string_lenght for each one
create or replace view _#_index as
select id_post, position as hashtag_pos, INSTR(text,' ', position) as blank_position, INSTR(text,' ', position) - position as string_length, text
from start_index_position;
At the end with a simple query as:
select substr(text, hashtag_pos, string_length)from _#_index;
... i can retrieve my #.
There is few mistakes with this process:
- i don't think it's optimized.
- if my # or whatever patterns i'm looking for is at the end of the string it will retrieve a null value cause there will be no blank space at the end of the string.
- i think i m not using at his best level the "CONNECT BY" "LEVEL" Oracle hierarchical suery system. True no?
Thanks a lot again for the site and for your answear
I apologize for my english, i am french
Tristan
Complementary informations:
Test sample:
create table test (a int, b varchar2(5000), c date);
with this data sample:
insert into test values( 1, "#PROTESTOBR: Protestos em várias cidades brasileiras são destaque nos principais sites internacionais! O mundo começa a conhecer o Brasil de verdade! http://www.guardian.co.uk/world/2013/jun/18/brazil-protests-erupt-huge-scale http://www.washingtonpost.com/sports/crowds-of-protesters-demonstrate-in-at-least-8-brazilian-cities-venting-complaints-about-life/2013/06/17/4b4d27e2-d7b5-11e2-b418-9dfa095e125d_story.html http://www.corriere.it/esteri/13_giugno_18/brasile-proteste-rio-trasporti_b593f39c-d7ab-11e2-a4df-7eff8733b462.shtml http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html http://www.bbc.co.uk/portuguese/noticias/130617_protestos_live.shtml http://elpais.com/elpais/portada_america.html #vemprarua #vempraruarj #vempraruabrasil #ogiganteacordou #geracaoinvencivel @J.Oliveira Curta a Página Geração Invencível: http://www.facebook.com/GeracaoInvencive", to_date('17-june-2013'));
insert into test values( 2, "#AoVivo no acampamento #BrasilComLula com o Deputado Paulo Pimenta #BrasilComLula #MoroPersegueLula", to_date('10-may-2017');
Okay so let's try my solution with this tiny little table of two data sets. It will works. But as soon as i have much more data (my actual table contains 960741 posts, it means 960741 strings for which i want to duplicate the methodology i explain above).
So how can i optimize/how can i deploy it in a larger scale? the idea is to use substr/instr more than costy regex to extract all # from strings (recursively, a string could contain more than one #)?
Thank you tom!
Something like this? Yes, it uses a regex, but I would benchmark it before moving onto more complex options
SQL> create table test (a int, b varchar2(4000), c date);
Table created.
SQL>
SQL> insert into test values(
2 1,
3 '#PROTESTOBR: Protestos em várias cidades brasileiras são destaque nos principais sites internacionais! O mundo começa a conhecer o Brasil de verdade! http://www.guardian.co.uk/world/2013/jun/18/b
razil-protests-erupt-huge-scale http://www.washingtonpost.com/sports/crowds-of-protesters-demonstrate-in-at-least-8-brazilian-cities-venting-complaints-about-life/2013/06/17/4b4d27e2-d7b5-11e2-b418-9dfa095
e125d_story.html http://www.corriere.it/esteri/13_giugno_18/brasile-proteste-rio-trasporti_b593f39c-d7ab-11e2-a4df-7eff8733b462.shtml http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au
-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html http://www.lemonde.fr/ameriques/article/2013/06/17/manifestations-au-bresil-le-gouvernement-menace-du-carton-rouge_3431786_3222.html http:/
/www.bbc.co.uk/portuguese/noticias/130617_protestos_live.shtml http://elpais.com/elpais/portada_america.html #vemprarua #vempraruarj #vempraruabrasil #ogiganteacordou #geracaoinvencivel @J.Oliveira Curt
a a Página Geração Invencível: http://www.facebook.com/GeracaoInvencive',
4 to_date('17-june-2013')
5 );
1 row created.
SQL>
SQL> insert into test values( 2, '#AoVivo no acampamento #BrasilComLula com o Deputado Paulo Pimenta #BrasilComLula #MoroPersegueLula', to_date('10-may-2017'));
1 row created.
SQL>
SQL>
SQL> select a, r.column_value, trim(regexp_substr(b,'#.*? ', 1, r.column_value)) str
2 from test,
3 table(cast(
4 multiset(
5 select level from dual
6 connect by level <= regexp_count(test.b,'#.*? ')
7 ) as sys.OdciNumberList
8 )) r;
A COLUMN_VALUE STR
---------- ------------ ----------------------------------------
1 1 #PROTESTOBR:
1 2 #vemprarua
1 3 #vempraruarj
1 4 #vempraruabrasil
1 5 #ogiganteacordou
1 6 #geracaoinvencivel
2 1 #AoVivo
2 2 #BrasilComLula
2 3 #BrasilComLula
9 rows selected.
SQL>