Skip to Main Content

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Ahmad.

Asked: June 29, 2016 - 2:52 am UTC

Last updated: July 29, 2016 - 1:46 pm UTC

Version: 11g

Viewed 1000+ times

You Asked

Dir Sir:
I am developing an anti-plagiarism system, I am using Oracle text to search for my text.
my corpus contains more than 30 million of records.
I want to check my document against this corpus.
I want to fetch the highest score only, not all the records, so I ordered the resutls by score ascending.
but the order by clause is killing my performance. how can I get the results in high performance query?
what to do? thanks in advanced.
here is my query:

select sentence_id ,sc ,isn ,sentence_length ,processed_text from
(select /*+ First_Rows(5) */ sentence_id,score(1) as sc, isn ,sentence_length ,processed_text from plag_docsentences2
where contains(PROCESSED_TEXT,'<query>
<textquery>' || OriginalSentence ||'
<progression>
<seq><rewrite>transform((TOKENS, "{", " }", ","))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>',1)>0 order by score(1) desc )
where rownum <=5

and Chris said...

You can define your index to include an order by. This enables Oracle Database to push the sort into the index. This can be a big saving compared to sorting after accessing the table.

create table t (
  id int,
  x  varchar2(50)
);

insert into t
  select rownum,
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3) || ' ' || 
         dbms_random.string('l', 3)
  from dual
  connect by level <= 20000;

commit;

create index i on t(x) indextype is ctxsys.context;
exec dbms_stats.gather_table_stats(user, 't');

select /*+ gather_plan_statistics */* 
from ( 
  select id, x, score(1) from t
  where  contains(x, 'abc,def,ghi', 1) > 0
  order  by score(1) desc
)
where  rownum <= 3;

select * from table(dbms_xplan.display_cursor(null, null, 'BASIC LAST +IOSTATS'));

-------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |      1 |        |      3 |00:00:00.01 |      90 |
|*  1 |  COUNT STOPKEY                 |      |      1 |        |      3 |00:00:00.01 |      90 |
|   2 |   VIEW                         |      |      1 |     40 |      3 |00:00:00.01 |      90 |
|*  3 |    SORT ORDER BY STOPKEY       |      |      1 |     40 |      3 |00:00:00.01 |      90 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T    |      1 |     40 |     35 |00:00:00.01 |      90 |
|*  5 |      DOMAIN INDEX              | I    |      1 |        |     35 |00:00:00.01 |      59 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)
   5 - access("CTXSYS"."CONTAINS"("X",'abc,def,ghi',1)>0)

drop index i;
create index i on t(x) indextype is ctxsys.context order by x;

select /*+ gather_plan_statistics */* 
from ( 
  select id, x, score(1) from t
  where  contains(x, 'abc,def,ghi', 1) > 0
  order  by score(1) desc
)
where  rownum <= 3;

select * from table(dbms_xplan.display_cursor(null, null, 'BASIC LAST +IOSTATS'));

------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |      |      1 |        |      3 |00:00:00.01 |      62 |
|*  1 |  COUNT STOPKEY                |      |      1 |        |      3 |00:00:00.01 |      62 |
|   2 |   VIEW                        |      |      1 |      3 |      3 |00:00:00.01 |      62 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T    |      1 |     40 |      3 |00:00:00.01 |      62 |
|*  4 |     DOMAIN INDEX              | I    |      1 |        |      3 |00:00:00.01 |      59 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   4 - access("CTXSYS"."CONTAINS"("X",'abc,def,ghi',1)>0)

Note how in the ordered version, the buffers drops from 90 -> 62. The "SORT ORDER BY STOPKEY" operation is also gone, because Oracle was able to do this sorting in the index itself. It also only accesses the three rows from our table the query returns, instead of all 35 that match the contains expression.

For further details on how this works and the restrictions, see:

http://docs.oracle.com/cd/E11882_01/text.112/e24436/csql.htm#sthref147

If this doesn't help, please post the execution plan for your query.

Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

how to order by score(1)

Ahmad Sawalha, July 04, 2016 - 11:24 am UTC

Dir Sir,
thank you for you response, I read the solution, and followed the link but I need to know:
1- can I order by score(1), in creating my index?
....CREATE INDEX foox ON foo(D) INDEXTYPE IS CTXSYS.CONTEXT
FILTER BY B, C
ORDER BY score(1);
2- If not, what columns to include in the filter by clause? and how they will enhance my performance?
thanks.
Chris Saxon
July 04, 2016 - 12:26 pm UTC

1. No, you have to reference columns in the table.

2. Just include the column you're indexing and "order by score" won't need a table access:

create index i on t(x) indextype is ctxsys.context order by x;


See the code in my example above.

about cluase && score(0)> threshold

Ahmad Sawalha, July 13, 2016 - 6:57 pm UTC

Thank you very much
but I have another question:
1- which is better for performance using " about " clause or the syntax above?
2- what if I use in the "contains" cluase, instead of >0 , I use > 60 like this
<score datatype="INTEGER" algorithm="COUNT"/> </query>',1)>60
is it better for performance
thank you very very very much.
Chris Saxon
July 14, 2016 - 9:13 am UTC

1. ABOUT and your syntax are functionally different! The "," matches for exact matches of the strings in your input sentence. The more different terms there are, the higher the score.

ABOUT searches for text related to your term:

For example, an ABOUT query on heat might return documents related to temperature, even though the term temperature is not part of the query.

So if anything ABOUT will be slower, because it has more matches. But it depends on your data and search terms.

But decide what functionality you need first!

2. The contains() clause returns the score of each row. So comparing this to a higher value will filter more rows in the index. This means you access fewer table blocks. So yes, having:

contains(...) > 60


Should be more efficient than

contains(...) > 0


Other enhancements

Ahmad Sawalha, July 23, 2016 - 9:39 pm UTC

Dear Sir, Thank you for help, but I need more help please,
what other settengs can I do to get the best performance?
I read in Oracle documentation, I found some other settengs like , parallel, partitioning... etc.
what is the best hardware and software settings in my case?
what is the best parallel degree? I have 8 CPUs
how can I partition my table? there is no column to partition by( like price , order_date ) I have just plain text with ids. can I partition my table logically?
any suggestions?
thanks a lot.
Connor McDonald
July 25, 2016 - 8:26 am UTC

The first question to ask is "is my query fast enough"? To determine this, speak with the users to find out what "fast enough" is!

If the answer is yes, then you can stop!

I don't have enough information about your situation to answer your other questions. If you want to read more about these, check out the performance tuning guide:

http://docs.oracle.com/database/121/TGDBA/toc.htm

Processed_text column is greater than 249

Ahmad Sawalha, July 24, 2016 - 9:53 pm UTC

Dir Sir
when I executed the create index with order by col.
it gave me an error
when I checked the documentation, it was that my column is greater than 249 chars
what to do please ? I am stuck with this project
thanks
Chris Saxon
July 25, 2016 - 8:27 am UTC

What exactly is your error?

Can you post a test-case demonstrating it?

error in col. size

Ahmad Sawalha, July 26, 2016 - 9:33 am UTC

Dir sir
here is the error ( my col. PROCESSED_TEXT size is 200 char, but contains Arabic text) :

SQL Error: ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-20000: Oracle Text error:
DRG-11310: invalid order by column: PROCESSED_TEXT
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.TEXTINDEXMETHODS", line 366
29855. 00000 - "error occurred in the execution of ODCIINDEXCREATE routine"
*Cause: Failed to successfully execute the ODCIIndexCreate routine.
*Action: Check to see if the routine has been coded correctly.
Chris Saxon
July 26, 2016 - 8:31 pm UTC

And what is your index create statement? And the create table while you're at it! ;)

explain plan for index

Ahmad Sawalha, July 26, 2016 - 11:23 pm UTC

Dir Sir: thanks for you effort, I solved the error but I have another issue ( no performance enhancements), here is my steps
1- create index PLAG_DOCSEN3_IDX on plag_docsentences3(PROCESSED_TEXT) indextype is ctxsys.context order by PROCESSED_TEXT;

processed_text is varchar2(200).
2- DBMS_STATS.GATHER_TABLE_STATS('plagdb','PLAG_DOCSENTENCES3') ;

3- my query is :
select sentence_id ,sc ,isn ,sentence_length ,processed_text from
(select /*+ First_Rows(5) */ sentence_id,score(1) as sc, isn ,sentence_length ,processed_text from plag_docsentences3
where contains(PROCESSED_TEXT,'<query>
<textquery>
هدف درس حدد درج نمي نظم اعضاء هيأ درس درأ جمع نجح وطن قدس ومد أثر غور جنس أهل علم
<progression>

<seq><rewrite>transform((TOKENS, "{", " }", ","))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>',1)>0 order by score(1) desc )
where rownum <=5;

4- the explain plan :
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 1167 | 1807 (1)| 00:00:22 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 1167 | 1807 (1)| 00:00:22 |
| 3 | TABLE ACCESS BY INDEX ROWID| PLAG_DOCSENTENCES3 | 1 | 215 | 1807 (1)| 00:00:22 |
|* 4 | DOMAIN INDEX | PLAG_DOCSEN3_IDX | | | 1806 (0)| 00:00:22 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM<=5)
4 - access("CTXSYS"."CONTAINS"("PROCESSED_TEXT",'<query> <textquery> هدف درس حدد
درج نمي نظم اعضاء هيأ درس درأ جمع نجح وطن قدس ومد أثر غور جنس أهل علم <progression>
<seq><rewrite>transform((TOKENS, "{", " }", ","))</rewrite></seq>
</progression> </textquery> <score datatype="INTEGER" algorithm="COUNT"/>
</query>',1)>0)


I created the index another time without order by
and repeated the whole steps
and this is the explain plan
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 1167 | 1456 (1)| 00:00:18 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 1167 | 1456 (1)| 00:00:18 |
| 3 | TABLE ACCESS BY INDEX ROWID| PLAG_DOCSENTENCES3 | 1 | 215 | 1456 (1)| 00:00:18 |
|* 4 | DOMAIN INDEX | PLAG_DOCSEN3_IDX | | | 1455 (0)| 00:00:18 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM<=5)
4 - access("CTXSYS"."CONTAINS"("PROCESSED_TEXT",'<query> <textquery> هدف درس حدد
درج نمي نظم اعضاء هيأ درس درأ جمع نجح وطن قدس ومد أثر غور جنس أهل علم <progression>
<seq><rewrite>transform((TOKENS, "{", " }", ","))</rewrite></seq>
</progression> </textquery> <score datatype="INTEGER" algorithm="COUNT"/>
</query>',1)>60)

it takes the same time.
what I am doing wrong
thanks
Connor McDonald
July 29, 2016 - 1:46 pm UTC

What does the execution (not explain!) plan show?

https://blogs.oracle.com/sql/entry/how_to_create_an_execution

..continue

Ahmad Sawalha, July 26, 2016 - 11:34 pm UTC

my table is 3 million rows now, and is growing toward 100 million, and the query takes 0.2 seconds.
my case was that when the table has 100M records, and I sent 300 query, it takes 45 minutes.

Connor McDonald
July 29, 2016 - 1:46 pm UTC

Please show us the execution plan.

execution plan ( without order by, with order by)

Ahmad Sawalha, August 01, 2016 - 5:38 pm UTC

without order by:
1- create index PLAG_DOCSEN3_IDX on plag_docsentences3(processed_text) indextype is ctxsys.context;

2-DBMS_STATS.GATHER_TABLE_STATS('plagdb','plag_docsentences3') ;
3-
  
--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                    |      1 |        |      3 |00:00:01.99 |     309K|  19926 |
|*  1 |  COUNT STOPKEY                   |                    |      1 |        |      3 |00:00:01.99 |     309K|  19926 |
|   2 |   VIEW                           |                    |      1 |      2 |      3 |00:00:01.99 |     309K|  19926 |
|*  3 |    COUNT STOPKEY                 |                    |      1 |        |      3 |00:00:01.99 |     309K|  19926 |
|   4 |     VIEW                         |                    |      1 |      2 |      3 |00:00:01.99 |     309K|  19926 |
|*  5 |      SORT ORDER BY STOPKEY       |                    |      1 |      2 |      3 |00:00:01.99 |     309K|  19926 |
|   6 |       TABLE ACCESS BY INDEX ROWID| PLAG_DOCSENTENCES3 |      1 |      2 |   1642K|00:00:01.72 |     309K|  19926 |
|*  7 |        DOMAIN INDEX              | PLAG_DOCSEN3_IDX   |      1 |        |   1642K|00:00:00.75 |   28177 |     11 |
--------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=5)
   5 - filter(ROWNUM<=5)
   7 - access("CTXSYS"."CONTAINS"("PROCESSED_TEXT",'<query>    <textquery>     أثر خدم ظمأ شكل حسن حصل طلب كلأ 
              علم ربأ تبع وكل غوثالأردن ميد روض خلد حلم خشن يمن حلم      <progression>             
              <seq><rewrite>transform((TOKENS,  "{", " }", ","))</rewrite></seq>      </progression>    </textquery>   <score 
              datatype="INTEGER" algorithm="COUNT"/> </query>',1)>0)
 

with order by

Ahmad Sawalha, August 01, 2016 - 5:42 pm UTC

1-create index PLAG_DOCSEN3_IDX on plag_docsentences3(processed_text) indextype is ctxsys.context order by processed_text;
2- DBMS_STATS.GATHER_TABLE_STATS('plagdb','plag_docsentences3') ;
3-

--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                    |      1 |        |      3 |00:00:01.89 |     310K|   6832 |
|*  1 |  COUNT STOPKEY                   |                    |      1 |        |      3 |00:00:01.89 |     310K|   6832 |
|   2 |   VIEW                           |                    |      1 |      2 |      3 |00:00:01.89 |     310K|   6832 |
|*  3 |    COUNT STOPKEY                 |                    |      1 |        |      3 |00:00:01.89 |     310K|   6832 |
|   4 |     VIEW                         |                    |      1 |      2 |      3 |00:00:01.89 |     310K|   6832 |
|*  5 |      SORT ORDER BY STOPKEY       |                    |      1 |      2 |      3 |00:00:01.89 |     310K|   6832 |
|   6 |       TABLE ACCESS BY INDEX ROWID| PLAG_DOCSENTENCES3 |      1 |      2 |   1642K|00:00:01.63 |     310K|   6832 |
|*  7 |        DOMAIN INDEX              | PLAG_DOCSEN3_IDX   |      1 |        |   1642K|00:00:00.77 |   28861 |      0 |
--------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=5)
   5 - filter(ROWNUM<=5)
   7 - access("CTXSYS"."CONTAINS"("PROCESSED_TEXT",'<query>    <textquery>     أثر خدم ظمأ شكل حسن حصل طلب كلأ 
              علم ربأ تبع وكل غوثالأردن ميد روض خلد حلم خشن يمن حلم      <progression>             
              <seq><rewrite>transform((TOKENS,  "{", " }", ","))</rewrite></seq>      </progression>    </textquery>   <score 
              datatype="INTEGER" algorithm="COUNT"/> </query>',1)>0)
 

sql statement

Ahmad Sawalha, August 01, 2016 - 5:46 pm UTC

sorry, for the 3 replies, just to format.
and thank you very much for help.
select /*+ gather_plan_statistics */* 
from ( 
select sentence_id ,sc ,isn ,sentence_length ,processed_text from 
         (select  /*+ First_Rows(5)  */ sentence_id,score(1) as sc, isn ,sentence_length ,processed_text from plag_docsentences3 
              where contains(PROCESSED_TEXT,'<query>
   <textquery>
    أثر خدم ظمأ شكل حسن حصل طلب كلأ علم ربأ تبع وكل غوثالأردن ميد روض خلد حلم خشن يمن حلم
     <progression>
      
     <seq><rewrite>transform((TOKENS,  "{", " }", ","))</rewrite></seq>
     </progression>
   </textquery>
  <score datatype="INTEGER" algorithm="COUNT"/>
</query>',1)>0 order by score(1) desc ) 
     where rownum <=5)
     where rownum <=3;

More to Explore

Hints

Need more information on hints? Check out the complete hint descriptions