Thanks for the question, Pedro.

Asked: July 20, 2022 - 10:45 am UTC

Last updated: July 28, 2022 - 2:05 pm UTC

Version: 12.1.0.1

Viewed 10K+ times! This question is

You Asked

We have two tables, a and b, with with an oracle text JSON full text index as described in https://oracle-base.com/articles/12c/indexing-json-data-in-oracle-database-12cr1#full-text-search

Table b receives about 100.000 inserts a day, where before each insert, several queries like the following are run to find a "matching" parent entry in table a:

select * from a join b on a.id = b.a_id where contains(a.json, ...) > 0 and contains(b.json,...) > 0

If no matching entry is found in a, a new one is inserted, which happens 20.000 to 90.000 times.

We are gathering statistics on a and b once a day, as well as optimizing the full text indices with ctx_ddl.optimize_index.

The selectivity of the search criteria on a is high, greater than 50%, while the selectivity of the search criteria on b is low, a dozen or so records. So we would expect oracle to do a nested loop, with the outer loop on b, but we have observed the opposite. From the explain plan it seems, that oracle is underestimating the selectivity on a.

Is there more maintenance necessary that gathering statistics and optimizing the full text indices? Should we do it more often? Should we consider using an index hint?

and Chris said...

To help us understand what's going on here we really need to see the query's execution plan - including the actual row & I/O stats.

Get this by running:

alter session set statistics_level = all;

<your query>

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

And post your findings here.

Details of what the contains expressions are will help too. Do you really need a full text index? Can you create regular BTree indexes over JSON_VALUE expressions instead?

It's also worth reviewing the process. Instead of querying first, then running an insert if necessary does using INSERT ... SELECT ... WHERE NOT EXISTS help? e.g.:

insert into ...
  select json from ... 
  where not exists ( ... );

Rating

(2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Pedro, July 23, 2022 - 6:06 pm UTC

Thanks for the quick reply, Chris.
Here is an execution plan (from a test environment, where we observe the same behaviour):

SQL_ID  83g1jtz1pbtjh, child number 0
-------------------------------------
select * from evtprd_event r2  inner join evtprd_normalized_message r1  
on (r2.id = r1.event_id) where  contains(r1.json,'{<token1>} 
INPATH(<path for token1>)') > 0  and contains(r1.json,'{<token2>} 
INPATH(<path for token2>)') > 0  and 
contains(r1.json,'{<token3>} INPATH(<path for token3>)') 
> 0  and contains(r2.json,'{<token4>} INPATH(<path for token4>)') 
> 0 and r1.event_status = 10 and r2.status not in 
('ARCHIVED','CLOSED')
 
Plan hash value: 1663880685
 
--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name                      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                           |      1 |        |      0 |00:00:02.96 |   33977 |    237 |
|   1 |  NESTED LOOPS                               |                           |      1 |      1 |      0 |00:00:02.96 |   33977 |    237 |
|*  2 |   TABLE ACCESS BY INDEX ROWID               | EVTPRD_EVENT              |      1 |      1 |   1353 |00:00:00.11 |    1772 |      3 |
|*  3 |    DOMAIN INDEX                             | EVTPRD_EVENT_I_JSN        |      1 |        |   1427 |00:00:00.10 |     615 |      3 |
|*  4 |   TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| EVTPRD_NORMALIZED_MESSAGE |   1353 |      1 |      0 |00:00:02.85 |   32205 |    234 |
|*  5 |    INDEX RANGE SCAN                         | EVTPRD_NRMAL_MSG_EVNT_ID  |   1353 |      5 |   7428 |00:00:00.10 |    2122 |      0 |
--------------------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(("R2"."STATUS"<>'CLOSED' AND "R2"."STATUS"<>'ARCHIVED'))
   3 - access("CTXSYS"."CONTAINS"("R2"."JSON",'{<token4>} INPATH(<path for token4>)')>0)
   4 - filter(("R1"."EVENT_STATUS"=10 AND "CTXSYS"."CONTAINS"("R1"."JSON",'{<token1>} INPATH(<path for token1>)')>0 
              AND "CTXSYS"."CONTAINS"("R1"."JSON",'{<token3>} INPATH(<path for token3>)')>0 AND 
              "CTXSYS"."CONTAINS"("R1"."JSON",'{<token2>} INPATH(<path for token2>)')>0))
   5 - access("R2"."ID"="R1"."EVENT_ID")

The actual tokens and paths in the contains expression have been anonymized in the hope that the concrete values are not relevant for the moment.

The reason why we are using a full text index, instead of function based indices on json_value expressions, is that the actual tokens and paths should be configurable by the users and are subject to change (They describe the rules of how "normalized messages" are grouped together to form "events". There is a set of rules and for each rule a query similar to the above is executed). We don't want to adjust the indices when there is a change in the rules, and also we would need quite a lot of indices, so a json full text index seemed to be well suited for our requirements.

Now rule regarding <token4> of the "event" table in the above query, for which the cardinality issue seems to occur, is not expected to be changed, so we could use a function based index for it, or even a normal column with an index. Also this token can only have a small set of values, so maybe one could even think of partitioning, if we move it to a regular column.

Nevertheless we would like to have more insight into how the cardinality estimates of the optimizer can be checked and influenced.

Regarding the tokens queried in the contains clauses for the "normalized message" table: here we expect that the cardinality for the combined expression always to be small. A stylized example would be something like:

contains(json, '{<source>} INPATH(/source)') > 0 and contains(json, '{<id>} INPATH(/source_id)') > 0

Here the combination of source and source_id would have a small expected cardinality but source alone would have a high one.

July 25, 2022 - 2:34 pm UTC

more insight into how the cardinality estimates of the optimizer can be checked

By getting a plan like the one you submitted here!

For each operation, you want to check, does:

Starts * E-rows ~ A-rows

If these are similar (within an order of magnitude), then generally the estimates are good enough. Looking lines 2-5 of the plan:

2: 1 * 1 << 1353 - this estimate is far too low, but is a consequence of the index operation below it
3: 1 * ??? => 1427? The E-rows column is missing for this! I don't know why this is; this may be way line 2 estimates one row
4: 1353 * 1 >> 0 - the estimate is way over here, but the minimum row estimate shown is 1, and again this is a consequence of the operation below it
5: 1353 * 5 = 6765 ~ 7428 - this estimate is close enough

Another important thing to check is:

How much work is done by each operation?

The Buffers column is the primary stat to look at here. From this plan, operation 4 does the most at 30,083 gets (32,205 - 2,122; you have to subtract the work from its children). This is a huge amount of work!

Overlooking the fact this returns zero rows for now, the database is doing ~7,400 index lookups of the table. This is a little over four I/O operations per row. Ideally operation 4 would be doing one get per row at most. Again, I don't know why this based on the information available.

So questions I have are:

Why are the E-rows missing for operation 3 & does this help account for the large underestimate for line 2?
Why is operation 4 doing so much work? This is also the slowest part of the plan (From the A-Time).

Now rule regarding <token4> of the "event" table in the above query, for which the cardinality issue seems to occur, is not expected to be changed, so we could use a function based index for it, or even a normal column with an index.

I would definitely try that.

Just as important - and perhaps more - is to get to the bottom of why line 4 does so much work.

A reader, July 27, 2022 - 3:01 pm UTC

So questions I have are:

Why are the E-rows missing for operation 3 & does this help account for the large underestimate for line 2?
Why is operation 4 doing so much work? This is also the slowest part of the plan (From the A-Time).

Good questions, I can only speculate:

operation 4: in our case the json column is a clob. Could this add the additional overhead? Also: I'm not sure how the contains expression is actually evaluated by oracle text. Would it parse the columns value again or lookup the parsed results in some internally managed tables?

operation 3: I did the following tests and notitced a difference between gather_table_stats and analyze table compute statistics, where the e-rows estimated seem to accurate after analyze table, only.
According to the Oracle Text Developer Guide both analyze sstatement and use of dbms_stats are valid: https://docs.oracle.com/database/121/CCAPP/GUID-E4F013A8-E83C-44AF-B9A4-CA1FBF717730.htm#CCAPP9634

alter session set statistics_level = all;
create table t_json(json clob, constraint t_json_chk_json check (json is json) enable); 
create index t_json_i_json on t_json(json) 
   indextype is ctxsys.context  parameters ('section group ctxsys.json_section_group sync (on commit)'); 

insert into t_json values ('{}');
insert into t_json select '{"type": "t1", "id": 1}' from dual connect by level <= 10;
insert into t_json select '{"type": "t2", "id": 2}' from dual connect by level <= 10000;
commit;

execute dbms_stats.gather_table_stats(user, 'T_JSON', estimate_percent=>50) ;

select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>SQL_ID  0rwwkj8hq0526, child number 0
>------------------------------------
>select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |      83 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |      83 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |    502 |  10000 |00:00:00.01 |      83 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
>   2 - access("CTXSYS"."CONTAINS"("JSON",'{t2} INPATH(/type)')>0)
 
insert into t_json select '{"type": "t3", "id": 3}' from dual connect by level <= 30000;
commit;

select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>SQL_ID  7ucz1zm29hw3v, child number 0
>-------------------------------------
>select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |     231 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |     231 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |    502 |  30000 |00:00:00.01 |     231 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
> 2 - access("CTXSYS"."CONTAINS"("JSON",'{t3} INPATH(/type)')>0)
   
-- regather statistics - no change in plan

execute dbms_stats.gather_table_stats(user, 'T_JSON', estimate_percent=>50) ;

select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>   SQL_ID  0rwwkj8hq0526, child number 0
>-------------------------------------
>select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |      79 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |      79 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |    502 |  10000 |00:00:00.01 |      79 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
>   2 - access("CTXSYS"."CONTAINS"("JSON",'{t2} INPATH(/type)')>0)

select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>SQL_ID  7ucz1zm29hw3v, child number 0
>-------------------------------------
>select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |     231 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |     231 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |    502 |  30000 |00:00:00.01 |     231 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
>   2 - access("CTXSYS"."CONTAINS"("JSON",'{t3} INPATH(/type)')>0)
   
-- use analyze instead of gather_table_stats - estimates in plan are changing
   
analyze table t_json compute statistics;

select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>SQL_ID  0rwwkj8hq0526, child number 0
>-------------------------------------
>select count(1) from t_json where contains(json,'{t2} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |      79 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |      79 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |  10000 |  10000 |00:00:00.01 |      79 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
>   2 - access("CTXSYS"."CONTAINS"("JSON",'{t2} INPATH(/type)')>0)
    
select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 0;
select * from  table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

>SQL_ID  7ucz1zm29hw3v, child number 0
>-------------------------------------
>select count(1) from t_json where contains(json,'{t3} INPATH(/type)') > 
>0
> 
>Plan hash value: 3915049437
> 
>--------------------------------------------------------------------------------------------
>| Id  | Operation        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
>--------------------------------------------------------------------------------------------
>|   0 | SELECT STATEMENT |               |      1 |        |      1 |00:00:00.01 |     231 |
>|   1 |  SORT AGGREGATE  |               |      1 |      1 |      1 |00:00:00.01 |     231 |
>|*  2 |   DOMAIN INDEX   | T_JSON_I_JSON |      1 |  29999 |  30000 |00:00:00.01 |     231 |
>--------------------------------------------------------------------------------------------
> 
>Predicate Information (identified by operation id):
>---------------------------------------------------
> 
>   2 - access("CTXSYS"."CONTAINS"("JSON",'{t3} INPATH(/type)')>0)

July 28, 2022 - 2:05 pm UTC

OP 4: I guessed you're using a clob/blob ;)

There are lots of possible reasons - row chaining/migration, lob storage settings, getting read consistent data, ... - you'll need to investigate to find out exactly why.

One other thing I just noticed - it's using GLOBAL index to access the data. Is the table partitioned? Do you see similar effects (lots of work to fetch few rows) on non-partitioned tables?

I'm not sure how the contains expression is actually evaluated by oracle text

I'm not sure either to be honest! You could try tracing the session to see what's happening when you run CONTAINS.

OP 3: ANALYZE is obsolete, you should use dbms_stats.

You should also set the estimate percent to AUTO

https://blogs.oracle.com/optimizer/post/setting-estimate-percent

(though it's possible Oracle Text behaves differently).

That said, if ANALYZE enables the optimizer to come up with better estimates and therefore plans, you could use it while finding a better/more permanent solution.

Have you tried creating a function-based index to assist the event lookup?

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

More to Explore

SQL

JSON index not applied as expected

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

More to Explore

SQL