The Oracle DB version I am working on is : Oracle Database 12c Enterprise Edition Release 18.104.22.168.0 - 64bit Production
Its a RAC with 2 instances.
This is about a query which runs quickly (under 15s) under instance #1 but takes between 10-11 mins when it runs under Instance #2.
The query is utilizing indexes as expected.
User provided me the query and I got the sql_ids of its historical runs via DBA_HIST_SQLTEXT.
That gave me 5 sql_ids of the same query run between 20-Jul and 28-Jul.
select sql_id from dba_hist_sqltext where sql_text like '%CLAIMTYP%claimtype%');Observation 1:
Via following query, I discovered that slower runs are on Instance #2, & faster ones on #1.
select distinct h.instance_number,trunc(h.sample_time) run_date,h.sql_exec_start, h.sql_id , h.session_id, h.sql_child_number, h.sql_full_plan_hash_value,
to_char(trunc(elapsed_time_total/1000000/60/60/24),'9') || ':' ||to_char(trunc(MOD(elapsed_time_total/1000000/60/60, 24)),'09') ||':' ||
to_char(trunc(MOD(elapsed_time_total/1000000,3600)/60),'09') || ':' ||to_char(MOD(MOD(elapsed_time_total/1000000, 3600),60),'09') as "Time Taken [dd:hh:mi:ss]"
from dba_hist_active_sess_history h,DBA_HIST_SQLSTAT s
where s.sql_id=h.sql_id and h.instance_number=s.instance_number
and h.sql_id in ('73b4smcjjk38s','bt8sw7vj42sp3',........);Observation 2:
In dba_hist_active_sess_history, under column DELTA_INTERCONNECT_IO_BYTES, I found several non-null values under Instance #2 run, which as per Oracle documentation means "Number of I/O bytes sent over the I/O interconnect over the last DELTA_TIME microseconds".
Value of that column is NULL in all other cases where same query ran on Instance #1.
And there are many rows in ASH for slower run sql_ids but faster runs have only couple of rows (each row indicating 10s interval).
Does above observations indicate some issue with server of Instance #2?
In RAC we have the concept of "node affinity", namely, in a perfect world, a particular piece of data would always be accessed from the same node
Because in a "just started" RAC system, the cost of accessing the data is the same in all instances, ie, reading it from the shared disk.
However, in an actively running system, the costs can be different across nodes because accessing data from buffer cache of the node a query is running on will will be faster than accessing data from the buffer cache of another node. In particular, if there is a LOT of changes to blocks on "remote" nodes, then the workload on remote nodes can be have an impact on queries on a single node, because
a) cross-instance locks take more coordination, and
b) we need to ship data across the interconnect...which adds overhead, and
b) the remote node might be called upon to construct a consistent read of a block
So (for example) you might be running a query on node A and that node has plenty of spare CPU.
Node B on the other hand is being absolutely smashed by some other job.
If the query on Node A needs consistent reads of blocks from node B, then this could take a long time (because Node B is under stress).
Huge values for your interconnect bytes stats indicate that a lot of cross-node chatter is going on to satisfy that request.