Hi Team,
We are doing some testing on heavy loads, but we are experiencing strange behavior.
We have a process that is doing a lot of updates and inserts on a primarily 2 tables. But some of the aggregate queries have a tendency to start being slow (from being very fast) and doing way more Logical I/O than seems normal for it to do.
During testing we are running on quite homogenous data so we should not see very much difference between each run. I have also tested the scripts on similar data and never gotten slow runs.
All code is run from PL/SQL here. During test, we are running 5 session that each works with one entity, that will have from 2000 to 400000 transactions in the run (always the same number for each entity in each run, although this will not be the case when we come to production).
For 2000, 10000, 20000 and 40000 pr entity (10000, 50000, 100000 and 200000 runs), we have no problems. Most of the scripts for select and update are running within second (even less than a second).
When we enter a 500000 run (100000 pr. entity), we start seeing problems. First 2-3 times the scripts are run, it takes seconds and then the next take minutes (for this amount up to 12-18 minutes). Logical I/O rises from almost 0 to 60000000.
When we enter a 1000000 run (200000 pr. entity), things even get worse. Then almost all of the runs take more than 20 minutes and more often than not, some of the runs never end properly, as we have a 1 hour timeout.
Here is an example code:
WITH source AS (
SELECT
MAX(lpad(t2.column6, 7, '0') || '0000') AS aggregate_reference,
t1.column1,
t1.column2,
t1.column3,
t1.column4,
t1.column5
FROM
tab1e1 t1
INNER JOIN table2 t2 ON t1.column4 = t2.column4
AND t1.initiatorbankid = t2.initiatorbankid
AND t1.column3 = t2.column3
AND t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.column5 = t2.column5
WHERE
t1.column1 <> 'T'
AND t1.column7= '2544'
AND t1.column4 = 2081
GROUP BY
t1.column3,
t1.column1,
t1.column2,
t1.column4,
t1.column5
)
SELECT
appdata.type_for_reference(pk_column, aggregate_reference)
BULK COLLECT
INTO v_ref
FROM
table1 tx
INNER JOIN source ON tx.column4 = source.column4
AND tx.column3 = source.column3
AND tx.column1 = source.column1
AND tx.column2 = source.column2
AND tx.column5 = source.column5;
In a 500000 run, table1 has 500000 (100000 pr. entity) records, table2 has 200000 (40000 pr. entity) records. The with clause would return about 50000 records (10000 pr. entity) and the total statement about 375000 (75000 pr. entity).
Before we would have the whole statement including update connected to this one. That never worked.
We have now split it up to 3 runs (2 selects and 1 insert), we run the with clause first into another type and then using that type we get the result from here.
SELECT
MAX(lpad(t2.column6, 7, '0') || '0000') AS aggregate_reference,
t1.column1,
t1.column2,
t1.column3,
t1.column4,
t1.column5
BULK COLLECT
INTO v_info
FROM
table1 t1
INNER JOIN table2 t2 ON t1.column4 = t2.column4
AND t1.initiatorbankid = t2.initiatorbankid
AND t1.column3 = t2.column3
AND t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.column5 = t2.column5
WHERE
t1.column1 <> 'T'
AND t1.column7 = '2544'
AND t1.column4 = 2081
GROUP BY
t1.column3,
t1.column1,
t1.column2,
t1.column4,
t1.column5;
SELECT
appdata.type_for_reference(pk_column, aggregate_reference)
BULK COLLECT
INTO v_ref
FROM
table1 t1
INNER JOIN TABLE ( v_info ) source ON tx.column4 = source.column4
AND tx.column3 = source.column3
AND tx.column1 = source.column1
AND tx.column2 = source.column2
AND tx.column5 = source.column5;
This seems to work.
My DBA and I differ on opinion of what is happening. Our system has alot of CPUs, alot of Memory and a powerful machine (exadata). But I have a feeling that the memory portions, especially aggregate part are not properly set up. My DBA means to say that our scripts are badly designed.
The 5 columns from table2 group clause are indexed together on table1. The combination column7 and column1 are indexed on table2. I can't see any way to rewrite the scripts (and the scripts anyways show a quick run when run independently).
Now the question.
How can I see what might be causing this and if some memory settings are not properly set up?
I am sorry for no liveSQL things. We are only seeing this with a lot of data so setting things up there would be difficult.
Regards,
Ingimundur Gudmundsson
I've no idea what you mean by a
500000 run. What exactly is involved in a run? One update changing 500,000 rows? 500,000 updates changing one row each? Something else? You mention inserts and updates, but I only selects in the examples provided.
We'll need more clarity around what the process is doing to give good insights. But here's a
complete guess that may be relevant.
Statements can only see changes committed by other sessions. If you run a query and another session has uncommitted DML changes on that table, the database has to reconstruct how the table looked before these changes.
It does this by working back through the undo. If there are lots of uncommitted changes, it has to do lots of work.
For example, this create a table, inserts a row into it, and then counts these rows:
create table t ( c1 int );
insert into t values ( 1 );
commit;
set serveroutput off
alter session set statistics_level = all;
select count(*) from t;
select * from dbms_xplan.display_cursor( format => 'ALLSTATS LAST');
-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 7 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 7 |
| 2 | TABLE ACCESS FULL| T | 1 | 1 | 1 |00:00:00.01 | 7 |
-------------------------------------------------------------------------------------
As expected, this only needs a few I/O to read the table.
Now start a second session and update the table 1,000 times:
begin
for i in 1 .. 1000 loop
update t set c1 = c1 + 1;
end loop;
end;
/
Leave these changes uncommitted.Go back to the first session and rerun the count. The database has to undo all 1,000 of these changes and the query does just over 1,000 I/Os:
select count(*) from t;
select * from dbms_xplan.display_cursor( format => 'ALLSTATS LAST');
-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 1008 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1008 |
| 2 | TABLE ACCESS FULL| T | 1 | 1 | 1 |00:00:00.01 | 1008 |
-------------------------------------------------------------------------------------
If you increase the number of updates in the second session, you'll see the query in the first do more and more work => slower queries. Eventually it'll start reading from disk which will slow things even further.
Like I said - this is just a guess, but is consistent with the symptoms you describe. Please share more details on what the process is doing to confirm (or deny!) this and discuss solutions.