Skip to Main Content
  • Questions
  • How to Improve the Performance of Group By with Having

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, NARENDRA.

Asked: February 14, 2018 - 8:41 am UTC

Last updated: March 30, 2018 - 4:37 am UTC

Version: 11.2.0.1.0

Viewed 10K+ times! This question is

You Asked

I have a table t containing three fields accountno, tran_date, amount and amount is always > 0.
.
There are many records for each accountno. I want to select all the accountnos where the sum(amount) > 100.

The simple query is like this

select accountno from t group by accountno having sum(amount) > 100;


This query is taking time , as there are many records for each account. I want to optimise this query , in such a way
that as soon as the sum of some of the records is > 100, it should not process further that accountno, is it possible ?


with LiveSQL Test Case:

and Chris said...

A group by with a having clause is the best way to write this query.

But there is a trick available:

Materialized views!

These store the results of a query. Provided each account has "many" rows in the table this can significantly reduce the number of rows you process. And, everything else being equal, less rows => faster query.

The best part is:

You don't need to change your SQL :)

Provided you have query rewrite enabled, Oracle Database can use this automagically. Notice how the plan for the query has "MAT_VIEW REWRITE ACCESS FULL" and only processes 43 rows (A-rows column):

create table t as
  select owner accountno,
         trunc( created ) as tran_date,
         object_id as amount
  from   all_objects;

create materialized view mv 
  enable query rewrite as 
  select accountno,
         count(*) ct,
         sum( amount ) sm
  from   t
  group by accountno;
  
set serveroutput off
select /*+ gather_plan_statistics */accountno
from   t
group by accountno 
having sum( amount ) > 1000;

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));

PLAN_TABLE_OUTPUT                                                          
SQL_ID  74naf2w6wvqh1, child number 0                                      
-------------------------------------                                      
select /*+ gather_plan_statistics */accountno from   t group by            
accountno  having sum( amount ) > 1000                                     
                                                                           
Plan hash value: 572630632                                                 
                                                                           
------------------------------------------------------------------------   
| Id  | Operation                    | Name | Starts | E-Rows | A-Rows |   
------------------------------------------------------------------------   
|   0 | SELECT STATEMENT             |      |      1 |        |     43 |   
|*  1 |  MAT_VIEW REWRITE ACCESS FULL| MV   |      1 |     43 |     43 |   
------------------------------------------------------------------------   
                                                                           
Predicate Information (identified by operation id):                        
---------------------------------------------------                        
                                                                           
   1 - filter("MV"."SM">1000)


Further, if you expect the number of accounts having a sum > 100 to be "small" you can create an index on this column in the MV. And the database can use that.

But to do this the MV has to be "fresh". That is, the data stored in it must exactly match the data in the base table.

Add one row and there's a mismatch. So it accesses the table again:

insert into t values ('CHRIS', sysdate, 1);
commit;

select /*+ gather_plan_statistics */accountno
from   t
group by accountno 
having sum( amount ) > 1000;

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));

PLAN_TABLE_OUTPUT                                                  
SQL_ID  74naf2w6wvqh1, child number 1                              
-------------------------------------                              
select /*+ gather_plan_statistics */accountno from   t group by    
accountno  having sum( amount ) > 1000                             
                                                                   
Plan hash value: 1381620754                                        
                                                                   
---------------------------------------------------------------    
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |    
---------------------------------------------------------------    
|   0 | SELECT STATEMENT    |      |      1 |        |     43 |    
|*  1 |  FILTER             |      |      1 |        |     43 |    
|   2 |   HASH GROUP BY     |      |      1 |      1 |     43 |    
|   3 |    TABLE ACCESS FULL| T    |      1 |  73095 |  73096 |    
---------------------------------------------------------------    
                                                                   
Predicate Information (identified by operation id):                
---------------------------------------------------                
                                                                   
   1 - filter(SUM("AMOUNT")>1000)


Notice how the full scan of T in line 3 accesses 73,096 rows? That's going to take a wee bit more effort to process than 43...

So you need to keep it up-to-date. Ideally by defining materialized view logs. And making it "fast refresh on commit".

Or, if you're lucky enough to be on 12.2 create a real time materialized view with the "enable on query computation" clause:

https://blogs.oracle.com/sql/12-things-developers-will-love-about-oracle-database-12c-release-2#real-time-mv

Rating

  (5 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

NARENDRA GUPTA, February 15, 2018 - 8:52 am UTC

Thanks cris for the prompt reply.
But I think creation of MV itself will take time (and space) as size of my table t is in TB. Also everyday Millons of records are added so the refresh will also take time.

I have following alternative using PL/SQL function, but still it is not giving desired result.


create table ac
as
select username as accountno from all_users
/

select accountno from ac where checkamt(accountno) =0
/

CREATE OR REPLACE FUNCTION checkamt(lac in varchar2) RETURN number
AS
totalamt number;

cursor c1
is
select amount from t where t.accountno=lac;

BEGIN

totalamt := 0;

FOR k in c1 LOOP
totalamt := totalamt + k.amount;
if (totalamt >= 1000) then
return 0;
end if;
END LOOP;

return 1;

END checkamt;


Does cursor fetches all the rows ? whether adding any hint like FIRST_ROWS will help ?

Thanks
Chris Saxon
February 15, 2018 - 11:16 am UTC

Yes, creating the MV will take time. But once it's in place fast refreshes should be quick provided you do them on commit/regularly.

And remember: for the size of the MV it doesn't matter how many rows you insert to the table. It's how many new, distinct account numbers you insert. If you insert 10 million rows in the table, but they're all for accounts that already exist, then the MV will still have the same number of rows after you refresh it.

Does cursor fetches all the rows ? whether adding any hint like FIRST_ROWS will help ?

Nooooooo! A looped PL/SQL solution will almost certainly be significantly slower!

Martin Rose, March 27, 2018 - 3:25 pm UTC

If you were to partition table 't' by 'accountno', the GROUP BY would be faster, as the data is essentially being stored already GROUPed.
Connor McDonald
March 28, 2018 - 1:56 am UTC

I'm not entirely convinced on that one

SQL> exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

SQL> create table t nologging
  2  partition by hash(acct) partitions 32
  3  as select
  4    trunc(dbms_random.value(1,1000)) acct,
  5    rownum r
  6  from
  7    ( select 1 from dual connect by level <= 1000 ),
  8    ( select 1 from dual connect by level <= 10000 );

Table created.

SQL> exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

SQL> create table t1 nologging
  2  partition by hash(acct) partitions 32
  3  as select
  4    trunc(dbms_random.value(1,1000)) acct,
  5    rownum r
  6  from
  7    ( select 1 from dual connect by level <= 1000 ),
  8    ( select 1 from dual connect by level <= 10000 );

Table created.

SQL>
SQL>
SQL> set timing on
SQL> set autotrace traceonly stat
SQL> select acct
  2  from   t
  3  group by acct
  4  having sum(r) < 50000000000;

466 rows selected.

Elapsed: 00:00:01.36

Statistics
----------------------------------------------------------
         46  recursive calls
         11  db block gets
      20617  consistent gets
      20474  physical reads
       2100  redo size
       8882  bytes sent via SQL*Net to client
        949  bytes received via SQL*Net from client
         33  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
        466  rows processed

SQL>
SQL> set timing on
SQL> set autotrace traceonly stat
SQL> select acct
  2  from   t1
  3  group by acct
  4  having sum(r) < 50000000000;

466 rows selected.

Elapsed: 00:00:01.34

Statistics
----------------------------------------------------------
          6  recursive calls
         10  db block gets
      20462  consistent gets
      20453  physical reads
       2100  redo size
       8882  bytes sent via SQL*Net to client
        949  bytes received via SQL*Net from client
         33  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        466  rows processed

SQL>
SQL>


Martin Rose, March 28, 2018 - 3:13 pm UTC

I meant LIST PARTITION it by ACCOUNTNO.

Martin Rose, March 28, 2018 - 3:39 pm UTC

But in your tests, both tables were partitioned.
One shouldn't have been. ;-)

Then the results are most different.

Using your hash example, the non-partitioned table takes 1.5 secs to return the first 50 rows. Whilst the partitioned one takes only 0.22 secs for the same. Returning all rows however, is roughly the same time (the partitioned table being 0.1 secs slower).
Connor McDonald
March 30, 2018 - 4:37 am UTC

Hee hee...I'm chuckling at my cut-paste fail here :-) But even with a corrected one

SQL>  exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL>  create table t nologging
  2   partition by hash(acct) partitions 32
  3   as select
  4     trunc(dbms_random.value(1,1000)) acct,
  5     rownum r
  6   from
  7     ( select 1 from dual connect by level <= 1000 ),
  8     ( select 1 from dual connect by level <= 10000 );

Table created.

SQL>
SQL>
SQL>  exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL>  create table t1 nologging
  2   as select
  3     trunc(dbms_random.value(1,1000)) acct,
  4     rownum r
  5   from
  6     ( select 1 from dual connect by level <= 1000 ),
  7     ( select 1 from dual connect by level <= 10000 );

Table created.

SQL>
SQL>
SQL>  set timing on
SQL>  set autotrace traceonly stat
SQL>  select acct
  2   from   t
  3   group by acct
  4   having sum(r) < 50000000000;

466 rows selected.

Elapsed: 00:00:02.63

Statistics
----------------------------------------------------------
         15  recursive calls
         12  db block gets
      20539  consistent gets
      20469  physical reads
       2060  redo size
       8882  bytes sent via SQL*Net to client
        949  bytes received via SQL*Net from client
         33  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        466  rows processed

SQL>
SQL>  set timing on
SQL>  set autotrace traceonly stat
SQL>  select acct
  2   from   t1
  3   group by acct
  4   having sum(r) < 50000000000;

466 rows selected.

Elapsed: 00:00:02.14

Statistics
----------------------------------------------------------
          6  recursive calls
         10  db block gets
      20470  consistent gets
      20453  physical reads
       2100  redo size
       8882  bytes sent via SQL*Net to client
        949  bytes received via SQL*Net from client
         33  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        466  rows processed

SQL>
SQL>


And as I can see from your further investigations, it looks like our GROUP BY does not "go hunting" for opportunities to minimize work.




Martin Rose, March 29, 2018 - 3:31 pm UTC

Since you've gone away for your Easter holidays, I thought I'd take the time to do a more complete test and give you something for when you get back.

When I suggested the LIST PARTITIONing, I did so because I believed each 'accountno' would go into its own partition, thereby eliminating any work on the actual grouping at SELECT time (since the data is stored pre-grouped) - though CPU time would still be needed for the HAVING SUM(r).

It's a bit laborious to write out the definition for 1000 LIST PARTITIONs, so I assumed that increasing the HASH PARTITIONS to 1024 would achieve the same thing. I ran the below code in SQL Developer (with the fetch array set at 50 records), threw away the first result of each one (as that would skew the results by it having to bring the data into the buffer cache, which subsequent runs didn’t) and obtained the average times for each test. Here’s what I got (running on Oracle 11g2).
exec dbms_random.seed(0)

drop   table t32;
create table t32 nologging pctfree 0 PARALLEL compress
    partition by hash(acct) partitions 32
    as select
      trunc(dbms_random.value(1,1000)) acct,
      rownum r
    from
      ( select 1 from dual connect by level <= 1000 ),
      ( select 1 from dual connect by level <= 10000 );

execute dbms_stats.gather_table_stats (OWNNAME => USER, TABNAME => 'T32',  granularity => 'ALL', DEGREE => DBMS_STATS.DEFAULT_DEGREE);


exec dbms_random.seed(0)

drop   table t1024;
create table t1024 nologging pctfree 0 PARALLEL compress
    partition by hash(acct) partitions 1024
    as select
      trunc(dbms_random.value(1,1000)) acct,
      rownum r
    from
      ( select 1 from dual connect by level <= 1000 ),
      ( select 1 from dual connect by level <= 10000 );

execute dbms_stats.gather_table_stats (OWNNAME => USER, TABNAME => 'T1024',  granularity => 'ALL', DEGREE => DBMS_STATS.DEFAULT_DEGREE);
      
      
exec dbms_random.seed(0)      

drop   table t;
create table t nologging pctfree 0 PARALLEL compress
    as select
      trunc(dbms_random.value(1,1000)) acct,
      rownum r
    from
      ( select 1 from dual connect by level <= 1000 ),
      ( select 1 from dual connect by level <= 10000 );

execute dbms_stats.gather_table_stats (OWNNAME => USER, TABNAME => 'T', granularity => 'ALL', DEGREE => DBMS_STATS.DEFAULT_DEGREE);


select /*+ NOPARALLEL */ acct
    from   t
    group by acct
    having sum(r) < 50000000000;
    
select /*+ NOPARALLEL */ acct
    from   t32
    group by acct
    having sum(r) < 50000000000;
    
select /*+ NOPARALLEL */ acct
    from   t1024
    group by acct
    having sum(r) < 50000000000;   
    
select /*+ PARALLEL */ acct
    from   t
    group by acct
    having sum(r) < 50000000000;
    
select /*+ PARALLEL */ acct
    from   t32
    group by acct
    having sum(r) < 50000000000;
    
select /*+ PARALLEL */ acct
    from   t1024
    group by acct
    having sum(r) < 50000000000;


        NOPARALLEL     NOPARALLEL      PARALLEL        PARALLEL
     (First 50 rows)   (All Rows)   (First 50 rows)   (All Rows)
T          1.8            1.8             0.5            0.6  secs
T32        0.2            1.8             0.4            0.5  secs
T1024      0.5            4.5             0.2            0.25 secs

So what went wrong with my assumption that putting each ‘acct’ into its own partition would mean no work would be done to perform the GROUPing?
T1024 NOPARALLEL (All Rows) @ 4.5 secs clearly shows it isn’t happening as I’d expect (although the parallel version of the same works fine).

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.