Skip to Main Content
  • Questions
  • How to Group Rows into Buckets Based on Chunk Size

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question.

Asked: September 21, 2017 - 8:51 am UTC

Last updated: September 21, 2017 - 1:25 pm UTC

Version: 11g

Viewed 1000+ times

You Asked

Hi Chirs/Connor,

LiveSQL: https://livesql.oracle.com/apex/livesql/s/flfiwlt4m3yqb6v2tync6gr4o

I have a table test, from that i want to get the details as follows:
Get the no of chunk having file size <= 2GB
E.g. id = 1,2,3 sum of these three rows file comes as <= 2GB so it will be treated as one chunk
likewise id = 4,5 form a chunk
id= 6,7,8,9 form a chunk
and id = 10 for a chunk

so totally for as given created_date = TO_DATE('09/21/2017','MM/DD/YYYY'), we have 4 chunk of sum of file size <= 2GB.

So i want a sql which will either of result like below:

Can you please help in this SQL

id  created_date     no_of_chunks   Chunk_Id
--  ---------------  -------------  ---------
1    21-SEP-17       4              1 
2    21-SEP-17       4              1
3    21-SEP-17       4              1
4    21-SEP-17       4              2
5    21-SEP-17       4              2
6    21-SEP-17       4              3
7    21-SEP-17       4              3
8    21-SEP-17       4              3
9    21-SEP-17       4              3
10   21-SEP-17       4              4


OR

no_of_chunks  id  Chunk_Id   
-----------   --  ---------  
4             1   1          
              2   1
              3   1
              4   2
              5   2
              6   3
              7   3
              8   3
              9   3
              10  4


with LiveSQL Test Case:

and Chris said...

You can find a similar problem discussed at:

https://asktom.oracle.com/pls/apex/asktom.search?tag=how-to-split-rows-into-balanced-sets-based-on-a-running-total-limited-to-2000

IMO the best solution uses match_recognize... Which needs 12c.

If you're on 11.2, you could use recursive with:

with data
  as (select sum(file_size) over(order by to_number(substr(file_name, 6))) as running_total,
             to_number(substr(file_name, 6)) ord,
             a.*
       from  test a
      )
    ,row_generator as
     (select level as lvl
              ,(level-1)*2048 +1 as lo_lvl
              ,level*2048 as hi_lvl
          from dual
        connect by level<=(select ceil(max(data.running_total)/2048)
                             from data
                           )     
     )
select a1.id
      ,a1.created_date
      ,a1.file_size
      ,b1.lvl as chunk_id
      ,sum(a1.file_size) over(order by a1.ord) as running_tot
  from data a1
  join row_generator b1
     on a1.running_total>=b1.lo_lvl
    and a1.running_total<=b1.hi_lvl
order by 1,2  ;

ID  CREATED_DATE          FILE_SIZE  CHUNK_ID  RUNNING_TOT  
1   21-SEP-2017 00:00:00  1024       1         1024         
2   21-SEP-2017 00:00:00  512        1         1536         
3   21-SEP-2017 00:00:00  512        1         2048         
4   21-SEP-2017 00:00:00  1024       2         3072         
5   21-SEP-2017 00:00:00  1024       2         4096         
6   21-SEP-2017 00:00:00  512        3         4608         
7   21-SEP-2017 00:00:00  512        3         5120         
8   21-SEP-2017 00:00:00  512        3         5632         
9   21-SEP-2017 00:00:00  512        3         6144         
10  21-SEP-2017 00:00:00  2048       4         8192  


Otherwise model:

select id, created_date, file_size, fs_tot, chunk_id
from   test
model
  dimension by (row_number() over(order by to_number(substr(file_name, 6))) rn)
  measures (id, created_date, file_size, 0 fs_tot, 1 chunk_id)
  rules(
    fs_tot[any] = file_size[cv()] +
      case when fs_tot[cv()-1] + file_size[cv()] <= 2048
           then fs_tot[cv()-1] else 0 end,
    chunk_id[rn>1] = chunk_id[cv()-1] +
      case when fs_tot[cv()-1] + file_size[cv()] <= 2048
           then 0 else 1 end
  );

ID  CREATED_DATE          FILE_SIZE  FS_TOT  CHUNK_ID  
1   21-SEP-2017 00:00:00  1024       1024    1         
2   21-SEP-2017 00:00:00  512        1536    1         
3   21-SEP-2017 00:00:00  512        2048    1         
4   21-SEP-2017 00:00:00  1024       1024    2         
5   21-SEP-2017 00:00:00  1024       2048    2         
6   21-SEP-2017 00:00:00  512        512     3         
7   21-SEP-2017 00:00:00  512        1024    3         
8   21-SEP-2017 00:00:00  512        1536    3         
9   21-SEP-2017 00:00:00  512        2048    3         
10  21-SEP-2017 00:00:00  2048       2048    4 


HT to Duke Ganote and Stew Ashton for these solutions.

Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Wow!! That's Awesome

A reader, September 21, 2017 - 1:17 pm UTC

Thanks a lot Chirs, Duke Ganote and Stew Ashton for this!!
Looking at these SQL's, I feel like I don't even fall under basic Level 1 of SQL.. :)

In my 7 years of experience, I never understood how MODEL clause (great SQL feature) works..

If Chirs/Connor visit India and I get chance to meet them by any chance, I would like to understand this feature :)

Thanks a lot again!!
Chris Saxon
September 21, 2017 - 1:25 pm UTC

Model's neat, but match_recognize is much cooler. Put your effort into understanding that instead ;)

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.