Skip to Main Content
  • Questions
  • Grouping and updating column with equal parts

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Chiran.

Asked: January 24, 2023 - 6:43 am UTC

Last updated: January 26, 2023 - 1:25 pm UTC

Version: 12c

Viewed 10K+ times! This question is

You Asked

Hi,

I have a table that holds about 2 million records. No column in the table can be used to divide rows equally. Initially when the data is loaded into the table, the bucket_number column is null. After data loaded, I would like to calculate and equally divide the total no. of records and update bucket_number column with number. Later the buckets will be processed parallelly by another batch process. I could use ntile to bucketize data but I want to persist the bucket number in table.
What is the best way to bucketize and update column in Oracle?

and Chris said...

What exactly is the bucket_number function and what do you use it for?

You can split rows into equal-sized buckets with NTILE. If there's no suitable column you could sort by ROWID.

For example, this splits the rows into 10 buckets:

ntile(10) over ( order by rowid )


Later the buckets will be processed parallelly by another batch process

Look into DIY parallelism with DBMS_parallel_execute. This includes procedures to automatically split the rows into N buckets (chunks) based on ROWID, column or another query. It may also simplify what you're trying to do:

https://oracle-base.com/articles/11g/dbms_parallel_execute_11gR2

For more on row-bucketing methods, watch the recording of my Office Hours session on this topic:


Rating

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Chiran P, January 24, 2023 - 7:47 pm UTC

Thanks Chris,

To illustrate it in simple way. Below is original state of my table. Data loaded into this table with null bucket_no.
ID EMP_ID DEPT BUCKET_NO
---------- -------------------- -------------------- ----------
1 2000 A
2 2001 A
3 2002 B
4 2003 C
5 2004 D

Expected state of table after dividing records into 3(configurable) parts. Need not be perfectly equal but near equal.

ID EMP_ID DEPT BUCKET_NO
---------- -------------------- -------------------- ----------
1 2000 A 1
2 2001 A 1
3 2002 B 2
4 2003 C 2
5 2004 D 3

ntile might give me results with grouping done as select but I need to persist the bucket_no in the table. i.e I need to update the bucket_no column for all 2 mil rows in the table. There are some batch processes outside of Oracle(say some java application) that use this buket_no for parallel processing. Oracle does not have to process them after bucket_no updated.

Is there any cost efficient SQL to update all rows with bucket numbers?



Chiran P, January 25, 2023 - 5:12 am UTC

How below piece of code is going to fare for over 2 million records update? is there any better way of doing it?

update employee f set bucket_no= (select mybucket from (SELECT 
   NTILE(4) OVER (order by id) as mybucket,id
  FROM employee ) e where e.id=f.id);

Chris Saxon
January 25, 2023 - 1:23 pm UTC

Updating all the rows in a table can take a long time. Here are some options to consider:

Parallel DML


Use in-built database parallelism to run the update in parallel

dbms_redefinition.execute_update 


This is an optimized, online version of UPDATE; just paste in your update statement

Create-table-as-select


Use create table to "update" the data:

create table tmp as 
  select .., ntile(N) over (...) bucket.
  from ...


You can use dbms_redefinition to do this online; see

https://connor-mcdonald.com/2016/11/16/performing-a-large-correlated-update/

DBMS_parallel_execute


Use this to split the rows into equal-sized chunks and do the updates in parallel.

Persist bucket number??

Narendra, January 25, 2023 - 9:46 pm UTC

ntile might give me results with grouping done as select but I need to persist the bucket_no in the table. i.e I need to update the bucket_no column for all 2 mil rows in the table. There are some batch processes outside of Oracle(say some java application) that use this buket_no for parallel processing. Oracle does not have to process them after bucket_no updated.

Chiran,
You have not mentioned it but I am really hoping that the requirement of persisting bucket numbers is original requirement which has some justification (I believe you have mentioned in original post something like "I want to persist the bucket number"...)
Because above you also claim that Oracle (aka database code) is not going to use the bucket number at all. While I understand that some batch processes outside database need to have ability to process this table data in near-equal sized chunks, it still does not necessarily mean that there is a requirement to STORE the bucket number.
For e.g. if this outside batch process is only ever going to process all data in this table only once then you still don't need to persist bucket number. All that is needed is to be able to present the data in this table with an extra column containing bucket number, which can be achieved by creating a view on this table columns and adding a derived column using NTILE function.

I would be tempted to revisit the real requirement behind the bucket number column value. Is the same batch process (or some other process) going to use this bucket number value to either create and store data in database that needs the ability to reconcile with original source data? Or is someone worried about the case where if batch process fails without processing all data and needs to be restarted, either from the beginning or simply resume from the point of failure (after fixing the underlying cause of failure)?

As someone has rightly said, the best way to speed up something is to not do it, it would help if you are able to revisit and confirm that the requirement for persisting bucket number is valid.
Chris Saxon
January 26, 2023 - 1:25 pm UTC

Good points

Why buckets by row?

Stew Ashton, January 26, 2023 - 9:57 am UTC

You say you want the bucket numbers in order to do batch processing in parallel later on.

Batch processing is usually more efficient if you have buckets based on blocks, not on rows. DBMS_PARALLEL_EXECUTE allows you to divide a table into buckets based on blocks; you dictate the number of blocks per bucket, not the number of buckets.

If for some reason you want to dictate the number of buckets, see https://stewashton.wordpress.com/category/chunking-tables/
Chris Saxon
January 26, 2023 - 1:25 pm UTC

Thanks for sharing Stew

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library