You Asked

Hello Tom,
I would need to prepare data for Oracle Datamining.
I have in a table about 60 000 data and I would need to randomly split data to the three groups A,B,C.

ideally:
Group A -> contains 50% of all data
Group B -> contains 25% of all data
Group C -> contains 25% of all data

My question is general: how is it possible to do it in Oracle Database?

and Tom said...

depends on how you want to approach "random" here.

would it be sufficient to take the first two rows and say that is group a, next one - group b, next one group c and so on.

ops$tkyte%ORA11GR1> select empno, case when mod(rownum-1,4) in (0,1) then 1 else mod(rownum-1,4) end grp from scott.emp;

     EMPNO        GRP
---------- ----------
      7369          1
      7499          1
      7521          2
      7566          3
      7654          1
      7698          1
      7782          2
      7788          3
      7839          1
      7844          1
      7876          2
      7900          3
      7902          1
      7934          1

14 rows selected.

or you could assign a random number to each row

ops$tkyte%ORA11GR1> select empno, case when rnd <= .5 then 1 when rnd <= .75 then 2 when rnd <= 1.0 then 3 end grp, rnd
  2  from (select /*+ materialize */ empno, dbms_random.value rnd from scott.emp )
  3  /

     EMPNO        GRP        RND
---------- ---------- ----------
      7369          1 .032468662
      7499          1 .133789451
      7521          1 .344913779
      7566          3 .974843528
      7654          3 .828559909
      7698          2 .529104438
      7782          2 .641904567
      7788          1 .027464852
      7839          3 .926926609
      7844          3 .793143602
      7876          1  .39843107
      7900          2 .682357302
      7902          3 .823775602
      7934          3 .911027465

14 rows selected.

or, if the groups can be overlapping, you can use sampling

http://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i79194

Or, you could use ntile, width_bucket, ora_hash

ops$tkyte%ORA11GR1> select empno, ntile(4) over (order by dbms_random.random) from scott.emp;

     EMPNO NTILE(4)OVER(ORDERBYDBMS_RANDOM.RANDOM)
---------- ---------------------------------------
      7844                                       1
      7902                                       1
      7782                                       1
      7934                                       1
      7499                                       2
      7900                                       2
      7876                                       2
      7839                                       2
      7788                                       3
      7369                                       3
      7566                                       3
      7521                                       4
      7654                                       4
      7698                                       4

14 rows selected.


ops$tkyte%ORA11GR1> select ora_hash(empno,3), empno from scott.emp;

ORA_HASH(EMPNO,3)      EMPNO
----------------- ----------
                0       7369
                0       7499
                3       7521
                3       7566
                2       7654
                0       7698
                2       7782
                3       7788
                2       7839
                3       7844
                0       7876
                1       7900
                1       7902
                2       7934

14 rows selected.

http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions214.htm#SQLRF06163

You'd have to be more specific on what you mean by random - or what is good enough to be random.

for 60,000 rows - dbms_random would be reasonable - that is teeny tiny.

Rating

(2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

amazing

jiri, November 13, 2009 - 8:42 pm UTC

I think you listed pretty much every single option ... I'm speechless.

two tiny add-ons (if I can)

1. you can use DBMS_CRYPTO instead of dbms_random (i dont think I should say it's "better" random generator right?) ;-)

2. if you have very large amount of data and you want to do quick split, you can use HASH partitioning and then select from a partition

November 15, 2009 - 3:19 pm UTC

1) yes.

2) yes.

additional info

Martin, November 16, 2009 - 5:11 am UTC

Hello Tom,
thank you for quick response.

Q: You'd have to be more specific on what you mean by random - or what is good enough to be random.

A: The each group must have same number of members, but the mebers must be distributed to the group randomly.

The background is, I have outlines from a cut mashine, each row is an unique record about mashine activities.
The goal is to find out by the help of datamining methods the information about improvement the capacity of production.
Therefore I need to split data to three groups uniformly and randomly.

Thank you,
Martin

November 23, 2009 - 9:34 am UTC

ok, I demonstrated a plethora of ways to do it, did any of them strike you as useful for your issue?

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

amazing

additional info

More to Explore

Analytics

I would need ramdomly split data in a table

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

amazing

additional info

More to Explore

Analytics