Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Eugene.

Asked: October 24, 2004 - 9:38 pm UTC

Last updated: September 28, 2009 - 1:07 pm UTC

Version: 8.1.6

Viewed 10K+ times! This question is

You Asked

Hi Tom,
I came across one of the Oracle 10G built-in packages called ORA-HASH. I can use it as I would DBMS_OBFUSCATION_TOOLKIT, I believe. Could you, please, give me an example where this package would be most useful. Also, I can't find this package in 8i nor in 9i. Does it exists only in 10G?

Thanks,
Eugene

and we said...

ora_hash is a new 10g function -- therefore, it is not available in 8i or 9i as it did not exist.


</code> http://docs.oracle.com/cd/B14117_01/server.101/b10759/functions097.htm#sthref1515

suggests a use for it (to generate a semi-random subset of data for example).

In 8i, 9i -- you can use:

http://docs.oracle.com/cd/B14117_01/appdev.101/b10802/d_util.htm#1003647 <code>

to "hash"

Rating

  (34 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Be carefule with hashes

Michael Friedman, December 19, 2004 - 8:23 pm UTC

You have to remember to be very careful with hashes.

For example, we're planning an application that will implement complex keys similar to EBS Flexfields. Since we don't know in advance which columns will be used, since there can be as many as 20, and since the concatenated value can be well over 2,000 bytes we will not use a concatenated unique index.

Instead we're going to hash the concatenated string, store that in a column, and do a unique index on it.

But we can't use Ora_Hash.

Remember the birthday paradox? If you have N people in a room what are the odds two have the same birthday? Turns out the odds are even when N = sqrt(365).

Ora_Hash generates a value between 0 and 2^32-1. That means by the time we have 2^16 = 65,536 rows in our table we have even odds of a hash collision preventing us from adding a valid row.

Therefore we're using MD5. It is cryptographically secure and presumably slower than Ora_Hash, but it has a 128 bit key. That means even odds of a collision at 2^64 ~= 16,000,000,000,000,000,000 rows... we can live with that.

It is also nice to know that the only collisions ever found with MD5 required serious cryptographic work and that they are collision attacks, not preimage attacks, meaning that it would take a wild coincidence for this to be a problem in our system. Since I am far more likely to win the lottery twice in a row I'm not worrying about it.

</code> http://www.cryptography.com/cnews/hash.html <code>

Tom Kyte
December 19, 2004 - 8:44 pm UTC

this is a bin-ing function -- to hash a given value into a set of N bins -- to hash into N bins.

Very few hashes are isomorphic (there are the "perfect hashes" but they take the input data into consideration) -- most all hashes have collisions, purposely so -- it is part of their design.

so, you would not use an MD5 hash to select a random subset :)
but you could use ora_hash to do so.

Hash

hans, July 25, 2005 - 1:33 pm UTC

My requirement is to compare the records in two tables( say Table A and Table B) before flagging it for update or insert for history reasons.
I am planning to write a Java Stored procedure to develop a hash for the records in A and B and compare the hash values. if the hash values are not the same the records goes in
history table.

I will appreciate if you have any examples for JSP with hash comparision.
Thanks in advance

Tom Kyte
July 25, 2005 - 2:24 pm UTC

why not just use sql and let sql do the comparision? it is what databases were written to do? process data.

Comparing data in two tables

Mikito Harakiri, July 25, 2005 - 6:01 pm UTC

Wow! I was using DBMS_UTILITY.get_hash_value for quick table difference comparison. (Hinted by somebody that Microsoft has a hash value based table comparison). DBMS_UTILITY.get_hash_value was no contender due to PL/SQL-SQL context switch. ora_hash is fast!


Test case:

create table A as
select obj# id, name from sys.obj$
where rownum < 100000;

create table B as
select obj# id, name from sys.obj$
where rownum < 100010;

--Symmetric difference:
select * from A
union all
select * from B
minus (
select * from B
intersect
select * from A
);

3.5 sec

--select sum(DBMS_UTILITY.get_hash_value(id||name,
1,
POWER(2,16)-1)
)
from A
union all
select sum(DBMS_UTILITY.get_hash_value(id||name,
1,
POWER(2,16)-1)
)
from B;

2.8 sec

select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
from A
union all
select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
from B;

650 msec!

Tom Kyte
July 25, 2005 - 7:44 pm UTC

yeahbut, since the next question is typically 'what is different', I myself (assuming the two tables are on the same machine) would prefer just to ask that question:

ops$tkyte@ORA10G> /*
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> drop table a;
ops$tkyte@ORA10G> drop table b;
ops$tkyte@ORA10G>
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> create table A as
ops$tkyte@ORA10G> select obj# id, name from sys.obj$
ops$tkyte@ORA10G> /
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> create table B as
ops$tkyte@ORA10G> select obj# id, name from sys.obj$
ops$tkyte@ORA10G> /
ops$tkyte@ORA10G> */
ops$tkyte@ORA10G> set timing on
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
  2    from A
  3   union all
  4  select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
  5    from B;
 
SUM(ORA_HASH(ID||'|'||NAME,POWER(2,16)-1))
------------------------------------------
                                1646226153
                                1646240054
 
Elapsed: 00:00:00.16
ops$tkyte@ORA10G>
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> select id, name,
  2         count(src1) CNT1,
  3          count(src2) CNT2
  4     from
  5   ( select a.*,
  6          1 src1,
  7          to_number(null) src2
  8       from a
  9      union all
 10     select b.*,
 11          to_number(null) src1,
 12          2 src2
 13       from b
 14    )
 15   group by id, name
 16  having count(src1) <> count(src2)
 17  /
 
        ID NAME                                 CNT1       CNT2
---------- ------------------------------ ---------- ----------
     75766 B                                       0          1
 
Elapsed: 00:00:00.18
ops$tkyte@ORA10G>


 

hash

hans, July 25, 2005 - 8:45 pm UTC

Tom,
By comparision of records I would like to concatinate all the fields of table say 'A' and pass the whole record and create a hash value of it and similarly for Table 'B', I would create a hash code for the record considering that both the tables are similar in structure and the record on both the tables is selected by the same ID#.

How do I do it in java stored proc for hash code comparisions.

I have oracle 8.1.7 environment
Thanks

Tom Kyte
July 25, 2005 - 8:51 pm UTC

question "why"

why not just use sql?

but interestingly, just see "above", there are hash functions shown there as well as the sql that shows you WHAT is different about the tables instead of just saying "yes, they are different"



Test case

Mikito Harakiri, July 25, 2005 - 9:01 pm UTC

By the hash value returned, I would guess that there are about 50K objects in your database. The difference between the methods becomes more pronounced as the data volume increases. In fact sort becomes the bottleneck, so that your query would slow down as n*log(n), while hash value comparison speed is proportional to n.

Tom Kyte
July 26, 2005 - 7:42 am UTC

Not necessarily (1,000,000 rows)

You are assuming it must sort, as things evolve, you'll find that to not be the case.

True, the hashing is a useful technique - I use it over dblinks. But many times, I use the compare table query technique - because the very very next question is typically "so, what was different"

For truly large datasets, the hashing is useful.

big_table@ORA10GR2> select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
2 from A
3 union all
4 select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
5 from B;

SUM(ORA_HASH(ID||'|'||NAME,POWER(2,16)-1))
------------------------------------------
3.2744E+10
3.2744E+10

Elapsed: 00:00:02.16
big_table@ORA10GR2>
big_table@ORA10GR2> select id, name,
2 count(src1) CNT1,
3 count(src2) CNT2
4 from
5 ( select a.*,
6 1 src1,
7 to_number(null) src2
8 from a
9 union all
10 select b.*,
11 to_number(null) src1,
12 2 src2
13 from b
14 )
15 group by id, name
16 having count(src1) <> count(src2)
17 /

ID NAME CNT1 CNT2
---------- ------------------------------ ---------- ----------
1 ICOL$ 0 1

Elapsed: 00:00:02.89
big_table@ORA10GR2> set autotrace traceonly explain
big_table@ORA10GR2> /
Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 737071912

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2418K| 83M| 2505 (16)| 00:00:31 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 2418K| 83M| 2505 (16)| 00:00:31 |
| 3 | VIEW | | 2418K| 83M| 2176 (4)| 00:00:27 |
| 4 | UNION-ALL | | | | | |
| 5 | TABLE ACCESS FULL| A | 1274K| 36M| 1090 (4)| 00:00:14 |
| 6 | TABLE ACCESS FULL| B | 1143K| 32M| 1086 (3)| 00:00:14 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(COUNT("SRC1")<>COUNT("SRC2"))


Hash value of table ?

Parag J Patankar, February 08, 2006 - 6:01 am UTC

Hi Tom,

I am using Oracle 9.2. Can I get hash value of a complete table in by simple way for e.g. cksum of file in unix ?

Is it possible in 10g ?

regards & thanks
pjp

Tom Kyte
February 08, 2006 - 8:13 am UTC

see example above, it shows how to do this.

select sum(hash(the_columns)) from the_table


dbms_utility - all releases, ora_hash, new 10g

Is it correct ?

Parag J Patankar, February 10, 2006 - 1:59 am UTC

Hi Tom,

To get hash value of table like "cksum" like unix I have written following script

column xx format 999,999,999,999,999,999,999,999,999

select sum(dbms_utility.get_hash_value(x1, 1, power(2,16)-1)) xx
from
(
select column_name||'||'x1 from user_tab_columns where table_name=upper('&1')
union
select 'x' x1 from dual
), &1
/

Is this correct way ?

kindly guide me.

regards & thanks
pjp

Tom Kyte
February 10, 2006 - 12:43 pm UTC

no, not sure what you were trying with a query against user_tab_columns and dual?

You sort of want to query YOUR tables - were you by any chance trying to write SQL to generate the SQL?

you would do something like:


select sum(dbms_utility.get_hash_value(c1||'/'||c2||'/'||...., 1, power(..)))
from T1
union all
select sum(dbms_utility.get_hash_value(c1||'/'||c2||'/'||...., 1, power(..)))
from T2

and if you got back two rows with different numbers - you would know they differ (you would have no clue "how", but you would "know" they did)

get_has_value

Parag J Patankar, February 13, 2006 - 2:46 am UTC

Hi Tom,

I was running following query

column xx format 999,999,999,999,999,999,999,999,999

select sum(dbms_utility.get_hash_value(x1, 1, power(2,16)-1)) xx
from
(
select column_name||'||'x1 from user_tab_columns where table_name=upper('&1')
union
select 'x' x1 from dual
), &1
/

To have generic query to get has_value of any table that is why I am using user_tab_columns to get all columns of a table. I am using dual table to just have proper "concatenated" value of the last column.

Kindly tell me is this SQL written in "CORRECT" way ( in terms of efficency ) to get hash value of any table data in a schema.

thanks & regards
pjp




Tom Kyte
February 13, 2006 - 8:31 am UTC

I already answered this. read above.

That query makes absolutely no sense to me whatsover.

You are querying User_tab_columns.
You are querying dual.
You caresian join this to your table.
you hash x1 - which is some junk from user_tab_columns.
You never touch your table (except to use it as a way to generate tons of X1 values, all of which come from user_tab_columns)


It makes no sense. No.

I tried to show you the correct way right above.

It does not involve user_tab_columns.
It does not involve dual.


It *appears* you are trying to write SQL to write SQL (eg: run a query against use_tab_columns to generate a SQL statement) but you are very far away from that (that would take two queries, one to generate the SQL, another to "be the sql you just generated")

Murali

Murali, March 13, 2006 - 3:38 am UTC

Tom:

I have table tb_curr_policy_ld have 4 millions of records when i try to generate hash_value by using dbms_utility its taking now Hour. Our current Production DB version is 8.1.7.4

is any better way we can do update process .
souce_system column is already indexed not sure why its taking 60 Minutes.

pls suggest better way can do this this update process

update
tb_curr_policy_ld
set to_hash_value =
dbms_utility.get_hash_value(
to_pol_no ||
to_ben_no ||
to_source_system ||
to_ben_la_nric_no ||
to_ben_la_nric_type ||
to_payer_nric_no ||
to_payer_nric_type ||
to_assignee_nric_no ||
to_assignee_nric_type ||
to_addr_no ||
to_main_plan ||
to_pol_serv_br ||
to_pol_prem_due_date ||
to_pol_currency ||
to_pol_pay_mode ||
to_pol_pay_method ||
to_pol_tot_inst_prem ||
to_pol_tot_annual_prem ||
to_pol_next_so_ded_date ||
to_pol_bank_acc_no ||
to_servicing_agent ||
to_commission_agent ||
to_issue_agent ||
to_ben_type ||
to_ben_status ||
to_ben_status_date ||
to_ben_issue_date ||
to_ben_start_date ||
to_ben_cover_end_date ||
to_ben_prem_expiry_date ||
to_ben_cvr_year ||
to_ben_prem_year ||
to_ben_sum_assd ||
to_ben_inst_prem ||
to_ben_ann_prem ||
to_apl_os ||
to_loan_os ||
to_survival_ben_amt ||
to_survival_due_date ||
to_survival_ben_opt ||
to_net_cash_value ||
to_loan_value,
1000,
power(2, 16) - 1)
where to_source_system = 'LIFEASIA'

Tom Kyte
March 13, 2006 - 10:04 am UTC

why are you trying to create a hash value and perhaps you meant to do it as you loaded the data?

computing a hash is cpu intensive.

the dbms_utility package is not going to be fast when called 4million times. It is doing about 1,111 per second (not bad eh?)


If you have additional CPU resources on this machine, you could use do it yourself parallelism to facilitate doing this in parallel

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:10498431232211 <code>


Murali

A reader, March 14, 2006 - 5:00 am UTC

Tom:
Thanks for your quick response.This is a great site - very informative. I also read your book - Expert One-On-One. I've already recommended it to my team-mates. You are doing a great job, keep it up!

let me come to point. actually why we are creating hash value is , every day from As/400 we get current image policy flat file. As/400 side they can't detect which policy details was changed from day to day, so they always give current policy records with current image, so we have to identify updated records and new records by comparing current image table(tb_curr_policy_ld) and previous image table(tb_previous_policy_ld)

Steps we do is

1) truncate tb_previous_policy_ld
2) move tb_curr_policy_ld records into tb_previous_policy_ld
3) truncate tb_curr_policy_ld
3) load current policy image As/400 file into tb_curr_policy_ld using sql*loader

4) we generate hash value in tb_curr_policy_ld
5) then compare tb_curr_policy_ld, tb_previous_policy_ld by using two key columns policy_number and hash value.

if hash_value is same in tb_curr_policy_ld and tb_previious_policy_ld then no change in policy records.
if hash_value is change then we assume policy records changed

We use follwing update statment to mark processed indicator to 'X' i.e no need to process i.e hash value is same in yesterday and today

update tb_curr_policy_ld
set to_processed_ind = 'X'
where to_source_system = v_ip_srce_sys
and exists( select 1
from tb_last_policy_ld
where ye_pol_no = to_pol_no
and ye_ben_no = to_ben_no
and ye_hash_value = to_hash_value
and ye_source_system = to_source_system);

and we select to_processed_ind null record to process new and updated records.


The problem we facing is when we generate hash value its taking 1 hour to update all records.

Actually we have one CPU even we use parallel hints not much improvement.

Thanks in advance

Tom Kyte
March 14, 2006 - 10:57 am UTC

why not just update a join?


update ( select columns from existing table,
columns from new table
where JOIN
and <at least one column is different> )
set columns in existing table = columns from new table;



A reader, March 15, 2006 - 12:54 pm UTC

Hi Tom, one question I was reading up in this question, about the possibility to get duplicated the hash value, I don't understand it.
Could you explain that, and the risks of using hash values and getting duplicates
Please :)

Tom Kyte
March 15, 2006 - 5:38 pm UTC

two strings can possibly hash to the same value - it is called a hash collision. It happens when you take an infinite set of values (all strings) and perform a function that maps them into a finite set (the size of your hash table)


There are no "risks", there is just the "fact"

A reader, March 15, 2006 - 4:44 pm UTC

The second question is, when you suggest to use that hash function on Oracle database?, some ideas please :)

Tom Kyte
March 16, 2006 - 7:36 am UTC

whenever you need to in your applications? I mean - hashing is just a "computer science 101" sort of thing.

You might take a users password for your own custom authentication scheme and "hash it". So you don't store a clear text password.

actually, this page here seems to have some ideas when you might use them too.

A reader, March 16, 2006 - 8:18 am UTC

Thanks Tom

what else is different ?

Sokrates, January 02, 2007 - 7:03 am UTC

ora_hash is deterministic
dbms_utility.GET_HASH_VALUE is not

When to use ora_hash vs MD5

ozjbrown, January 04, 2007 - 5:53 pm UTC

Hi Tom,

Could you provide me with some guidelines as to when to use ora_hash or MD5? We update a large # of records into our data warehouse. We would like to determine what records are changed. The source table has an update_date column but it is highly unreliable as it is not consistently stamped. So in lieu of this column, I thought about using the ora_hash after reading your comments. However, someone is recommending MD5 as ora_hash has a limitation of 65,000 records.

If I use ora_hash on the table (approximately 500,000 to 1,000,000 changed records), would it help to avoid collisions if I use the sum with the ora_hash and compare that value to the hash value on the source table where I join the two tables through the primary key columns?

Or is MD5 the way to go? What is the best approach (in terms of speed, reliability in obtaining changed records, etc)?

Thanks


Tom Kyte
January 05, 2007 - 9:44 am UTC

md5 has a much larger "namespace", it would make sense to use that to avoid false positives and potential misses (two different rows hashing to the same value)

both are still possible with md5, but much less likely.

Ora_hash limitations

Michael Friedman, January 05, 2007 - 11:53 am UTC

I'm the guy who mentioned the limitation... but it's a bit more complicated than that.

If you're familiar with the birthday paradox, you know that if you select N random numbers from a possible set of M total numbers then you have a 50% chance of two of those N random numbers being the same when N = sqrt(M).

That means that since the maximum size of the ora_hash hash is 32 bits, if you choose 2^16 different random values and apply ora_hash to them you have a 50% chance of having a collision - a false positive.

For many applications that's not acceptable - if you have 300,000 values and you are looking for any duplicate pairs this is probably not acceptable.

In that case, you need a hash function which is less likely to generate collisions.

You could play around with ora_hash (for example, ora_hash(value) || ora_hash(value || 'x') ) but without knowing the details of the ora_hash algorithm you have no idea if this approach will give you significantly better results.

The other alternative is a hash like MD5 which is 128 bits. Then it takes 2^64 values before you have a 50% chance of a false collision. For most people's data volumes it is far more likely that they will drop dead, their computer will explode, and their Oracle database will corrupt all their data and their backups will not be recoverable than that they will get a collision with 128 bits. In other words, you have far worse and more likely problems to worry about than a collision.

ora_hash

ozjbrown, January 10, 2007 - 10:07 pm UTC

I am confused on how I would get a false positive.

If my tables contain 5 million rows in each and I am trying to find the differences (what records has been inserted and/or updated), could I not use the primary key column to help in the checksum comparison -

select col1, sum(ora_hash(col1||'|'|| col2||'|'|| col3||'|'||col4||'|'||col5||'|'||coln, POWER(2,16)-1))
from A
minus
select col1, sum(ora_hash(col1||'|'|| col2||'|'|| col3||'|'||col4||'|'||col5||'|'||coln, POWER(2,16)-1))
from B

In the sql above, col1 is the primary key in table A and B and there are 30 columns in each table that will be concatenated in calculating the checksum.

So with the above sql using ora_hash, would I still get a false positive? The reason that I am asking is that from the thread on ora_hash, this 10g function is much faster than md5. Is this true?

Thanks again!

hash

ozjbrown, January 25, 2007 - 8:23 pm UTC

Hi Tom,

Still waiting on a response on my last query. Hope you can clarify my confusion.

Thanks

What chance of clash

Richard, February 15, 2007 - 12:24 pm UTC

Thanks to Michael, as well as to Tom, for clarifying the use of the larger hash size of MD5.

I have two follow up questions:
(a) To turn this query on it's head, the question that many of us need to answer is "given n rows and using MD5, what are the chances of a collision". In other words if we have 10m rows, and this gives (say) a 1/1m chance of a collision, we can assess whether the risk is acceptable. Is there a formula for this?
(b) Am I right in thinking that the nature of the encryptin is such that the nature of the input values have no bearing on the output, e.g. if two input values differ only slightly, the hashed values are no more likely to be the same?
Tom Kyte
February 16, 2007 - 12:44 pm UTC

a) md5, with a 128bit key, has 3.4x10^38 combinations. the chance of a collision is rather small (but exists)

b) yes - google around on md5 and you'll get a feel for what it does and how it works.

Collisions

Michael Friedman, February 17, 2007 - 12:07 am UTC

Ozjbrown,

How big are your tables? 10 records each? No worries - a false hit is wildly unlikely. 2^32 records each? Collisions are guaranteed even with the same table.

Richard,

If you are matching two sets of size M and N using hash values and your question is what are the odds that a random value in N will have the same hash as a random value in M and your hash space is size H then the odds are:

1/H - chance that a random pair from M and N match

1 - 1/H - chance that a random pair from M and N do not match

(1 - 1/H) ^ M - chance that a random value from N does not match any value in M

(1 - 1/H) ^ M ^ N - chance that no value from N matches any value from M

So if you're using OraHash, H is 2^32.

If your tables are 1MM records each ~= 2^20 and you think it is reasonable that there are 16 extra rows in table 2 then M = 2^20 and N = 2^4.

So (1 - 1/2^32) ^ (2^24) and you have a .39% chance of a false match for one of the extra rows. That might be acceptable.

But if you have 256 extra rows and 16 million records then your odds of a false match go up to 63%.

With MD5 they would still be considerably worse than your chance of winning the lottery several times in a row.


ora_hash

ozjbrown, April 04, 2007 - 9:22 am UTC

Hi Michael,

The two tables are over 14 million rows. If you notice, I use the primary key to group by the ora_hash value calculated for the remaining columns. For example, table A and table B has 4 columns with column1 being the primary key for both tables. Table A is the current snapshot of the data while Table B is the previous snapshot of the data. So I was using the following sql to determine to if there is a record changed in Table A since Table B -

select column1, sum(ora_hash(column2 || '|' || column3
|| '|' || column4
,power(2,16)-1))
from A
group by column1
minus
select column1, sum(ora_hash(column2 || '|' || column3
|| '|' || column4
,power(2,16)-1))
from B
group by column1

Would it be faster this way or doing a straight column compared?

Regards,

jb
Tom Kyte
April 04, 2007 - 10:48 am UTC

when you timed it, what did you observe?

the answer will always be "it depends" - but for just 3 columns - the expense of hashing might outweigh the speed of compares

but actually, in looking at your example - we cannot compare your query to comparing column values - you are summing a hash over column1 - it would be very different from comparing columns

How to unhash

Vipin, August 24, 2007 - 10:32 am UTC

Tom,

Is there a method to "unhash" the hashed value returned by Ora_hash function???


Tom Kyte
August 27, 2007 - 3:20 pm UTC

no, the hashing is a polymorphic function, not isomorphic

Man, I've been waiting 21 years to say that :)

hashes go "one way" since two different strings can hash to the same value.

is ORA_HASH what Oracle uses internally with hash partitioning?

Adam Musch, December 07, 2007 - 9:36 am UTC

I'm looking to parallelize a process that includes a range-hash partitioned table. The process is driven by the range-partitioning column, and I'd like to parallelize it by the hash column without getting into the partitioning nomenclature. If ORA_HASH were used for both, then ORA_HASH(hash_partitioning_column, number_of_hash_partitions) = [thread] would be perfect.

A new kind of ORDER BY

Andrew, January 24, 2008 - 1:21 am UTC

I wish Oracle had a kind of a special HASH BY statement which would perform ordering, or I should say clustering by a hash value of a combination of columns. To illustrate say if you had

Column | A | B |
-------------
| 3 | 10 |
| 5 | 20 |
| 5 | 15 |
| 3 | 50 |
| 15 | 100 |
| 5 | 25 |

You could do a HASH BY column A, which would produce something like
| 3 | 50 |
| 3 | 10 |
| 15 | 100 |
| 5 | 20 |
| 5 | 15 |
| 5 | 25 |
Effectively, all the distinct values of column A would get grouped with each other, but there's no particular ascending or a descending order.

I'd imagine this should be quite easy to implement, and as opposed to O(NlogN) for a normal ORDER BY this would have a complexity of O(N) just like for any hash-based operations.

There arI wish Oracle had a kind of a special HASH BY statement which would perform ordering, or I should say clustering by a hash value of a combination of columns. To illustrate say if you had

Column | A | B |
-------------
| 3 | 10 |
| 5 | 20 |
| 5 | 15 |
| 3 | 50 |
| 15 | 100 |
| 5 | 25 |

You could do a HASH BY column A, which would produce something like
| 3 | 50 |
| 3 | 10 |
| 15 | 100 |
| 5 | 20 |
| 5 | 15 |
| 5 | 25 |
Effectively, all the distinct values of column A would get grouped with each other, but there's no particular ascending or a descending order.

I'd imagine this should be quite easy to implement, and as opposed to O(NlogN) for a normal ORDER BY this would have a complexity of O(N) just like for any hash-based operations.

This would be quite useful in many real-life situations, where an exact ORDER BY is not a strict requirement.

compare sysdate with update_date column in table

A reader, April 30, 2008 - 12:23 am UTC

Tom,

We have a ETL job which performs DELTA load every day by comparing sysdate with update_date in the table.

What happens is, when we compare SYSDATE with MV_UPDATE_DATE, it sometimes returns 0 rows as date timestamp is not matching with sysdate.

Is there any effecient and right way to check updated_date in table with SYSDATE. if we use trunc(update_date)=trunc(sysdate), then it does not use index on update_date.

our code compare sysdate like this:
===================================
...
WHERE MS_CLIENT_REQUEST.MV_UPDATE_DATE=SYSDATE
AND MS_JOB.MV_UPDATE_DATE=SYSDATE
AND MS_MASTER_TASK_CATALOG.MV_UPDATE_DATE=SYSDATE
AND ASSIGNED_TO.MV_UPDATE_DATE=SYSDATE
AND MS_SERVICE_REQUEST.MV_UPDATE_DATE=SYSDATE
AND MS_TASK.MV_UPDATE_DATE=SYSDATE
AND MS_WORKGROUP.MV_UPDATE_DATE=SYSDATE
Tom Kyte
April 30, 2008 - 10:05 am UTC

where update_date >= trunc(sysdate) and update_date < trunc(sysdate+1)

gets everything from "today"

A reader, April 30, 2008 - 12:27 am UTC

Sorry Tom, i posted this question in the irrelevant original question. Please delete it. I will post it under Delta processing original question.

Insert based on ora_hash

reader, January 07, 2009 - 9:38 pm UTC

Tom,

I have two tables stage and target. Stage table has extract of all data that changed in transactional system within last 24 hours. I am trying to insert into target table only data that changed in certain columns. How can I insert into target table rows identified by result set of this query?

select *
(
select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
from stage
minus
select sum(ora_hash(id||'|'||name, POWER(2,16)-1))
from target
)

Thank you in advance
Tom Kyte
January 08, 2009 - 9:02 am UTC

that query identifies nothing


select sum(number) from stage - returns a single row
select sum(number) from target - returns a single row


in order to determine if a column has changed, you need to be able to join it to another row to compare it to.

So, what is the JOIN criteria for joining stage to target.

It will be

insert into target
select stage.*
  from stage, target
 where stage.KEY = target.KEY(+)
   and (target.KEY is null /* new row */ OR
        decode(stage.c1,target.c1,1,0) = 0 OR  /* one of the columns */
        decode(stage.c2,target.c2,1,0) = 0 OR  /* was changed */
        ... )



decode is nice to use in this case since it is "NULL friendly"


for example:

ops$tkyte%ORA10GR2> create table stage
  2  as
  3  select object_id, object_name, owner, object_type
  4    from all_objects;

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> create table target
  2  as
  3  select * from stage;

Table created.


our join key is object_id, we are interested in new rows in stage or rows in stage such that object_name OR object_type were modified (but not owner, we don't care about that one)

So, we modify some stage data:

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> update stage set object_name = 'x'||substr(object_name,2) where object_id between 0 and 10;

9 rows updated.

ops$tkyte%ORA10GR2> update stage set owner = 'y'||substr(owner,2) where object_id between 100 and 110;

11 rows updated.

ops$tkyte%ORA10GR2> update stage set object_type = 'z'|| substr(object_Type,2) where object_id between 200 and 210;

9 rows updated.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> insert into stage (object_id,object_name,owner,object_type) values (0,'x','y','z');

1 row created.


we expect 9+9+1 = 19 rows and sure enough:

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select stage.*
  2    from stage, target
  3     where stage.object_id = target.object_id(+)
  4        and (target.object_id is null /* new row */ OR
  5             decode(stage.object_name,target.object_name,1,0) = 0 OR  /* one of the columns */
  6                     decode(stage.object_type,target.object_type,1,0) = 0 )
  7  /

 OBJECT_ID OBJECT_NAME                    OWNER                          OBJECT_TYPE
---------- ------------------------------ ------------------------------ -------------------
         3 x_OBJ#                         SYS                            INDEX
         9 x_FILE#_BLOCK#                 SYS                            INDEX
         7 x_TS#                          SYS                            INDEX
         6 x_TS#                          SYS                            CLUSTER
         5 xLU$                           SYS                            TABLE
         8 x_FILE#_BLOCK#                 SYS                            CLUSTER
        10 x_USER#                        SYS                            CLUSTER
         2 x_OBJ#                         SYS                            CLUSTER
         4 xAB$                           SYS                            TABLE
       200 I_NTAB2                        SYS                            zNDEX
       201 I_NTAB3                        SYS                            zNDEX
       202 REFCON$                        SYS                            zABLE
       203 I_REFCON1                      SYS                            zNDEX
       204 I_REFCON2                      SYS                            zNDEX
       205 OPQTYPE$                       SYS                            zABLE
       206 I_OPQTYPE1                     SYS                            zNDEX
       207 VIEWCON$                       SYS                            zABLE
       210 I_VIEWCON1                     SYS                            zNDEX
         0 x                              y                              z

19 rows selected.

As for "ora_hash is deterministic" (see above)

Max, June 02, 2009 - 2:02 am UTC

As of 10gR2 "ora_hash" is NOT deterministic -- as documented by interal bug 5204050: "ORA_HASH IS UNDETERMINISTIC W/ LONG COLUMNS", there are datatype restrictions including LOB, LONG and
UDT (user-defined datatypes, unless they are of nested table type).
Tom Kyte
June 02, 2009 - 7:35 am UTC

it was a bug that they accidentally allowed long and lobs to be passed, the documentation was updated to reflect that.

ora_hash is deterministic
ora_hash is to be used only on the base scalar types.

ORA_HASH can be used as fast alternative to MD5

Jeroen, August 12, 2009 - 5:56 am UTC

In my opinion it is fairly easy to reach a much higher precision just by concatenating several ORA_HASH results with different seeds. The second argument is the bucket size which defaults to 2^32-1, so about 10 decimal digits. The third argument sets the seed. I use the following code to get more presicion:
ORA_HASH(expression, 999999999, 0) * 1000000000 +
ORA_HASH(expression, 999999999, 1)

This will result in a 18-digit hash-value, about 2^60. Just by adding two more hashes with seed 2 and 3, you can reach the maximum precision of a NUMBER (2^128, 38 digits).
Every call to ORA_HASH will cost extra time, but I think it will cost less than the PL/SQL switch you need with MD5.


Tom Kyte
August 13, 2009 - 11:43 am UTC

...but I think it will cost less than the PL/SQL switch ...

we don't have to "think" or "hypothesize" about that - we can verify and validate.

Also, MD5 has lots of attributes/qualities about it (cryptology speaking) which this would not have.



Use nested table type instead of concatenation

Jeroen, August 12, 2009 - 6:14 am UTC

I ran into a problem when I used the ORA_HASH function like it is presented in several examples above:
ORA_HASH(
  column1 || '-' ||
  column2 || '-' || 
  ...
  columnN
)

The result of this expression can exceed 4000 characters. This will result in 'ORA-01489 Result of string concat is too large'.
To solve this problem, I used a nested table type:
CREATE TYPE columns_as_tab IS TABLE OF VARCHAR2(4000);

And then:
ORA_HASH(
  columns_as_tab(
    column1
  , column2
  , ...
  , columnN
  )
)

ora-hash

Sanjiv, September 08, 2009 - 10:57 am UTC

Is there way to generate one hash value for one column of table data? mean to say for employee table if we can get hash one hash value for column (emp_name) for all reocrds

I have 2500 tables, after application patch i have to check if patch has changed any data inside user tables. comparing for each row/table will take lot of time.

to suggest (oracle version is 9.2.0.8)
Tom Kyte
September 08, 2009 - 9:29 pm UTC

what would you be comparing to? You say "after application patch i have to check if patch has
changed any data inside user tables." - where will the original data be?

why not just enable auditing - then you can ask "what table where modified"?

Sanjiv, September 22, 2009 - 4:35 am UTC

let me explain with example, we have database for ERP application, every time when we apply patch/bug fix we need to validate after patch/fix that patch has not touched / changed user data. Still we have not developed any script/process to perform data validation check.

As we have same database to validate (pre-post), looking for better option to validate data. Right now getting suggestion to spool data in ascii files (pre patch) and after patch is applied spool same table data and compare.

Process will take lot of time as user records count is around 1.7 bilion rows (sum of all table data). Also looking at random sample data colleciotn but nothing has been finilzed because of data volume.


Tom Kyte
September 28, 2009 - 1:07 pm UTC

... fix we need to validate after patch/fix that patch has
not touched / changed user data. ...

turn on auditing, if the patch/fix touched data - an AUDIT TRAIL will tell you.

flashback query can be used to see what the data USED to be if you need - after reviewing the audit trail

fine grained auditing can capture not only "this table was modified" but the SQL and the inputs to the SQL that modified the table.

there you go. *SAME ANSWER* as before


Sanjiv, October 01, 2009 - 9:09 am UTC

Thank You Tom

Non-deterministic for LONG datatypes

Jay, October 04, 2009 - 4:15 pm UTC

On columns having the LONG datatype, ora_hash does not yield consistent values.

For example:
SQL> select ora_hash(trigger_body) from user_triggers where trigger_name='REPCATLOGTRIG';
             904287141
SQL> select ora_hash(trigger_body) from user_triggers where trigger_name='REPCATLOGTRIG';
             904287141
SQL> select ora_hash(trigger_body) from user_triggers where trigger_name='REPCATLOGTRIG';
            1354605623
SQL> select ora_hash(trigger_body) from user_triggers where trigger_name='REPCATLOGTRIG';
             758348682

Unfortunate, because it looks like the alternatives are a lot more complicated.