Skip to Main Content
  • Questions
  • Fastest way to diff table from wo databases

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, jc.

Asked: May 28, 2005 - 5:09 pm UTC

Last updated: July 09, 2008 - 6:29 pm UTC

Version: 10.1.0.3

Viewed 10K+ times! This question is

You Asked

Tom,

To diff the contents of the same table on two dabatabases:

Method 1: (the old/easy way)

select * from tab minus tab@remote_db;
select * from tab@remote_db minus tab;

Method 2: I learned from you; this is faster:

select count(1) from
(select 1, count(src1),count(src2)
from (select a.*, 1 src1, to_number(null) src2 from tt a
union all
select b.*, to_number(null) src1, 2 src2 from tt@optical1 b)
group by
OWNER,OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,
CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS,TEMPORARY,GENERATED,SECONDARY
having count(src1) <> count(src2)
);

How frequent is diff table done?
-- Only intermittently, mostly in maintenence hours. There can be up to 200 tables being compared.

How wide are the rows?
--Most tables have less than 20 column. Less than 10 tables have more than 100 columns, in which not all columns are filled with data.

How many rows are these tables?
--About 10 tables may have 1-2 million rows; all others are less then 250K rows.

Question: Do you any other faster way to diff table than your Method 2? (Thanks!!!)

------------------------

This is my test case:

On remote_db:

10g> create table tt as select * from dba_objects;

On local_db:

10g> create tt as select * from tt@remote_db;
10g> update tt set status='local' where object_id=28;

On remote_db:

10g> update tt set status='remote' where object_id=48;

Then I compare the difference with this:

10g> variable start number;

exec :start := dbms_utility.get_time;

select count(1) from
(select 1, count(src1),count(src2)
from (select a.*, 1 src1, to_number(null) src2 from tt a
union all
select b.*, to_number(null) src1, 2 src2 from tt@remote_db b)
group by OWNER,OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,
OBJECT_TYPE,CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS,TEMPORARY,
GENERATED,SECONDARY
having count(src1) <> count(src2)
);

exec dbms_output.put_line('NEW:'||(dbms_utility.get_time-:start) || ' secs' );

prompt ============== MINUS ===================

exec :start := dbms_utility.get_time;

select count(1) from (select * from tt minus select * from tt@remote_db) ;
select count(1) from (select * from tt@remote_db minus select * from tt) ;

exec dbms_output.put_line('OLD:'||(dbms_utility.get_time-:start) || ' hsecs' );


and Tom said...

questions for you --

o how frequently do you need to this.
o how wide are the rows.
o how many rows are there.

Ok, there are two answers. I know of nothing "faster" to compare two arbitrary tables than the group by trick.

However, if I knew in advance I'd be wanting to do this over and over, and I owned the table, I might store a hash with each row similar to what I've stated here:

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:41742134580019 <code>

Now, the query to find "possible duplicates" (false positives are extremely unlikely) becomes very easy and we ship very little data over the network (lots less than the raw data itself) and the hashes become a key to find the duplicate rows for display purposes.

If the network was the bottlneck, and CPU was freely available, it might even be worthwhile to have a view on each site that just returned the hash (a plsql pipelined function that returns hashes from the table) -- that way we could "compare" but not have to send tons of data back and forth.


ops$tkyte@ORA10G> create or replace type myTabletype as table of raw(16)
2 /

Type created.

ops$tkyte@ORA10G>
ops$tkyte@ORA10G> create or replace function hash_emp_row return myTableType
2 PIPELINED
3 is
4 l_data long;
5 begin
6 for p_rec in ( select * from emp )
7 loop
8 l_data := p_rec.empno || '/' || p_rec.ename || '/' ||
9 p_rec.job || '/' || p_rec.mgr || '/' ||
10 to_char(p_rec.hiredate,'yyyymmddhh24miss')||'/'||
11 p_rec.sal || '/' || p_rec.comm || '/' || p_rec.deptno;
12
13 pipe row( dbms_crypto.hash( src => utl_raw.cast_to_raw(l_data), typ=>dbms_crypto.hash_sh1 ) );
14 end loop;
15 return;
16 end;
17 /

Function created.

ops$tkyte@ORA10G>
ops$tkyte@ORA10G> select * from table( hash_emp_row() );

COLUMN_VALUE
--------------------------------
938038876BA25B11B04E2C049AC5C2A2
CFC778EDE45768BEB61B167592B96F84
8C86638EDDD66C1E87CC4B5F543A3C58
979BF6885C038CAA467988806BC17ABB
D3CA2C3A8AADB09F6E2AFD9C8C908E7E
75F0F6F868387E36ED4EE822A9E1AC5D
D3ED8DB9B3A6089195ABEACFC94F1CC6
68AACD0483116F4FACCC47E2FD2F4574
93FB3DA7CCC7D56F71E7E42CC2BB90DD
EF70145942A55E5469621EBDC409D373
37D8CD7AFA0BC0F89E3F62F9EBCF7E3A
82AC70476C9E28A648166E88BFA246DB
598385613321685FF202CC20B3C86205
92E16DE14F8C430E7F06436D54B9E466

14 rows selected.


That uses the sh1 hash instead of the md5 as an example.

Rating

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Not sure if I get it right..

A reader, May 31, 2005 - 4:42 pm UTC

Tom, thanks for your prompt response.

I can't add raw(16) column to existing tables when inserting. Therefore the hashing has to be done during diff table operation. I tried, and modified a little on your sample program, and created view on top of table. Here is my test samples:

-- conn / as sysdba
-- grant execute on DBMS_CRYPTO to public;

10g> conn myuser/myuser

create or replace type rawTableType as table of raw(16)
/

create or replace function hashrow(p_tname varchar2) return rawTableType
PIPELINED
is
l_stmt long;
l_data long;
TYPE refcur IS REF CURSOR;
c1 refcur;
begin

--concatenate statment to column1||column2||.. for select

select regexp_replace(stragg(column_name), ',','||') into l_stmt from user_tab_columns where table_name=p_tname;

open c1 for 'select ' || l_stmt || ' thestr from ' || p_tname;
loop

fetch c1 into l_data;
exit when c1%notfound;

pipe row( sys.dbms_crypto.hash( src => utl_raw.cast_to_raw(l_data), typ=>sy
s.dbms_crypto.hash_md5 ) );
-- using MD5
end loop;
return;
end;
/

10g> alter session set nls_date_format='YYYYMMDDHH24MISS';

Then, create view for the table (on both databases):

10g> create view v_tt as select * from table(hashrow('TT'));

Then I tried to compare the contents of tables using select-minu-select and that with hashed rows:

-- variable start number
set serverout on size 10000
exec :start := dbms_utility.get_time;

select count(1) from (select * from tt minus select * from tt@remote_db) ;
select count(1) from (select * from tt@remote_db minus select * from tt) ;

exec dbms_output.put_line('OLD:'||(dbms_utility.get_time-:start) || ' hsecs' );

exec :start := dbms_utility.get_time;

select count(1) from (select * from v_tt minus select * from v_tt@remote_db) ;
select count(1) from (select * from v_tt@remote_db minus select * from v_tt) ;

exec dbms_output.put_line('HASH:'||(dbms_utility.get_time-:start) || ' hsecs' );

COUNT(1)
----------
0

COUNT(1)
----------
0

OLD:833 hsecs

COUNT(1)
----------
10705

COUNT(1)
----------
10705

HASH:2229 hsecs

I am not sure if I get it right.. Apparently the hash values generated on both databases are always different. How do I compare the hashed row value? It also takes longer when hashing is done the the same time.


Tom Kyte
June 01, 2005 - 8:37 am UTC

you've made a mistake somewhere. Here is my example (big SUGGESTION: use static sql and only static sql -- if need be, write a generic routine that writes the pipelined function for a given table -- but don't be generic here. You'll miss out on the 10g array fetch for the implicit cursors of 100 rows at time)

created this at both sites:

scott@ORA10GR1> create or replace type rawTableType as table of raw(16)
2 /

Type created.

scott@ORA10GR1> create or replace function hashrow return rawTableType
2 PIPELINED
3 is
4 l_stmt long;
5 l_data long;
6 TYPE refcur IS REF CURSOR;
7 c1 refcur;
8 begin
9 for x in ( select empno || '/' ||
10 ename || '/' ||
11 job || '/' ||
12 mgr || '/' ||
13 to_char(hiredate,'yyyymmddhh24miss') || '/' ||
14 sal || '/' ||
15 comm || '/' ||
16 deptno data
17 from emp )
18 loop
19 pipe row( sys.dbms_crypto.hash
20 ( src => utl_raw.cast_to_raw(x.data),
21 typ=>sys.dbms_crypto.hash_md5 )
22 );
23 end loop;
24 return;
25 end;
26 /
Function created.

scott@ORA10GR1> create or replace view emp_view
2 as
3 select * from table(hashrow());
View created.

scott@ORA10GR1> select * from emp_view;

COLUMN_VALUE
--------------------------------
96C7AF1AED37858C2D38CA955975069D
68EBC1E1B2B47F1BEBB0FC4BCB6E3408
D3A52FD8A5405A26370066A226A9C437
78B82C7CFAD7955CC4F2BAB58F54CBFC
3A99DDF9C17F035E633FBA085C941CF5
EE21B46747B90BA605C36723C2DA1557
9C982E8673B37F3EE865F8EBD74C0669
243A88DA780E7AF590C2D1F945891A50
F2BCD4F03869C4AAD16D848937B993BD
4785D553EDFE95B5D5832215A10AA36B
324D92832667B0102C12204751998514
C571808ABC6659F44626DDD3B7BAA67E
48C27BED93FC95F851B46C7E25E03C73
80B89547CEF50BF29989962FC2DF6384

14 rows selected.

scott@ORA10GR1> select * from emp_view@ora10g;

COLUMN_VALUE
--------------------------------
96C7AF1AED37858C2D38CA955975069D
68EBC1E1B2B47F1BEBB0FC4BCB6E3408
D3A52FD8A5405A26370066A226A9C437
78B82C7CFAD7955CC4F2BAB58F54CBFC
3A99DDF9C17F035E633FBA085C941CF5
EE21B46747B90BA605C36723C2DA1557
9C982E8673B37F3EE865F8EBD74C0669
243A88DA780E7AF590C2D1F945891A50
F2BCD4F03869C4AAD16D848937B993BD
C29DC8DBDA15E4FDC8BBA359A3389C14
324D92832667B0102C12204751998514
C571808ABC6659F44626DDD3B7BAA67E
48C27BED93FC95F851B46C7E25E03C73
5C9553DB1344BEBA2402D9D263B12772

14 rows selected.

scott@ORA10GR1>
scott@ORA10GR1> select * from emp_view minus select * from emp_view@ora10g;

COLUMN_VALUE
--------------------------------
4785D553EDFE95B5D5832215A10AA36B
80B89547CEF50BF29989962FC2DF6384

scott@ORA10GR1> select * from emp_view@ora10g minus select * from emp_view;

COLUMN_VALUE
--------------------------------
5C9553DB1344BEBA2402D9D263B12772
C29DC8DBDA15E4FDC8BBA359A3389C14


And in fact there are differences:

scott@ORA10GR1> select empno,ename,hiredate,
2 count(src1) CNT1,
3 count(src2) CNT2
4 from
5 ( select a.*,
6 1 src1,
7 to_number(null) src2
8 from emp a
9 union all
10 select b.*,
11 to_number(null) src1,
12 2 src2
13 from emp@ora10g b
14 )
15 group by empno,ename,job,mgr,hiredate,sal,comm,deptno
16 having count(src1) <> count(src2)
17 /

EMPNO ENAME HIREDATE CNT1 CNT2
---------- ---------- --------- ---------- ----------
7788 SCOTT 09-DEC-82 1 0
7788 SCOTT 19-APR-87 0 1
7876 ADAMS 12-JAN-83 1 0
7876 ADAMS 23-MAY-87 0 1


Now, a thought that came to mind -- if the character sets are DIFFERENT, hashing won't work -- but MINUS could (stress COULD, as the character set is converted for the fields over the wire, so pulling from an 8bit character set to a 7bit database would compare OK, but going the other way might not)

PLSQL %ROWTYPE Record --> BLOB --> Hash

Alex H, July 09, 2008 - 2:24 pm UTC

Hi Tom,

Do you know of any tricks that will help in converting a complete record (e.g. passed as %ROWTYPE parameter) to BLOB so after that we can use the DBMS_CRYPTO.Hash on the BLOB value?
This will save us the "hassle" of manually generating the l_data string (i.e. all that consistent converting of the dates and number columns to strings, dealing with NULL values and strings containing control symbols, etc.)

TIA,
Alex

P.S. If I have my source data as flat files a simple PERL script will add to each record the digest (e.g. the MD5 but you can pick any other flavor) so I can do the complete snapshot compare later in the DB. Here is the simplified version of script like that which you call before your ETL tool (I have used that with both Informatica and DataStage):
----
#!/usr/bin/perl -w
use Digest::MD5 qw(md5_hex);
my $infile = shift || "g:/src_data.dat";
my $outfile = "g:/stg_data/src_data_MD5.dat";

open(FILE, $infile) or die "Can't open '$infile': $!";
open(OFILE, '>', $outfile) or die "Can't open '$outfile': $!";

binmode(FILE);
while (<FILE>) {
print OFILE md5_hex("$_") ."|"."$_";
}
close(FILE);
close(OFILE)
----
Tom Kyte
July 09, 2008 - 3:07 pm UTC

no, there is not. And even if it could be overlayed by a "blob", you would have the control information in there (length bytes, null flags and such) so it would never compare to what was in your file in the first place...

that and the number representation would be upto 22bytes of stuff, a date would be 7 bytes, a timestamp 7-11 bytes and so on. They would not be 'strings' in the rowtype variable - but they would be in your file.

it would be easy to generate this from the dictionary, but you would have to deal with the proper formatting of numbers and dates.... that gets tricky of course, your input files might have

01-jan-2008 in one place

01012008000000 in another

but they are the "same" date.

%ROWTYPE record to BLOB

Alex H, July 09, 2008 - 6:09 pm UTC

Thank you for the quick response Tom!

Let me clarify my follow-up because I don't want to compare flat file records with table content; The scenario is:
- Weekly snapshot produced from UniVerse DB by PICK Basic code and "dumped" in a Oracle table;
- The layout of the "table" is 300+ columns where large number of the original UniVerse fields are multivalued and as a result in the Oracle table one varchar2(4000) column contains bunch of character strings delimited by high ASCII values like ASC(253);
- number of records in each snapshot is around 1 million; modified records since last snapshot ~1% of total; new records since last snapshot less then 1%;
- Today DataStage ETL job compares New and Old snapshots by matching the keys and then performs field by field compare to determine the modified records.
BOTTOM LINE: it takes a while to finish...

What I'm looking for is from each snapshot table record to get one MD5 digest that goes to the staging area with the record itself so eventually the field by field compare concentrates on only on the one percent and ignores the rest. However, the hassle of getting one string from these 300+ columns is too great that's why my follow-up about turning complete %ROWTYPE record to one BLOB value.

If I have two Oracle tables (e.g. OLD_SNAP and NEW_SNAP) and I pick two records from the respective tables with matching PK's and 100% identical column content, are you saying that their %ROWTYPE records by design contain internal differences that will make pointless any attempt for conversion of the entire %ROWTYPE structure to BLOB's?

TIA,
Alex
Tom Kyte
July 09, 2008 - 6:29 pm UTC

write code to write code.

create a procedure that reads the dictionary and creates a view that selects the column and a hash of them - after you convert dates/timestamps/numbers to some standard format.


A reader, July 10, 2008 - 4:34 am UTC

The DBMS_COMPARISON package in 11.1 provides interfaces to compare and converge database objects at different databases.
http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_comparison.htm

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library