Francesco Cavalli, August 15, 2019 - 4:59 pm UTC
Thank you for the answer Chris.
Do you know if the md5 is better then natural unique key?
If I use a tool (Penthao PDI), with a natural unique key is explicit which is the way to read in the output table, but with md5 I can't understand, because I have only a string. For exemple:
Table1 (col1, col2, col3, col4, col5) Unique on col1, col2
With PDI I search the row by col1 and col2 and I complete the second grid with the fields that I want to ins or to upd.
If I use md5 which is my search key? We need to add notes in all steps to explain.
I hope you understand...
And in term of speed?
Thanks
August 16, 2019 - 8:46 am UTC
I'm not familiar with Pentaho, so can't comment on how this works.
If I use md5 which is my search key? We need to add notes in all steps to explain.
You tell me! What are you searching on? What do you do if there's a match?
If values for ( col1, col2 ) exist in the source, what are you planning on doing in the target? Overwriting all the other values? Or inserting a new row and setting some form of version columns (e.g. start/end dates, is_active flag, ...) in the new & old rows?
If you're searching on all the columns of a unique constraint, then the database can use the unique index to do the lookup. This will be fast.
If you're using a hash, you have to:
- Compute the hash on the source
- Compute the hash on the target
- Do the lookup
Unless you've already computed, stored, and indexed the hash on the target doing these steps will (probably) be more work.
But really, there are too many unknowns for me to say what the "best" approach is here.
Give us your worked examples. e.g.
- Read into source these values
- If a given business key doesn't exist in the target, load it
- If a given business key DOES exist in the target, you do... what in the target table?
Umm...
Chuck Jolley, August 18, 2019 - 3:55 am UTC
"...because md5 is the good practice...",
Why do you think that?
Surrogate keys are meaningless by definition.
Just use a sequence. It's way less resource incentive, will definitely never conflict, and will be better for concurrence.
August 19, 2019 - 2:20 pm UTC
I believe the OP is referring to taking an MD5 hash of the values in each row. And using this to see if those values already exist.
Which is a separate concept to surrogate keys & sequences.
Interesting idea.
Chuck Jolley, August 19, 2019 - 3:43 pm UTC
Interesting idea.
That might make sense if you have a lot of columns in the logical primary key.
August 19, 2019 - 3:52 pm UTC
Yep, it can be useful when searching on many columns.
Whether it's the way forward for the OP isn't clear from the information given...