Skip to Main Content

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Francesco.

Asked: August 15, 2019 - 3:01 pm UTC

Last updated: August 19, 2019 - 3:52 pm UTC

Version: 11

Viewed 1000+ times

You Asked

Hello,
I start a new data warehouse project. In the ETLs for the ins/upd steps, I suggest to use MD5 and not the natural key (my sources are txt files from a mainframe).

I need to explain to my project manager why I want to use md5 and not the natural key. He don't accept "...because md5 is the good practice...", he don't accepta dogmas, he want some exemples.

We need to do the rules for all ins/upd steps, for exemple:
In staging area I copy the txt files in an Oracle table, without rules. One file=one table.
This tables are the input for the DW. I want to write the stock area, the DB in 3NF. This is the first use of ins/upd key. It's here were I want to use the md5 and not the natural key.

Can you give me exemples?

Many thanks.

and Chris said...

I'm not clear on exactly how you plan to use the MD5 hash. Or how exactly you plan to insert/update rows.

Assuming you're talking about managing the slowly changing dimension tables, there are several approaches you can take to change these.

Which makes it hard to give examples!

If you're going for a simple overwrite of values, then a merge on the business key should work fine:

merge into target_tab t
using source_tab s
on  ( t.business_key = s.business_key ) 
when matched then
  update ...
when not matched then
  insert ...


If you're doing something else, you'll need to explain - with examples - what you're trying to do!

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Francesco Cavalli, August 15, 2019 - 4:59 pm UTC

Thank you for the answer Chris.

Do you know if the md5 is better then natural unique key?
If I use a tool (Penthao PDI), with a natural unique key is explicit which is the way to read in the output table, but with md5 I can't understand, because I have only a string. For exemple:

Table1 (col1, col2, col3, col4, col5) Unique on col1, col2
With PDI I search the row by col1 and col2 and I complete the second grid with the fields that I want to ins or to upd.
If I use md5 which is my search key? We need to add notes in all steps to explain.
I hope you understand...

And in term of speed?

Thanks
Chris Saxon
August 16, 2019 - 8:46 am UTC

I'm not familiar with Pentaho, so can't comment on how this works.

If I use md5 which is my search key? We need to add notes in all steps to explain.

You tell me! What are you searching on? What do you do if there's a match?

If values for ( col1, col2 ) exist in the source, what are you planning on doing in the target? Overwriting all the other values? Or inserting a new row and setting some form of version columns (e.g. start/end dates, is_active flag, ...) in the new & old rows?

If you're searching on all the columns of a unique constraint, then the database can use the unique index to do the lookup. This will be fast.

If you're using a hash, you have to:

- Compute the hash on the source
- Compute the hash on the target
- Do the lookup

Unless you've already computed, stored, and indexed the hash on the target doing these steps will (probably) be more work.

But really, there are too many unknowns for me to say what the "best" approach is here.

Give us your worked examples. e.g.

- Read into source these values
- If a given business key doesn't exist in the target, load it
- If a given business key DOES exist in the target, you do... what in the target table?

Umm...

Chuck Jolley, August 18, 2019 - 3:55 am UTC

"...because md5 is the good practice...",

Why do you think that?
Surrogate keys are meaningless by definition.
Just use a sequence. It's way less resource incentive, will definitely never conflict, and will be better for concurrence.
Chris Saxon
August 19, 2019 - 2:20 pm UTC

I believe the OP is referring to taking an MD5 hash of the values in each row. And using this to see if those values already exist.

Which is a separate concept to surrogate keys & sequences.

Interesting idea.

Chuck Jolley, August 19, 2019 - 3:43 pm UTC

Interesting idea.
That might make sense if you have a lot of columns in the logical primary key.

Chris Saxon
August 19, 2019 - 3:52 pm UTC

Yep, it can be useful when searching on many columns.

Whether it's the way forward for the OP isn't clear from the information given...

More to Explore

Design

New to good database design? Check out Chris Saxon's full fundamentals class.