Incrrmental
Gh, August 08, 2018 - 6:51 pm UTC
You can build any a datalake where all changes are logged in as simple rows as fact tables.
Then build on a set of mv refreshed incrementally using analytics
A reader, August 08, 2018 - 9:27 pm UTC
Thanks for your response Chris.
"So you want your consumers to join changes with the relevant parent data?"
This is the first time we are looking to use CDC design pattern and we are in very early stages and getting mixed up with the traditional batch data extracts and CDC design pattern. So far we discussed about having batch data extracts as initial data loads but these batch extracts won't have primary keys since they are de-normalized. So any subsequent CDC updates will still need to be in de-normalized format, hence my original question about getting PK values of the parent tables.
After your response, I read little bit about CDC design pattern, in all cases CDC is described in use with data warehouse or analytics data (i.e. internal systems only). Is it a common practice to send CDC data changes to external clients and ask them to replicate part of our data tables and let them do the joins? How about future schema/table enhancements? Thanks.
- Prasad P
August 09, 2018 - 9:55 am UTC
What exactly are your consumers? What is their purpose and why do they need to reconstruct the full data set?
these batch extracts won't have primary keys since they are de-normalized
Any particular reason to exclude the primary keys? And, assuming you mean sequence assigned PKs (or similar), the data will still include the business key, right?
Can you join on that?
Prasad Panchagnula, August 09, 2018 - 1:47 pm UTC
"What exactly are your consumers? What is their purpose and why do they need to reconstruct the full data set?"
The nature of the data is of financial nature - investments, stocks, assets..etc. Sometimes the data is sent to marketing teams with performance figures. They don't need to reconstruct the full data set, but with CDC implementation pattern, the design pattern setup seems to have mirrored tables in both source and target systems which makes it easy to push (synchronise) changed data in one table to the target system and let the target system join the data. In this way we are hoping that we could send our data changes quickly asap. We are still contemplating whether to send data at table level and let target systems do the joins or we join them up and then send them.
"Any particular reason to exclude the primary keys? And, assuming you mean sequence assigned PKs (or similar), the data will still include the business key, right?"
Correct, they still have business keys but not the number sequence primary keys. In some tables, the business keys are the PK's and in others they are not. So, in those tables where sequence numbers are used as PK's, we need to put some context around it in order to define what data it is related to, hence reverse traverse to the parent tables for parent's business key.
- Prasad P
August 09, 2018 - 4:05 pm UTC
It depends on where you want to put your effort.
If you're doing the joins on the source system, it's
- More load on the source system
- More work for you to put the propagation in place
- If the consumers want to change the data in their report, you need to recode the propagation.
Whereas if you do the joins on the consumers, it's
- Less load on the source system
- Less work for you to put the propagation in place
- Consumers have the flexibility to write their own reports (assuming they don't want data from new tables).
I'm struggling to see what benefit you get from doing the joins yourself on the source.
Prasad Panchagnula, August 09, 2018 - 4:22 pm UTC
"I'm struggling to see what benefit you get from doing the joins yourself on the source. "
This is how we have at the moment where we send de-normalized data as a batch at the end of the month (or fortnight).
We believe (or someone within our organisation) that we can push the data changes faster by adopting CDC design pattern. Only thing I'm still not sure is if anyone has implemented this CDC pattern to push data changes to external parties and if anyone has implemented this for a data of this nature which I'm dealing with i.e. financial/asset related.
August 10, 2018 - 8:22 am UTC
CDC should allow the consumers to get changes faster.
Forget your current approach. Push changes for all the tables each consumer cares about as they come in. You may want/need to build a transformation step at the end to convert the data to a format the consumer wants.
This is the basis of ETL & data warehousing. There are various ETL design patterns available. But the exact details of how you do this depends on your goals. Understand what you're trying to achieve by doing this before rushing in with CDC!
A reader, August 10, 2018 - 10:33 am UTC
Thank you Chris for all your inputs. Much appreciated for your time in responding to my queries as well.
- Prasad P
August 10, 2018 - 2:30 pm UTC
Happy to help