You Asked
In my data management class, we've started talking about OLTP vs. OLAP and the access patterns of each. The main points are that OLTP requires high performance, as these are the systems that our users interact with, and there's often a large concurrent number of transactions. OLAP on the other hand requires flexibility in order to view the data in different ways, is often read-only, and performance isn't as critical. This all makes sense to me.
From these points, our professor then draws the following conclusions: OLTP should be denormalized, and OLAP should be normalized. This still makes sense to me on the surface - it would follow that normalization would have a penalty on speed and complexity due to multiple joins (bad for OLTP), and would enable more sophisticated views of the data (good for OLAP).
But when I start thinking about it more, it doesn't really seem like that would work. If we denormalize OLTP, wouldn't that leave the door open for data redundancy and inconsistency? And if we normalize OLAP, wouldn't aggregations become much more difficult to write? And then I read about star schema and how that's somewhat denormalized, which seems to contradict what we're learning in lecture, and when I started digging deeper it seems like every single resource I can find on the internet is saying the opposite of what we're being taught. I seriously doubt my professor is just straight up wrong, but it's getting very confusing to sort through when I'm getting told directly conflicting things.
So what are y'all's thoughts on this? Is there a right answer to (de)normalization and OLTP/OLAP? Or is this a case of differing schools of thought / it depends on the situation?
and Chris said...
The primary reason for normalization is data quality/consistency. As you normalize to higher levels more types of data error become impossible. Using lower levels means you're trading data quality for something else.
While performance is often quoted as this "something else" (as in denormalization => faster) it's more complicated than that.
Denormalizing optimizes your data for a particular use case. Provided your queries match this use case you can get better performance. Query the data in unexpected ways and performance can be worse than on normalized tables.
For example, star schemas are designed for aggregating many rows and work well for this use case. But OLTP-style lookups (e.g. get the customer, invoice, and product details for order X) can be less efficient than a normalized schema.
Denormalizing also tends to make writes more expensive and reduce their concurrency. If you need to change duplicated data, you must update more rows than in a normalized database. This increases the chances more than one person needs to update a row at the same time, leading to blocking.
Allowing many users to write data at the same time is a key requirement for many OLTP systems. This means more normalized designs are common in OLTP applications.
There's always a trade-off. Denormalizing will make some operations faster while making others slower. You have given up the data quality guarantees that come from being in a particular normal form though.
So as always the question is whether the gains are worth the costs for your use case.
Is this answer out of date? If it is, please let us know via a Comment