You Asked
Recently we observed that cached sequence values were lost significantly, appearing as large gaps in persisted values.
Our system does not expect gapless sequence values, however we are trying to understand root cause.
gv$rowcache shows high getmisses in histogram, objects, segments, followed by sequences.
SGA dump shows many grows in shared pool and shrinks in buffer cache, and gv$db_object_cache shows high loads.
We did a trend of sequence value jump or loss from cache based on gaps found in persisted values across time.
I will list the events that may be contributory to the gap. Please let me know if this is a incorrect hypothesis.
1. Onset of moderately large large sequence jumps aligns with migration to multitenant database
2. Prior to multitenant migration, we never had histogram collection as part of stats, it appears a DBA has run stats with histogram collection turned on at the time of multitenant migration, sequence gaps are silently occuring
3. Few months after multitenant migration, a tablespace rebuild activity followed by stats collection was done. This time the standard stats collection script was run and removed stats from many tables, but not few core tables that are extensively used in the application. So never before seen histogram traffic is still continuing to dictionary cache.
4. After multitenant migration, another effort started where many tables started to get partitioned. There were 2 large on-time efforts that created several thousands of partitions, followed by regular scheduled jobs creating few hundreds of partitions for historical data management
5. The sequence jump (and loss from cache) seems to continue for many months, un-noticed as the application is not affected by gaps.
6. Some of the regular scheduled batch jobs were missed, so there was a large gap where tables were not monthly partitioned properly as expected. At discovery a one-time catchup activity was performed where around 800 partitions were created.
7. The sequence jump phenomenon exploded uncontrollably and was discovered by a partner system.
We pinned the sequence in memory to calm it down.
Here is my draft hypothesis for a root cause, please correct if it does not hold:
a) Multitenant migration increased traffic to dictionary cache, sequence metadata is evicted and reload constantly, so values have jumps/gaps
b) As many partitions are created, more traffic is arriving to dictionary cache and pressure increases, sequence jumps occur silently meanwhile.
c) Cumulative traffic to dictionary cache is increasing every month as 100s of partitions are added by monthly job and pressure is getting intolerable, sequence evictions are on rise
c) One-time gap-covering exercise for missed partition creation pushed the traffic over a threshold, sequence jump skyrocketed and now visible in the application as a very large gap.
Does these dots connect as a root cause ?
and Connor said...
Your hypothesis is definitely plausible.
Another potential candidate here is dynamic sequence caching (depending on when you last applied RUs etc etc).
It may well be the case that your dictionary cache has *not* really changed much, but the *size* of the cache has silently been increased and thus the gaps look a lot larger.
https://connor-mcdonald.com/2021/08/06/we-killed-sequence-contention/