A reader, September 09, 2015 - 2:04 am UTC
Hi Tom
Thank You.
Yes, HW resources are there to be consumed, and they should be used to the maximum possible extent. when I am doing a load testing I am also running linux sar command in parallel. At the begining I had a bad sql, so the %user column of the sar output was showing 99% utilization with the increase of load, the increase of number of "business transactions". Now when I removed the bad sql just for a testing, I see the %system column goes high in the sar output. On investigating further I see the LGWR consuming 55% of entire CPU, through the entire duration of testing. And OEM-db-control shows a spike on log-file-sync.
So can I conclude that log-file-sync is maxing out CPU, and CPUs are not available for my business transactions, and I am not able to get a better business-TPS
Sandeep
Commit frequency ?
Hemant K Chitale, September 10, 2015 - 8:31 am UTC
Sandeep,
How frequently are you committing in those 30 SQL operations ?
What is the number of rows / size of redo generated per commit ?
Are there other transactions also executing on the same server ? (They'd be generating redo that can "piggyback" on your commits).
commit frequency
A reader, September 10, 2015 - 10:48 am UTC
The below is for a duration of 10 minutes
------------------------------------------------
Top User Events
-------------------------
Event Event Class % Event Avg Active Sessions
log file sync Commit 88.83 49.20
CPU + Wait for CPU CPU 9.72 5.38
September 14, 2015 - 3:03 am UTC
Take a look also at 'log file parallel write'. Thats the portion of your log file sync time that is being spent doing the IO.
If that's small (redo writes on modern hardware should be almost sub millisecond), and you've still got high log file sync, then its either a CPU problem (the box cant cope), or a concurrency problem (lots and lots of people all committing at high frequency).
Some related info here
http://goo.gl/6vigmw Hope this helps.
A reader, September 25, 2015 - 4:20 am UTC
Hi Tom.
Thank You.
I saw the link, what I understood from the blog is as below.
1) There is 1 LGWR and a lot of consumers of his service
2) On a busy system, when process-1 posted LGWR it is in the process of servicing a batch of prior processes from (2 to n). That is, when process-1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS
3) Before LGWR can flush process-1’s redo, he has to finish flushing the prior batch and post all of those waiters (process-2 to ...process-n).
4) LGWR is firing semaphores (waking up one more process, lending the cpu occupied by it to the semaphore) and waiting to get back the CPU. This is what reflects as log file sync wait. It is all about CPU starvation.
5) Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR
6) Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
7) Once LGWR loses his CPU it may be quite some time until he gets it back, and this waiting time attributes to LFSW
If all these things are happen to happen. And this is how it has been designed then virtually there is no solution, other than increasing CPUs, instead of having a solid state I/O device.
So what can be a solution for a heavily committing, application.
I would happy to be wrong.
Thanks and Regards,
Sandeep
September 25, 2015 - 7:03 am UTC
From your 10min sample, can you email the entire report to asktom_us@oracle.com
We can then have a look at it.
But if you are committing at a frequency faster than your CPU's can handle, you coudl try running LGWR at a higher priority. (You need to speak to Support about the "_high_priority_processes" and whether its supported on your platform)
If that doesn't work, then you might just not have enough CPU
Sandeep, September 26, 2015 - 7:24 pm UTC
Hi Tom,
Many Thanks for the reply.
Which report you want me to send it to you, is it the ASH for a 10mins duration, during rush hour. Which part of the ASH in a minimum should post. But I would like to figure it out by myself, with your help. Where I should check in an AWR report, that would tell me for the number of commits I am doing I dont have enough of CPU? Like searching for the "youth-fountain", I have already spent a significant amount of time, without any positive outcomes.
Thanks and Regards,
Sandeep
September 28, 2015 - 3:53 am UTC
Either report is fine. And we'll advise on where to look etc without trying to simply spoon feed the answer.
A reader, October 06, 2015 - 2:18 pm UTC
Hi Tom,
I am sorry, but my client is having reservations. So may not be able share a report. Could you please let me know which all sections in an awr report I should look for and the math involved in it. I just want to know "How the calculation is done" ? The same way you provide many of the answers for questions "How it is done by oracle". Sorry for the trouble.
Regards,
Sandeep
October 07, 2015 - 3:08 am UTC
In that case, tell the client to peruse support note Doc ID 1376916.1
It covers this topic in good detail.
Hope this helps.
so what if CPU is 90%+ idle and parallel write wait is low?
John Cantu, November 10, 2016 - 12:01 pm UTC
Hello Chris,
I have a follow-up question to your response:
"Take a look also at 'log file parallel write'. Thats the portion of your log file sync time that is being spent doing the IO.
If that's small (redo writes on modern hardware should be almost sub millisecond), and you've still got high log file sync, then its either a CPU problem (the box cant cope), or a concurrency problem (lots and lots of people all committing at high frequency).
Some related info here
http://goo.gl/6vigmw" "
If only 692 out of 3764 seconds is spent on “log file parallel write” and the CPU is 99% idle and the concurrency hasn’t changed, what else could be the reason for only 692 out of 3764 seconds being spent on I/O? In other words, what other thing can the database do with the 3764-692 seconds?
Thanks
November 11, 2016 - 8:30 am UTC
Could be lots of *potential* causes
- excessive commit frequency (across the instance)
- lgwr has been swapped out (memory issue)
- lgwr contention for something else (ie, if lgwr itself is blocked, then a user session see this as 'log file sync'
if none of these seen, then might be time to talk to Support for some lower level tracing
John Cantu, November 10, 2016 - 12:02 pm UTC
sorry, forgot to mention that 3764 seconds is spent on "log file sync".
John Cantu, November 11, 2016 - 3:18 pm UTC
Thanks, Connor.
Chris mentioned that the wait event "log file parallel write" can be used to figure out how much of the "log file sync" time was spent doing I/O. Are there similar wait events for other I/O wait events like "db file scattered read" and "db file sequential read" that will provide insight into how much time was spent doing I/O and how much time was spent doing nonI/O?
November 14, 2016 - 2:29 am UTC
No - they are different because they are the *same* session
For commit:
session 12 asks LGWR to do something, and then waits on log file sync. But they are waiting on *another* session (lgwr) to do its work. LGWR will then have the 'log file parallel write' waits for its session.
For normal read:
session 12 *does* the I/O, hence its waits (db file seq... etc) *are* the I/O cost (as perceived by Oracle).
Now, in *all* cases, an I/O request by Oracle to the OS *might* have some delays that are not *true* I/O (eg queueing etc) ,but thats no different to anything doing I/O.
John Cantu, November 16, 2016 - 5:36 pm UTC
Thanks, Connor. That makes sense.
You mentioned that I/O request may have time spent on items like "queueing" on the OS, so how do we find out how much time was spent "queueing"? Is that is figured out through OS tools only like sar or iostat?
November 17, 2016 - 2:31 am UTC
Yes, you have to go into lower level tools. Not even sure if sar/iostat will pick it up.
John Cantu, November 17, 2016 - 10:09 am UTC
Thank you
Dana Day, November 23, 2022 - 12:51 am UTC
Just a comment. I remember attending an Oracle conference and watching Tom Kyte answer askTom questions between sessions.
Tom was the only person in a panel of 30 who immediately answered and proved a question from the audience that stumped all the experts on stage.
I mention this because
a) Tom is the epitome of test/prove.
b) The spirit of Tom Kyte lives on in AskTom.
c) Conner and company continue to provide awesome Oracle advice.
Thank you all for your efforts and excellence!
Dana
November 25, 2022 - 5:20 am UTC
Thanks for the feedback!
Thank you Connor
A reader, November 28, 2022 - 7:07 am UTC
Thank you Connor, will try these out.
Cheers!
November 30, 2022 - 1:22 am UTC
Glad to help