Skip to Main Content

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Sandeep.

Asked: September 08, 2015 - 3:21 pm UTC

Last updated: November 17, 2016 - 2:31 am UTC

Version: 11.2.0.4.0

Viewed 50K+ times! This question is

You Asked

Hi Tom,

1 ) As per my understanding, Log-File-Sync wait is the duration, after which I receive an acknowledgement that my commit is successful. In my database the only prominent issue I see is huge log-file-sync waits.
2 ) There are 32 cpu and each one is 8 core.
3 ) One of my business operation takes 300ms (select + insert + update +select+ select etc... total 30 sql statements), lets refer it as one business transaction (BT , I am aware of the difference, and what actually one db transaction)
4 ) So my one BT takes 300 ms, and to complete 1 BT os+network+hw takes rest of the 700ms
5 ) 300ms of one CPU is enough for 1 BT
6 ) so in any given one second I should be able to do 32*8 = 256 BT
7 ) but I am only able to complete 32 BT, and spikes of Log-File-Sync consumes everything else
8 ) Considering my server is like a desktop machine, and no SAN storage is configured
9 ) How to find out, the disk capacity is only Xmb/sec write speed, which may be resulting in less number of data writes

Regards

and Chris said...

You're correct, log file sync is wait for LGWR to write to the redo files.

It's not clear what you mean by "prominent issue".

Is this noticeably slowing down all commits, or is it just the top wait event on your system?

Remember that there's always something that has to be "top".

If log file sync is a problem, then bear in mind the following:

Disk throughput is only one aspect that affects LGWR. It consumes CPU while executing too. If you've maxed out your CPU capacity processing "business transactions", then it will be starved for resource. This can lead to you seeing a lot of "log file sync" waits.

If your datafiles are on the same disks as the redo logs, then DBWR will also be contending for the same disk.

So there's a number of things that may mean LGWR is not able to write at maximum capacity.

For further discussion and things you can do to reduce the wait, take a look at:

http://logicalread.solarwinds.com/oracle-log-file-sync-wait-event-dr01

More info on LGWR:

https://docs.oracle.com/cd/B19306_01/server.102/b14220/process.htm#i7261

Rating

  (11 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

A reader, September 09, 2015 - 2:04 am UTC

Hi Tom
Thank You.
Yes, HW resources are there to be consumed, and they should be used to the maximum possible extent. when I am doing a load testing I am also running linux sar command in parallel. At the begining I had a bad sql, so the %user column of the sar output was showing 99% utilization with the increase of load, the increase of number of "business transactions". Now when I removed the bad sql just for a testing, I see the %system column goes high in the sar output. On investigating further I see the LGWR consuming 55% of entire CPU, through the entire duration of testing. And OEM-db-control shows a spike on log-file-sync.

So can I conclude that log-file-sync is maxing out CPU, and CPUs are not available for my business transactions, and I am not able to get a better business-TPS

Sandeep

Commit frequency ?

Hemant K Chitale, September 10, 2015 - 8:31 am UTC

Sandeep,
How frequently are you committing in those 30 SQL operations ?
What is the number of rows / size of redo generated per commit ?
Are there other transactions also executing on the same server ? (They'd be generating redo that can "piggyback" on your commits).

commit frequency

A reader, September 10, 2015 - 10:48 am UTC

The below is for a duration of 10 minutes
------------------------------------------------

Top User Events
-------------------------
Event Event Class % Event Avg Active Sessions
log file sync Commit 88.83 49.20
CPU + Wait for CPU CPU 9.72 5.38


Chris Saxon
September 14, 2015 - 3:03 am UTC

Take a look also at 'log file parallel write'. Thats the portion of your log file sync time that is being spent doing the IO.

If that's small (redo writes on modern hardware should be almost sub millisecond), and you've still got high log file sync, then its either a CPU problem (the box cant cope), or a concurrency problem (lots and lots of people all committing at high frequency).

Some related info here http://goo.gl/6vigmw

Hope this helps.

A reader, September 25, 2015 - 4:20 am UTC

Hi Tom.

Thank You.
I saw the link, what I understood from the blog is as below.
1) There is 1 LGWR and a lot of consumers of his service
2) On a busy system, when process-1 posted LGWR it is in the process of servicing a batch of prior processes from (2 to n). That is, when process-1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS
3) Before LGWR can flush process-1’s redo, he has to finish flushing the prior batch and post all of those waiters (process-2 to ...process-n).
4) LGWR is firing semaphores (waking up one more process, lending the cpu occupied by it to the semaphore) and waiting to get back the CPU. This is what reflects as log file sync wait. It is all about CPU starvation.
5) Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR
6) Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
7) Once LGWR loses his CPU it may be quite some time until he gets it back, and this waiting time attributes to LFSW

If all these things are happen to happen. And this is how it has been designed then virtually there is no solution, other than increasing CPUs, instead of having a solid state I/O device.

So what can be a solution for a heavily committing, application.

I would happy to be wrong.

Thanks and Regards,
Sandeep


Connor McDonald
September 25, 2015 - 7:03 am UTC

From your 10min sample, can you email the entire report to asktom_us@oracle.com

We can then have a look at it.

But if you are committing at a frequency faster than your CPU's can handle, you coudl try running LGWR at a higher priority. (You need to speak to Support about the "_high_priority_processes" and whether its supported on your platform)

If that doesn't work, then you might just not have enough CPU

Sandeep, September 26, 2015 - 7:24 pm UTC

Hi Tom,

Many Thanks for the reply.
Which report you want me to send it to you, is it the ASH for a 10mins duration, during rush hour. Which part of the ASH in a minimum should post. But I would like to figure it out by myself, with your help. Where I should check in an AWR report, that would tell me for the number of commits I am doing I dont have enough of CPU? Like searching for the "youth-fountain", I have already spent a significant amount of time, without any positive outcomes.

Thanks and Regards,
Sandeep
Connor McDonald
September 28, 2015 - 3:53 am UTC

Either report is fine. And we'll advise on where to look etc without trying to simply spoon feed the answer.


A reader, October 06, 2015 - 2:18 pm UTC

Hi Tom,
I am sorry, but my client is having reservations. So may not be able share a report. Could you please let me know which all sections in an awr report I should look for and the math involved in it. I just want to know "How the calculation is done" ? The same way you provide many of the answers for questions "How it is done by oracle". Sorry for the trouble.

Regards,
Sandeep
Chris Saxon
October 07, 2015 - 3:08 am UTC

In that case, tell the client to peruse support note Doc ID 1376916.1

It covers this topic in good detail.

Hope this helps.

so what if CPU is 90%+ idle and parallel write wait is low?

John Cantu, November 10, 2016 - 12:01 pm UTC

Hello Chris,

I have a follow-up question to your response:

"Take a look also at 'log file parallel write'. Thats the portion of your log file sync time that is being spent doing the IO.

If that's small (redo writes on modern hardware should be almost sub millisecond), and you've still got high log file sync, then its either a CPU problem (the box cant cope), or a concurrency problem (lots and lots of people all committing at high frequency).

Some related info here http://goo.gl/6vigmw"
"

If only 692 out of 3764 seconds is spent on “log file parallel write” and the CPU is 99% idle and the concurrency hasn’t changed, what else could be the reason for only 692 out of 3764 seconds being spent on I/O? In other words, what other thing can the database do with the 3764-692 seconds?

Thanks

Connor McDonald
November 11, 2016 - 8:30 am UTC

Could be lots of *potential* causes

- excessive commit frequency (across the instance)
- lgwr has been swapped out (memory issue)
- lgwr contention for something else (ie, if lgwr itself is blocked, then a user session see this as 'log file sync'

if none of these seen, then might be time to talk to Support for some lower level tracing

John Cantu, November 10, 2016 - 12:02 pm UTC

sorry, forgot to mention that 3764 seconds is spent on "log file sync".

John Cantu, November 11, 2016 - 3:18 pm UTC

Thanks, Connor.

Chris mentioned that the wait event "log file parallel write" can be used to figure out how much of the "log file sync" time was spent doing I/O. Are there similar wait events for other I/O wait events like "db file scattered read" and "db file sequential read" that will provide insight into how much time was spent doing I/O and how much time was spent doing nonI/O?

Connor McDonald
November 14, 2016 - 2:29 am UTC

No - they are different because they are the *same* session

For commit:

session 12 asks LGWR to do something, and then waits on log file sync. But they are waiting on *another* session (lgwr) to do its work. LGWR will then have the 'log file parallel write' waits for its session.

For normal read:

session 12 *does* the I/O, hence its waits (db file seq... etc) *are* the I/O cost (as perceived by Oracle).

Now, in *all* cases, an I/O request by Oracle to the OS *might* have some delays that are not *true* I/O (eg queueing etc) ,but thats no different to anything doing I/O.

John Cantu, November 16, 2016 - 5:36 pm UTC

Thanks, Connor. That makes sense.

You mentioned that I/O request may have time spent on items like "queueing" on the OS, so how do we find out how much time was spent "queueing"? Is that is figured out through OS tools only like sar or iostat?

Connor McDonald
November 17, 2016 - 2:31 am UTC

Yes, you have to go into lower level tools. Not even sure if sar/iostat will pick it up.

John Cantu, November 17, 2016 - 10:09 am UTC

Thank you