Skip to Main Content
  • Questions
  • Can RAC and Oracle Clusterware restart a batch process on another node if an instance/node it is running on failed?

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Milo.

Asked: May 03, 2007 - 11:30 am UTC

Last updated: July 02, 2012 - 1:43 pm UTC

Version: 9.2.0.8 and 10.2.0.3

Viewed 1000+ times

You Asked

We are in the process of planning to move our application to a 10g RAC environment, and our Database Architect proposes to make some changes in our batch application to take advantage of RAC, and what he suggested for us to do, made me wonder if RAC can handle it automatically or if we do need to modify our batch process to handle it programmatically in case of a node failure, as suggested by our DB architect.
Here are the background information on our batch process, which on a typical day we would need to process at least a million records.
1. A Java process kicks off which calls a stored plsql program that determines what records we need
to process and assemble these records into group of records;
2. The same java process would then assigns a group of record to a java thread for processing.
Multiple java threads is started to process these records in parallel.
3. Each java thread calls a PLSQL stored program, which in turn process the records assigned to it.
Our Database Architect proposes we modify our batch program so that each java thread would be
executed on different Oracle instance/node to essentially spread the load across a number of nodes.
My question is this, If the RAC node on which one or more of our batch threads is running on, happened to crash or fail, will RAC/Oracle Clusterware be intelligent enough to reconnect these java threads and resume
running on a surviving node? Or do we need to handle this by changing our batch program to recover
and restart one or more new java thread to continue processing on the surviving nodes?

Thanks in advance for taking and answering my question.

and Tom said...

I question why you have java in there at all. Seems to be misplaced.

Why not just invoke a stored procedure that submits N jobs using dbms_job, each procedure submitted would have inputs directing it to process a given "batch". The database does the rest.

Including giving you the ability to have each process run on a given node, or just run on any node - you control that as you submit the job.



Rating

  (12 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

RajaRajan, May 05, 2007 - 12:15 am UTC

The purpose of going to RAC would be for High Availability ,and your design (Clientside or Server side load balancing) will take care of the load balancing part..
The developer need not bother on balancing the load across the nodes ..that should have been addressed during the RAC implementation phase. And if the jobs are instance specific there is noway that it is going to be HA ,and if a node fails, you got to have a shell script or so,to redirect the job to be executed on another node .\

Regds
Raj

Why use Java for sake of using Java...?

Sagar, May 07, 2007 - 2:42 am UTC

Hi Milo,
I fully agree and follow Toms philosophy....
In the order of preference SQL>>PLSQL>>JAVA>>C
Why make things complicated?...
If plsql stored procedure can do that use it. Anyway right now you are not making your code "database independent" as you are invoking plsql stored procedure from Java.


Thanks!

Milo Tan, May 08, 2007 - 1:05 pm UTC

Thanks to everyone for their inputs. Unfortunately, this is an existing batch process, and was deployed even before I got assigned to this group. For our RAC proof-of-concept, the mandate is not to make any drastic change in the existing code and design. I did some further reading on RAC, and do correct me If I wrong here, Tom. But for what we are doing in our batch process, I don't think RAC can continue processing (i.e., automatic failover) on another node/instance, in the case the instance the thread is processing on failed.





Tom Kyte
May 11, 2007 - 8:18 am UTC

sure it can, but any in flight transaction would be rolled back and have to be restarted.

Create Service and configure TAF

Mwakuye, May 11, 2007 - 2:45 pm UTC

Hi Milo,

RAC can automatically fail over to the surviving instance like Tom said.

Create a Database Service with dbms_service Package or Enterprize Manager and enable TAF on it.

Tell the Batch process to connect using that service.
You can even enable load-balancing on that service.

I have not tried this but here is a step-by-step Example:

http://www.oracle.com/technology/obe/10gr2_db_vmware/ha/rac/rac.htm

Regards,

Mwakuye

you can failover

Fernando Madrigal, May 13, 2008 - 1:37 pm UTC

but your transacion no,
it resides in the shared_pool, log_buffer, ... etc.

You get a new session, but your batch job loose its transaction state.
Tom Kyte
May 13, 2008 - 3:28 pm UTC

your current in process read-write transaction will be aborted - but you'll still be connected, yes. You resubmit the transaction and it would immediately "work" that time - you are 'failed over'

your current read only transaction can failover and pick up where it left off if you want.

connection pooling

Anil, October 21, 2009 - 11:46 pm UTC

My Java application uses connection pooling machanism to connect to the databases. We can see that both the instances in a cluster are having 100 connections each. However the DBA recently reported that only 1 instance of the cluster is getting used heavily. What could be the reason for this. Thanks

Tom Kyte
October 23, 2009 - 1:19 pm UTC

depends on how you configured your stuff entirely.

I'll assume:

your connection pool extends 100 connections at a time.
you are using load balancing (not round robin - but load balanced)

you start your app server and it initiates 100 connections, they'll all tend to go to node1 since both nodes have told the listener "I am idle". It'll take a while for node 1 to tell the listener "I'm not busy", so your logon storm gets all 100 connected.

At some point, the need to have 101 connections arises, so you get 100 more, they all go to node 2.


But mostly, you stay under 100 needed connections - they are all on node1, everynow and then you hit node2, but not often.


If you used the Oracle connection pool - it is rac aware and does load balancing back at the application server.

Otherwise, you'll probably want to use the round robin set up so that the first connection goes to node1, then node2 then node1 and so on.

Alexander, May 02, 2012 - 12:29 pm UTC

Hi,

I have the same situation as stated above, I have a few follow up questions.

1) If we allocate 100 connections at a time, and both nodes are idle, and we are using load balancing, why wouldn't it distribute 50/50 between the nodes?

2) Does this mean if we have 100 connections and one node is really hot it will not distribute existing connections within the pool to the idle node, only if new connections need to be made?

3) What is the round-robin setting for the client connection string? Is that the default?

Just to give some background, we have web services that load rows one at a time. So because each row is basically its own transaction I would have thought it would be distributed evenly between our two nodes.

I've already instructed the developers to bulk it up and stop thrashing our lgwr. But I'm trying to anticipate even if they're coming through 1000 at a time, how they will be directed and weigh the pros/cons of load balancing and the additional interconnect traffic.
Tom Kyte
May 02, 2012 - 2:20 pm UTC

1) because it takes a while to notice that a given node is "busy" - pmon reports that to the listener - after a while.

you'd use round robin to get the 50/50.

2) if you use a "non-rac aware" connection pool - you'll have a good chance of that - yes. because your connection pool doesn't understand which instance it is connected to and doesn't understand what the load on each node is and doesn't know how to load balance.

if you are using a connection pool - your connection pool is the thing mediating access to the database - NOT the listener anymore. Once it (your connection pool) established its connections - we are out of the picture - your connection pool is doling out the connections, the pathways to the database - not us.

That is why I said you'd probably want to use a rac aware connection pool.

3) yes, just don't enable load balancing


In many cases - when you are doing a large load like that - you might want to limit it to a single node as you'd otherwise have index blocks flying around between the nodes so as to update them and only one node at a time really can modify a block (only one node can have the block in current mode).

see this
http://www.youtube.com/watch?v=2jWq-VUeOGs

and the section on resource management right at the beginning.

Alexander, May 03, 2012 - 9:39 am UTC

Thanks this is very help. Something else I don't quite understand though, if I looked on both instances the connections numbers are about even. That's not really in agreement with what you're saying, so I'm trying to connect that dot.
Tom Kyte
May 03, 2012 - 2:26 pm UTC

are you using load balancing, how does your connection pool connect?

do your connection pools start up N connections immediately or over time grow to N?

are the number of active sessions typically about the same?

Alexander, May 03, 2012 - 3:42 pm UTC

are you using load balancing, how does your connection pool connect?

Yes, load balancing.

do your connection pools start up N connections immediately or over time grow to N?

I believe they start up with N.

are the number of active sessions typically about the same?

Typically yes. But when this "batch" (really slow by slow) occurs, it's likely uneven due to the fact one server is pegged and one is 30%.
Tom Kyte
May 06, 2012 - 2:03 pm UTC

If you run this test (run in test if possible :) )

a) stop app server
b) note connections on each instance
c) start app server
d) report on connections to each instance


Are you using the Oracle jdbc connection pool (can be RAC aware)?


when this 'batch' goes - does the work load ever even out - if not, it would be an indication that you are not RAC aware and just throwing load at the servers in any fashion.

Bulk Load

A reader, May 06, 2012 - 5:12 am UTC

Hi Tom, You said "In many cases -when you are doing a large load like that -you might want to limit it to a single node as you'd otherwise have index blocks flying around between the nodes so as to update them".
But wouldn't it be good, if we drop / disable indexes before the load and rebuild indexes after the load. In this case shouldn't multiple nodes provide a better response time. However I do understand that in cases where indexes cannot be dropped e.g incremental loads, a single node may be more efficient.
Tom Kyte
May 06, 2012 - 3:29 pm UTC

depends on your needs.
depends on what you are doing.
it depends.


but even in light of no indexes - it might still make sense to isolate the workload. Especially if the other nodes are doing query/reporting work. If you are doing bulk/direct path loading - you can do it mostly in isolation without impacting the other workload too much.

How to ?

A, July 02, 2012 - 9:12 am UTC

Hello Tom,
How to to limit the load to a single node?

Thanks
A
Tom Kyte
July 02, 2012 - 1:43 pm UTC

describe to me this batch - it is a job in the job queue, is it a client connecting to an instance, is it a parallel query - what is it?

typically - you would configure a service, that service would have only one instance register for it - hence all connections for that service will go to that one instance.

Marc, December 03, 2013 - 12:10 am UTC

From your response :
Are you using the Oracle jdbc connection pool (can be RAC aware)?

What do you mean by that? You mean the below sample TNS

jdbc:oracle:thin:@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = tmg100-vip.svr.us.naya.net)(PORT = 1582))(ADDRESS = (PROTOCOL = TCP)(HOST = tsrdcudboq100-vip.svr.us.naya.net)(PORT = 1582))(LOAD_BALANCE = yes)(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = NAYA)))"

To be honest, I am findind it difficult to understand the client side load balancing in connection with java connection pooling. I would appreciate if you can elaborate a bit on this topic.
Thanks for your wonderful service to the oracle community.


More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library