The Tom Kyte Blog: So what was the answer, Part II

Saturday, February 11, 2006

So what was the answer, Part II

Consolidation… From Merriam-Webster Online we discover:

Main Entry: con·sol·i·da·tion
Pronunciation: k&n-"sä-l&-'dA-sh&n
Function: noun
1 : the act or process of consolidating : the state of being consolidated
2 : the process of uniting : the quality or state of being united; specifically : the unification of two or more corporations by dissolution of existing ones and creation of a single new corporation
3 : pathological alteration of lung tissue from an aerated condition to one of solid consistency

I will definitely not be discussing the 3rd definition, that does not sound very good at all. However, the 2nd definition “the process of uniting” – that sounds good.

And to me – it is in general a good thing. Many of the people I talked with had a similar problem. Lots of distributed sites, running basically the same application (well, in fact the same application) and a perceived need to share/replicate data. They wanted to discuss data sharing techniques.

I did not. I’m not a big fan of replication – especially bi-directional, update anywhere replication (which is what they all thought they were interested in). The complexity this adds in terms of design, development, testing, and administration/maintenance is huge. I don’t care whose replication software/process/method/magic you use, it is complex. Most all of them thought they wanted all data centrally aggregated – but each remote site would have the ability to work autonomously – queuing changes when doing so and synchronizing later. If they were not working autonomously, then updates would either happen to the central site and propagate out – or use a distributed two phase commit transaction updating both locations.

I don’t like either of those approaches really. The update anywhere – in an application of any appreciable size (and these were non-trivial, in place/legacy applications) – concept involves a rather complex design (or redesign). It is not something you can just “turn on” and expect to work. An application has to be designed to replicate in this fashion – and if it must replicate over 3 or more databases – the design becomes even more complicated. Two is hard enough, 3 or more is harder. The problem is, many people try to approach this without the design/redesign phase. This is doomed to failure. Update conflicts will happen (same data gets modified in two places). The developers of the application had to of thought of this fact and had to of designed “what happens then”. The problem is – many developers don’t understand “lost update” issues in a single database let alone “update conflicts” in a distributed, replicated, update anywhere database.

Take a simple inventory application. There is a tire, a single tire in stock right now. I, having access to site 1, order that tire. You, having access to site 2, do the same. Eventually, our updates cross each other on the network. Now, the inventory has “negative one” tires in it. What happens here. The update conflict detection was easy enough – the database does that for us. The conflict resolution was trivial (just keep taking things out of inventory). However, we have just violated some sort of business rule. In a single system, one of us would have been told “Sorry, you lose”. In the distributed system – we both think we won. How do you pick who loses? How do you notify them? What is the maximum time someone might be deluded into thinking they have the tire on order? What else can go wrong (remember there are literally hundreds of tables in the application – maybe hundreds or thousands of transactions – any of which could “go wrong”) in this application.

Now, some 15 years ago (early to mid 1990’s), we may have considered this a necessary evil. Networking was a shaky proposition. Wide Area Networking was really shaky and unreliable. But today, in 2006?

I was at the West Wall in Jerusalem that week I was talking with all of these people about their four questions. As far as I was concerned, I was about as far away as I’ve ever been from “my systems” – the computer systems I use, I rely on (email is one of Oracle’s mission critical applications). At the West Wall, I was in line for a tour – nothing to do for a couple of minutes. So what did I do to amuse myself? I checked my email naturally, I could have browsed the web, instant messaged with someone, whatever. I remember when I would go to Europe from the US in 1995 – it was like going to another planet connectivity wise. No mobile phone. I couldn’t even use the phone jacks to dial into a network there without a converter – and even then, I had no phone numbers to dial. I was effectively off of the network unless I was in an Oracle office. Now – it seems that no matter where in the world I am, I have access to a network and to “my systems”.

Whether it be my phone with GPRS/Edge, my Aircard with EVDO or 1xRTT, a line of sight wireless network, a hotspot which seem to be popping up everywhere, a wired network, a satellite connection – whatever. In Tel Aviv – my wireless connection for the 24 hour period expired as I was writing an email in the hotel. I was leaving in a couple of hours and didn’t want to pay again. No big deal, I simply failed over to my phone (plug in the hot sync cable, fire up pdanet, I’m on).

My thought on replication therefore, don’t spend the money in the design, development, test, maintenance, and administration (which will be quite huge, but only if you want the application to actually work) – but rather invest in a redundant networking infrastructure – a failover solution. That is something that will be useful for everything. Not just this one little application – everything.

In some cases however, the problem wasn’t necessarily technical – it was political. People don’t like to give up “their” data (and here I thought the data belonged to the company…). This would mean centralization, coordination, perceived loss of control. Then all I can do is spell out exactly what it entails to build a distributed, replicated application. It isn’t easy.

Our mission critical application in Oracle of email is a single centralized system (with failover of course – RAC in a room to keep the main server going, data guard to a remote site to make it so that a catastrophe doesn’t wipe it out). It used to not be that way. It was many little distributed systems all over the world. We had the same arguments internally – network is the problem, loss of ‘control’ is the problem, you cannot take ‘our’ data away from us was the problem. Funny thing years later – none of these are a problem anymore. It runs, it runs well, it runs with lots less overhead than before. It is easier.

Replication technologies, the unidirectional read only type, has a place perhaps – in warehousing. But to build an application with – not unless there is a really compelling technical reason (a submarine has a really compelling technical reason for example, there data sharing technologies and update anywhere just might be appropriate).

Anytime I’m asked about synchronous replication (this table must be the same in both databases all of the time), my answer is “you’ve done this really wrong”. Even if asynchronous replication is permitted – but the data is modifiable in more than one place, I would answer basically the same. I know how the replication technology works, I’ve used it, I can describe to you what it does – but I personally don’t like to promote it for most applications. It is the path of last resort – not my first choice.

So, back to consolidation – I believe “like systems” should be consolidated. If you were going to replicate between two or more systems – you probably really meant to build a single system. Distributed complexity is just that – complex.

I believe that the maximum number of instances on a single server is one. (in my world that is also the minimum, but that is another story…). If you are running 10 instances with 10 applications – you really meant to run a single instance with 10 applications inside of it. It is the only way you’ll really be able to tune, to control, to manage, to keep them all on the same release (sort of forces the issue). If you can run 10 instances on that server with 10 applications – you could really run 11 or 12 or more applications on that same server with a single instance. You don’t have multiple SGA’s with their redundancies and “oversizing”, you don’t have multiple pmons, smons, lgwrs, dbwrs, and so on (and the contention caused by having multiple lgwrs, multiple archs, all thinking they are operating in isolation). You don’t have one of the instances consuming all of the CPU at the expense of the others (you can use profiles and resource manager in a single instance to control resource utilization).

So, that is my take on consolidation. It does not mean “you will run a single database for everything”. It means you will run as few databases as you can – one instance per machine, and try to avoid distributed complexity. Data sharing has it’s place – in warehousing, but update anywhere replication is hard. It complicates the design (or at least it should, but many times does not, leading to applications that don’t work correctly in the field).

14 Comments:

Anonymous said....: Well, if the telcos start selling gigabit connectivity at reasonable prices. And provide it in all locations, including the semi-rural sites that mid-sized manufacturing companies prefer (I mentioned in your home T1 post the difficulties involved in getting just a 56k data circuit to a manufacturing facility in western South Carolina). And the telcos agree not to use adverse QoS priortization to control my traffic / charge me more, as they are currently threatening to do.

And perhaps most importantly, if application designers start taking performance considerations in the real world seriously. As an example, go to http://www.oracle.com/careers and try to search for jobs and/or submit a resume. One of the slowest most frustrating sites I deal with.

Many of the people who push these things (and I am not including Tom here; he seems quite practically minded) are consultants who fly from large city airport hotel meeting room to large city strongly connected office, from Starbucks in Chicago to Starbucks in London - and who think as a result that there is high quality global connectivity. But when you get just 50 km outside the large cities, connectivity starts to fail very quickly; often you are dealing with telcos whose (even today) engineers have less experience with the datacomm equipment than you do. And rural telcos that simply don't have the cash flow necessary to put in the pipe and interface equipment you need.

Color me a bit skeptical as of yet.; Sat Feb 11, 04:43:00 PM EST
Thomas Kyte said....: Color me a bit skeptical as of yet.

I was talking about applications, running in major cities, 0km outside of large cities, in the middle of it all.

Network connectivity is not the issue with these applications. "I own 'my' data" was the biggest issue - with the "but what if the network disappears tomorrow" being one of the reasons.

At a corporate level, having satellite at the very least as a failover - not a big deal for a web based application.

And, I'd have a much easier time teaching developers how to build a page that works under high latency, high bandwidth - then I would teaching them how to deal with update anywhere conflict resolution. Of that - I am sure (and even if I couldn't, at least it would just mean "slow" instead of "wrong"); Sat Feb 11, 08:07:00 PM EST
Anonymous said....: Hi Tom

As misfortune would have it, I'm in the midst of redesigning one of those submarine-type situations where each site absolutely requires its own working copy of the data, but the data is in fact centrally "owned" and needs to be aggregated at some point (ideally at the first opportunity following each update). I've come up with the outline of a ridiculously complex replication plan – some materialized views are read-only at the remote sites, some are updatable, some data originates at and might be owned by the remote site and needs to be replicated to the central site, etc. Connectivity is frequently available but by no means guaranteed; it will be down for several hours per working day and might be down for months at a time, and of course the site must be able to work regardless of the state of its connectivity.

As you’ve pointed out, Oracle can be set up to perform as needed, but writing the application to handle the exceptions and to generally be “replication environment aware” is going to be non-trivial. Do you know of an application development reference geared towards that specific problem?

For that matter, managing all those Oracle instances around the world, with spotty connectivity, is going to be something of a challenge. They are already out there in the current version of the application, but very little Oracle replication is happening and I feel like that is going to increase the need to try to monitor the database jobs and logs, which isn’t done today. Any advice on that matter? My thought is to post a resume on that slow Oracle careers site, find a better problem to work on, and let these guys find a better Oracle dev to work on their problem.; Sat Feb 11, 10:50:00 PM EST
Thomas Kyte said....: Do you know of an application development reference geared towards that specific problem?

Not specifically, no. Most of what has been written on replication is geared more towards "admin" than development. The development stuff however is sort of like any database problem though. In a single database, you have lost updates, multi-user race conditions, how to do integrity checks correctly and in a scalable fashion - and so on. But no cookbook to "do your application in a single database", just a bunch of concepts and ideas you have to take into consider while building your application.

Update anywhere replication just makes that list a tad bit longer. And the problems are harder to solve, harder to simulate (and therefore harder to test), harder to even conceive (the old "how they HECK did that happen" syndrome).; Sun Feb 12, 08:17:00 AM EST
DBA King said....: I myself prefer the Active-Passive / RAC + Data Guard/*Plex for DR to keep things simple. IMHO the bad designs are due to the Biz Tech Groups getting carried away by hype and arm twisting the architects to add features that they have just read about in a major magazine.

I stand clueless, on why architects/developers try to design and develop OLite solutions for CRM/Sales Force , when they are better of investing in a good GPRS/EDGE/WIFI infrastructure... should I say Job Security ???

Oracle King; Sun Feb 12, 08:00:00 PM EST
Tim... said....: Tom, I agree with what you say regarding the reduction of instances throughout the organisation, but it's very difficult when using a range of Oracle products.

Just think of a simple stack like, DB10g (for your own applications), AS10g (with infrastructure), Oracle eBusiness Suite, Oracle Collaboration Suite and 10g Grid Control. Now try to consolidate all those elements, including infrastructure databases, into a single instance.

If you manage to get it working, is it a supported configuration? What happens when you need to upgrade or patch a database that breaks one applications certification?

The reality is that Oracle's current product stack requires many instances. Each product has it's own set of requirements and the only way to make them run in a stable manner is to seperate them off.

When you add in several third party applications, with their own specific requirements, it's very easy to find yourself with a whole bunch of servers and instances.

When we started our current system upgrade our management wanted a single RAC database for everything. We now have 12 distinct Oracle databases. All of our internal applications run on a single RAC database. The others are required to support numerous Oracle and 3rd part applications.; Mon Feb 13, 02:50:00 AM EST
Thomas Kyte said....: but it's very difficult when using a range of Oracle products.

Rather, I would say "it is very difficult when using a range of products" - perdiod.

How many ebusiness suites do you need to have? 1 or 10? I say 1, others say 1 per [whatever].

How many copies of YOUR appilcation do you need to have? 1 or 1000? I say 1, others say 1 per [whatever].

How many instances do you need to run two of your in house developed applications (those things you own, you built)?

I say one, others say two.

I tried to summarize:

So, that is my take on consolidation. It does not mean “you will run a single database for everything”. It means you will run as few databases as you can – one instance per machine, and try to avoid distributed complexity.

Meaning - don't add to the problem by purposely creating an application that compounds the problem.; Mon Feb 13, 07:12:00 AM EST
Anonymous said....: The stovepipes have not gone away, they've just been outsourced.

wsodkou; Mon Feb 13, 09:53:00 AM EST
Anonymous said....: completely agree about replication. as a developer that supports an application with remote users, replication is by far the greatest time consuming task in our development cycle. it seems common in sales organizations that ownership of data is perceived to be important. reps competing with each other is the main reason for this behavior, in my opinion. i blame management for not implementing a unifying message.

we are currently looking at some sort of blackberry/wireless laptop solution which i think will be just as difficult to maintain. i'd prefer an internet based solution, but my lowly developer status doesn't have much weight around here.

any chance you'd come to hartford for a talk?; Mon Feb 13, 03:14:00 PM EST
Thomas Kyte said....: any chance you'd come to hartford

I've been there a couple of times already in fact - with the local user group. I'm sure I'll be there again sometime. (Was just there October 12th, 2005..); Mon Feb 13, 03:56:00 PM EST
Anonymous said....: Hi Tom,

We have a great discussion in our company here is Israel.
“To replicate or not to replicate” that is a question...
We have Authentication Server (Some kind of Radius or RSA Authentication Server)
We use Oracle for user’s status database and some more.
There is relatively small number of tables that may be replicated, small amount of data.

Arguments to make 3 sites with 3 Authentication Servers and 3 DataBases that are replicated with Oracle 10i Streams or something else:

- Instead of Authentication via US site’s DataBase Israel users will be Authenticated via Israel’s site DataBase
- Instead of make complex distributed deployment we install one simple installation Unit ( Server + DataBase ) on each site and just use some tools for replication (or even In house development tools). Almost no changes in the servers and deployment;
- Active-Active DataBase instead of Active-Pasive. And if one site is down, users will continue to Authenticate via Remote site;
- Replication is complex, they say, only if amount of data to be replicated is really great and number of possible conflicts is also great;

As alternative one central Data Base with RAC and Data Guard… For scalability Authentication Servers on each site and still 1 central DataBase

Personally, I am not sure that we have a “submarine”

Are you familiar with any commercial Authentication System that uses central Oracle Database instead of replication?
What additional problems expected with replication design in our case?

Thanks,
Natanel; Tue Feb 14, 05:18:00 AM EST
Thomas Kyte said....: Natanel

Your are correct, you do not have a submarine.

the number of update conflict opportunities you need to make something complex:

One.

It is not the magnitude of the number of "off the top of our heads" conflicts.

This bullet point of yours:

...
- Instead of make complex distributed deployment we install one simple installation Unit ( Server + DataBase ) on each site and just use some tools for replication (or even In house development tools). Almost no changes in the servers and deployment;
.....

How can MAKING a complex distributed deployment (replication+ three sites) be the opposite of a complex distributed deployment?

if you deploy once, you have, well, the definition of "non-distributed" don't you?

Most people use a central Oracle database instead of replication - because most people do not use replication.; Tue Feb 14, 06:34:00 AM EST
Anonymous said....: rats - sorry i missed you in october. i'm migrating my skills to oracle and am hoping to enter the oracle world full-time sometime in the spring. your books have been an inspiration to me and i hope to see you sometime in the future.; Tue Feb 14, 02:37:00 PM EST
Andy Campbell said....: Totally agree replication should be a last resort. I've supported many replicated databases and they are your worst nightmare when you go out sync, they all seem to eventually, some replication products more regularly than others.

In the worst case you have to take down time on all the databases to sync up the data - and availability is often the reason replication is put in to start with. Its even more annoying when its one way replication to a reporting database and you need an outage on the production OLTP.

I've always liked OAR - because its complicated. All the API calls that are required make architects/developers think twice about using it!

Andy; Tue Feb 14, 05:02:00 PM EST