Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, AJ.

Asked: August 31, 2009 - 4:47 pm UTC

Last updated: February 01, 2011 - 3:37 pm UTC

Version: 11.1.0

Viewed 1000+ times

You Asked

I was recently pointed to the article about DB Sharding (Shared Nothing).
http://www.dbta.com/Articles/Editorial/Trends-and-Applications/Database-Sharding--The-Key-to-Database-Scalability--55615.aspx

It seems to me a bit like Sharding to Oracle RAC is like SQL Server partitioning is to Oracle Partitioning.

Your app had better know exactly where to find the data (or at least where to find where to find the data).

What is your take on Sharding.

Thanks.

and Tom said...

I didn't get (understand) the reference to sql server however.

RAC is not about sharding at all - it is the opposite, we are shared everything with RAC

Exadata however http://www.oracle.com/database/exadata.html is all about that. But without you having to remember where the data was put. It gives the opportunity to store and retrieve (rapidly) more data than most applications would ever need - without having to 'program to it'


With true data sharding - you'd need some directory services and very specific types of applications, it is not "general purpose", it solves a particular problem.

Rating

  (9 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Sharding

lds, September 02, 2009 - 6:43 am UTC

In my experience, sharding works if the "partitionioning" of data is static enough, and clearly defined from start, because changes are much harder to implement and have deep impacts down to the hardware level.
Also, they usually require an application-level knowledge of where data are and how to integrate them - up to the risk of having client side joins and aggregations of large data.
What the article tells about maintenance is just half the truth: it is true managing a large database is complex, but managing data scattered among multiple systems has issues as well - consolidation was meant to avoid them because they exist.
Take the "issue", SAN, for example - one gets a large, fast, fault tolerant storage. Sharding and storing data locally just means to replicate fault-tolerance at every node level. A SAN can allocate space as needed, while local storage may have wasted space or lack of it, and outgrow the available hardware.
In some applications it could work well - Google indexes words in web pages - that "domain" is static enough, alphabets, words and languages don't change over time fast enough - and there are good statistical analisys about distribution, there could be good ways to "partition" those data across several machines, while merging results is simple enough - it returns web addresses in a simple display.
Other database may need to present data to different applications, and allow for more complex data manipulation - IMHO sharding in those cases just add a layer of complexity that has to be managed at the application level, while the "shared everything" approach is more flexible.

really?

Jiri, September 02, 2009 - 12:46 pm UTC

shared nothing vs shared everything is like a religion, Inmon vs Kimball, ... we will never find the truth simply because there is no "better".

Teradata database is shared-nothing and although the cost is very high because of this, you can scale better and can fit better is some situations. I've seen Teradata perform great in pretty fast changing active data warehouse environment and I disagree that shared-nothing adds big complexity for DBAs or Developers (it may be a bit more complex to design the indexes, however it may be easier to manage load balancing). Again, there is no better or worse, it's simply different.

Tom Kyte
September 02, 2009 - 1:06 pm UTC

we are shared everything in RAC

but at the storage level we can go shared nothing (transparently, with great flexibility and the ability to dynamic distribute and redistribute the data - unlike the static partitioning of a teradata).

We do both these days. At the same time if you need.

Please elaborate

Jayashree Subramanian, March 02, 2010 - 2:25 am UTC

You mentioned "at the storage level we can go shared nothing (transparently, with great flexibility and the ability to dynamic distribute and redistribute the data - unlike the static partitioning of a teradata)" I am very curious to know which feature you are refering to.
Tom Kyte
March 02, 2010 - 8:27 am UTC

teradata uses a shared nothing architecture. If you add resources, you have to redistribute the data. Oracle RAC is a shared everything architecture. If you add resources (more nodes), you don't have to move a byte of data, we dynamically distribute access to the data based on the currently available nodes (which might change over the course of day)



A little bit of insight into tricks of RAC would have been helpful.

Shouvik Basu, November 16, 2010 - 7:55 am UTC

Shared everything RAC is only logically true.
There is a notorious thing called interconnect trafic which eats out most of the horizontal scalability gains.
I should not have said "notorious", in fact, interconnect is the reason RAC is shared everything.

In practice most RAC installations will centralize specific modules on specific node. The loose term for this is "Application Partitioning". This ensures one node rarely needs data handled by other node.

This is indeed a bias towards shared nothing.

RAC has got other benefits like Transparent Failover for some basic SQLs. They are good from availability standpoint.

And of course I disagree with Tom on share-nothing at storage level. This will be of very limited or no use, if there is heavy interconnect traffic. Global Cache series of waits in a RAC AWR is a very interesting study.
Tom Kyte
November 17, 2010 - 5:09 am UTC

... Shared everything RAC is only logically true. ...

huh? It is physically true as well.

Physically - all of the resources are shared - everyone has access to all of the disk - 100% of the time. That is "shared everything", physically and otherwise.


...In practice most RAC installations will centralize specific modules on specific
node. ...

that is not true, does it happen? do some people do it that way? yes. However, MOST do not, because MOST use more than one node for any given application. As soon as they do that (and MOST do), then they are still doing application partitioning but in a clustered fashion.

RAC has been successful in lots of cases (shared nothing, the same application running on multiple nodes).

RAC has been unsuccessful in other cases (if you have an application that doesn't scale well on a single machine, it will scale even less well on many machines)


You can take the above two sentences and replace RAC with <any technology goes here> and they will still be true.


if it were of very limited or no use, we wouldn't be having this discussion at all. for RAC wouldn't be anything anyone talked about - it wouldn't be used.

A reader, January 14, 2011 - 7:45 am UTC

Tom,

I have heard claims from IBM that though Oracle RAC is shared disk there is a master node for each block in the database. They claim that because each block has a master node any modifications to a block have to first check with the master node and then with the node that is currently holding the latest block. They further claim that this communication between all the nodes is what inhibits RAC scalability. They have also claimed that when the master node goes down all the locks must be remastered during which the data is unavailable.

So I have some questions for you in this regard.

Is it true that there is a master node that keeps track of all the blocks?

Is it true that a single block has only one master node? meaning if that node fails that block must be propagated to another surviving node?

Thanks
Is it true
Tom Kyte
January 14, 2011 - 12:10 pm UTC

... They claim that because each block
has a master node any modifications to a block have to first check with the
master node and then with the node that is currently holding the latest block. ...

that is not technically accurate. If the node that wishes to make the change has the block already - it doesn't really need to do that.

... They have also claimed that when the master node
goes down all the locks must be remastered during which the data is
unavailable. ...

a process that happens rapidly. When their (IBMs) shared nothing solution goes down - what happens then? Ask also of IBM why their *mainframe* parallel sysplex stuff works in a manner very similar to RAC. I know their "open systems" implementations do not - but their big iron, mainframe, bit time stuff does.


The mastering of the block must be propagated to another node - yes, but that is a rather fast operation - we are not talking minutes here in the event of a failure - not even close.


A reader, January 14, 2011 - 2:04 pm UTC

Thanks for answering the questions Tom. Just to clarify - when a block needs to be modified by instance A and the latest block is in instance C and the master instance is D what are the series of events that take place before that block is modified?

IBM has a new solution for Linux/unix/Windows called DB2 Purescale which is shared-disk architecture similar to the Sysplex solution. The difference according to them is that all the lock and cache management happens on a Central node (this is not a DB node but one that is dedicated to lock/cache management). And because of this they claim the inter-node communication is minimized.
Tom Kyte
January 14, 2011 - 7:01 pm UTC

A, since it doesn't have the current version, would ask D where it is and request it from C.

I fail to see how a centralized (can you spell single point of failure) would cut down on the traffic. Explain that or have them explain it. And not just in theory - in practical terms - with math behind it.

In IBM land - A would speak to Central Services (think Brazil - the movie :)) and Central Services would tell them C has it - go there. So what is the difference again?

How is having ONE THING do all of the work more scalable than each node doing a percentage of the work?

IBM Purescale vs Oracle RAC

THRJG, January 20, 2011 - 8:50 pm UTC

Dear reader ,

Thought to shed some light on to this , since I am also a Ask Tom fan and a Oracle DB fan.

For couple years Oracle RAC was the only transparently Horizontally scalable database in the market.

The IBM Purescale doest have much technical documents out there yet , but I managed to figure out few things.

- they took some concepts our of the z-series mainframe
- Boxed it in a P series HW with AIX/Linux
- Of course there is always a Mainframe version

So as per my understanding IBM DB2 Purescale allows two (could be more )centralized locking node's to be present for HA reasons in their configurations. requires minimum one purescale node for the solution to work in Cluster mode.
The purescale node acts as a dedicated message box for all other nodes. It is constructed with low latency IB/RDMA interconnects.

What they have basically done is created a additional node in the middle dedicated to handle what Oracle RAC does in Cache Fusion. As per the currently available literature , It seems to be limited to IBM HW. The Clever point to recognize is that they centralized the Caching to a single dedicated node and made that node highly available , which dramatically reduce the inter node synchronization traffic. Where as in Oracle RAC everybody know everything (for great extent) ..but this cause's some traffic.

Where as Oracle RAC is generic solution , that allows us to create on many platforms. PureScale I believe will more or less fight at the level of Oracle Exadata not with Vanilla RAC.

On the HW side of the techniques, they have invested on infini band with RDMA and exposing the DB2 code to to the Cache sync via that. Couple of years back Oracle RAC also took steps towards improving interconnect speed by adopting RDA protocol over IB(UDP) on Linux. As per IBM they can do the scaling via this dedicated RDMA concepts by bypassing OS layer and other additional layers (interrupts ).

I am sure if there is a Market competition from IBM for Exadata , it will only be a temporary set back for Oracle. Since decades a go Oracle and Sun did some really good innovative feature like Solaris KAIO ..which made a difference when it came to IO load.

I am sure with up coming Solaris critical application API and some focus on RDMA based optimization will allow the Oracle to do the same much better ! or they might do the whole Cache - Fusion operation in HW / dedicated solely for cache coherency.

Another point is that most of these approaches were materialized some time back in Clustering world ..., looks like it is all coming back ..., instead of proprietory interfac ..looks like vendors are joining the IB Band wagon.

For your scenarios on what happens to a data block and how it gets transferred and how the cache fusion is keeping this information is explained very well in two very good books on the subject.

- Real Application cluster handbook - K Gopalakrishnan
- Pro Oracle Database 10g RAC on Linux: Installation, Administration, and Performance By Julian Dyke , Steve Shaw

Last but not least , there is no master instance or master node concept in RAC, some times one could get confused with terms like master instance ...which simply refers to a state of a RAC memory structure in a given time quantum under specific Cache coherence state.

If you are interested more on IBM DB2 PuresScale , you can also follow this blog on ,
http://it.toolbox.com/blogs/db2luw/db2-purescale-scalability-part-1-35173


Thanks and Good luck ! Nothing can beat the Oracle Innovation !




Tom Kyte
January 24, 2011 - 7:04 am UTC

... which dramatically reduce the inter node
synchronization traffic. ...


how so? the situation is still "I need block X" - in Oracle we would:

a) ask ourselves - do we already 'own' block X - if so, done.
b) if not, we don't own it, then hash block X to figure out which node is the master
c) communicate with that node to figure out who has it, if the master 'owns' the block - it can give it to us and skip (d)
d) get it

In IBM we would

a) ask ourselves - do we already 'own' block X - if so, done.
b) if not, we don't own' it, then go to the same node everyone else is - generate yet more traffic to that single thing (instead of spreading the load out over all available hardware) to figure out which node is the master
c) get it


How does that cut down on traffic?

IBM Purescale vs Oracle RAC

THRJG, January 28, 2011 - 6:55 pm UTC

Hi Tom,

The primary reason behind my statement is that ,As per my understanding Cache fusion architecture consists for a distributed cache and its management. Oracle adheres to cache coherent protocol. So in this , in addition to the Data block sharing , synchronization information (info kept in GRD) is also shared between nodes, also concepts like resource master and who is master what resource is kept synchronized across instances. And the way oracle currently implements the interconnect is very generic , I believe it could be optimized further , but then people might start saying its proprietory and it is more costly.

IBM is trying a short term workaround , by no means I believe the current approach is as sophisticated/ or scales better compared to oracle ,e.g because there is no information on dynamic resource mastering and workload balancing. In IBM world , that distributed nature of the memory structure is missing for great extent and always single point of contact (everything is in purescale node). As you said this could be a negative point. I believe due to the fact that IBM model uses a very finite messaging mechanism and it is currently optimized for OLTP as well , so with current purescale vision works for some extent.

May be Mr.Cary Millisap could help us here with a small queuing model.

Hope this explains my initial thoughts. of course it is yet to be proven.

Tom Kyte
February 01, 2011 - 3:37 pm UTC

... in addition to the Data block
sharing , synchronization information (info kept in GRD) is also shared between
nodes, ...

just in the same fashion as purescale - sure. Meaning - the nodes need to know who has the block - we do it using all nodes (split the work out) - they do it on a single node (contention).


... because
there is no information on dynamic resource mastering and workload balancing. ..

and that is not a feature.

simulating database sharding in RAC

sasanka, March 15, 2014 - 6:31 pm UTC

I believe I understand shared nothing/shared architecture. each have pros and cons . in Oracle if large ,highly concurrent tables are partitioned in different tablespace in different disk groups/grid disks will it not give the benefit of both the worlds. IN ASM I am always seeing only one disk group for +DATA and it is best practice in RAC. What abt Manual.

More to Explore

VLDB

If you are new to partitioning, check out Connor McDonald's introduction series here.

VLDB

Documentation set on VLDB and Partitioning.