Skip to Main Content
  • Questions
  • Join a large table ( 70 Million Records) with two other tables

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Radek.

Asked: August 17, 2016 - 1:03 pm UTC

Last updated: August 18, 2016 - 2:33 am UTC

Version: 11G

Viewed 10K+ times! This question is

You Asked

Hi,

I have a requirement where I have to join three tables, say Employee (~70 Million records) with department (~2000 records) and Cities (~2000 records).

The query looks something like

select /*20 fields*/ from employee e,department d,cities c where /*conditions*/

which produces an output of almost a Billion records.

The query takes around two hours to run, which brings to my question, Is there any way to reduce the time for the query to execute ?

I have thought of fetching records dividing them using rownum but that might cause an inconsistency in data because there may be updates or deletes that might happen and doing so might cause an inconsistency.

Any reference would also be of great help !

Thanks in advance !!

//Radek



and Chris said...

If your largest table has 70 million rows, but your results have billions you've missed a join somewhere!

Fixing this will go part of the way to making your query faster.

But if you're selecting all 70 million rows from employees, your query will still take a while.

So how can you make it faster?

If it does return everything, first you'll need to add some conditions to your where clause that reduce the number of rows you return.

Next you'll want to see if there's any indexes you can create. These may (not will, may) make the query faster. If you want to know more about these, I recommend reading:

http://use-the-index-luke.com/

Once you've done this, check the execution plan for your query. Use this to see what Oracle is doing. Then take appropriate action. If you need help with this, create an execution (repeat execution, not explain) plan. Post it here along with the query, details of the table and its indexes. Then we'll see what we can do.

If you're not sure how to create an execution plan, you can find instructions at:

https://blogs.oracle.com/sql/entry/how_to_create_an_execution

Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

A lighting fast response !!

Radek, August 17, 2016 - 4:00 pm UTC

Hi Chris,

Thanks & that was a fast reply !

I used employee and other tables to make the question simple, here is how my actual query looks like,

select /*some columns*/ from
(select a,b,c,d,e,f from p1,p2,p3 where /*conditions*/
union
select e,f,g,h,i from q1,q2,q3 where /*conditions*/) t1,
(select a,b,c,d,e,f from m1,m2,m3,m4 where /*conditions*/) t2,
(select a,b,c,d,e,f from n1,n2,n3 where /*conditions*/) t3
where t1.a = t2.a and t1.b = t3.b and t2.c = t3.c

t2,t3 are ~2000 records
t1 has ~70 Million

and I have made sure that all the tables are correctly joined.

This query is used to fetch data from past 20 years ( including every day so data increases every day).

I have checked the query plan and there was no problem with it, because it is just that the data is huge that I am facing this problem.

Say the DB of stackoverflow had this weird thought of checking all the questions that are posted in the last 5 years. So how could he fetch and view all those questions quickly ?

Chris Saxon
August 18, 2016 - 2:33 am UTC

For any SQL, there is a "perfect" execution, namely, the minimum amount of work that has to be done to get all the results. An "inefficient" query is one that does more work than that theoretical minimum.

The closer you can get the actual work done to the minimal work required, the better your query is.

This all sounds self-evident, but often people overlook that the minimum amount of work *might* actually be a lot. If I want to aggregate (say) the length of each comment from stackoverflow for the past years, then I must

a) visit in some way every single comment, OR
b) change the physical implementation to mitigate that (eg store the length of each comment in a separate table as they are created).

It is *always* this choice - accept the *current* minimal amount of work required, and see how to do this efficiently, OR, change the minimal amount of work required.

So returning to your original premise. If you've got 70million records to wade through, the task is:

a) find a way of scanning those 70 million rows *faster* (eg parallel, better execution plan), OR
b) change the implementation to scan less data (eg materialized view, indexing, compression etc).

Hope this helps.

Connor