Skip to Main Content
  • Questions
  • Data Retrieval from very large tables

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Sandeep.

Asked: December 07, 2014 - 4:44 am UTC

Last updated: October 07, 2015 - 2:39 am UTC

Version: 11.2.0

Viewed 100K+ times! This question is

You Asked

Hi Tom,

I am seeing a different trend growing up. Even the transaction tables are growing so fast that the batch jobs (summarizing the data of transaction tables into smaller version, for faster data retrieval), though scheduled at non business hours ( I wonder, in today's world which hour is not a business hour J ), stretching near to business hours. I understand for decision making, a separate infrastructure should be in place, but the scenario with me goes like as below.

As a customer, I am subscribing to a voice/data package, I am making a payment there, but I have an option to check whether my payment got received by the service provider or not, by sending a predefined code as a text message to the service provider, and the very next moment I get a text message in return, containing the last payment done by me. I may be the millionth customer, trying to retrieve data from the same table at the same time, the traditional way of table tuning (partitioning, indexing, separate tablespace + datafiles for index and table, partition level indexing, sub partitioning, also considering the statistics are not stale ) does not look like of much help, rather I would say, the amount of data involved is too large here. And there is not much to do to tune the SQL here because, it would be a straight forward query. Since the batch job to summarize the data for that day is yet to run, I am supposed to query the transaction table here. If no other Third Party Tool is involved here, how it is done to make the response Immediate, or in other words the end user should receive the return message ( concentrating on the query output only) as quickly as possible. Yes there may be some data designing concepts which does like "Divide and Rule" could be of help, but sticking to my scenario of a "Very Large Table" or may be "Very Large Data from which the filtering is to be done" , how it has to be handled if only using oracle database software.

Hope my scenario is not a funny one. J

Thanks and Regards

and Tom said...

It doesn't matter how large the table is - if I'm going to get a very few number of rows (as I would for the above request, looking up the last payment would be trivial) it comes back about as fast from a billion row table as it would for a hundred thousand row table.

Partitioning would not help. (you are not full scanning)
Summarization would not help. (you want the last record)

All you need is an index to find the last transaction for a given customer.

select * from
(select payment_amount from txn where customer_id = ? and trans_type = 'PMT' order by date desc) where rownum = 1;

that would take very little work at all.


Not sure what exact issue you are trying to point out - but the scenario you lay out would not be an issue.


Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Data Retrieval from very large tables

Sandeep, December 08, 2014 - 6:14 pm UTC

Hi Tom,
Many Thanks for the response.

What I am trying to point out is that, how should I deal with, when I receive such a statement from the end user "At the beginning it was very fast (definitely for a new business proposal there would always be less subscribers at the beginning) after a month why we see a delay of seconds and sometimes of minutes, when other than the information growth, nothing else has changed, same query, same application layer coding, same network bandwidth or may be now it is improved than earlier". One may point out that this is a classic case of complete table scan, but if it is not the case, then where should I begin from. What if I am retrieving more than 30% ( I came across suggestions like indexing helps when I am retrieving only 5 or 10 or may be 15% of rows, from different school of thoughts) of the table data, to generate statistics.

Regards,
Tom Kyte
December 09, 2014 - 3:19 pm UTC

if that is what you were trying to point out, you picked an opposite example, an ANTI-example.

Something has changed - the data is larger. Something else probably changed too - the volume of queries has changed, the volume of users has changed (someone created all of that data) so the amount of CPU, Memory and network available to the query has in fact changed (more things using them, less for you to use)


there are two types of queries in the world.


Ones that take about the same amount of time to retrieve data from a table(s) regardless of the size of the table. These generally need to process and return the same number of rows regardless of the table size.

Examples:

a) you gave us a classic one above. Find the last transaction (1 record) for a given customer. That'll be an index min/max scan. It will perform N IO's where N= height of the index. The height of a index is generally very small (1,2,3,4,5) even for millions or billions of rows.

b) give me the last 10 visits for a patient

c) look up this primary key


Others tend to take longer as the volume of data grows, examples:


a) what is the sum of sales over time (no predicate, sum all rows)

b) how many page views have I done (a predicate, look at my records, but over time I have to count more and more records)

c) please find all of my records in this unindexed table (full scan, takes longer and longer)


Each of those has solutions - I might use a materialized view for (a) and (b). I might add an index for (c)



Indexing by the way might be good for up to 0.5% (zero point five) in some cases, up to 100% in others. It depends.

Suppose you have a table that gets 100 rows per block. Suppose you want to retrieve 1% of the rows. Do you want to use an index?

If you say "yes, obviously", then I will tell you that the 1% of the rows you want are each on a different block, there are no blocks such that 2 rows on there are both wanted by you - each block has 1 row. Now..... do you want to use an index? NOPE - it would be really bad to read every block a single block at a time via an index. Full scan would rock and roll.

If you say "no", then I will tell you that this table has 100 blocks, and the first block has 100 rows with a value of 1, the second block has a 100 rows with 2 and so on. The query "where id = 1" should of course use an index to find the single block....


You see - it depends, there isn't a single answer. You have to actually understand how the data arrives and understand the nature of the question you are trying to answer (the queries). Every problem might be slightly different from every other problem. There is no silver bullet.

What you can (and must) do is... Understand how the database works. Understand data patterns (1% - not as simple as it seems). Understand what data structures you have available to you (hash cluster, btree cluster, index organized tables, partitioning, the ability to reorg a table with order by using dbms_redefinition, covering indexes, function based indexes, materialized views, etc etc etc) and then after understanding the problem you are facing - apply the right tool(s) in the right fashion.

A reader, December 10, 2014 - 9:24 am UTC

Hi Tom,

Many Thanks, for the reply, accepting your suggestion.

Regards

A reader, September 17, 2015 - 6:14 pm UTC

Hi Tom,

I have

case-1: 1000 rows
case-2: 100,000 rows
case-3: 100,00,000 rows

As per my understanding "order by" clause is a sorting operation.

May be I have to read more about how oracle works, but I think case-1 will take less time than case-3. And if this is correct, then as the number of rows increases in the table the query is going to take more time, only because of the sorting operation.

Thanks and Regards,
Sandeep
Chris Saxon
September 18, 2015 - 4:28 am UTC

THe answer is "it depends"

If it takes 3 seconds hours to find those 1,000,000 rows, then the 10 seconds to sort them is a large percentage of the execution time.

But if it takes 10 hours to find those 1,000,000 rows, then the 10 seconds to sort them is irrelevant.


The same holds true for *any* number of records - you understand the data, understand how it needs to be accessed, and design/implement the physical design to meet those goals.

Hope this helps.

A reader, September 19, 2015 - 5:34 am UTC

Hi Tom,
Thank You.

From your inputs, could I please conclude, the below.
1) Generally, if there is a index in place and only specific number of records are to be fetched like last 5 transactions, the time taken for retrieval will be less, but chances are there that the sorting may consume additional time.
The Question:- If the table size 30GB, each record only is of few 100 bytes, then scanning the index would be adding few more seconds or not?
2) What if this table is a transaction table, I mean the number of records continuously increasing each second, since the table size is huge and it is a transaction table I dont have the liberty to collect statistics whenever I want.
The execution plan will keep referring to the statics collected at midnight. Sure, why such a thing or design is in place, because need to provide, real time data, should be able to get information about the transaction just did, few seconds back just before checking what are the last 5 transactions. And why the table is such huge, simply because not to miss the scenario where one customer does a transaction today, but the last transaction he did was 15 days back, so have to have all the data (at least one months data, which is why the huge table size, yes partitioning by date is in place, just to purge data older than 30 days to limit the table and tablespace size, and local indexing is in place, to avoid the lock because of rebuilding global index, after the daily partition purging)

I cant think of a summery table or a materialized view here as everything is changing every second here and the table size is also huge?

I am lost, not sure about which technology I am missing out here, or not understanding the technology itself, so not able to implement properly, though I am aware.

Regards,
Sandeep
Connor McDonald
September 21, 2015 - 3:27 am UTC

1) the sorting cost will be proportional to the number of rows (and their size) to be sorted. If you only get 5 rows (even from a 10000G table), it will be quick to sort them

2) if a table is growing *steadily* then why bother *collecting* statistics. Just set them manually. eg if you add 100,000 rows per day, just bump up the row counts and block counts accordingly each day (or even more frequently if you need to). SImilarly, you know the high water mark on (say) the transaction time will be "now"...

A reader, September 22, 2015 - 2:48 am UTC

Hi Tom,
Thank You.

1) If the table size is 10TB, and we have a index in place, indexing column is used in the where clause

The query below:-

select * from
(select payment_amount from txn where customer_id = '12345' order by transaction_date desc) where rownum <= 5;

So here, the first thing done is filtering all the records, for the given customer_id, though the number blocks scanned is 3 to 5 for retrieving a single record, but for a 10TB table and a transaction table, the inner query will be retrieving more than a million record.
Once the above is done. than only 5 records are fetched out of this 1 million sorted records, and directly from the 10TB table.
Just wanted to understand, if this is the efficient way.

2) Little more about collecting statistics, collecting statistics for a 10TB table, frequently, how it is going to impact the customers.

Regards,
Sandeep


Connor McDonald
September 22, 2015 - 6:34 am UTC

If

where customer_id = '12345'

will return 1 million rows, then that will be slow to retrieve AND slow to sort. In the case above, a better index choice would probably be: customer_id, transaction_date

For example

SQL> drop table TXN purge;

Table dropped.

SQL> create table TXN as select * from dba_objects;

Table created.

SQL> create index IX on TXN ( owner, created ) ;

Index created.

SQL> set autotrace traceonly explain
SQL> select *
  2  from ( select * from TXN where owner = 'SYS' order by created desc )
  3  where rownum <= 10;

Execution Plan
----------------------------------------------------------
Plan hash value: 4197688684

---------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |    10 |  3680 |     4   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY                 |      |       |       |            |          |
|   2 |   VIEW                         |      |    10 |  3680 |     4   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID | TXN  |  2716 |   307K|     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN DESCENDING| IX   |    10 |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=10)
   4 - access("OWNER"='SYS')


A reader, September 22, 2015 - 3:26 pm UTC

Hi Tom,

Thank You.

What if the data type is timestamp and not date?

Regards,
Sandeep
Chris Saxon
September 23, 2015 - 12:05 am UTC

I think its probably not strenuous an exercise for you to take the example I just posted, and use a timestamp instead of a date and see what happens.

Do a "describe" on TXN and you'll see what I mean.

A reader, September 24, 2015 - 3:28 am UTC

Hi Tom,

Definitely give a try, and would get back to you. Why I immediately asked back is, I was suggested to not to do indexing over a timestamp column where milliseconds are recorded.

Regards,
Sandeep

A reader, September 24, 2015 - 5:47 am UTC

Hi Tom,

The CPU cost was 4, when statistics are not collected.
Once I collected the statistics, after table creation, now the CPU cost is 3 and the index is also getting used.

So if the seektime is 10ms, and iotfrspeed is 4kb and with 8kb block size, so single block read time is 12ms and the total time taken would be 36ms for a cpu cost of 3.

create index IX on TXN ( owner, timestamp ) ;

select *
from ( select * from TXN where owner = 'SYS' order by timestamp desc )
where rownum <= 10;

Execution Plan
----------------------------------------------------------
Plan hash value: 4197688684

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 2070 | 3 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 10 | 2070 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID | TXN | 9209 | 809K| 3 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN DESCENDING| IX | 10 | | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM<=10)
4 - access("OWNER"='SYS')

I see now what you meant.
Regards,
Sandeep

A reader, October 06, 2015 - 1:51 pm UTC

Hi Tom,

Thank You so much.
I guess, my question is finally answered.
I was under the impression, that columns in where clause should be indexed only and was not aware that I can make index on the column on which I am doing order by as well. So could you please also help me knowing what all clause (like order by) columns I can put indexing on?

Regards,
Sandeep
Connor McDonald
October 07, 2015 - 2:39 am UTC

It could be *anything*...

eg

select col1, col2
from my_table;

Lets say "my_table" has 300 columns. It *might* be benefical to index col1,col2 simply so the query above can scan through the whole index, which will be "thinner" (ie, smaller) than the whole table (with its 298 extra columns).

Indexes are all the same, ie, does the *benefit* in having it outweigh the *cost* of having

Hope this helps.

A reader, October 10, 2015 - 5:15 am UTC

Hi Tom,
Many Thanks
It is helpful in understanding indexing better.

Thanks and Regards,
Sandeep

More to Explore

VLDB

If you are new to partitioning, check out Connor McDonald's introduction series here.

VLDB

Documentation set on VLDB and Partitioning.