Data Retrieval from very large tables
Sandeep, December 08, 2014 - 6:14 pm UTC
Hi Tom,
Many Thanks for the response.
What I am trying to point out is that, how should I deal with, when I receive such a statement from the end user "At the beginning it was very fast (definitely for a new business proposal there would always be less subscribers at the beginning) after a month why we see a delay of seconds and sometimes of minutes, when other than the information growth, nothing else has changed, same query, same application layer coding, same network bandwidth or may be now it is improved than earlier". One may point out that this is a classic case of complete table scan, but if it is not the case, then where should I begin from. What if I am retrieving more than 30% ( I came across suggestions like indexing helps when I am retrieving only 5 or 10 or may be 15% of rows, from different school of thoughts) of the table data, to generate statistics.
Regards,
December 09, 2014 - 3:19 pm UTC
if that is what you were trying to point out, you picked an opposite example, an ANTI-example.
Something has changed - the data is larger. Something else probably changed too - the volume of queries has changed, the volume of users has changed (someone created all of that data) so the amount of CPU, Memory and network available to the query has in fact changed (more things using them, less for you to use)
there are two types of queries in the world.
Ones that take about the same amount of time to retrieve data from a table(s) regardless of the size of the table. These generally need to process and return the same number of rows regardless of the table size.
Examples:
a) you gave us a classic one above. Find the last transaction (1 record) for a given customer. That'll be an index min/max scan. It will perform N IO's where N= height of the index. The height of a index is generally very small (1,2,3,4,5) even for millions or billions of rows.
b) give me the last 10 visits for a patient
c) look up this primary key
Others tend to take longer as the volume of data grows, examples:
a) what is the sum of sales over time (no predicate, sum all rows)
b) how many page views have I done (a predicate, look at my records, but over time I have to count more and more records)
c) please find all of my records in this unindexed table (full scan, takes longer and longer)
Each of those has solutions - I might use a materialized view for (a) and (b). I might add an index for (c)
Indexing by the way might be good for up to 0.5% (zero point five) in some cases, up to 100% in others. It depends.
Suppose you have a table that gets 100 rows per block. Suppose you want to retrieve 1% of the rows. Do you want to use an index?
If you say "yes, obviously", then I will tell you that the 1% of the rows you want are each on a different block, there are no blocks such that 2 rows on there are both wanted by you - each block has 1 row. Now..... do you want to use an index? NOPE - it would be really bad to read every block a single block at a time via an index. Full scan would rock and roll.
If you say "no", then I will tell you that this table has 100 blocks, and the first block has 100 rows with a value of 1, the second block has a 100 rows with 2 and so on. The query "where id = 1" should of course use an index to find the single block....
You see - it depends, there isn't a single answer. You have to actually understand how the data arrives and understand the nature of the question you are trying to answer (the queries). Every problem might be slightly different from every other problem. There is no silver bullet.
What you can (and must) do is... Understand how the database works. Understand data patterns (1% - not as simple as it seems). Understand what data structures you have available to you (hash cluster, btree cluster, index organized tables, partitioning, the ability to reorg a table with order by using dbms_redefinition, covering indexes, function based indexes, materialized views, etc etc etc) and then after understanding the problem you are facing - apply the right tool(s) in the right fashion.
A reader, December 10, 2014 - 9:24 am UTC
Hi Tom,
Many Thanks, for the reply, accepting your suggestion.
Regards
A reader, September 17, 2015 - 6:14 pm UTC
Hi Tom,
I have
case-1: 1000 rows
case-2: 100,000 rows
case-3: 100,00,000 rows
As per my understanding "order by" clause is a sorting operation.
May be I have to read more about how oracle works, but I think case-1 will take less time than case-3. And if this is correct, then as the number of rows increases in the table the query is going to take more time, only because of the sorting operation.
Thanks and Regards,
Sandeep
September 18, 2015 - 4:28 am UTC
THe answer is "it depends"
If it takes 3 seconds hours to find those 1,000,000 rows, then the 10 seconds to sort them is a large percentage of the execution time.
But if it takes 10 hours to find those 1,000,000 rows, then the 10 seconds to sort them is irrelevant.
The same holds true for *any* number of records - you understand the data, understand how it needs to be accessed, and design/implement the physical design to meet those goals.
Hope this helps.
A reader, September 19, 2015 - 5:34 am UTC
Hi Tom,
Thank You.
From your inputs, could I please conclude, the below.
1) Generally, if there is a index in place and only specific number of records are to be fetched like last 5 transactions, the time taken for retrieval will be less, but chances are there that the sorting may consume additional time.
The Question:- If the table size 30GB, each record only is of few 100 bytes, then scanning the index would be adding few more seconds or not?
2) What if this table is a transaction table, I mean the number of records continuously increasing each second, since the table size is huge and it is a transaction table I dont have the liberty to collect statistics whenever I want.
The execution plan will keep referring to the statics collected at midnight. Sure, why such a thing or design is in place, because need to provide, real time data, should be able to get information about the transaction just did, few seconds back just before checking what are the last 5 transactions. And why the table is such huge, simply because not to miss the scenario where one customer does a transaction today, but the last transaction he did was 15 days back, so have to have all the data (at least one months data, which is why the huge table size, yes partitioning by date is in place, just to purge data older than 30 days to limit the table and tablespace size, and local indexing is in place, to avoid the lock because of rebuilding global index, after the daily partition purging)
I cant think of a summery table or a materialized view here as everything is changing every second here and the table size is also huge?
I am lost, not sure about which technology I am missing out here, or not understanding the technology itself, so not able to implement properly, though I am aware.
Regards,
Sandeep
September 21, 2015 - 3:27 am UTC
1) the sorting cost will be proportional to the number of rows (and their size) to be sorted. If you only get 5 rows (even from a 10000G table), it will be quick to sort them
2) if a table is growing *steadily* then why bother *collecting* statistics. Just set them manually. eg if you add 100,000 rows per day, just bump up the row counts and block counts accordingly each day (or even more frequently if you need to). SImilarly, you know the high water mark on (say) the transaction time will be "now"...
A reader, September 22, 2015 - 2:48 am UTC
Hi Tom,
Thank You.
1) If the table size is 10TB, and we have a index in place, indexing column is used in the where clause
The query below:-
select * from
(select payment_amount from txn where customer_id = '12345' order by transaction_date desc) where rownum <= 5;
So here, the first thing done is filtering all the records, for the given customer_id, though the number blocks scanned is 3 to 5 for retrieving a single record, but for a 10TB table and a transaction table, the inner query will be retrieving more than a million record.
Once the above is done. than only 5 records are fetched out of this 1 million sorted records, and directly from the 10TB table.
Just wanted to understand, if this is the efficient way.
2) Little more about collecting statistics, collecting statistics for a 10TB table, frequently, how it is going to impact the customers.
Regards,
Sandeep
September 22, 2015 - 6:34 am UTC
If
where customer_id = '12345'
will return 1 million rows, then that will be slow to retrieve AND slow to sort. In the case above, a better index choice would probably be: customer_id, transaction_date
For example
SQL> drop table TXN purge;
Table dropped.
SQL> create table TXN as select * from dba_objects;
Table created.
SQL> create index IX on TXN ( owner, created ) ;
Index created.
SQL> set autotrace traceonly explain
SQL> select *
2 from ( select * from TXN where owner = 'SYS' order by created desc )
3 where rownum <= 10;
Execution Plan
----------------------------------------------------------
Plan hash value: 4197688684
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 3680 | 4 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 10 | 3680 | 4 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID | TXN | 2716 | 307K| 4 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN DESCENDING| IX | 10 | | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=10)
4 - access("OWNER"='SYS')
A reader, September 22, 2015 - 3:26 pm UTC
Hi Tom,
Thank You.
What if the data type is timestamp and not date?
Regards,
Sandeep
September 23, 2015 - 12:05 am UTC
I think its probably not strenuous an exercise for you to take the example I just posted, and use a timestamp instead of a date and see what happens.
Do a "describe" on TXN and you'll see what I mean.
A reader, September 24, 2015 - 3:28 am UTC
Hi Tom,
Definitely give a try, and would get back to you. Why I immediately asked back is, I was suggested to not to do indexing over a timestamp column where milliseconds are recorded.
Regards,
Sandeep
A reader, September 24, 2015 - 5:47 am UTC
Hi Tom,
The CPU cost was 4, when statistics are not collected.
Once I collected the statistics, after table creation, now the CPU cost is 3 and the index is also getting used.
So if the seektime is 10ms, and iotfrspeed is 4kb and with 8kb block size, so single block read time is 12ms and the total time taken would be 36ms for a cpu cost of 3.
create index IX on TXN ( owner, timestamp ) ;
select *
from ( select * from TXN where owner = 'SYS' order by timestamp desc )
where rownum <= 10;
Execution Plan
----------------------------------------------------------
Plan hash value: 4197688684
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 2070 | 3 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 10 | 2070 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID | TXN | 9209 | 809K| 3 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN DESCENDING| IX | 10 | | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=10)
4 - access("OWNER"='SYS')
I see now what you meant.
Regards,
Sandeep
A reader, October 06, 2015 - 1:51 pm UTC
Hi Tom,
Thank You so much.
I guess, my question is finally answered.
I was under the impression, that columns in where clause should be indexed only and was not aware that I can make index on the column on which I am doing order by as well. So could you please also help me knowing what all clause (like order by) columns I can put indexing on?
Regards,
Sandeep
October 07, 2015 - 2:39 am UTC
It could be *anything*...
eg
select col1, col2
from my_table;
Lets say "my_table" has 300 columns. It *might* be benefical to index col1,col2 simply so the query above can scan through the whole index, which will be "thinner" (ie, smaller) than the whole table (with its 298 extra columns).
Indexes are all the same, ie, does the *benefit* in having it outweigh the *cost* of having
Hope this helps.
A reader, October 10, 2015 - 5:15 am UTC
Hi Tom,
Many Thanks
It is helpful in understanding indexing better.
Thanks and Regards,
Sandeep