and we said...
well, I've written about such things myself in the past... Way back in Effective Oracle by Design for example...
These two (Jonathan's idea and mine below) are very similar - except for the fact that I refuse to return 300,000 rows to the client! I agree with Jonathan at the start of his paper - that there is probably some huge mistake being made there. I limited hits to 500 at most - and even then, I thought 500 was probably an order of magnitude too large.
so, I would have fought long and hard, hard and long with the fact that 300,000 rows are coming out - I'd need a really good justification.
And then, if I lost such a battle, and the sort was the absolute killer (see last paragraph below, same sentiment shown there - optimize to the resource that will kill you - if CPU and RAM for sortspace is your weak point...) then his query would be ok.
I like to call his approach "understanding the physics of the data", you have to have a really good understanding of how data is accessed, what is happening (NOT INTERNALS, I am not talking INTERNALS - I'm talking about how indexes are used, what a sort does, how the data is laid out, and so on) and what you can do about it.
Just as a note: you'd have to be very very careful with that hinted approach. I've said many times - the only way to get sorted data from Oracle is to use an ORDER BY statement. Jonathan is using a set of hints which may or may not work in the future - or even today under certain circumstances (see his link, he caveats that). As long as the plan turns out to be what he is expecting - it should work - but things change - so any patches to the database would have to include a heavy test of this query - to ensure a change to the optimizer doesn't change the plan.
Optimize to Your Most Frequently Asked Questions
If you have a system you know will be executing some given query or set of queries hundreds of times per minute, or even per second, that system must be designed, optimized, and built around those queries.
For example, my workplace once had an internal system called Phone. You could telnet into any email machine (back when email was character mode), and on the command line, type phone <search string>. It would return data like this:
$ phone tkyte
TKYTE Kyte, Tom 703/555 4567 Managing Technologies RESTON:
When the Web exploded in about 1995/1996, our group wrote a small web system that loaded this phone data into a table and let people search it. Now that it was in a database and had a little GUI to go with it, it started becoming the de-facto standard within the company for looking up information about people. Over time, we started adding more data and more fields to it. It really started to catch on.
At some point, we decided to add a lot more fields to the system and rebuild it with more features. The first thing we did, based on our knowledge of how people would use this simple little system, was to design the tables to hold this data. We had a read-only repository of data that people would be searching, and it needed to be fast. This system was growing in popularity every day and was threatening to consume our machine with its resources. We had a single 67-column, 75,000-row table upon which we wanted to perform a simple string search against various fields. So, if a user put in ABC, it would find ABC in the email address, or the first name, or the last name, or the middle name, and so on. Even worse, it would be interpreted as %ABC%, and the data was in mixed case.
There wasn¿t an index in the world that could help us here (we tried them all), so we built our own. Every night, as we refreshed the data from our human resources system (a complete refresh of data), we would also issue the following after we were finished:
CREATE TABLE FAST_EMPS
SELECT upper(last_name)||'/'||upper(first_name)||'/' ¿. || '/' ||
substr( phone, length(phone)-4) SEARCH_STRING,
In effect, we built the most dense, compact table possible (PCTFREE 0) and asked that it be cached if possible. Then we would query:
where rowid in ( select row_id
where search_string like :bv
and rownum <= 500 )
This query would always full-scan the FAST_EMP table, which is what we wanted, because that table was, in fact, our ¿index.¿ Given the types of questions we were using, that was the only choice. Our goal from the outset was to minimize the amount of data that would be scanned, to limit the amount of data people would get back, and to make it as fast as possible. The query shown here accomplishes all three goals.
The FAST_EMP table is typically always in the buffer cache. It is small (less than 8% the size of the original table) and scans very fast. It has already done the work of the case-insensitive searching for us once (instead of once per query) by storing the data in uppercase. It limits the number of hits to 500 (if your search is broader than that, refine it; you¿ll never look at 500 hits). In effect, that table works a lot like an index, since it stores the ROWID in EMPLOYEES. There are no indexes employed on this system in order to do this search¿just two tables.
Before settling on this design, we tried a couple of alternative approaches:
· We tried using a fast full scan on a function-based index (close, but not as fast).
· We tried interMedia text (not useful due to the %ABC% requirement).
· We tried having just an extra field in the base EMPLOYEES table (wiped out the buffer cache because it was too big to full-scan).
It may seem foolish to have spent so much time on this detail. However, this one query is executed between 150,000 and 250,000 times a day. This is two to three times a second, every second, all day long, assuming a constant flow of traffic (and we cannot assume that, since we frequently have spikes in activity, as most systems do). If this single query performed poorly, our entire system would fall apart, and it is just one of thousands of queries we need to do. By determining where our weak points would be¿the lowest hanging fruit, so to say¿and concentrating on them, we were able to build an application that scales very well. If we had tried the tuning-after-the-fact principle, we would have found ourselves rewriting after the fact.
The point is that when developing a system, you need to give a lot of thought to how people will actually use that system. Consider how physical organization can matter, and realize the impact a simple two times increase in logical I/O might have on your system¿s performance.
Is this answer out of date? If it is, please let us know via a Comment