Skip to Main Content
  • Questions
  • first_value when doing joins with multiple tables

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Steve.

Asked: March 14, 2002 - 11:02 am UTC

Last updated: July 27, 2006 - 12:11 pm UTC

Version: 8.1.7

Viewed 1000+ times

You Asked


Hi Tom,

I am trying to better understand the first_value funtion when multiple
joins/tables are involved. The problem I'm having is making sure
I specify the right "partition by" clause. I created some test
tables and I did manage to get the result I wanted but it took
a few attempts. It's easy with test data to know when the query
is correct. It's not so obvious or easly to validate with live data.

The example I used involved 5 tables:

people: a list of people
jobs: a list of jobs
job_history: for a given person, the list of
jobs they held.
job_address_history: For a given job/company, their address history
(i.e. the company moved a few times)
address_history: For a given person, their address history
(i.e. they moved a few times).

Note: The SQL to create/populate these tables is at the
end of this message.


Query Goal:
-----------

List all users, with their current job, current home address, and,
for their current job, the company's current address.


I think this is the query:

col address format a22
col name format a14

select distinct
first
,last
,first_value(name)
over (partition by job_history.person
order by end_date desc nulls first
) name
--
,first_value(start_date)
over (partition by job_history.person
order by end_date desc nulls first
) start_date
--
,first_value(end_date)
over (partition by job_history.person
order by end_date desc nulls first
) end_date
--
,first_value(address_history.address)
over (partition by people.id
order by address_history.move_out_date desc nulls first
) address
--
,first_value(job_address_history.address)
over (partition by people.id
order by end_date desc nulls first
) Company_address
from people
,job_history
,jobs
,address_history
,job_address_history
where people.id = job_history.person
and job_history.JOB_ID = jobs.job_id
and jobs.job_id = job_address_history.job_id
and people.id = address_history.person;


FIRST LAST NAME START_DAT END_DATE ADDRESS COMPANY_ADDRESS
------ ------ -------- --------- --------- ---------------------- ------------------------
John Doe Oracle 01-APR-98 John's current house Oracle's current address
Mary Smith Oracle 01-NOV-00 Mary's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Oracle's current address




But when I first tried the query, I used the following:


select distinct
first
,last
,first_value(name)
over (partition by job_history.person
order by end_date desc nulls first
) name
--
,first_value(start_date)
over (partition by job_history.person
order by end_date desc nulls first
) start_date
--
,first_value(end_date)
over (partition by job_history.person
order by end_date desc nulls first
) end_date
--
,first_value(address_history.address)
over (partition by people.id
order by address_history.move_out_date desc nulls first
) address
--
--,first_value(job_address_history.address)
-- over (partition by people.id
--
,first_value(job_address_history.address)
over (partition by job_address_history.job_id
order by end_date desc nulls first
) Company_address
from people
,job_history
,jobs
,address_history
,job_address_history
where people.id = job_history.person
and job_history.JOB_ID = jobs.job_id
and jobs.job_id = job_address_history.job_id
and people.id = address_history.person



FIRST LAST NAME START_DAT END_DATE ADDRESS COMPANY_ADDRESS
------ ------ -------- --------- --------- ---------------------- ----------------------------------------
John Doe Oracle 01-APR-98 John's current house Burger King's current address
John Doe Oracle 01-APR-98 John's current house McDonalds's current address
John Doe Oracle 01-APR-98 John's current house Oracle's current address
Mary Smith Oracle 01-NOV-00 Mary's current house Can. Tire's first and current address
Mary Smith Oracle 01-NOV-00 Mary's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Wal-mart's current address.

7 rows selected.


Which is obviously wrong.

Can you give me some guidelines or "rules" to properly
define the partitions to avoid the "guessing game" I
seem to always be going through.

Thanks Tom.

Steve


Here's the scripts to create/populate the tables in case you need them.






drop table people;
create table people (id int,first varchar2(20),last varchar2(20));
insert into people values (91,'Mary','Smith');
insert into people values (92,'John','Doe');
insert into people values (93,'Mike','Jones');

drop table jobs;
create table jobs (job_id int,name varchar2(30),address varchar2(40));
insert into jobs values (1,'McDonalds','Maple Road');
insert into jobs values (2,'Burger King','Main Street');
insert into jobs values (3,'Oracle','500 Oracle Parkway Redwood Shores, CA');
insert into jobs values (4,'Wal-mart','President Drive');
insert into jobs values (5,'Canadian Tire','Maple Leaf drive');
insert into jobs values (6,'Sybase','22 cul-de-sac ave');


drop table job_history;
create table job_history (person int,start_date date,end_date date, job_id int);
insert into job_history values (92,'01-jan-1985','01-jan-1989',1);
insert into job_history values (92,'01-jan-1989','01-apr-1998',2);
insert into job_history values (92,'01-apr-1998',NULL,3);
insert into job_history values (91,'01-jan-1990','01-nov-2000',5);
insert into job_history values (91,'01-nov-2000',null,3);
insert into job_history values (93,'01-jan-1988','01-jan-1990',6);
insert into job_history values (93,'01-jan-1990','01-jan-2001',4);
insert into job_history values (93,'01-jan-2001',null,3);


drop table address_history;
create table address_history(person int, move_in_date date, move_out_date date, address varchar2(40));
insert into address_history
values (91,'01-jan-1990','01-jan-1995','Mary''s first house');
insert into address_history
values (91,'01-jan-1995','01-jan-1997','Mary''s second house');
insert into address_history
values (91,'01-jan-1997','01-jan-2000','Mary''s third house');
insert into address_history
values (91,'01-jan-2000',NULL,'Mary''s current house');
insert into address_history
values (92,'28-apr-1966','27-jul-1991','John''s first house');
insert into address_history
values (92,'27-jul-1991','15-sep-1994','John''s second house');
insert into address_history
values (92,'15-sep-1994',NULL,'John''s current house');
insert into address_history
values (93,'01-jan-1964','01-jan-1977','Mike''s first house');
insert into address_history
values (93,'01-jan-1977',NULL,'Mike''s current house');

drop table job_address_history;
create table job_address_history(job_id int, move_in_date date,
move_out_date date, address varchar2(40));
insert into job_address_history
values (1,'01-jan-1979','01-jan-1992','McDonalds''s first address');
insert into job_address_history
values (1,'01-jan-1992','01-jan-1999','McDonalds''s second address');
insert into job_address_history
values (1,'01-jan-1999',NULL,'McDonalds''s current address');
insert into job_address_history
values (2,'01-jan-1985','01-jan-1997','Burger King''s first address');
insert into job_address_history
values (2,'01-jan-1997',NULL,'Burger King''s current address');
insert into job_address_history
values (3,'01-jan-1980','01-jan-2000','Oracle''s first address');
insert into job_address_history
values (3,'01-jan-2000','01-dec-2000','Oracle''s second address');
insert into job_address_history
values (3,'01-dec-2000',NULL,'Oracle''s current address');
insert into job_address_history
values (4,'01-jan-1966','01-jan-2000','Wal-mart''s first address.');
insert into job_address_history
values (4,'01-jan-2000',NULL,'Wal-mart''s current address.');
insert into job_address_history
values (5,'28-apr-1966',NULL,'Can. Tire''s first and current address');
insert into job_address_history
values (5,'27-jul-1991','15-sep-1994','Sybase''s First address');
insert into job_address_history
values (5,'15-sep-1994',NULL,'Sybase''s current address');


and Tom said...

first, thanks so much for an easy to run example -- create tables, insert statements -- everything I needed. That was most excellent (hint hint rest of world)

well, if you have my book -- I think I cover this topic fairly well & verbosely. It's hard to visualize. Once you learn how to "read it aloud", it might help.

For example -- the only differences in your queries is the last column Company Address.

You had in the first (correct) query:

,first_value(job_address_history.address)
over (partition by people.id
order by end_date desc nulls first
) Company_address

In "english" that says -- "take all of the ROWS in the ENTIRE result set, group it by person.id. Within each group of person.id, sort the data desc by END_DATE, putting NULLS first. Then, take the FIRST_VALUE of that group and assign it to each row in that group (every row with the same person id)"

In effect that finds the most recent (null being most most recent) job_address_history.address for a person.id (a person)



The second one (incorrect one) ask for:


,first_value(job_address_history.address)
over (partition by job_address_history.job_id
order by end_date desc nulls first
) Company_address

that says partition the entire result set into groups by JOB_ADDRESS_HISTORY.JOB_ID. Now, sort each of these groups (which contain rows for possibly many different PEOPLE) by end_date desc (nulls first). Now for each row with the same JOB_ID, assign the address of the most recent company address. So, each PERSON in the database who had the same job as some other person will get the SAME address - the address for perhaps some other people.id.




In short, you wanted "first values" by people.id -- you always wanted to create a partition (a window if you will) that looks as people.id and sorts the data in that window looking for the "first value" for that PERSON.

Grouping the data by JOB_ID did some operations across people -- something you didn't want to do.

It can be hard to visualize but maybe this will help:

o the analytics happen on the ENTIRE result set (they are done last)
o they work within the partitions, the groups, you define in that result set
o if you needed something "by person", person.id would be your partition key




Rating

  (7 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

I'll sleep easier tonight!

Steve McKee, March 15, 2002 - 6:46 am UTC

Yes it did help. I think the comment that really makes it clear was:

o the analytics happen on the ENTIRE result set (they are done last)

So if I understand correctly, you

o execute the query,
o get all the possible combinations for a given person,
o define the partition window by person id,
o then, for each column, decide how you want to order the result set to nail down the column you want.

The other important thing to remember is that it's the "distinct" clause that wraps it all up and returns the single, result row.

(sigh)

Would it be safe to say that in general, for a given query, the partition key is always the same? What might change, though, is the order by (in some cases, it may contain a combination of columns to nail down the row value you really want).

Thanks a lot Tom (book is on order).

Tom Kyte
March 15, 2002 - 2:08 pm UTC

No, it's not safe to assume that.

Suppose - you wanted a report that shows:

a person
what they make
the salary of the highest paid person in their dept
the salary of the highest paid person with their same job
the salary of the highest paid person with the same mgr.

Well, that would be:

scott@ORA817DEV.US.ORACLE.COM> select ename || ' a ' || job "A person",
2 sal "makes",
3 first_value(sal) over ( partition by deptno order by sal desc ) "Highest in Dept",
4 first_value(sal) over ( partition by job order by sal desc ) "Highest in Job",
5 first_value(sal) over ( partition by mgr order by sal desc ) "Highest/same mgr"
6 from emp
7 /

A person makes Highest in Dept Highest in Job Highest/same mgr
-------------------- ---------- --------------- -------------- ----------------
KING a PRESIDENT 5000 5000 5000 5000
CLARK a MANAGER 2450 5000 2975 2975
MILLER a CLERK 1300 5000 1300 1300
SCOTT a ANALYST 3000 3000 3000 3000
FORD a ANALYST 3000 3000 3000 3000
JONES a MANAGER 2975 3000 2975 2975
ADAMS a CLERK 1100 3000 1300 1100
SMITH a CLERK 800 3000 1300 800
BLAKE a MANAGER 2850 2850 2975 2975
ALLEN a SALESMAN 1600 2850 1600 1600
TURNER a SALESMAN 1500 2850 1600 1600
WARD a SALESMAN 1250 2850 1600 1600
MARTIN a SALESMAN 1250 2850 1600 1600
JAMES a CLERK 950 2850 1300 1600

14 rows selected.

In general- the partition, order by and range can and will change. In your case, it did not since you were "people" focused.

Just when I though I had it all figured out...

Steve Mckee, April 24, 2002 - 8:40 pm UTC

Hi Tom,

I have yet another problem with the first_value/last_value function. Here's the query:


select distinct
substr(th_th_id,1,8),
first_value(account_after)
over (partition by substr(th_th_id,1,8)
order by nvl(account_after,'0')||th_th_id --to_char(start_time,'yyyymmddhh24miss')
desc NULLs first)
from pp_calls
where account = 522229650001




When I use the first_value, I get results like:




20020306 148.84
20020307 72.68


So far, so good. When I use the last_value
(exact same query, just use the last_value
instead of first_value) I get:


20020306 0
20020306 96.2
20020306 106.2
20020306 111.2
20020306 116.2
20020306 122.08
20020306 127.08
20020306 132.08
20020306 137.08
20020306 148.84
20020307 0
20020307 1.38
20020307 3.88
20020307 55.92
20020307 72.68

I was expecting to get different values of course,
but not more rows. I am yet again confused.

You'll probably spot the problem right away without sample data.
But if you require sample data, I will gladly forward it to
you.


Thanks.

Steve

Tom Kyte
April 24, 2002 - 8:56 pm UTC

If you have my book, I go through the analytic stuff in detail.  One comment I make in there is:

...
Jumping ahead just a little to the very next section on windowing clauses ? it can be said that the existence of an ORDER BY in an analytic function will add a default window clause of ?RANGE UNBOUNDED PRECEDING?.  What that means is that the set of rows to be used in the computation is the current and all preceding rows in the current partition.  Without the ORDER BY the default window is the entire partition.
....


-- the computation is the current and ALL PRECEDING (but not following) rows!!


So, sometimes it helps to just dump the entire thing and see what you got:

ops$tkyte@ORA817DEV.US.ORACLE.COM> select substr( th_th_id, 1, 8 ),
  2         account_after,
  3         first_value( account_after) 
              over ( partition by substr( th_th_id, 1, 8 )
  4           order by nvl(account_after,'0') || th_th_id desc nulls first ) fv,
  5         last_value( account_after) 
              over ( partition by substr( th_th_id, 1, 8 )
  6           order by nvl(account_after,'0') || th_th_id desc nulls first ) lv1,
  7         last_value( account_after) 
              over ( partition by substr( th_th_id, 1, 8 )
  8           order by nvl(account_after,'0') || th_th_id desc nulls first
  9           rows between unbounded preceding and unbounded following ) lv
 10    from t
 11   where account = 522229650001
 12   order by 1, nvl(account_after,'0') || th_th_id desc nulls first
 13  /

SUBSTR(T ACCOUNT_AFTER         FV        LV1         LV
-------- ------------- ---------- ---------- ----------
20020306           9.9        9.9        9.9        1.1
20020306           8.8        9.9        8.8        1.1
20020306           7.7        9.9        7.7        1.1
20020306           6.6        9.9        6.6        1.1
20020306           5.5        9.9        5.5        1.1
20020306           4.4        9.9        4.4        1.1
20020306           3.3        9.9        3.3        1.1
20020306           2.2        9.9        2.2        1.1
20020306         15.15        9.9      15.15        1.1
20020306         14.14        9.9      14.14        1.1
20020306         13.13        9.9      13.13        1.1
20020306         12.12        9.9      12.12        1.1
20020306         11.11        9.9      11.11        1.1
20020306          10.1        9.9       10.1        1.1
20020306           1.1        9.9        1.1        1.1
20020307       115.115    115.115    115.115    101.101
20020307       114.114    115.115    114.114    101.101
20020307       113.113    115.115    113.113    101.101
20020307       112.112    115.115    112.112    101.101
20020307       111.111    115.115    111.111    101.101
20020307        110.11    115.115     110.11    101.101
20020307       109.109    115.115    109.109    101.101
20020307       108.108    115.115    108.108    101.101
20020307       107.107    115.115    107.107    101.101
20020307       106.106    115.115    106.106    101.101
20020307       105.105    115.115    105.105    101.101
20020307       104.104    115.115    104.104    101.101
20020307       103.103    115.115    103.103    101.101
20020307       102.102    115.115    102.102    101.101
20020307       101.101    115.115    101.101    101.101

30 rows selected.

That shows that while FV is "fixed" in the window (default range is unbounded preceding) the LV1 is the "last value UPTO the current row".  We have to specify the RANGE to get the last last value in the partition! 

Tom, you kill me!

Steve McKee, April 24, 2002 - 9:23 pm UTC

You are amazing. Answers fast and with data!! Thanks!!!

This was not obvious and I didn't spot it in
the Oracle manuals. I have your book but it's
at home and I'm at customer site in Mexico.

A huge thanks.

That use of DISTINCT

Dave Hemming, May 18, 2005 - 7:41 am UTC

Thanks Tom, all this analytic knowledge has made me look extremely smart to my coworkers!

I'm wondering about one thing, though. A lot of the time we want the details of the latest record in a history table for a given key - traditionally we did

select ....
from HISTORY
where key=:key
and last_updated = (
select MAX(last_updated)
from HISTORY
where key=:key)

but what is the best way to do this using these new features? A DISTINCT and FIRST_VALUEs (or LAST_VALUEs) as in the original question, a where clause equating last_updated to the MAX(last_updated) OVER (partition by key) column, or something else? It's primarily when joining the history table back onto the main table (amongst others) and we want one HISTORY record for one MAIN_TABLE key...

Tom Kyte
May 18, 2005 - 9:15 am UTC

If you one ONE record (you have a key)


select *
from (select *
from history
where key = :key
order by key DESC, last_updated DESC)
where rownum = 1;

might even be better (assuming index on key,last_updated, read index backwards and get one record)


If you want ALL of the last records from the history table -- then analytics would be appropriate.

A reader, July 27, 2006 - 10:04 am UTC

SELECT empno, deptno, job
, mgr
, dense_rank() OVER ( PARTITION BY job ORDER BY sal DESC ) rnk
, sum ( sal ) OVER ( PARTITION BY job ORDER BY sal DESC )
"Run1"
, SUM ( sal ) OVER ( PARTITION BY job
ORDER BY sal DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING )
"Run2"
FROM emp
ORDER BY job ASC, sal DESC
/



EMPNO DEPTNO JOB MGR RNK Run1 Run2
---------- ---------- --------- ---------- ---------- ---------- ----------
1001 10 Clerk 1 990 990
1002 10 Manager 1 11125 21126
1005 11 Manager 2 21126 21126
1003 10 Secretary 1 2202 5063
1004 11 Secretary 1 2202 5063
1010 10 Secretary 1002 2 2942 5063
1013 10 Secretary 1005 3 3657 5063
1012 10 Secretary 1005 4 4362 5063
1011 10 Secretary 1002 5 5063 5063
1006 11 Security 1 1001 1001

10 rows selected.


Tom, difficult to produce actual data, but this is not dissimilar to your example, hope this is enought to ask a question.
The question is Run1 values for empno 1003 and 1004, whey are they the same? (2002)

1) Is that because they are of the same rank during order and need to be treated like a mini window and summed up?
Or is that because of the default 'range unbounded preceding' somehow looking at it simultaneously?

2) When we say range unbounded preceding and we have ordered desc, and we have the top 3 rows the same value, do they all rank the same, so if we do a avg or sum on them they are treated together?


Tom Kyte
July 27, 2006 - 12:11 pm UTC

because the SAL column is not unique. order by sal, empno (make it unique) and you'll see the difference.

neither of those sal rows "come first", so they both do.

To "A reader"

Michel Cadot, July 27, 2006 - 12:41 pm UTC

A reader, July 31, 2006 - 2:57 am UTC

Thanks Tom and Michael, that sorted it.

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.