You Asked
Hi Tom,
I am trying to better understand the first_value funtion when multiple
joins/tables are involved. The problem I'm having is making sure
I specify the right "partition by" clause. I created some test
tables and I did manage to get the result I wanted but it took
a few attempts. It's easy with test data to know when the query
is correct. It's not so obvious or easly to validate with live data.
The example I used involved 5 tables:
people: a list of people
jobs: a list of jobs
job_history: for a given person, the list of
jobs they held.
job_address_history: For a given job/company, their address history
(i.e. the company moved a few times)
address_history: For a given person, their address history
(i.e. they moved a few times).
Note: The SQL to create/populate these tables is at the
end of this message.
Query Goal:
-----------
List all users, with their current job, current home address, and,
for their current job, the company's current address.
I think this is the query:
col address format a22
col name format a14
select distinct
first
,last
,first_value(name)
over (partition by job_history.person
order by end_date desc nulls first
) name
--
,first_value(start_date)
over (partition by job_history.person
order by end_date desc nulls first
) start_date
--
,first_value(end_date)
over (partition by job_history.person
order by end_date desc nulls first
) end_date
--
,first_value(address_history.address)
over (partition by people.id
order by address_history.move_out_date desc nulls first
) address
--
,first_value(job_address_history.address)
over (partition by people.id
order by end_date desc nulls first
) Company_address
from people
,job_history
,jobs
,address_history
,job_address_history
where people.id = job_history.person
and job_history.JOB_ID = jobs.job_id
and jobs.job_id = job_address_history.job_id
and people.id = address_history.person;
FIRST LAST NAME START_DAT END_DATE ADDRESS COMPANY_ADDRESS
------ ------ -------- --------- --------- ---------------------- ------------------------
John Doe Oracle 01-APR-98 John's current house Oracle's current address
Mary Smith Oracle 01-NOV-00 Mary's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Oracle's current address
But when I first tried the query, I used the following:
select distinct
first
,last
,first_value(name)
over (partition by job_history.person
order by end_date desc nulls first
) name
--
,first_value(start_date)
over (partition by job_history.person
order by end_date desc nulls first
) start_date
--
,first_value(end_date)
over (partition by job_history.person
order by end_date desc nulls first
) end_date
--
,first_value(address_history.address)
over (partition by people.id
order by address_history.move_out_date desc nulls first
) address
--
--,first_value(job_address_history.address)
-- over (partition by people.id
--
,first_value(job_address_history.address)
over (partition by job_address_history.job_id
order by end_date desc nulls first
) Company_address
from people
,job_history
,jobs
,address_history
,job_address_history
where people.id = job_history.person
and job_history.JOB_ID = jobs.job_id
and jobs.job_id = job_address_history.job_id
and people.id = address_history.person
FIRST LAST NAME START_DAT END_DATE ADDRESS COMPANY_ADDRESS
------ ------ -------- --------- --------- ---------------------- ----------------------------------------
John Doe Oracle 01-APR-98 John's current house Burger King's current address
John Doe Oracle 01-APR-98 John's current house McDonalds's current address
John Doe Oracle 01-APR-98 John's current house Oracle's current address
Mary Smith Oracle 01-NOV-00 Mary's current house Can. Tire's first and current address
Mary Smith Oracle 01-NOV-00 Mary's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Oracle's current address
Mike Jones Oracle 01-JAN-01 Mike's current house Wal-mart's current address.
7 rows selected.
Which is obviously wrong.
Can you give me some guidelines or "rules" to properly
define the partitions to avoid the "guessing game" I
seem to always be going through.
Thanks Tom.
Steve
Here's the scripts to create/populate the tables in case you need them.
drop table people;
create table people (id int,first varchar2(20),last varchar2(20));
insert into people values (91,'Mary','Smith');
insert into people values (92,'John','Doe');
insert into people values (93,'Mike','Jones');
drop table jobs;
create table jobs (job_id int,name varchar2(30),address varchar2(40));
insert into jobs values (1,'McDonalds','Maple Road');
insert into jobs values (2,'Burger King','Main Street');
insert into jobs values (3,'Oracle','500 Oracle Parkway Redwood Shores, CA');
insert into jobs values (4,'Wal-mart','President Drive');
insert into jobs values (5,'Canadian Tire','Maple Leaf drive');
insert into jobs values (6,'Sybase','22 cul-de-sac ave');
drop table job_history;
create table job_history (person int,start_date date,end_date date, job_id int);
insert into job_history values (92,'01-jan-1985','01-jan-1989',1);
insert into job_history values (92,'01-jan-1989','01-apr-1998',2);
insert into job_history values (92,'01-apr-1998',NULL,3);
insert into job_history values (91,'01-jan-1990','01-nov-2000',5);
insert into job_history values (91,'01-nov-2000',null,3);
insert into job_history values (93,'01-jan-1988','01-jan-1990',6);
insert into job_history values (93,'01-jan-1990','01-jan-2001',4);
insert into job_history values (93,'01-jan-2001',null,3);
drop table address_history;
create table address_history(person int, move_in_date date, move_out_date date, address varchar2(40));
insert into address_history
values (91,'01-jan-1990','01-jan-1995','Mary''s first house');
insert into address_history
values (91,'01-jan-1995','01-jan-1997','Mary''s second house');
insert into address_history
values (91,'01-jan-1997','01-jan-2000','Mary''s third house');
insert into address_history
values (91,'01-jan-2000',NULL,'Mary''s current house');
insert into address_history
values (92,'28-apr-1966','27-jul-1991','John''s first house');
insert into address_history
values (92,'27-jul-1991','15-sep-1994','John''s second house');
insert into address_history
values (92,'15-sep-1994',NULL,'John''s current house');
insert into address_history
values (93,'01-jan-1964','01-jan-1977','Mike''s first house');
insert into address_history
values (93,'01-jan-1977',NULL,'Mike''s current house');
drop table job_address_history;
create table job_address_history(job_id int, move_in_date date,
move_out_date date, address varchar2(40));
insert into job_address_history
values (1,'01-jan-1979','01-jan-1992','McDonalds''s first address');
insert into job_address_history
values (1,'01-jan-1992','01-jan-1999','McDonalds''s second address');
insert into job_address_history
values (1,'01-jan-1999',NULL,'McDonalds''s current address');
insert into job_address_history
values (2,'01-jan-1985','01-jan-1997','Burger King''s first address');
insert into job_address_history
values (2,'01-jan-1997',NULL,'Burger King''s current address');
insert into job_address_history
values (3,'01-jan-1980','01-jan-2000','Oracle''s first address');
insert into job_address_history
values (3,'01-jan-2000','01-dec-2000','Oracle''s second address');
insert into job_address_history
values (3,'01-dec-2000',NULL,'Oracle''s current address');
insert into job_address_history
values (4,'01-jan-1966','01-jan-2000','Wal-mart''s first address.');
insert into job_address_history
values (4,'01-jan-2000',NULL,'Wal-mart''s current address.');
insert into job_address_history
values (5,'28-apr-1966',NULL,'Can. Tire''s first and current address');
insert into job_address_history
values (5,'27-jul-1991','15-sep-1994','Sybase''s First address');
insert into job_address_history
values (5,'15-sep-1994',NULL,'Sybase''s current address');
and Tom said...
first, thanks so much for an easy to run example -- create tables, insert statements -- everything I needed. That was most excellent (hint hint rest of world)
well, if you have my book -- I think I cover this topic fairly well & verbosely. It's hard to visualize. Once you learn how to "read it aloud", it might help.
For example -- the only differences in your queries is the last column Company Address.
You had in the first (correct) query:
,first_value(job_address_history.address)
over (partition by people.id
order by end_date desc nulls first
) Company_address
In "english" that says -- "take all of the ROWS in the ENTIRE result set, group it by person.id. Within each group of person.id, sort the data desc by END_DATE, putting NULLS first. Then, take the FIRST_VALUE of that group and assign it to each row in that group (every row with the same person id)"
In effect that finds the most recent (null being most most recent) job_address_history.address for a person.id (a person)
The second one (incorrect one) ask for:
,first_value(job_address_history.address)
over (partition by job_address_history.job_id
order by end_date desc nulls first
) Company_address
that says partition the entire result set into groups by JOB_ADDRESS_HISTORY.JOB_ID. Now, sort each of these groups (which contain rows for possibly many different PEOPLE) by end_date desc (nulls first). Now for each row with the same JOB_ID, assign the address of the most recent company address. So, each PERSON in the database who had the same job as some other person will get the SAME address - the address for perhaps some other people.id.
In short, you wanted "first values" by people.id -- you always wanted to create a partition (a window if you will) that looks as people.id and sorts the data in that window looking for the "first value" for that PERSON.
Grouping the data by JOB_ID did some operations across people -- something you didn't want to do.
It can be hard to visualize but maybe this will help:
o the analytics happen on the ENTIRE result set (they are done last)
o they work within the partitions, the groups, you define in that result set
o if you needed something "by person", person.id would be your partition key
Rating
(7 ratings)
Is this answer out of date? If it is, please let us know via a Comment