Skip to Main Content
  • Questions
  • Analytic Function - Count Distinct in Unbounded Preceding Window

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Betsy.

Asked: June 09, 2008 - 5:07 pm UTC

Last updated: June 10, 2008 - 10:56 am UTC

Version: 10g

Viewed 10K+ times! This question is

You Asked

I want to run the following code, but Oracle doesn't like the distinct in the second column build. Essentially, I want to develop a running (cumulative) patient count for every incremental row. While pt_id is unique, com_pt_id is not; therefore there will be instances where the edw_pts_cum_2 starts to be less than the pts_cum count.

Here is sample of what I'm trying to accomplish. Do have any thoughts on how to develop column edw_pts_cum_2?

Cum Distance pts_cum edw_pts_cum_2 Pt_id com_pt_id
10 1 1 101 1
20 2 2 201 2
30 3 3 301 3
40 4 3 102 1


create table lab_data_res_sub_xy_com_sub
nologging
nocache
pctfree 0
as
select a.*,
count(*) over (order by distance asc
rows between unbounded preceding and current row)
as pts_cum,
count(distinct com_pt_id) over (order by distance
rows between unbounded preceding and current row)
as edw_pts_cum_2
from lab_data_res_sub_xy_com_dda a
order by distance asc, test_epatient_id;

and Tom said...

well, the problem is..... fairly hard....

think about this, you order by distance and say the com_pt_id data looks like this


1
2
1
2
3
1
2
4

You have to remember what values you've already hit - for example, we keep hitting 1 over and over - we need to remember "1" has already been hit. Even if there are 1,000,000 intervening values - we need to remember "1 has been hit" so as to not count it again


Here are two approaches - one is a scalar subquery (rt_2b). I do not think (well, I know) it won't scale very well if the result set is huge. The other is an analytic on an analytic. We "mark" the first com_pt_id by distance and only count the first one.


ops$tkyte%ORA10GR2> create table t
  2  as
  3  select hiredate distance, empno pt_id, mgr com_pt_id
  4    from scott.emp;

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select pt_id, com_pt_id, distance,
  2         count(*) over (order by distance) rt_1,
  3             count(case when rn=1 then com_pt_id end) over (order by distance) rt_2a,
  4             (select count(distinct com_pt_id) from t t2 where t2.distance <= t1.distance) rt_2b
  5    from (select pt_id, com_pt_id, distance,
  6                 row_number() over (partition by com_pt_id order by distance) rn
  7                    from t) t1
  8   order by distance
  9  /

     PT_ID  COM_PT_ID DISTANCE        RT_1      RT_2A      RT_2B
---------- ---------- --------- ---------- ---------- ----------
      7369       7902 17-DEC-80          1          1          1
      7499       7698 20-FEB-81          2          2          2
      7521       7698 22-FEB-81          3          2          2
      7566       7839 02-APR-81          4          3          3
      7698       7839 01-MAY-81          5          3          3
      7782       7839 09-JUN-81          6          3          3
      7844       7698 08-SEP-81          7          3          3
      7654       7698 28-SEP-81          8          3          3
      7839            17-NOV-81          9          3          3
      7902       7566 03-DEC-81         11          4          4
      7900       7698 03-DEC-81         11          4          4
      7934       7782 23-JAN-82         12          5          5
      7788       7566 09-DEC-82         13          5          5
      7876       7788 12-JAN-83         14          6          6

14 rows selected.




Rating

  (2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

"Analytic Function - Count Distinct in Unbounded Preceding Window", version 10g

Betsy McNair, June 10, 2008 - 1:30 pm UTC

Both solutions work; however, you are corrent in that the solution that produces rt_2b doesn't scale well. The code that produces result rt_2a runs in about 2-6 minutes against 1.6 million records whereas the code that produces result rt_2b being tested against the same data set is still running after more than an hour. Thanks for your help.

model

Laurent Schneider, June 10, 2008 - 1:59 pm UTC

In 10g you could consider a model approach

select * from t model 
dimension by (distance,pt_id) measures (com_pt_id, 0 c) 
rules (c[distance,any]=count(distinct com_pt_id)[distance<=cv(distance),any])
order by distance
/ 

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.