Skip to Main Content
  • Questions
  • Cross join having 10 billion intermediate records but 200K Final records

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, PRABHU.

Asked: April 28, 2017 - 6:17 pm UTC

Last updated: April 29, 2017 - 2:02 am UTC

Version: 11GR2

Viewed 1000+ times

You Asked

Hello   
  
I working on the doing an application to customer match , where an application can have multiple customers  (corporate)  and customer can have multiple applications.  
so i am doing a cartesian join and using UTL_MATCH .JARO_WINKLER_SMILIARTY for string matching.   
  
Table A (applications) Approximatly 20,000 per day  
Table B (CusomterPfile) approximatly 400,000 per month.  
  
I have to match apps for month, so intermeidate volume is 20,000*400,000 = 8,000,000,000 and final data set with JW_SCORE>970  
Its time consuming to create the table of 8billion and fillter for200K Approx having  jw_score>970.  
And also to filter in sub querry. please suggest a better approach  .  
  
 CREATE TABLE tablea  
(  
  APP_ID           NUMBER(14),  
  APP_LOAD_DATE  DATE,  
  SSN            VARCHAR2(20 BYTE),  
  DOB            DATE,  
  FIRST_NAME     VARCHAR2(50 BYTE),  
  LAST_NAME      VARCHAR2(50 BYTE),  
  ADDR           VARCHAR2(92 BYTE),  
  CITY           VARCHAR2(50 BYTE),  
  STATE          VARCHAR2(20 BYTE),  
  ZIP            VARCHAR2(20 BYTE)  
);  
CREATE TABLE tableb  
(  
  ACCOUNT_NUMBER                 NUMBER(9)                 NOT NULL,  
  DOB                 DATE,  
  SSN                 NUMBER(9),  
  FIRST_NAME          VARCHAR2(32 BYTE),  
  LAST_NAME           VARCHAR2(60 BYTE),  
  ADDR                VARCHAR2(100 BYTE),  
  CITY                VARCHAR2(26 BYTE),  
  STATE               CHAR(2 BYTE),  
  ZIP                 CHAR(5 BYTE),  
  ZIP_4               CHAR(4 BYTE)  
);   
 
 
 SELECT A.* , JW_SSN+JW_DOB+JW_FNAME+JW_LNAME+JW_ADDR AS JW_SCORE   FROM  (  
SELECT /*+ PARALLEL(16) */  
      a.appid  
     , a.ssn a_ssn  
     , b.ssn b_ssn   
     , a.dob AS a_dob  
     , b.dob AS b_dob   
     , a.first_name AS a_first_name  
     , b.first_name AS b_first_name   
     , a.last_name AS a_last_name  
     , b.last_name AS b_last_name   
     , A.FIRST_NAME ||' '|| A.LAST_NAME AS A_FULLNAME  
     , B.FIRST_NAME ||' ' ||B.LAST_NAME  AS B_FULLNAME    
     , a.addr AS a_addr  
     , b.addr b_addr   
     , a.city AS a_city  
     , b.city b_city   
     , a.state AS a_state  
     , b.state b_state   
     , a.zip AS a_zip  
     , b.zip b_zip  
     , ( CASE WHEN ( A.ssn IS NULL or B.SSN IS NULL ) THEN 100  
            WHEN ( ( A.SSN = 111111111 OR B.SSN = 111111111)) THEN 100  
            ELSE UTL_MATCH.jaro_winkler_similarity (  LPAD( A.ssn,9,0) , LPAD( b.ssn ,9,0))  
      END)* 3 as JW_SSN   
       ,( CASE WHEN ( A.dob IS NULL OR B.dob IS NULL ) THEN 100   
            ELSE UTL_MATCH.jaro_winkler_similarity ( A.dob ,B.dob)  
      END )*2 as JW_DOB   
      ,( CASE WHEN ( A.first_name IS NULL OR B.first_name IS NULL ) THEN 100   
            ELSE UTL_MATCH.jaro_winkler_similarity ( A.first_name ,B.first_name)  
      END ) *2as JW_FNAME  
      , (CASE WHEN ( A.last_name IS NULL OR B.last_name IS NULL ) THEN 100   
            ELSE UTL_MATCH.jaro_winkler_similarity ( A.last_name ,B.last_name)  
      END) *2 as JW_LNAME  
      , CASE WHEN (  A.FIRST_NAME || A.LAST_NAME IS OR AND B.FIRST_NAME||B.LAST_NAME IS NULL ) THEN 100   
            ELSE UTL_MATCH.jaro_winkler_similarity (  A.FIRST_NAME ||' '|| A.LAST_NAME , B.FIRST_NAME ||' ' ||B.LAST_NAME)  
      END as JW_FULLNAME  
     , ( CASE WHEN ( A.addr IS NULL AND B.addr IS NULL ) THEN 100   
            ELSE UTL_MATCH.jaro_winkler_similarity ( A.addr ,B.addr)  
      END) * 1 as JW_ADDR   
  FROM tablea a, tableb b  
 WHERE     1 = 1  
 AND b.start_service_date >= a.app_load_date  
  
  ) A  
 WHERE JW_SSN+JW_DOB+JW_FNAME+JW_LNAME+JW_ADDR > 970; 

Plan hash value: 1977758127
 
----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |   733M|   242G|       |  5713  (98)| 00:00:12 |        |      |            |
|   1 |  PX COORDINATOR FORCED SERIAL|              |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10001     |   733M|   242G|       |  5713  (98)| 00:00:12 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    MERGE JOIN                |              |   733M|   242G|       |  5713  (98)| 00:00:12 |  Q1,01 | PCWP |            |
|   4 |     SORT JOIN                |              |   479K|    72M|   170M|    74   (5)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE              |              |   479K|    72M|       |    71   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX SEND BROADCAST      | :TQ10000     |   479K|    72M|       |    71   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   7 |        PX BLOCK ITERATOR     |              |   479K|    72M|       |    71   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL    | JW_BAN_TABLE |   479K|    72M|       |    71   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  9 |     FILTER                   |              |       |       |       |            |          |  Q1,01 | PCWP |            |
|* 10 |      SORT JOIN               |              |   611K|   114M|   258M|    94   (5)| 00:00:01 |  Q1,01 | PCWP |            |
|  11 |       PX BLOCK ITERATOR      |              |   611K|   114M|       |    91   (2)| 00:00:01 |  Q1,01 | PCWC |            |
|  12 |        TABLE ACCESS FULL     | JW_APP       |   611K|   114M|       |    91   (2)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   9 - filter(CASE  WHEN (("A"."SSN" IS NULL) OR ("B"."SSN" IS NULL)) THEN 100 WHEN ((TO_NUMBER("A"."SSN")=111111111) OR
              ("B"."SSN"=111111111)) THEN 100 ELSE "UTL_MATCH"."JARO_WINKLER_SIMILARITY"(LPAD("A"."SSN",9,'0'),LPAD(TO_CHAR("B"."SSN"),9
              ,'0')) END *3+CASE  WHEN (("A"."DOB" IS NULL) OR ("B"."DOB" IS NULL)) THEN 100 ELSE
              "UTL_MATCH"."JARO_WINKLER_SIMILARITY"(INTERNAL_FUNCTION("A"."DOB"),INTERNAL_FUNCTION("B"."DOB")) END *2+CASE  WHEN
              (("A"."FIRST_NAME" IS NULL) OR ("B"."FIRST_NAME" IS NULL)) THEN 100 ELSE
              "UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."FIRST_NAME","B"."FIRST_NAME") END *2+CASE  WHEN (("A"."LAST_NAME" IS NULL) OR
              ("B"."LAST_NAME" IS NULL)) THEN 100 ELSE "UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."LAST_NAME","B"."LAST_NAME") END
              *2+CASE  WHEN (("A"."ADDR" IS NULL) AND ("B"."ADDR" IS NULL)) THEN 100 ELSE
              "UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."ADDR","B"."ADDR") END *1>970)
  10 - access(INTERNAL_FUNCTION("B"."START_SERVICE_DATE")>=INTERNAL_FUNCTION("A"."APP_LOAD_DATE"))
       filter(INTERNAL_FUNCTION("B"."START_SERVICE_DATE")>=INTERNAL_FUNCTION("A"."APP_LOAD_DATE"))
 
Note
-----
   - Degree of Parallelism is 16 because of hint



thanks in Advance

and Connor said...

You need "more" to join on to give a better reduction of rows through the join.

So you can provide some assumptions that *you* are probably making through the database. For example, you might assume that we'll match on first letter of each name, and the state they live in.

AND b.start_service_date >= a.app_load_date
and substr(b.first_name,1,1) = substr(a.first_name,1,1)
and substr(b.last_name,1,1) = substr(a.last_name,1,1)
and b.state = a.state



Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Hints

Need more information on hints? Check out the complete hint descriptions