Hi Tom,
I have scenario where I'm indexing (comma separated values) CSV records and XML records, and storing the actual record on a Write-Once-Read-Many storage EMC Centera, and storing the index on Oracle database. The index facilitates searching and retrieval of record from EMC Centera.
For each record we store metadata of search fields. For CSV the search fields are simply defined as field positioning within the record e.g. CustomerID is at 5th position. For XML the search fields are defined as XPATH e.g. CustomerID is at a /Transact/CustomerID/. For XML its is possible that an XPATH is pointing to repeating group i.e. multiple values are returned when XPATH is evaluated.
Metadata Structure
==================
Field Key - 101
Field Name - 'CustomerID'
Field Position - '/Transact/CustomerID/text()' (assuming XML document)
When I read a XML document with many thousand records, for each record I create entry like below in two tables:-
Main Index Table (not to confuse with database Index)
=====================================================
EMC Centera Record Address - 'GHD37RT098f5'
Index Key - 123
File Name - 'abc.xml'
File Record Number - 1
<few other attributes>
EAV Table
=========
Index Key - 123
Field Key - 101
Field Value - '784598'
I have given above a simple example with one search attribute, but around 5-20 search attributes are likely.
Also the EAV table can have more than one value for same "Index Key" and "Field Key", in case of XPATH pointing to an XML repeating group .
I have seen in other posts on asktom where EAV model is painful to query and I'm facing the same situation, it is very slow.
As per your suggestion in other posts, I did thought of flattening my "Main Index Table" by adding CustomerID as standard column:-
EMC Centera Record Address - 'GHD37RT098f5'
Index Key - 123
File Name - 'abc.xml'
File Record Number - 1
CustomerID - '784598'
Works great for CSV case where a value is only 1. But how to model multiple values?
At the time of metadata creation I know if a search field can take multiple values or not. Based on this knowledge I can create a standard column for single values and for multiple values I'm thinking of either XMLType or Nested Table.
Nested Table behind the scenes anyway would end up as EAV data model. Also more than one Nested table could be required if there are more than one search field which requires multiple values storage.
What about XMLType? I thought of this, but couldn't see how I can store multiple values and make each of them separately indexed or in where clause filter. I tried a simple XML structure like this for storing two value.
<Values><Value>1234</Value><Value>7890</Value></Values>
But couldn't get my way around how to individually index the values and using them in filter condition.
Looking for an advise.
I don't fully understand what you're trying to achieve. Could you create a child table and the fields with multiple values become rows in the this child? You can then use standard database indexing to find the values you need.
If you want to go down the XML method, you can index it using an XMLIndex. This enables Oracle to use the index in the XML functions XMLQuery, XMLTable, XMLExists, XMLCast, extract, extractValue, and existsNode:
SQL>create table t (
2 x xmltype
3 );
Table created.
Elapsed: 00:00:00.26
SQL>
SQL>insert into t values (xmltype('<Values><Value>1234</Value><Value>7890</Value></Values>'));
1 row created.
Elapsed: 00:00:00.24
SQL>insert into t values (xmltype('<Values><Value>4321</Value><Value>321</Value></Values>'));
1 row created.
Elapsed: 00:00:00.22
SQL>insert into t values (xmltype('<Values><Value>0000</Value><Value>654</Value></Values>'));
1 row created.
Elapsed: 00:00:00.22
SQL>insert into t values (xmltype('<Values><Value>5467</Value><Value>987</Value></Values>'));
1 row created.
Elapsed: 00:00:00.22
SQL>commit;
Commit complete.
Elapsed: 00:00:00.22
SQL>
SQL>exec dbms_stats.gather_table_stats(user, 'T');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.25
SQL>
SQL>set autotrace trace exp
SQL>select * from t
2 where xmlexists ('/Values[Value=1234]' passing x);
Elapsed: 00:00:00.24
Execution Plan
----------------------------------------------------------
Plan hash value: 1605161290
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 4 | 492 | 8 (0)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL | T | 4 | 492 | 3 (0)| 00:00:01 |
| 3 | NESTED LOOPS SEMI | | 2 | 8 | 5 (0)| 00:00:01 |
| 4 | XPATH EVALUATION | | | | | |
|* 5 | XPATH EVALUATION | | | | | |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT /*+ <not feasible>)
5 - filter(TO_BINARY_DOUBLE("P1"."C_01$")=1.234E+003D)
Note
-----
- Unoptimized XML construct detected (enable XMLOptimizationCheck for more information)
SQL>
SQL>create index xi on t (x) indextype is xdb.XMLIndex;
Index created.
Elapsed: 00:00:00.57
SQL>exec dbms_stats.gather_table_stats(user, 'T');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.33
SQL>
SQL>select * from t
2 where xmlexists ('/Values[Value=1234]' passing x);
Elapsed: 00:00:00.23
Execution Plan
----------------------------------------------------------
Plan hash value: 2895750719
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 159 | 4 (25)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 159 | 4 (25)| 00:00:01 |
| 2 | SORT UNIQUE | | 1 | 36 | 2 (0)| 00:00:01 |
|* 3 | TABLE ACCESS BY INDEX ROWID BATCHED| SYS98172_XI_PATH_TABLE | 1 | 36 | 2 (0)| 00:00:01 |
|* 4 | INDEX SKIP SCAN | SYS98172_XI_PIKEY_IX | 6 | | 1 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY USER ROWID | T | 1 | 123 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(TO_BINARY_DOUBLE("SYS_P2"."VALUE")=1.234E+003D AND
SYS_XMLI_LOC_ISNODE("SYS_P2"."LOCATOR")=1)
4 - access("SYS_P2"."PATHID"=HEXTORAW('5EBC'))
filter("SYS_P2"."PATHID"=HEXTORAW('5EBC'))The access goes from a full table scan of T, to a table access by index rowid.
More details are in the docs:
http://docs.oracle.com/cd/B28359_01/appdev.111/b28369/xdb_indexing.htm#CHDFCGGI