I asked Joel Kallman, our resident interMedia expert to take a look at this. He gave this very detailed example that shows how to use interMedia to index XML and do section searching on the attributes contained within an XML document:
InterMedia Text supports indexing of XML via the specification of a section group. A section group is a collection of predefined sections of a document. interMedia Text sections let you search for text within a particular named section, rather than across an entire document. This can dramatically improve the accuracy of searching across a set of tagged documents. 1) Firstly, create the table to store our XML documentscreate table employee_xml(
id number primary key,
xmldoc clob )
/
2) Insert a sample document (the DTD is not required)insert into employee_xml values( 1,
'<?xml version="1.0"?>
<!DOCTYPE employee [
<!ELEMENT employee (Name, Dept, Title)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Dept (#PCDATA)>
<!ELEMENT Title (#PCDATA)>
]>
<employee>
<Name>Joel Kallman</Name>
<Dept>Oracle Service Industries Technology Group</Dept>
<Title>Technologist</Title>
</employee>');
3) Create our interMedia Text section group called 'xmlgroup', Add the tags Name and Dept to the section
group. (Caution: in XML, tag names are case-sensitive, but tag names in section groups are case-insensitive)begin
ctx_ddl.create_section_group('xmlgroup', 'XML_SECTION_GROUP');
ctx_ddl.add_zone_section( 'xmlgroup', 'Name', 'Name' );
ctx_ddl.add_zone_section( 'xmlgroup', 'Dept', 'Dept' );
end;
/
4) Create our interMedia Text index, specifying the section
group we created above. Also, specify the
null_filter, as the Inso filter is not required.create index employee_xml_index on employee_xml( xmldoc )
indextype is ctxsys.context
parameters( 'filter ctxsys.null_filter section group xmlgroup' )
/
5) Now, execute a query, searching for a name within a
specific sectionselect id from employee_xml where
contains(xmldoc, 'Joel within Name') > 0;
6) Only non-empty tags will be indexed, but not the tag names
themselves. Thus, the following queries will return
zero rows.select id from employee_xml
where contains(xmldoc, 'title') > 0;
select id from employee_xml
where contains(xmldoc, 'employee') > 0;
7) But the following query will locate our document, even
though we have not defined Title as a section.select id from employee_xml where contains(xmldoc, 'Technologist') > 0;
Let's say you want to get going right away with indexing XML, and don't want to have to specify sections for every element in your XML document collection. You can do this very easily by using the predefined AUTO_SECTION_GROUP. This section group is exactly like the XML section group, but the pre-definition of sections is not required. For all non-empty tags in your document, a zone section will be created with the section name the same as the tag name. Use of the AUTO_SECTION_GROUP is also ideal when you may not know in advance all of the tag names that will be a part of your XML document set. 8) Drop our existing interMedia Text index.drop index employee_xml_index
/
9) And this time, recreate it specifying the
AUTO_SECTION_GROUP. We do not need to predefine the sections of our group, it is handled for us automatically.create index employee_xml_index on employee_xml( xmldoc )
indextype is ctxsys.context
parameters( 'filter ctxsys.null_filter section group ctxsys.auto_section_group' )
/
10) And once again, we should be able to locate our document using a section searchselect id from employee_xml
where contains(xmldoc, 'Technologist within Title') > 0;
interMedia Text in 8.1.6 and 8.1.7 has added many exciting features in support of XML. For example, AUTO_SECTION_GROUP also automatically indexes tag attributes. Additionally, in 8.1.6 and later, you can define sections qualified with a doctype delimiter. This avoids the problem of tag "collisions" when you may have different document types with identical tag names. These and many other new features can be reviewed at:</code>
http://technet.oracle.com/sample_code/products/text/htdocs/text_samples/Lite/Samples/imt_816_techover.html <code>
...