Skip to Main Content
  • Questions
  • Remove RTF Tags from a Text Field Containing RTF

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Tim.

Asked: October 17, 2004 - 12:44 pm UTC

Last updated: April 30, 2012 - 8:21 am UTC

Version: 9i

Viewed 10K+ times! This question is

You Asked

Hi Tom,
In our database we have some varchar2(4000) fields (could later be a clob). Where we store RTF data. I would like to know if there is anyway I can remove the RTF tags returning only the text again. For Example a Function RTFtoTEXT that takes the RTF:

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}}
{\colortbl ;\red0\green0\blue255;}
{\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\lang1033\ul\b\i\f0\fs20 This is a test.\par
\cf1\ulnone\i0 This is a test.\cf0\b0\par
\par
\par
\par
\par
}

And Returns:

This is a test.
This is a test.

Can you do this with Oracle Text or Something?

TIA
Tim

and Tom said...

Yes, we can do this with text -- you can

a) filter to plaintext if you have an index
b) filter to html with or without an index
c) call ctxhx directly from the command line to filter the text and load it

I'll demo a) and b). you can play with ctxhx from the command line from $ORACLE_HOME/ctx/bin if you want (run it, it'll tell you the inputs it takes)



ops$tkyte@ORA9IR2> create table demo
2 ( id int primary key,
3 theblob blob,
4 theclob clob
5 )
6 /

Table created.

ops$tkyte@ORA9IR2> create table filter ( query_id number, document clob );

Table created.

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create index demo_idx on demo(theblob) indextype is ctxsys.context;

Index created.

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create sequence s;

Sequence created.

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create or replace directory my_files as '/home/tkyte/Desktop/'
2 /

Directory created.

ops$tkyte@ORA9IR2> @trace
ops$tkyte@ORA9IR2> alter session set events '10046 trace name context forever, level 12';

Session altered.

ops$tkyte@ORA9IR2> declare
2 l_blob blob;
3 l_clob clob;
4 l_id number;
5 l_bfile bfile;
6 begin
7 insert into demo values ( s.nextval, empty_blob(), empty_clob() )
8 returning id, theblob, theclob into l_id, l_blob, l_clob;
9
10 l_bfile := bfilename( 'MY_FILES', 'asktom.rtf' );
11 dbms_lob.fileopen( l_bfile );
12
13 dbms_lob.loadfromfile( l_blob, l_bfile,
14 dbms_lob.getlength( l_bfile ) );
15
16 dbms_lob.fileclose( l_bfile );
17
18 ctx_doc.ifilter( l_blob, l_clob );
19 commit;
20 ctx_doc.filter( 'DEMO_IDX', l_id, 'FILTER', l_id, TRUE );
21 end;
22 /

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> set long 500
ops$tkyte@ORA9IR2> select utl_raw.cast_to_varchar2(dbms_lob.substr(theblob,500,1)) from demo;

UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(THEBLOB,500,1))
-------------------------------------------------------------------------------
{\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1033\deflangfe1033{\fonttbl{\f0\froma
n\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fswiss\f
charset0\fprq2{\*\panose 020b0604020202020204}Arial;}
{\f2\fmodern\fcharset0\fprq1{\*\panose 02070309020205020404}Courier New;}{\f23\
froman\fcharset128\fprq1{\*\panose 00000000000000000000}MS Mincho{\*\falt MS ??
};}{\f28\froman\fcharset128\fprq1{\*\panose 00000000000000000000}@MS Mincho;}
{\f29\froman\fcharset238\fprq2 Times New Roman CE;


ops$tkyte@ORA9IR2> select theclob from demo;

THECLOB
-------------------------------------------------------------------------------
<HTML><BODY>
<h1><font size="5" face="Arial">Primary key index in Partitioning</font>
</h1>
<p><font size="3" face="Times New Roman">I have a table accounts whic
h has 80 million records (OLTP system). I would like to partition the table by
acct_by_date column.&nbsp;I will be going with range partition and global index
es. My concern is regd the primary key
acct_id. The index that will be created for primary key should it be local or g
lobal and which should I opt for?
</font></


ops$tkyte@ORA9IR2> select document from filter;

DOCUMENT
-------------------------------------------------------------------------------




Primary key index in Partitioning



I have a table accounts which has 80 million records (OLTP system). I would lik
e to partition the table by acct_by_date column.&#65533;I will be going with range par
tition and global indexes. My concern is regd the primary key acct_id. The inde
x that will be created for primary key should it be local or global and which s
hould I opt for?



Well, this is an easy one.&#65533;The primary key index can be local IF and ONLY IF, t
he primary key is in fact the (or part of th



ifilter works without a ctxsys.context index, but only lets you get HTML, filter works only with an index -- but lets you get plain text OR html


Rating

  (11 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Thanks but a small question

Tim, October 19, 2004 - 10:30 am UTC

Hi Tom,
Thanks for that it put me on the right track.

Just a quick followup question or three :)

1) I am assuming the trace commands are not needed for anything special or is there a reason for this?

2) When I do the filter on a clob nothing happens. Why? I'm guessing that this is already considered plain text or what?

create table workflow_temprtf
( id int primary key,
theblob blob,
theclob clob
);

create table workflow_rtffilter (query_id number,document clob);

create index workflow_rtfidx on workflow_temprtf(theblob) indextype is ctxsys.context;
create index workflow_rtfidx2 on workflow_temprtf(theclob) indextype is ctxsys.context;

create sequence workflow_rtfseq;

declare
l_blob blob;
l_clob clob;
l_id number;
l_bfile bfile;
lrtf varchar2(400);
lresult clob;
begin
lrtf:='{\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 Arial;}}'||CHR(13)||CHR(10);
lrtf:=lrtf||'\viewkind4\uc1\pard\lang1033\fs17 this is a test'||CHR(13)||CHR(10);
lrtf:=lrtf||'\par }';

select workflow_rtfseq.nextval into l_id from dual;
insert into workflow_temprtf values (l_id, utl_raw.cast_to_raw(lrtf) , lrtf );

ctx_doc.filter('WORKFLOW_RTFIDX2', l_id, 'WORKFLOW_RTFFILTER', l_id, TRUE );
select document into lresult from workflow_rtffilter where query_id=l_id;
dbms_output.put_line(lresult);
end;

Returns RTF still ? why ?

3) I did succeed in getting it to work with:

create table workflow_temprtf
( id int primary key,
theblob blob,
theclob clob
);

create table workflow_rtffilter (query_id number,document clob);

create index workflow_rtfidx on workflow_temprtf(theblob) indextype is ctxsys.context;
create index workflow_rtfidx2 on workflow_temprtf(theclob) indextype is ctxsys.context;

create sequence workflow_rtfseq;

declare
l_blob blob;
l_clob clob;
l_id number;
l_bfile bfile;
lrtf varchar2(400);
lresult clob;
begin
lrtf:='{\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 Arial;}}'||CHR(13)||CHR(10);
lrtf:=lrtf||'\viewkind4\uc1\pard\lang1033\fs17 this is a test'||CHR(13)||CHR(10);
lrtf:=lrtf||'\par }';

select workflow_rtfseq.nextval into l_id from dual;
insert into workflow_temprtf values (l_id, utl_raw.cast_to_raw(lrtf) , lrtf );

ctx_doc.filter('WORKFLOW_RTFIDX', l_id, 'WORKFLOW_RTFFILTER', l_id, TRUE );
select document into lresult from workflow_rtffilter where query_id=l_id;
dbms_output.put_line(lresult);
end;

However, and as also in your example there is a lot of extra lines in the result. Whats with those any idea?

Thanks for the help.




Tom Kyte
October 19, 2004 - 10:55 am UTC

1) doh, left that in by accident...

2) yes, the clob is just text -- blobs can have pdfs, xls, ppt, docs, whatever. the inso filter can recognize these. clobs cannot have those objects as they are binary in nature.

3) they are just "there", you could turn them into spaces easily -- but the filters just produce TEXT -- that is what gets fed into the indexing engine -- it just needed words.

No formatting is saved with the text filter, just the text. If formatting is relevant, the HTML output works.

Dmitry, March 01, 2005 - 2:34 pm UTC

Excellent! But can we do the same without Text?
Oracle CTX produce strange results with russian encodings.

Tom Kyte
March 01, 2005 - 3:30 pm UTC

it should not produce strange results, but this entire answer is predicated on using text specifically.

Oracle Text on 64 bit Linux

Dmitry, May 05, 2005 - 8:16 am UTC

Since russian text in rtf looks like: "\'ea\'e0\'ea \'ec\'e8\'ed\'e8\'ec\'f3\'ec" I have to use INSO filter for indexes, but INSO filter not implemented on my production 64 bit linux box. Don't you know why? I cannot find anything about this on OTN or metalink.

Tom Kyte
May 05, 2005 - 9:20 am UTC

the latest releases use a new filtering technology -- suggest you open an itar (we are dropping the inso filters) to get information on this.

Embedded images in RTF

Reed, October 31, 2008 - 5:13 pm UTC

Sorry to resurrect an old thread but the ctx_doc.ifilter function helped me get past a hurdle in migrating old data to a new system but there is one part missing and I may need to deal with is outside of oracle. I cant find what I am looking for in the documentation. (Probably just not looking in the right place)

I am importing several notes from a legacy system that are stored in varchar2(4000) fields in RTF format. If the RTF document is longer than 4000 characters then another record was created with an incremented sequence. I appended the notes into a blob field in a new table and used ifilter to convert the RTF tags to HTML and stored the results in a clob. This is working great although I had to use a slow-by-slow cursor loop to accomplish all this.

What I am now faced with are those notes that have embedded images. Is it possible to extract those images and store them somewhere? If so can you point me to some documentation that will help me figure this out and maybe provide an example?

Thanks
Tom Kyte
November 02, 2008 - 4:54 pm UTC

but this is a question for "ask-about-the-rtf-file-format.com" site.

I have no information about the internal format of a RTF file, sorry.

RTFEditorKit and Java stored procedure turn rtf to plain text

Doug, November 03, 2009 - 3:48 pm UTC

RTFEditorKit is apparently not a complete implementation of the RTF specification, but if your text is generated by RTFEditorKit this should work. I find that it is far faster than Oracle Text when converting text from multiple records on the fly. I was helped in this by http://bit.ly/1G6ArH and chapter 19 of Tom's Oracle Expert book.

I would be happy for improvements to this as I am not a java programmer.

The following was run on Oracle 9.2.0.8.0

ORA92080>CREATE TABLE test_rtf
2 AS
3 SELECT 1 AS id,
4 '{\rtf1\ansi\ansicpg1252\deff0\deflang1033'||
5 '{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}'||
6 '{\f1\fswiss\fprq2\fcharset0 System;}}\viewkind4'||
7 '\uc1\pard\f0\fs20 This is a sample rec'||
8 'eiving comment\b\f1\fs20\par }' AS rtf_text
9 FROM DUAL
10 UNION ALL
11 SELECT 2,
12 '{\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\'||
13 'fcharset0 Arial;}}'||CHR(13)||CHR(10)||
14 '\viewkind4\uc1\pard\lang1033\fs17 this is a test'||
15 CHR(13)||CHR(10)||'\par }'
16 FROM DUAL
17 UNION ALL
18 SELECT 3,
19 '{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}}'||
20 '{\colortbl ;\red0\green0\blue255;}'||
21 '{\*\generator Msftedit'||
22 '5.41.15.1507;}\viewkind4\uc1\pard\lang1033\ul\b\i\f0\fs20 This is a test.\par'||
23 '\cf1\ulnone\i0 This is a test.\cf0\b0\par'||
24 '\par'||
25 '\par'||
26 '\par'||
27 '\par'||
28 '}'
29 FROM DUAL
30 UNION ALL
31 SELECT 4, NULL
32 FROM DUAL;

Table created.
ORA92080>CREATE OR REPLACE PACKAGE rtf
2 IS
3 PROCEDURE convertToText ( rtf_in IN VARCHAR2, plain_text OUT VARCHAR2 )
4 IS
5 language java
6 name 'rtf.convertToText( java.lang.String, java.lang.String[] )';
7
8 FUNCTION to_text ( rtf_in IN VARCHAR2 )
9 RETURN VARCHAR2
10 IS
11 language java
12 name 'rtf.to_text( java.lang.String ) return java.lang.String';
13
14 END rtf;
15 /

Package created.

ORA92080> CREATE OR REPLACE AND COMPILE
2 JAVA SOURCE NAMED "rtf"
3 AS
4 import javax.swing.text.BadLocationException;
5 import javax.swing.text.Document;
6 import javax.swing.text.rtf.RTFEditorKit;
7 import java.io.*;
8
9 public class rtf extends Object
10 {
11 public static void convertToText( java.lang.String p_in,
12 java.lang.String[] p_out )
13 throws IOException, BadLocationException
14 {
15 // test for null inputs to avoid java.lang.NullPointerException when input is null
16 if ( p_in != null )
17 { RTFEditorKit kit = new RTFEditorKit();
18 Document doc = kit.createDefaultDocument();
19 kit.read(new StringReader(p_in), doc, 0);
20 p_out[0] = doc.getText(0, doc.getLength());
21 }
22 else p_out[0] = null;
23 }
24
25 public static String to_text(String p_in)
26 throws IOException, BadLocationException
27 {
28 // test for null inputs to avoid java.lang.NullPointerException when input is null
29 if (p_in != null)
30 { RTFEditorKit kit = new RTFEditorKit();
31 Document doc = kit.createDefaultDocument();
32 kit.read(new StringReader(p_in), doc, 0);
33 return doc.getText(0, doc.getLength());
34 }
35 else return null;
36 }
37 }
38 /

Java created.

ORA92080>SET LINESIZE 90
ORA92080>COLUMN id FORMAT 90
ORA92080>COLUMN rtf_text FORMAT A59 WORD_WRAP
ORA92080>COLUMN plain_text FORMAT A20 WORD_WRAP
ORA92080>SELECT id
2 , rtf_text
3 , rtf.to_text( rtf_text ) AS plain_text
4 FROM test_rtf;

ID RTF_TEXT PLAIN_TEXT
--- ----------------------------------------------------------- --------------------
1 {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil This is a sample
\fcharset0 MS Sans Serif;}{\f1\fswiss\fprq2\fcharset0 receiving comment
System;}}\viewkind4\uc1\pard\f0\fs20 This is a sample
receiving comment\b\f1\fs20\par }

2 {\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 this is a test
Arial;}}
\viewkind4\uc1\pard\lang1033\fs17 this is a test
\par }

3 {\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier This is a test.
New;}}{\colortbl ;\red0\green0\blue255;}{\*\generator This is a test.
Msftedit5.41.15.1507;}\viewkind4\uc1\pard\lang1033\ul\b\i\f
0\fs20 This is a test.\par\cf1\ulnone\i0 This is a
test.\cf0\b0\par\par\par\par\par}

4

4 rows selected.

Elapsed: 00:00:00.01

ORA92080>DECLARE
2 l_text_out VARCHAR2(4000);
3 BEGIN
4 rtf.convertToText
5 (
6 '{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl'||
7 '{\f0\fnil\fcharset0 MS Sans Serif;}'||
8 '{\f1\fswiss\fprq2\fcharset0 System;}}'||
9 '\viewkind4\uc1\pard\f0\fs20 This is a sample rec'||
10 'eiving comment\b\f1\fs20\par }'
11 , l_text_out
12 );
13
14 DBMS_OUTPUT.PUT_LINE( l_text_out );
15
16 END;
17 /
This is a sample receiving comment

PL/SQL procedure successfully completed.

"ctxsys.context " why its required??

Sunny, September 29, 2010 - 3:32 am UTC

Hi Tom,

1) Can you please explain the reason you have used ctxsys.context in "create index demo_idx on demo(theblob) indextype is ctxsys.context;"

2) When i am trying to execute the same statement in SQL Navigator 4.2 on Oracle 9ithe error I am getting is "ORA-29833: indextype does not exist".Also ,I am having "CREATE INDEXTYPE" privilege. What can be possible reason behind this?

Thanks
Tom Kyte
September 29, 2010 - 8:47 am UTC

1) because I wanted a context index on the field so I can use text functions on it.

2) did your DBA install text in your database?

$ oerr ora 29833
29833, 00000, "indextype does not exist"
// *Cause: There is no indextype by the specified name.
// *Action: <b>Use public views for the indextypes to see if an
//          indextype by the specified name has been created.</b>

different characters in rtf txt

Ico, November 25, 2010 - 12:25 am UTC

everything works perfectly until when You got in rtf text characters like "...\u1054?\u1090?\u1084?\u1077?\u1085?\u1072?...", they will become as "...??????..."

Is there any easy way to save these types on character?

thx

Tom Kyte
November 25, 2010 - 8:44 am UTC

RTF is a binary format, you would be using a blob and we wouldn't touch the data or change it at all.

i have one problem

satish kumar, March 13, 2012 - 7:23 am UTC

i tried this i am getting the html but not getting value in document column of filter table. Please suggest.
Tom Kyte
March 13, 2012 - 8:13 am UTC

huh?

Turkish and other characters not correctly converted

A reader, April 18, 2012 - 8:38 am UTC

First of all, Many many thanks for this piece of code to convert to plain text.

I know this is previously asked but i am not sure how to fix it.

We have rtf text in many languages saved in our database, when I used the above code to convert to plain text all of it correctly converts but the characters which have are like \u287? format do not convert correctly

Example
The below rtf saved in database

{\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Arial;}{\f1\fnil Arial Unicode MS;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1\pard\ltrpar\cf1\fs20 g\'fcnl\'fc\u287?\'fc \cf0\f1\par}

it actually is
günlüğü

but when using this code it converts to
günlüğğfc

Is there a way to fix this incorrect conversion?

ORA-00904: "SYS"."DBMS_JAVA"."SHORTNAME": invalid identifier

Rakesh Khandelwal, April 29, 2012 - 10:56 pm UTC

Hi Tom,

CREATE OR REPLACE AND COMPILE
    JAVA SOURCE NAMED "rtf"
    AS
    import javax.swing.text.BadLocationException;
    import javax.swing.text.Document;
    import javax.swing.text.rtf.RTFEditorKit;
    import java.io.*;
  
   public class rtf extends Object
   {
      public static void convertToText( java.lang.String p_in,
                                        java.lang.String[] p_out )
      throws IOException, BadLocationException
      {
         // test for null inputs to avoid java.lang.NullPointerException when input is null
        if ( p_in != null )
         { RTFEditorKit kit = new RTFEditorKit();
           Document doc = kit.createDefaultDocument();
           kit.read(new StringReader(p_in), doc, 0);
           p_out[0] = doc.getText(0, doc.getLength());
         }
         else p_out[0] = null;
      }
 
      public static String to_text(String p_in)
      throws IOException, BadLocationException
      {
         // test for null inputs to avoid java.lang.NullPointerException when input is null
         if (p_in != null)
         { RTFEditorKit kit = new RTFEditorKit();
           Document doc = kit.createDefaultDocument();
           kit.read(new StringReader(p_in), doc, 0);
           return doc.getText(0, doc.getLength());
         }
         else return null;
      }
   }
   /


While executing above code I am getting below error.

ORA-00904: "SYS"."DBMS_JAVA"."SHORTNAME": invalid identifier


Can you please help me on this.
Tom Kyte
April 30, 2012 - 8:21 am UTC

I'd sort of need a complete example to reproduce with

All results throwing ORA:22835 Issue

Nivedan Basak, March 03, 2015 - 10:05 pm UTC

Hi Tom,
With respect to the above issue, we have created the RTF Procedure as well as the Java source as mentioned by you. We have created a view on which we are trying to convert clob field to string using the functions provided. The view creates fine and no errors received. The particular select also works fine. But while we query all results from the view the solution fails with ORA-22835.

ERROR:
ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual:
5083, maximum: 4000)

More to Explore

VLDB

If you are new to partitioning, check out Connor McDonald's introduction series here.

VLDB

Documentation set on VLDB and Partitioning.